Special Session 122: Understanding the Learning of Deep Networks: Expressivity, Optimization, and Generalization

Functional neural network on infinite-dimensional data

Jun Fan
Hong Kong Baptist University
Hong Kong
Co-Author(s):    
Abstract:
Neural networks have proven their versatility in approximating continuous functions, but their capabilities extend far beyond. In this talk, we delve into the realm of functional neural networks, which offer a promising approach for approximating nonlinear smooth functionals. By investigating the convergence rates of approximation and generalization errors under different regularity conditions, we gain insights into the theoretical properties of these networks under the empirical risk minimization framework. This analysis contributes to a deeper understanding of functional neural networks and opens up new possibilities for their effective application in domains such as functional data analysis and scientific machine learning.

On the Expressivity of Neural Networks and Its Applications

Juncai He
The King Abdullah University of Science and Technology
Saudi Arabia
Co-Author(s):    Jinchao Xu and Lin Li
Abstract:
In this talk, I will present some recent results on the expressivity of neural networks and its applications. First, we will illustrate the connections between linear finite elements and ReLU DNNs, as well as between spectral methods and ReLU$^k$ DNNs. Second, we will share our latest findings regarding the open question of whether DNNs can precisely recover piecewise polynomials of arbitrary order on any simplicial mesh in any dimension. Then, we will discuss a specific result on the optimal expressivity of ReLU DNNs and its application, combining it with the Kolmogorov-Arnold representation theorem. Finally, I will offer a remark on the study of convolutional neural networks from an expressivity perspective.

Benign Overfitting of Vision Transformers

Wei Huang
RIKEN AIP
Japan
Co-Author(s):    
Abstract:
Transformers have demonstrated great power in the recent development of large foundational models. In particular, the Vision Transformer (ViT) has brought revolutionary changes to the field of vision, achieving significant accomplishments on the experimental side. However, their theoretical capabilities, particularly in terms of generalization when trained to overfit training data, are still not fully understood. To address this gap, this work delves deeply into the benign overfitting perspective of transformers in vision. To this end, we study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. By developing techniques that address the challenges posed by softmax and the interdependent nature of multiple weights in transformer optimization, we successfully characterized the training dynamics and achieved generalization in post-training. Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model. The theoretical results are further verified by experimental simulation. To the best of our knowledge, this is the first work to characterize benign overfitting for Transformers.

Optimization and Generalization of Gradient Descent for Shallow ReLU Networks

Yunwen Lei
The University of Hong Kong
Peoples Rep of China
Co-Author(s):    Puyu Wang, Yiming Ying, Ding-Xuan Zhou
Abstract:
Understanding the generalization and optimization of neural networks is a longstanding problem in modern learning theory. The prior analysis often leads to risk bounds of order $1/\sqrt{n}$ for ReLU networks, where $n$ is the sample size. In this talk, we present a general optimization and generalization analysis for gradient descent applied to shallow ReLU networks. We develop convergence rates of order $1/T$ for gradient descent with $T$ iterations, and show that the gradient descent iterates fall inside local balls around either an initialization point or a reference point. We also develop improved Rademacher complexity estimates by using the activation pattern of the ReLU function in these local balls.

Faster Convergence and Acceleration for Diffusion-Based Generative Models

Gen Li
The Chinese University of Hong Kong
Hong Kong
Co-Author(s):    
Abstract:
Diffusion models, which generate new data instances by learning to reverse a Markov diffusion process from noise, have become a cornerstone in contemporary generative modeling. While their practical power has now been widely recognized, theoretical underpinnings for mainstream samplers remain underdeveloped. Moreover, despite the recent surge of interest in accelerating diffusion-based samplers, convergence theory for these acceleration techniques remains limited. In this talk, I will introduce a new suite of non-asymptotic results aimed at better understanding popular samplers like DDPM and DDIM in discrete time, offering significantly improved convergence guarantees over previous work. Our theory accommodates L2-accurate score estimates, and does not require log-concavity or smoothness on the target distribution. Building on these insights, we propose training-free algorithms that provably accelerate diffusion-based samplers, leveraging ideas from higher-order approximation similar to those used in high-order ODE solvers like DPM-Solver. Our acceleration algorithms achieve state-of-the-art sample quality compared to existing methods.

Exploiting Low-dimensional Data Structures by Deep Neural Networks with Applications in Operator Learning

Hao Liu
Hong Kong Baptist University
Hong Kong
Co-Author(s):    
Abstract:
Deep neural networks have demonstrated a great success in many applications, especially for problems with high-dimensional data sets. In spite of that, most existing statistical theories are cursed by data dimension and cannot explain such a success. To bridge the gap between theories and practice, we exploit the low-dimensional structures of data set and establish theoretical guarantees with a fast rate that is only cursed by the intrinsic dimension of the data set. Autoencoder is a powerful tool in exploring data low-dimensional structures. In our work, we analyze the approximation error and generalization error of autoencoder and its application in operator learning. Our results provide fast rates depending on the intrinsic dimension of data sets and show that deep neural networks are adaptive to low-dimensional structures of data sets.

Overcoming High-Frequency Challenges: From Shallow to Multi-layer Neural Networks

Shijun Zhang
The Hong Kong Polytechnic University
Hong Kong
Co-Author(s):    Hongkai Zhao, Yimin Zhong, Haomin Zhou
Abstract:
This talk explores the limitations of shallow neural networks in handling high-frequency functions and presents a solution through a novel multi-layer, multi-component neural network (MMNN) architecture. We show how shallow networks act as low-pass filters, struggling with high-frequency components due to machine precision and slow learning dynamics. The MMNN architecture addresses these challenges by efficiently decomposing complex functions, significantly improving accuracy and reducing computational costs. Numerical experiments demonstrate the effectiveness of this approach in capturing fine details in oscillatory functions.