Special Session 73: Data-driven methods in dynamical systems

Global Convergence of Gradient Descent for Multi-Layer ResNets with Homogeneous Activation Functions in the Mean-Field Regime

Shi Chen
University of Wisconsin-Madison
USA
Co-Author(s):    Zhiyan Ding, Qin Li, Stephen J. Wright
Abstract:
Finding the optimal configuration of parameters in ResNet is a nonconvex minimization problem, but first order methods often find the global optimum when the network is overparameterized and the training algorithm is run for sufficiently many iterations. We study this phenomenon in the mean-field regime, where the network can be described approximately by an ordinary integral equation (OIE) and the training process of ResNet becomes a gradient-flow partial differential equation (PDE). Under the condition that the activation function is $2$-homogeneous or partially $1$-homogeneous, we show that this gradient-flow PDE converges to the global minimum. This result suggests that if the ResNet is sufficiently large, first order optimization methods can find global minimizers that fit the training data as well. Further, by controlling the generalization error, we prove that the gradient-flow PDE is stable with respect to its cost function. This result implies that a finite-sized but sufficiently large dataset drawn from the underlying data distribution is sufficient to exhibit the properties of the continuous limit. We give lower bounds on the depth and width of the network for the gradient-flow approximation to hold. We also lower-bound the size of the training data set needed to attain accurate approximation.