Special Session 185: Multiscale Analysis: Geometry and Evolution Problems (mSPACE)

Concentration phenomena of self-attention dynamics
Leon Bungert
University of Wuerzburg
Germany
Co-Author(s):    Albert Alcalde, Konstantin Riedl, Tim Roith
Abstract:
In this talk I will speak about concentration phenomena of self-attention transformers in the regimes of infinitely many layers and tokens. The dynamics are described by the Fokker--Planck equation \begin{align}\label{eq:Fokker-Planck} \partial_t\rho_t^\beta(x) = -\operatorname{div}\Big(\rho_t^\beta(x)P_{x}V\mathsf{m}_{\beta}[\rho_t^\beta](x)\Big),\qquad (t,x)\in[0,T]\times\mathbb{S}^{d-1}, \end{align} where $\mathbb{S}^{d-1}:=\{x\in\mathbb{R}^d\,:\,|x|=1\}$ is the sphere in $\mathbb{R}^d$, $T>0$ is a time horizon, $P_x:\mathbb{R}^d\to\mathbb{R}^{d}$, $y\mapsto y-\langle x,y\rangle x$ is the projection onto $T_x\mathbb{S}^{d-1}$, and \begin{align}\label{eq:consensus-point} \mathsf{m}_\beta[\rho_t^\beta](x) := \frac{\int_{\mathbb{S}^{d-1}}e^{\beta\langle By,x\rangle}y\,\mathrm{d}\rho_t^\beta(y)}{\int_{\mathbb{S}^{d-1}}e^{\beta\langle By,x\rangle}\,\mathrm{d}\rho_t^\beta(y)} \end{align} involves the inverse heat parameter $\beta>0$. The matrices $V,B\in\mathbb{R}^{d\times d}$ contain learned parameters and are assumed to be constant in time. It is known that for $\beta\to\infty$ solutions of \eqref{eq:Fokker-Planck} converge to solutions of a linear PDE, the solutions of which concentrate as $T\to\infty$ on the dominating eigendirections of the matrix $VB^\top$. In our work we will quantify these results by exploiting a striking similarity between \eqref{eq:Fokker-Planck} and the so-called polarized consensus-based optimization (CBO) method for global optimization. Using a CBO-inspired analysis we give explicit bounds for the Wasserstein-2 distance of the solution of \eqref{eq:Fokker-Planck} and a suitable target measure. The proof relies on an application of a quantitative Laplace principle to \eqref{eq:consensus-point} as well as a Lyapunov-type analysis for the time asymptotics. Our result sheds more light on the interior dynamics of self-attention transformers and might help identify reduced effective models.