| Abstract: |
| In this talk I will speak about concentration phenomena of self-attention transformers in the regimes of infinitely many layers and tokens.
The dynamics are described by the Fokker--Planck equation
\begin{align}\label{eq:Fokker-Planck}
\partial_t\rho_t^\beta(x) = -\operatorname{div}\Big(\rho_t^\beta(x)P_{x}V\mathsf{m}_{\beta}[\rho_t^\beta](x)\Big),\qquad (t,x)\in[0,T]\times\mathbb{S}^{d-1},
\end{align}
where $\mathbb{S}^{d-1}:=\{x\in\mathbb{R}^d\,:\,|x|=1\}$ is the sphere in $\mathbb{R}^d$, $T>0$ is a time horizon, $P_x:\mathbb{R}^d\to\mathbb{R}^{d}$, $y\mapsto y-\langle x,y\rangle x$ is the projection onto $T_x\mathbb{S}^{d-1}$, and
\begin{align}\label{eq:consensus-point}
\mathsf{m}_\beta[\rho_t^\beta](x) :=
\frac{\int_{\mathbb{S}^{d-1}}e^{\beta\langle By,x\rangle}y\,\mathrm{d}\rho_t^\beta(y)}{\int_{\mathbb{S}^{d-1}}e^{\beta\langle By,x\rangle}\,\mathrm{d}\rho_t^\beta(y)}
\end{align}
involves the inverse heat parameter $\beta>0$.
The matrices $V,B\in\mathbb{R}^{d\times d}$ contain learned parameters and are assumed to be constant in time.
It is known that for $\beta\to\infty$ solutions of \eqref{eq:Fokker-Planck} converge to solutions of a linear PDE, the solutions of which concentrate as $T\to\infty$ on the dominating eigendirections of the matrix $VB^\top$.
In our work we will quantify these results by exploiting a striking similarity between \eqref{eq:Fokker-Planck} and the so-called polarized consensus-based optimization (CBO) method for global optimization.
Using a CBO-inspired analysis we give explicit bounds for the Wasserstein-2 distance of the solution of \eqref{eq:Fokker-Planck} and a suitable target measure.
The proof relies on an application of a quantitative Laplace principle to \eqref{eq:consensus-point} as well as a Lyapunov-type analysis for the time asymptotics.
Our result sheds more light on the interior dynamics of self-attention transformers and might help identify reduced effective models. |
|