| Abstract: |
| Transformers, the architecture underlying modern large language models, (specifically its self-attention component) was recently recast as the multi-agent system
\begin{equation}
\dot{x}_k(t) = \operatorname{P}^{\perp}_{x_k(t)} \left( \sum_{j=1}^n \exp \left(\langle Qx_k(t), K x_j(t) \rangle \right) V x_j(t) \right), \quad k=1, \dots, n,
\end{equation}
where $x_k(t) \in {\mathbb S}^{d-1}$ represents the state of the $k$-th token after the $t$-th attention layer and $Q,K,V \in \mathbb{R}^{d\times d}$ are the trained query, key and value matrices and $\operatorname{P}^{\perp}_{x_k(t)}$ is the projection onto ${\mathbb S}^{d-1}$. The output of the model is determined by the long-time state of the system, making its asymptotic behavior a central object of study. The goal of this talk is to illustrate how methods from collective dynamics can help analyze large language models.
In this talk I present a dynamical systems perspective on a 2D linearized transformer architecture, where the evolution of tokens across attention layers can be reformulated as a Kuramoto-type interacting particle system. Using the Ott-Antonsen ansatz, the high-dimensional dynamics admits a low-dimensional reduction describing the evolution in a reduced vocabulary regime. I will focus on the stability of the reduced dynamics and how their behavior persists beyond the Ott-Antonsen resulting in the stability of the full particle system. |
|