Special Session 22: Models of emergence and collective dynamics

Self-Attention as a Multi-Agent System: a Dynamical Perspective on Transformers
Jan Peszek
University of Warsaw
Poland
Co-Author(s):    
Abstract:
Transformers, the architecture underlying modern large language models, (specifically its self-attention component) was recently recast as the multi-agent system \begin{equation} \dot{x}_k(t) = \operatorname{P}^{\perp}_{x_k(t)} \left( \sum_{j=1}^n \exp \left(\langle Qx_k(t), K x_j(t) \rangle \right) V x_j(t) \right), \quad k=1, \dots, n, \end{equation} where $x_k(t) \in {\mathbb S}^{d-1}$ represents the state of the $k$-th token after the $t$-th attention layer and $Q,K,V \in \mathbb{R}^{d\times d}$ are the trained query, key and value matrices and $\operatorname{P}^{\perp}_{x_k(t)}$ is the projection onto ${\mathbb S}^{d-1}$. The output of the model is determined by the long-time state of the system, making its asymptotic behavior a central object of study. The goal of this talk is to illustrate how methods from collective dynamics can help analyze large language models. In this talk I present a dynamical systems perspective on a 2D linearized transformer architecture, where the evolution of tokens across attention layers can be reformulated as a Kuramoto-type interacting particle system. Using the Ott-Antonsen ansatz, the high-dimensional dynamics admits a low-dimensional reduction describing the evolution in a reduced vocabulary regime. I will focus on the stability of the reduced dynamics and how their behavior persists beyond the Ott-Antonsen resulting in the stability of the full particle system.