Ask a Question

Prefer a chat interface with context about you and your work?

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

Despite the remarkable empirical performance of Transformers, their theoretical understanding remains elusive. Here, we consider a deep multi-head self-attention network, that is closely related to Transformers yet analytically tractable. We develop a statistical mechanics theory of Bayesian learning in this model, deriving exact equations for the network's predictor statistics under …