Dissecting the Interplay of Attention Paths in a Statistical Mechanics
Theory of Transformers
Dissecting the Interplay of Attention Paths in a Statistical Mechanics
Theory of Transformers
Despite the remarkable empirical performance of Transformers, their theoretical understanding remains elusive. Here, we consider a deep multi-head self-attention network, that is closely related to Transformers yet analytically tractable. We develop a statistical mechanics theory of Bayesian learning in this model, deriving exact equations for the network's predictor statistics under …