Ask a Question

Prefer a chat interface with context about you and your work?

Implicit Bias and Fast Convergence Rates for Self-attention

Implicit Bias and Fast Convergence Rates for Self-attention

Self-attention, the core mechanism of transformers, distinguishes them from traditional neural networks and drives their outstanding performance. Towards developing the fundamental optimization principles of self-attention, we investigate the implicit bias of gradient descent (GD) in training a self-attention layer with fixed linear decoder in binary classification. Drawing inspiration from the …