On the Optimization and Generalization of Two-layer Transformers with
Sign Gradient Descent
On the Optimization and Generalization of Two-layer Transformers with
Sign Gradient Descent
The Adam optimizer is widely used for transformer optimization in practice, which makes understanding the underlying optimization mechanisms an important problem. However, due to the Adam's complexity, theoretical analysis of how it optimizes transformers remains a challenging task. Fortunately, Sign Gradient Descent (SignGD) serves as an effective surrogate for Adam. …