Ask a Question

Prefer a chat interface with context about you and your work?

Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition

Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition

We present a theoretical explanation of the ``grokking'' phenomenon, where a model generalizes long after overfitting,for the originally-studied problem of modular addition. First, we show that early in gradient descent, when the ``kernel regime'' approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees …