Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition
Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition
We present a theoretical explanation of the ``grokking'' phenomenon, where a model generalizes long after overfitting,for the originally-studied problem of modular addition. First, we show that early in gradient descent, when the ``kernel regime'' approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees …