1:30 pm MCP 201
A solvable model of the grokking transition in neural networks.
Deep neural networks are effective in approximating highly complex real-world data because they successfully extract (or learn) the relevant features of the data and utilize them to make predictions. This feature learning is quite a mysterious, highly non-linear process: it is hard to predict when and how it happens on real-world tasks, in part, because we do not know the relevant features ourselves (hence motivating the use of machine learning in the first place). The only solvable, "meanfield", approximation to the dynamics of neural networks neglects feature learning in the first place. While the corrections that allow small amounts of feature learning are very hard to compute.
In this talk, after giving a thorough introduction, I will consider data generated by a simple deterministic, yet non-linear, rule such as various modular arithmetic operations. In this case it is possible to find the necessary features analytically. I will show that solutions found by optimization are approximately the same as the analytic ones. Curiously the relevant features are all learnt at the same time leading to a spontaneous jump in the accuracy of the network from 0 to 100 after many epochs of training.