Lessons learnt when doing theory

Read few classical papers thoroughly instead of reading many unfiltered papers superficially. Prioitize deep understanding over speed.
Work through the details by hand, and use the simplest examples to understand the theory.
When borrowing a paper to a new setting, check whether the original assumptions still hold and how are they used. Think about what new techniques we have and new assumptions required.
Use a scientific approach to understand the behavior, and possibly avoid overly complex technical papers.

I am deeply grateful to my advisors, Surbhi and Enric, as well as to other faculty members at Penn, including Jason and Weijie, for their invaluable guidance and insights in theory research.

Several classical papers I plan to completely go through in 2026 Spring:

Gradient Descent Provably Optimizes Over-parameterized Neural Networks. Simon S. Du, Xiyu Zhai, Barnabas Poczos, Aarti Singh.

Idea: The gram matrix during training stays close at the initialization in the kernel regime, derive convergence guarantee based on this. Overparametrization and random initialization is crucial.

On Lazy Training in Differentiable Programming. Lenaic Chizat, Edouard Oyallon, Francis Bach.

Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. Song Mei, Theodor Misiakiewicz, Andrea Montanari.

Idea: Gronwell’s inequality (Bounds between PDE, Nonlinear Dynamics, Particle Dynamics, GD, SGD). TO-DO: Understand Wasserstein Gradient Flow in Appendix F, G.

Parallelizing Stochastic Gradient Descent for Least Squares

Regression: mini-batching, averaging, and model misspecification. Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Aaron Sidford.

Reading…

Honam WONG

Lessons learnt when doing theory

Gradient Descent Provably Optimizes Over-parameterized Neural Networks. Simon S. Du, Xiyu Zhai, Barnabas Poczos, Aarti Singh.

On Lazy Training in Differentiable Programming. Lenaic Chizat, Edouard Oyallon, Francis Bach.

Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. Song Mei, Theodor Misiakiewicz, Andrea Montanari.

Parallelizing Stochastic Gradient Descent for Least Squares

Max-Margin Token Selection in Attention Mechanism. Davoud Ataee Tarzanagh, Yingcong Li, Xuechen Zhang, Samet Oymak.