Duality, Weight Decay, and Metrized Deep Learning

Newhouse, Laker

Author(s)

Newhouse, Laker

DownloadThesis PDF (2.274Mb)

Advisor

Isola, Phillip

Terms of use

Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc/4.0/

Metadata

Show full item record

Abstract

The Muon optimizer has shown convincing evidence that it is faster and more scalable than AdamW for deep learning training, setting speed records for training NanoGPT and scaling up to models with 16B parameters. The theory that led to Muon is called metrized deep learning, a method that suggests assigning norms to each part of a neural network. Chapter 1 begins with an accessible explanation of metrized deep learning, including one of its recurring tools: odd polynomial iterations that act directly on singular values. Chapter 2 reviews duality, a way to modify the gradient that seeks to decrease the loss the most while disturbing the model the least. Pedagogically, duality links four popular optimizers—SGD, Adam, Shampoo, and Muon—under a common framework, steepest descent under a norm. Practically, experiments suggest that duality-based optimizers train faster than AdamW and transfer learning rate across width. Chapter 3 develops tools to enforce weight norm constraints during training, conferring provable and upfront Lipschitz guarantees for transformers. We find that optimizer dynamics matter: switching from AdamW to Muon improves standard weight regularization methods—weight decay and spectral normalization—allowing models to reach equal performance with a lower Lipschitz bound. Leveraging that Muon’s update has a fixed spectral norm, we co-design a weight constraint method called spectral cap that improves the Lipschitz vs. performance tradeoff for MLPs and 2M parameter transformers. Our 4-Lipschitz transformer on Shakespeare text reaches validation accuracy 60%. Scaling to 145M parameters, our 600-Lipschitz transformer reaches 21% accuracy on internet text. However, to match the NanoGPT baseline validation accuracy of 39.4%, our Lipschitz upper bound increases to 10^274. Nonetheless, our Lipschitz transformers train without stability measures such as layer norm, QK norm, and tanh logit softcapping.

Date issued

2025-05

URI

https://hdl.handle.net/1721.1/162956

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses