Duality, Weight Decay, and Metrized Deep Learning

Newhouse, Laker

dc.contributor.advisor	Isola, Phillip
dc.contributor.author	Newhouse, Laker
dc.date.accessioned	2025-10-06T17:36:44Z
dc.date.available	2025-10-06T17:36:44Z
dc.date.issued	2025-05
dc.date.submitted	2025-06-23T14:03:08.568Z
dc.identifier.uri	https://hdl.handle.net/1721.1/162956
dc.description.abstract	The Muon optimizer has shown convincing evidence that it is faster and more scalable than AdamW for deep learning training, setting speed records for training NanoGPT and scaling up to models with 16B parameters. The theory that led to Muon is called metrized deep learning, a method that suggests assigning norms to each part of a neural network. Chapter 1 begins with an accessible explanation of metrized deep learning, including one of its recurring tools: odd polynomial iterations that act directly on singular values. Chapter 2 reviews duality, a way to modify the gradient that seeks to decrease the loss the most while disturbing the model the least. Pedagogically, duality links four popular optimizers—SGD, Adam, Shampoo, and Muon—under a common framework, steepest descent under a norm. Practically, experiments suggest that duality-based optimizers train faster than AdamW and transfer learning rate across width. Chapter 3 develops tools to enforce weight norm constraints during training, conferring provable and upfront Lipschitz guarantees for transformers. We find that optimizer dynamics matter: switching from AdamW to Muon improves standard weight regularization methods—weight decay and spectral normalization—allowing models to reach equal performance with a lower Lipschitz bound. Leveraging that Muon’s update has a fixed spectral norm, we co-design a weight constraint method called spectral cap that improves the Lipschitz vs. performance tradeoff for MLPs and 2M parameter transformers. Our 4-Lipschitz transformer on Shakespeare text reaches validation accuracy 60%. Scaling to 145M parameters, our 600-Lipschitz transformer reaches 21% accuracy on internet text. However, to match the NanoGPT baseline validation accuracy of 39.4%, our Lipschitz upper bound increases to 10^274. Nonetheless, our Lipschitz transformers train without stability measures such as layer norm, QK norm, and tanh logit softcapping.
dc.publisher	Massachusetts Institute of Technology
dc.rights	Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-nc/4.0/
dc.title	Duality, Weight Decay, and Metrized Deep Learning
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: newhouse-lakern-meng-eecs-2025 ...
Size:: 2.274Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record