MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Duality, Weight Decay, and Metrized Deep Learning

Author(s)
Newhouse, Laker
Thumbnail
DownloadThesis PDF (2.274Mb)
Advisor
Isola, Phillip
Terms of use
Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc/4.0/
Metadata
Show full item record
Abstract
The Muon optimizer has shown convincing evidence that it is faster and more scalable than AdamW for deep learning training, setting speed records for training NanoGPT and scaling up to models with 16B parameters. The theory that led to Muon is called metrized deep learning, a method that suggests assigning norms to each part of a neural network. Chapter 1 begins with an accessible explanation of metrized deep learning, including one of its recurring tools: odd polynomial iterations that act directly on singular values. Chapter 2 reviews duality, a way to modify the gradient that seeks to decrease the loss the most while disturbing the model the least. Pedagogically, duality links four popular optimizers—SGD, Adam, Shampoo, and Muon—under a common framework, steepest descent under a norm. Practically, experiments suggest that duality-based optimizers train faster than AdamW and transfer learning rate across width. Chapter 3 develops tools to enforce weight norm constraints during training, conferring provable and upfront Lipschitz guarantees for transformers. We find that optimizer dynamics matter: switching from AdamW to Muon improves standard weight regularization methods—weight decay and spectral normalization—allowing models to reach equal performance with a lower Lipschitz bound. Leveraging that Muon’s update has a fixed spectral norm, we co-design a weight constraint method called spectral cap that improves the Lipschitz vs. performance tradeoff for MLPs and 2M parameter transformers. Our 4-Lipschitz transformer on Shakespeare text reaches validation accuracy 60%. Scaling to 145M parameters, our 600-Lipschitz transformer reaches 21% accuracy on internet text. However, to match the NanoGPT baseline validation accuracy of 39.4%, our Lipschitz upper bound increases to 10^274. Nonetheless, our Lipschitz transformers train without stability measures such as layer norm, QK norm, and tanh logit softcapping.
Date issued
2025-05
URI
https://hdl.handle.net/1721.1/162956
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.