MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Making Sense of Training Large AI Models

Author(s)
Ahn, Kwangjun
Thumbnail
DownloadThesis PDF (7.320Mb)
Advisor
Sra, Suvrit
Jadbabaie, Ali
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Today, one of the most impressive applications of optimization is the training of large AI models. But currently such models are trained with ad-hoc heuristics at a very large computational cost, mainly due to lack of understanding of their working mechanisms. In this thesis, we conduct a systematic study of large-model optimization, crucially informed by practical applications. The first part investigates two interesting phenomena regarding optimization of Transformer-based models, one of the most popular architectures for language modeling. We investigate how training Transformer-based models can lead to remarkable properties such as in-context learning, and we further discuss the main challenges associated with Transformer training. The second part of this thesis focuses on understanding the Adam optimizer, one of the most popular algorithms for training large models. We offer a new view on Adam based on an online learning perspective that elucidates the importance of Adam’s algorithmic components. Building on this new perspective, we also prove that Adam achieves the optimal convergence rate in various non-convex optimization settings, both smooth and non-smooth settings. The third part of this thesis focuses on the unstable convergence phenomenon in training large models. We identify its main characteristics from first principles, and discuss its causes and implications for learning. We then discuss its connection to popular flat minima optimization algorithms, and initiate a formal study of them by defining a formal notion of flat minima, and analyzing the complexities of finding them.
Date issued
2024-09
URI
https://hdl.handle.net/1721.1/158484
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.