Show simple item record

dc.contributor.advisorSra, Suvrit
dc.contributor.advisorJadbabaie, Ali
dc.contributor.authorAhn, Kwangjun
dc.date.accessioned2025-03-12T16:55:02Z
dc.date.available2025-03-12T16:55:02Z
dc.date.issued2024-09
dc.date.submitted2025-03-04T18:28:45.909Z
dc.identifier.urihttps://hdl.handle.net/1721.1/158484
dc.description.abstractToday, one of the most impressive applications of optimization is the training of large AI models. But currently such models are trained with ad-hoc heuristics at a very large computational cost, mainly due to lack of understanding of their working mechanisms. In this thesis, we conduct a systematic study of large-model optimization, crucially informed by practical applications. The first part investigates two interesting phenomena regarding optimization of Transformer-based models, one of the most popular architectures for language modeling. We investigate how training Transformer-based models can lead to remarkable properties such as in-context learning, and we further discuss the main challenges associated with Transformer training. The second part of this thesis focuses on understanding the Adam optimizer, one of the most popular algorithms for training large models. We offer a new view on Adam based on an online learning perspective that elucidates the importance of Adam’s algorithmic components. Building on this new perspective, we also prove that Adam achieves the optimal convergence rate in various non-convex optimization settings, both smooth and non-smooth settings. The third part of this thesis focuses on the unstable convergence phenomenon in training large models. We identify its main characteristics from first principles, and discuss its causes and implications for learning. We then discuss its connection to popular flat minima optimization algorithms, and initiate a formal study of them by defining a formal notion of flat minima, and analyzing the complexities of finding them.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleMaking Sense of Training Large AI Models
dc.typeThesis
dc.description.degreePh.D.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.orcidhttps://orcid.org/0000-0001-5516-5775
mit.thesis.degreeDoctoral
thesis.degree.nameDoctor of Philosophy


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record