Making Sense of Training Large AI Models

Ahn, Kwangjun

dc.contributor.advisor	Sra, Suvrit
dc.contributor.advisor	Jadbabaie, Ali
dc.contributor.author	Ahn, Kwangjun
dc.date.accessioned	2025-03-12T16:55:02Z
dc.date.available	2025-03-12T16:55:02Z
dc.date.issued	2024-09
dc.date.submitted	2025-03-04T18:28:45.909Z
dc.identifier.uri	https://hdl.handle.net/1721.1/158484
dc.description.abstract	Today, one of the most impressive applications of optimization is the training of large AI models. But currently such models are trained with ad-hoc heuristics at a very large computational cost, mainly due to lack of understanding of their working mechanisms. In this thesis, we conduct a systematic study of large-model optimization, crucially informed by practical applications. The first part investigates two interesting phenomena regarding optimization of Transformer-based models, one of the most popular architectures for language modeling. We investigate how training Transformer-based models can lead to remarkable properties such as in-context learning, and we further discuss the main challenges associated with Transformer training. The second part of this thesis focuses on understanding the Adam optimizer, one of the most popular algorithms for training large models. We offer a new view on Adam based on an online learning perspective that elucidates the importance of Adam’s algorithmic components. Building on this new perspective, we also prove that Adam achieves the optimal convergence rate in various non-convex optimization settings, both smooth and non-smooth settings. The third part of this thesis focuses on the unstable convergence phenomenon in training large models. We identify its main characteristics from first principles, and discuss its causes and implications for learning. We then discuss its connection to popular flat minima optimization algorithms, and initiate a formal study of them by defining a formal notion of flat minima, and analyzing the complexities of finding them.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Making Sense of Training Large AI Models
dc.type	Thesis
dc.description.degree	Ph.D.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.orcid	https://orcid.org/0000-0001-5516-5775
mit.thesis.degree	Doctoral
thesis.degree.name	Doctor of Philosophy

Files in this item

Name:: ahn-kjahn-phd-eecs-2024-thesis.pdf
Size:: 7.320Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record