Making Sense of Training Large AI Models
Author(s)
Ahn, Kwangjun
DownloadThesis PDF (7.320Mb)
Advisor
Sra, Suvrit
Jadbabaie, Ali
Terms of use
Metadata
Show full item recordAbstract
Today, one of the most impressive applications of optimization is the training of large AI models. But currently such models are trained with ad-hoc heuristics at a very large computational cost, mainly due to lack of understanding of their working mechanisms. In this thesis, we conduct a systematic study of large-model optimization, crucially informed by practical applications. The first part investigates two interesting phenomena regarding optimization of Transformer-based models, one of the most popular architectures for language modeling. We investigate how training Transformer-based models can lead to remarkable properties such as in-context learning, and we further discuss the main challenges associated with Transformer training. The second part of this thesis focuses on understanding the Adam optimizer, one of the most popular algorithms for training large models. We offer a new view on Adam based on an online learning perspective that elucidates the importance of Adam’s algorithmic components. Building on this new perspective, we also prove that Adam achieves the optimal convergence rate in various non-convex optimization settings, both smooth and non-smooth settings. The third part of this thesis focuses on the unstable convergence phenomenon in training large models. We identify its main characteristics from first principles, and discuss its causes and implications for learning. We then discuss its connection to popular flat minima optimization algorithms, and initiate a formal study of them by defining a formal notion of flat minima, and analyzing the complexities of finding them.
Date issued
2024-09Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology