Making Sense of Training Large AI Models

Ahn, Kwangjun

Author(s)

Ahn, Kwangjun

DownloadThesis PDF (7.320Mb)

Advisor

Sra, Suvrit

Jadbabaie, Ali

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Today, one of the most impressive applications of optimization is the training of large AI models. But currently such models are trained with ad-hoc heuristics at a very large computational cost, mainly due to lack of understanding of their working mechanisms. In this thesis, we conduct a systematic study of large-model optimization, crucially informed by practical applications. The first part investigates two interesting phenomena regarding optimization of Transformer-based models, one of the most popular architectures for language modeling. We investigate how training Transformer-based models can lead to remarkable properties such as in-context learning, and we further discuss the main challenges associated with Transformer training. The second part of this thesis focuses on understanding the Adam optimizer, one of the most popular algorithms for training large models. We offer a new view on Adam based on an online learning perspective that elucidates the importance of Adam’s algorithmic components. Building on this new perspective, we also prove that Adam achieves the optimal convergence rate in various non-convex optimization settings, both smooth and non-smooth settings. The third part of this thesis focuses on the unstable convergence phenomenon in training large models. We identify its main characteristics from first principles, and discuss its causes and implications for learning. We then discuss its connection to popular flat minima optimization algorithms, and initiate a formal study of them by defining a formal notion of flat minima, and analyzing the complexities of finding them.

Date issued

2024-09

URI

https://hdl.handle.net/1721.1/158484

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses