How early can we average Neural Networks?

Nasimov, Umarbek

dc.contributor.advisor	Poggio, Tomaso
dc.contributor.author	Nasimov, Umarbek
dc.date.accessioned	2023-07-31T19:57:08Z
dc.date.available	2023-07-31T19:57:08Z
dc.date.issued	2023-06
dc.date.submitted	2023-06-06T16:35:02.790Z
dc.identifier.uri	https://hdl.handle.net/1721.1/151660
dc.description.abstract	There is a recurring observation in deep learning that neural networks can be combined simply with arithmetic averages over their parameters. This observation has led to many new research directions in model ensembling, meta-learning, federated learning, and optimization. We investigate the evolution of this phenomenon during the training trajectory of neural network models initialized from a common set of parameters (parent). Surprisingly, the benefit of averaging the parameters persists over long child trajectories from parent parameters with minimal training. Furthermore, we find that the parent can be merged with a single child with significant improvement in both training and test loss. Through analysis of the loss landscape, we find that the loss becomes sufficiently convex early on in training, and, as a consequence, models obtained by averaging multiple children often outperform any individual child.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	How early can we average Neural Networks?
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: nasimov-unasimov-meng-eecs-202 ...
Size:: 2.123Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record