Congestion Control in Machine Learning Clusters

Rajasekaran, Sudarsanan

dc.contributor.advisor	Ghobadi, Manya
dc.contributor.author	Rajasekaran, Sudarsanan
dc.date.accessioned	2024-08-21T18:55:56Z
dc.date.available	2024-08-21T18:55:56Z
dc.date.issued	2024-05
dc.date.submitted	2024-07-10T12:59:51.040Z
dc.identifier.uri	https://hdl.handle.net/1721.1/156313
dc.description.abstract	This paper argues that fair-sharing, the holy grail of congestion control algorithms for decades, is not necessarily a desirable property in Machine Learning (ML) training clusters. We demonstrate that for a specific combination of jobs, introducing unfairness improves the training time for all competing jobs. We call this specific combination of jobs compatible and define the compatibility criterion using a novel geometric abstraction. Our abstraction rolls time around a circle and rotates the communication phases of jobs to identify fully compatible jobs. Using this abstraction, we demonstrate up to 1.3× improvement in the average training iteration time of popular ML models. We advocate that resource management algorithms should take job compatibility on network links into account. We then propose three directions to ameliorate the impact of network congestion in ML training clusters: (i) an adaptively unfair congestion control scheme, (ii) priority queues on switches, and (iii) precise flow scheduling.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Congestion Control in Machine Learning Clusters
dc.type	Thesis
dc.description.degree	S.M.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Science in Electrical Engineering and Computer Science

Files in this item

Name:: rajasekaran-rsudhir-sm-eecs-20 ...
Size:: 2.080Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record