AIKIDO : toward straggler mitigation for distributed machine learning training in cloud data centers

Sharma, Ayush,M. Eng.Massachusetts Institute of Technology.

dc.contributor.advisor	Manya Ghobadi.	en_US
dc.contributor.author	Sharma, Ayush,M. Eng.Massachusetts Institute of Technology.	en_US
dc.contributor.other	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.	en_US
dc.date.accessioned	2020-09-15T22:01:56Z
dc.date.available	2020-09-15T22:01:56Z
dc.date.copyright	2020	en_US
dc.date.issued	2020	en_US
dc.identifier.uri	https://hdl.handle.net/1721.1/127520
dc.description	Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, May, 2020	en_US
dc.description	Cataloged from the official PDF of thesis.	en_US
dc.description	Includes bibliographical references (pages 69-75).	en_US
dc.description.abstract	As artificial intelligence becomes a critical component of everyday life, the popularity of using cloud data centers for training deep neural networks is relentlessly growing. This poses a significant challenge for data center operators where the network band-width is shared among multiple ML jobs as well as between ML jobs and data center flows. At high loads, the network experiences transient congestion events frequently which in turn delays the parameter updates between ML workers. Consequently, the training convergence suffers as some workers behind congested links straggle to update the model parameters in time, hence delaying all workers. We propose AIKIDO as a first step towards mitigating the impact of transient network-induced stragglers on training workloads caused by the dynamic nature of the data center traffic. AIKIDO exploits the inherent robustness of ML training on occasional loss of gradient updates and implements a Skip-Straggler communication strategy where the updates from straggling workers are simply skipped. In addition, AIKIDO introduces an Active-Backup strategy as an improvement to the Skip method to maintain a high accuracy convergence while using fewer resources than full worker replication. In our experiment, we use Google Cloud Engine environment to train ResNet-50 on ImageNet at various scales and demonstrate that AIKIDO is able to mitigate the effect of stragglers and achieve the time-to-accuracy as if there are no stragglers.	en_US
dc.description.statementofresponsibility	by Ayush Sharma.	en_US
dc.format.extent	75 pages	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Electrical Engineering and Computer Science.	en_US
dc.title	AIKIDO : toward straggler mitigation for distributed machine learning training in cloud data centers	en_US
dc.title.alternative	Toward straggler mitigation for distributed machine learning training in cloud data centers	en_US
dc.type	Thesis	en_US
dc.description.degree	M. Eng.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.identifier.oclc	1193029416	en_US
dc.description.collection	M.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science	en_US
dspace.imported	2020-09-15T22:01:56Z	en_US
mit.thesis.degree	Master	en_US
mit.thesis.department	EECS	en_US

Files in this item

Name:: 1193029416-MIT.pdf
Size:: 3.401Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record