MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

AIKIDO : toward straggler mitigation for distributed machine learning training in cloud data centers

Author(s)
Sharma, Ayush,M. Eng.Massachusetts Institute of Technology.
Thumbnail
Download1193029416-MIT.pdf (3.401Mb)
Alternative title
Toward straggler mitigation for distributed machine learning training in cloud data centers
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Manya Ghobadi.
Terms of use
MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided. http://dspace.mit.edu/handle/1721.1/7582
Metadata
Show full item record
Abstract
As artificial intelligence becomes a critical component of everyday life, the popularity of using cloud data centers for training deep neural networks is relentlessly growing. This poses a significant challenge for data center operators where the network band-width is shared among multiple ML jobs as well as between ML jobs and data center flows. At high loads, the network experiences transient congestion events frequently which in turn delays the parameter updates between ML workers. Consequently, the training convergence suffers as some workers behind congested links straggle to update the model parameters in time, hence delaying all workers. We propose AIKIDO as a first step towards mitigating the impact of transient network-induced stragglers on training workloads caused by the dynamic nature of the data center traffic. AIKIDO exploits the inherent robustness of ML training on occasional loss of gradient updates and implements a Skip-Straggler communication strategy where the updates from straggling workers are simply skipped. In addition, AIKIDO introduces an Active-Backup strategy as an improvement to the Skip method to maintain a high accuracy convergence while using fewer resources than full worker replication. In our experiment, we use Google Cloud Engine environment to train ResNet-50 on ImageNet at various scales and demonstrate that AIKIDO is able to mitigate the effect of stragglers and achieve the time-to-accuracy as if there are no stragglers.
Description
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, May, 2020
 
Cataloged from the official PDF of thesis.
 
Includes bibliographical references (pages 69-75).
 
Date issued
2020
URI
https://hdl.handle.net/1721.1/127520
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.