MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Machine Learning Training Jobs

Author(s)
Wang, Weiyang
Thumbnail
DownloadThesis PDF (21.11Mb)
Advisor
Ghobadi, Manya
Terms of use
In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
This thesis explores a novel approach for building direct-connect DNN training clusters. The proposed system, called TopoOpt, co-optimizes the distributed training process across three dimensions: computation, communication, and network topology. TopoOpt uses a novel alternating optimization technique and a group theory-inspired algorithm to find the best network topology and routing plan, together with parallelization strategy, for distributed DNN training. To motivate this research, we measure the communication patterns of distributed DNN workloads at Meta. Simulations with six real distributed training models show that, compared to similar-cost Fat-tree interconnects, TopoOpt reduces DNN training time by up to 3.4× on a 128-server cluster. Importantly, TopoOpt’s performance matches an ideal network using an abstract full bisection bandwidth switch, which costs 3.2× more. Experiments with a 12-node prototype demonstrate the feasibility of TopoOpt. The prototype shows that with 4×25 Gbps interfaces, TopoOpt’s training throughput is comparable to the ideal baseline of a 100 Gbps full bisection bandwidth network. TopoOpt is the first system with entirely commodity hardware that co-optimizes topology and parallelization strategy for DNN workloads and is currently being evaluated for deployment at Meta.
Date issued
2022-09
URI
https://hdl.handle.net/1721.1/147321
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.