Show simple item record

dc.contributor.advisorGhobadi, Manya
dc.contributor.authorRajasekaran, Sudarsanan
dc.date.accessioned2025-03-27T16:58:10Z
dc.date.available2025-03-27T16:58:10Z
dc.date.issued2025-02
dc.date.submitted2025-03-04T17:25:49.475Z
dc.identifier.urihttps://hdl.handle.net/1721.1/158918
dc.description.abstractThe ever-growing increase in dataset and model sizes of deep learning has created a massive demand for efficient GPU clusters. As the number of GPUs increases, the communication overhead of distributed Machine Learning (ML) training and fine-tuning workloads quickly takes up a significant portion of iteration time. Yet state-of-the-art ML schedulers tend to ignore the communication pattern of ML jobs when placing workers on GPUs. This thesis advocates for communication-aware resource scheduling as a critical approach to optimizing network utilization in ML clusters. We introduce a key idea for accelerating Deep Neural Network (DNN) jobs that interleaves the communication demands of different jobs sharing a network link. To illustrate this concept of interleaving, we first demonstrate how intentionally creating unfairness in bandwidth share between different DNN jobs improves their iteration times. Building on this insight, we present two novel systems designed to minimize network congestion and accelerate DNN training and fine-tuning jobs. The first system, Cassini, achieves interleaving using a centralized approach. In contrast, the second system, MLTCP, achieves the same goal using a distributed approach. Both systems are practical and readily deployable, depending on the service provider’s preference on deploying centralized or distributed solutions. In particular, Cassini, is a centralized network-aware job scheduler for ML clusters. Cassini introduces a novel geometric abstraction to consider the communication pattern of different jobs while placing them on network links. To do so, Cassini uses an Affinity graph that finds a series of time-shift values to adjust the communication phases of a subset of jobs such that the communication patterns of jobs sharing the same network link are interleaved with each other. Second is MLTCP, a distributed technique to approximate an interleaved centralized flow schedule. At the heart of MLTCP lies a straight-forward principle based on a key conceptual insight: by scaling the congestion window size (or sending rate) based on the number of bytes sent at each iteration, MLTCP flows eventually converge into a schedule that reduces network contention. To evaluate these systems, we conduct experiments using real-world DNN models on a testbed with Nvidia A100 GPUS. Cassini and MLTCP improve the average iteration times by up to 1.6× and 1.9×, respectively, demonstrating their effectiveness in reducing network congestion and accelerating ML workloads.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleAccelerating Distributed Deep Neural Network Training and Fine-Tuning Through Resource Interleaving
dc.typeThesis
dc.description.degreePh.D.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeDoctoral
thesis.degree.nameDoctor of Philosophy


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record