Accelerating Distributed Deep Neural Network Training
and Fine-Tuning Through Resource Interleaving

Rajasekaran, Sudarsanan

dc.contributor.advisor	Ghobadi, Manya
dc.contributor.author	Rajasekaran, Sudarsanan
dc.date.accessioned	2025-03-27T16:58:10Z
dc.date.available	2025-03-27T16:58:10Z
dc.date.issued	2025-02
dc.date.submitted	2025-03-04T17:25:49.475Z
dc.identifier.uri	https://hdl.handle.net/1721.1/158918
dc.description.abstract	The ever-growing increase in dataset and model sizes of deep learning has created a massive demand for efficient GPU clusters. As the number of GPUs increases, the communication overhead of distributed Machine Learning (ML) training and fine-tuning workloads quickly takes up a significant portion of iteration time. Yet state-of-the-art ML schedulers tend to ignore the communication pattern of ML jobs when placing workers on GPUs. This thesis advocates for communication-aware resource scheduling as a critical approach to optimizing network utilization in ML clusters. We introduce a key idea for accelerating Deep Neural Network (DNN) jobs that interleaves the communication demands of different jobs sharing a network link. To illustrate this concept of interleaving, we first demonstrate how intentionally creating unfairness in bandwidth share between different DNN jobs improves their iteration times. Building on this insight, we present two novel systems designed to minimize network congestion and accelerate DNN training and fine-tuning jobs. The first system, Cassini, achieves interleaving using a centralized approach. In contrast, the second system, MLTCP, achieves the same goal using a distributed approach. Both systems are practical and readily deployable, depending on the service provider’s preference on deploying centralized or distributed solutions. In particular, Cassini, is a centralized network-aware job scheduler for ML clusters. Cassini introduces a novel geometric abstraction to consider the communication pattern of different jobs while placing them on network links. To do so, Cassini uses an Affinity graph that finds a series of time-shift values to adjust the communication phases of a subset of jobs such that the communication patterns of jobs sharing the same network link are interleaved with each other. Second is MLTCP, a distributed technique to approximate an interleaved centralized flow schedule. At the heart of MLTCP lies a straight-forward principle based on a key conceptual insight: by scaling the congestion window size (or sending rate) based on the number of bytes sent at each iteration, MLTCP flows eventually converge into a schedule that reduces network contention. To evaluate these systems, we conduct experiments using real-world DNN models on a testbed with Nvidia A100 GPUS. Cassini and MLTCP improve the average iteration times by up to 1.6× and 1.9×, respectively, demonstrating their effectiveness in reducing network congestion and accelerating ML workloads.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Accelerating Distributed Deep Neural Network Training and Fine-Tuning Through Resource Interleaving
dc.type	Thesis
dc.description.degree	Ph.D.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Doctoral
thesis.degree.name	Doctor of Philosophy

Files in this item

Name:: rajasekaran-rsudhir-phd-eecs-2 ...
Size:: 7.814Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record

Accelerating Distributed Deep Neural Network Training and Fine-Tuning Through Resource Interleaving

Files in this item

This item appears in the following Collection(s)