MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Efficient Network Systems Design for Machine Learning

Author(s)
Yang, Mingran
Thumbnail
DownloadThesis PDF (5.431Mb)
Advisor
Ghobadi, Manya
Terms of use
Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-sa/4.0/
Metadata
Show full item record
Abstract
Machine learning (ML) is transforming modern life by powering a diverse range of groundbreaking applications. As ML models and datasets expand, the scale of training and inference workloads in modern datacenters is increasing at an unprecedented pace. As the demand for computing resources grows, the need for low-latency and energy-efficient network systems becomes increasingly urgent. This thesis introduces efficient network systems designed to support machine learning workloads. It presents three key systems: Trio-ML, which accelerates ML training; Lightning, which enhances ML inference efficiency; and on-fiber photonic computing, a forward-looking vision for next-generation computing systems. The first system, Trio-ML, accelerates data-parallel distributed ML training by leveraging in-network computing on Juniper Networks' programmable chipset Trio. Trio-ML features two key designs: in-network aggregation, which utilizes Trio packet processing threads to aggregate gradients directly inside the network, and in-network straggler mitigation, which utilizes Trio timer threads to detect and address stragglers. We prototype Trio-ML on a testbed with three real DNN models (ResNet50, DenseNet161, and VGG11) to demonstrate its effectiveness in mitigating stragglers while performing in-network aggregation. Our evaluations show that when stragglers occur in the cluster, Trio-ML outperforms today's state-of-the-art in-network aggregation solutions by up to 1.8x. The second system, Lightning, is the first reconfigurable photonic-electronic smartNIC to serve real-time ML inference requests. Lightning uses a fast datapath to feed traffic from the NIC into the photonic domain without creating digital packet processing and data movement bottlenecks. To do so, Lightning leverages a novel reconfigurable count-action abstraction that keeps track of the required computation operations of each inference packet. Our count-action abstraction decouples the compute control plane from the data plane by counting the number of operations in each task and triggers the execution of the next task(s) without interrupting the dataflow. We evaluate Lightning's performance using four platforms: prototype, chip synthesis, emulations, and simulations. Our simulations with large DNN models show that compared to Nvidia A100 GPU, A100X DPU, and Brainwave smartNIC, Lightning accelerates the average inference serve time by 337x, 329x, and 42x, while consuming 352x, 419x, and 54x less energy, respectively. Building on the in-network computing and photonic computing concepts discussed in Trio-ML and Lightning, we present a forward-looking vision for future computing systems. We argue that pluggable transponders are a prime platform for performing photonic computing inside the network without having to replace networking switches and routers. Optical transponders are ubiquitous in today's wide-area and datacenter networks, giving us a unique opportunity to re-purpose them for photonic computing. To this end, we introduce on-fiber photonic computing, explore key research challenges in bringing this vision to reality, and discuss real-world applications.
Date issued
2025-05
URI
https://hdl.handle.net/1721.1/164120
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.