GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Jeon, Byungsoo; Wu, Mengdi; Cao, Shiyi; Kim, Sunghyun; Park, Sunghyun; Aggarwal, Neeraj; Unger, Colin; Arfeen, Daiyaan; Liao, Peiyuan; Miao, Xupeng; Alizadeh, Mohammad; Ganger, Gregory; Chen, Tianqi; Jia, Zhihao

dc.contributor.author	Jeon, Byungsoo
dc.contributor.author	Wu, Mengdi
dc.contributor.author	Cao, Shiyi
dc.contributor.author	Kim, Sunghyun
dc.contributor.author	Park, Sunghyun
dc.contributor.author	Aggarwal, Neeraj
dc.contributor.author	Unger, Colin
dc.contributor.author	Arfeen, Daiyaan
dc.contributor.author	Liao, Peiyuan
dc.contributor.author	Miao, Xupeng
dc.contributor.author	Alizadeh, Mohammad
dc.contributor.author	Ganger, Gregory
dc.contributor.author	Chen, Tianqi
dc.contributor.author	Jia, Zhihao
dc.date.accessioned	2025-05-09T15:33:17Z
dc.date.available	2025-05-09T15:33:17Z
dc.date.issued	2025-02-03
dc.identifier.isbn	979-8-4007-0698-1
dc.identifier.uri	https://hdl.handle.net/1721.1/159248
dc.description	ASPLOS ’25, March 30–April 3, 2025, Rotterdam, Netherlands	en_US
dc.description.abstract	Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device (e.g. GPU). Pipeline parallelism is commonly used in existing DNN systems to support large-scale DNN training by partitioning a DNN into multiple stages, which concurrently perform DNN computation for different micro-batches of training samples in a pipeline fashion. However, existing pipeline-parallel approaches only consider sequential pipeline stages and thus ignore the topology of a DNN, resulting in missed model-parallel opportunities. This paper presents graph pipeline parallelism (GPP), a new pipeline-parallel scheme that partitions a DNN into pipeline stages whose dependencies are identified by a directed acyclic graph. GPP generalizes existing sequential pipeline parallelism and preserves the inherent topology of a DNN to enable concurrent execution of computationally-independent operators, resulting in reduced memory requirement and improved GPU performance. In addition, we develop GraphPipe, a distributed system that exploits GPP strategies to enable performant and scalable DNN training. GraphPipe partitions a DNN into a graph of stages, optimizes micro-batch schedules for these stages, and parallelizes DNN training using the discovered GPP strategies. Evaluation on a variety of DNNs shows that GraphPipe outperforms existing pipeline-parallel systems such as PipeDream and Piper by up to 1.6×. GraphPipe also reduces the search time by 9-21× compared to PipeDream and Piper.	en_US
dc.publisher	ACM\|Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1	en_US
dc.relation.isversionof	https://doi.org/10.1145/3669940.3707220	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en_US
dc.source	Association for Computing Machinery	en_US
dc.title	GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism	en_US
dc.type	Article	en_US
dc.identifier.citation	Byungsoo Jeon, Mengdi Wu, Shiyi Cao, Sunghyun Kim, Sunghyun Park, Neeraj Aggarwal, Colin Unger, Daiyaan Arfeen, Peiyuan Liao, Xupeng Miao, Mohammad Alizadeh, Gregory R. Ganger, Tianqi Chen, and Zhihao Jia. 2025. GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS '25). Association for Computing Machinery, New York, NY, USA, 557–571.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory	en_US
dc.identifier.mitlicense	PUBLISHER_CC
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dc.date.updated	2025-04-01T07:48:46Z
dc.language.rfc3066	en
dc.rights.holder	The author(s)
dspace.date.submission	2025-04-01T07:48:47Z
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: license_rdf
Size:: 40bytes
Format:: application/rdf+xml

View/Open

Name:: 3669940.3707220.pdf
Size:: 1.808Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record