PipeRAG: Fast Retrieval-Augmented Generation via Adaptive Pipeline Parallelism

Jiang, Wenqi; Zhang, Shuai; Han, Boran; Wang, Jie; Wang, Bernie; Kraska, Tim

dc.contributor.author	Jiang, Wenqi
dc.contributor.author	Zhang, Shuai
dc.contributor.author	Han, Boran
dc.contributor.author	Wang, Jie
dc.contributor.author	Wang, Bernie
dc.contributor.author	Kraska, Tim
dc.date.accessioned	2025-08-12T15:41:55Z
dc.date.available	2025-08-12T15:41:55Z
dc.date.issued	2025-07-20
dc.identifier.isbn	979-8-4007-1245-6
dc.identifier.uri	https://hdl.handle.net/1721.1/162351
dc.description	KDD ’25, August 3–7, 2025, Toronto, ON, Canada	en_US
dc.description.abstract	Retrieval-augmented generation (RAG) can enhance the generation quality of large language models (LLMs) by incorporating external token databases. However, retrievals from large databases can constitute a substantial portion of the overall generation time, particularly when retrievals are periodically performed to align the retrieved content with the latest states of generation. In this paper, we introduce PipeRAG, a novel algorithm-system co-design approach to reduce generation latency and enhance generation quality. PipeRAG integrates (1) pipeline parallelism to enable concurrent retrieval and generation processes, (2) flexible retrieval intervals to maximize the efficiency of pipeline parallelism, and (3) a performance model to automatically balance retrieval quality and latency based on the generation states and underlying hardware. Our evaluation shows that, by combining the three aforementioned methods, PipeRAG achieves up to 2.6× speedup in end-to-end generation latency while improving generation quality. These promising results showcase the effectiveness of co-designing algorithms with underlying systems, paving the way for the adoption of PipeRAG in future RAG systems.	en_US
dc.publisher	ACM\|Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1	en_US
dc.relation.isversionof	https://doi.org/10.1145/3690624.3709194	en_US
dc.rights	Creative Commons Attribution-Noncommercial	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc/4.0/	en_US
dc.source	Association for Computing Machinery	en_US
dc.title	PipeRAG: Fast Retrieval-Augmented Generation via Adaptive Pipeline Parallelism	en_US
dc.type	Article	en_US
dc.identifier.citation	Wenqi Jiang, Shuai Zhang, Boran Han, Jie Wang, Bernie Wang, and Tim Kraska. 2025. PipeRAG: Fast Retrieval-Augmented Generation via Adaptive Pipeline Parallelism. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (KDD '25). Association for Computing Machinery, New York, NY, USA, 589–600.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.identifier.mitlicense	PUBLISHER_POLICY
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dc.date.updated	2025-08-01T07:54:41Z
dc.language.rfc3066	en
dc.rights.holder	The author(s)
dspace.date.submission	2025-08-01T07:54:42Z
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: 3690624.3709194.pdf
Size:: 3.174Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record