| dc.contributor.author | Yao, Xiaozhe | |
| dc.contributor.author | Hu, Qinghao | |
| dc.contributor.author | Klimovic, Ana | |
| dc.date.accessioned | 2025-05-09T16:51:40Z | |
| dc.date.available | 2025-05-09T16:51:40Z | |
| dc.date.issued | 2025-03-30 | |
| dc.identifier.isbn | 979-8-4007-1196-1 | |
| dc.identifier.uri | https://hdl.handle.net/1721.1/159252 | |
| dc.description | EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands | en_US |
| dc.description.abstract | Fine-tuning large language models (LLMs) greatly improves model quality for downstream tasks. However, serving many fine-tuned LLMs concurrently is challenging due to the sporadic, bursty, and varying request patterns of different LLMs. To bridge this gap, we present DeltaZip, an LLM serving system that efficiently serves multiple full-parameter fine-tuned models concurrently by aggressively compressing model deltas by up to 10× while maintaining high model quality. The key insight behind this design is that fine-tuning results in small-magnitude changes to the pre-trained model. By co-designing the serving system with the compression algorithm, DeltaZip achieves 2× to 12× improvement in throughput compared to the state-of-the-art systems. | en_US |
| dc.publisher | ACM|Twentieth European Conference on Computer Systems | en_US |
| dc.relation.isversionof | https://doi.org/10.1145/3689031.3717468 | en_US |
| dc.rights | Creative Commons Attribution | en_US |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | en_US |
| dc.source | Association for Computing Machinery | en_US |
| dc.title | DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs | en_US |
| dc.type | Article | en_US |
| dc.identifier.citation | Xiaozhe Yao, Qinghao Hu, and Ana Klimovic. 2025. DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs. In Proceedings of the Twentieth European Conference on Computer Systems (EuroSys '25). Association for Computing Machinery, New York, NY, USA, 110–127. | en_US |
| dc.contributor.department | Massachusetts Institute of Technology. Research Laboratory of Electronics | en_US |
| dc.identifier.mitlicense | PUBLISHER_POLICY | |
| dc.eprint.version | Final published version | en_US |
| dc.type.uri | http://purl.org/eprint/type/ConferencePaper | en_US |
| eprint.status | http://purl.org/eprint/status/NonPeerReviewed | en_US |
| dc.date.updated | 2025-04-01T07:49:37Z | |
| dc.language.rfc3066 | en | |
| dc.rights.holder | The author(s) | |
| dspace.date.submission | 2025-04-01T07:49:37Z | |
| mit.license | PUBLISHER_CC | |
| mit.metadata.status | Authority Work and Publication Information Needed | en_US |