PockEngine: Sparse and Efficient Fine-tuning in a Pocket

Zhu, Ligeng; Hu, Lanxiang; Lin, Ji; Chen, Wei-Ming; Wang, Wei-Chen; Gan, Chuang; Han, Song

dc.contributor.author	Zhu, Ligeng
dc.contributor.author	Hu, Lanxiang
dc.contributor.author	Lin, Ji
dc.contributor.author	Chen, Wei-Ming
dc.contributor.author	Wang, Wei-Chen
dc.contributor.author	Gan, Chuang
dc.contributor.author	Han, Song
dc.date.accessioned	2024-01-03T18:41:43Z
dc.date.available	2024-01-03T18:41:43Z
dc.date.issued	2023-10-28
dc.identifier.isbn	979-8-4007-0329-4
dc.identifier.uri	https://hdl.handle.net/1721.1/153267
dc.description.abstract	On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the optimizations for learning on the edge, which faces challenges of resource limitations and edge hardware diversity. We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality. Secondly, PockEngine is compilation first: the entire training graph (including forward, backward and optimization steps) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. PockEngine also integrates a rich set of training graph optimizations, thus can further accelerate the training cost, including operator reordering and backend switching. PockEngine supports diverse applications, frontends and hardware backends: it flexibly compiles and tunes models defined in PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We evaluated PockEngine on both vision models and large language models. PockEngine achieves up to 15 × speedup over off-the-shelf TensorFlow (Raspberry Pi), 5.6 × memory saving back-propagation (Jetson AGX Orin). Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9 × faster than the PyTorch.	en_US
dc.publisher	ACM\|56th Annual IEEE/ACM International Symposium on Microarchitecture	en_US
dc.relation.isversionof	https://doi.org/10.1145/3613424.3614307	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en_US
dc.title	PockEngine: Sparse and Efficient Fine-tuning in a Pocket	en_US
dc.type	Article	en_US
dc.identifier.citation	Zhu, Ligeng, Hu, Lanxiang, Lin, Ji, Chen, Wei-Ming, Wang, Wei-Chen et al. 2023. "PockEngine: Sparse and Efficient Fine-tuning in a Pocket."
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.contributor.department	MIT-IBM Watson AI Lab
dc.identifier.mitlicense	PUBLISHER_CC
dc.identifier.mitlicense	PUBLISHER_CC
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dc.date.updated	2024-01-01T08:48:08Z
dc.language.rfc3066	en
dc.rights.holder	The author(s)
dspace.date.submission	2024-01-01T08:48:09Z
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: license_rdf
Size:: 40bytes
Format:: application/rdf+xml

View/Open

Name:: 3613424.3614307.pdf
Size:: 7.026Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record