Show simple item record

dc.contributor.authorZhu, Ligeng
dc.contributor.authorHu, Lanxiang
dc.contributor.authorLin, Ji
dc.contributor.authorChen, Wei-Ming
dc.contributor.authorWang, Wei-Chen
dc.contributor.authorGan, Chuang
dc.contributor.authorHan, Song
dc.date.accessioned2024-01-03T18:41:43Z
dc.date.available2024-01-03T18:41:43Z
dc.date.issued2023-10-28
dc.identifier.isbn979-8-4007-0329-4
dc.identifier.urihttps://hdl.handle.net/1721.1/153267
dc.description.abstractOn-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the optimizations for learning on the edge, which faces challenges of resource limitations and edge hardware diversity. We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality. Secondly, PockEngine is compilation first: the entire training graph (including forward, backward and optimization steps) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. PockEngine also integrates a rich set of training graph optimizations, thus can further accelerate the training cost, including operator reordering and backend switching. PockEngine supports diverse applications, frontends and hardware backends: it flexibly compiles and tunes models defined in PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We evaluated PockEngine on both vision models and large language models. PockEngine achieves up to 15 × speedup over off-the-shelf TensorFlow (Raspberry Pi), 5.6 × memory saving back-propagation (Jetson AGX Orin). Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9 × faster than the PyTorch.en_US
dc.publisherACM|56th Annual IEEE/ACM International Symposium on Microarchitectureen_US
dc.relation.isversionofhttps://doi.org/10.1145/3613424.3614307en_US
dc.rightsCreative Commons Attributionen_US
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/en_US
dc.titlePockEngine: Sparse and Efficient Fine-tuning in a Pocketen_US
dc.typeArticleen_US
dc.identifier.citationZhu, Ligeng, Hu, Lanxiang, Lin, Ji, Chen, Wei-Ming, Wang, Wei-Chen et al. 2023. "PockEngine: Sparse and Efficient Fine-tuning in a Pocket."
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.contributor.departmentMIT-IBM Watson AI Lab
dc.identifier.mitlicensePUBLISHER_CC
dc.identifier.mitlicensePUBLISHER_CC
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/ConferencePaperen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dc.date.updated2024-01-01T08:48:08Z
dc.language.rfc3066en
dc.rights.holderThe author(s)
dspace.date.submission2024-01-01T08:48:09Z
mit.licensePUBLISHER_CC
mit.metadata.statusAuthority Work and Publication Information Neededen_US


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record