Show simple item record

dc.contributor.advisorRagan-Kelley, Jonathan
dc.contributor.authorOuyang, Anne
dc.date.accessioned2023-07-31T19:47:28Z
dc.date.available2023-07-31T19:47:28Z
dc.date.issued2023-06
dc.date.submitted2023-06-06T16:35:24.190Z
dc.identifier.urihttps://hdl.handle.net/1721.1/151543
dc.description.abstractThe state of the art results in natural language processing tasks have been obtained by scaling up transformer-based machine learning models, which can have more than a hundred billion parameters. Training and deploying these models can be difficult and extremely expensive, and performance engineering efforts to improve the latency and throughput of these models are crucial in enabling widespread applications. We developed an analytical model for studying the performance of transformer inference and combined it with empirical studies using existing frameworks to gain insights into the performance characteristics of transformers and efficiency of existing implementations. The findings revealed the contribution of the different operations to the total parameter count, floating-point operations count, activation memory. A comparison between prefilling and generation stages highlighted differences in performance characteristics, with generation being slower due to low arithmetic intensity operations. Empirical studies with existing implementations on single GPUs showed that the implementation has a high roofline utilization but low FLOPs utilization during the generation stage, which indicates that implementation is reasonably efficient, but the low arithmetic operations during autoregressive generation is an inherent limitation of transformer-based architectures. We also experimented with various parallelism strategies for different inference workloads and distilled our observations as recommendations for effectively using parallelism. We found that the best parallelism strategy depends on the specific workloads (batch size and input and output sequence lengths). We also found that model parallelism can be useful for reasons beyond fitting the model in the GPU memory –– for example, in the case where a model fits in a single GPU, in the generation stage, tensor parallelism can decrease the latency for small batch settings. We hope that a comprehensive understanding of the performance characteristics and trade-offs can serve as a guide for researchers to optimize hardware resource utilization and enhance the efficiency of large language models.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleUnderstanding the Performance of Transformer Inference
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record