Understanding the Performance of Transformer
Inference

Ouyang, Anne

dc.contributor.advisor	Ragan-Kelley, Jonathan
dc.contributor.author	Ouyang, Anne
dc.date.accessioned	2023-07-31T19:47:28Z
dc.date.available	2023-07-31T19:47:28Z
dc.date.issued	2023-06
dc.date.submitted	2023-06-06T16:35:24.190Z
dc.identifier.uri	https://hdl.handle.net/1721.1/151543
dc.description.abstract	The state of the art results in natural language processing tasks have been obtained by scaling up transformer-based machine learning models, which can have more than a hundred billion parameters. Training and deploying these models can be difficult and extremely expensive, and performance engineering efforts to improve the latency and throughput of these models are crucial in enabling widespread applications. We developed an analytical model for studying the performance of transformer inference and combined it with empirical studies using existing frameworks to gain insights into the performance characteristics of transformers and efficiency of existing implementations. The findings revealed the contribution of the different operations to the total parameter count, floating-point operations count, activation memory. A comparison between prefilling and generation stages highlighted differences in performance characteristics, with generation being slower due to low arithmetic intensity operations. Empirical studies with existing implementations on single GPUs showed that the implementation has a high roofline utilization but low FLOPs utilization during the generation stage, which indicates that implementation is reasonably efficient, but the low arithmetic operations during autoregressive generation is an inherent limitation of transformer-based architectures. We also experimented with various parallelism strategies for different inference workloads and distilled our observations as recommendations for effectively using parallelism. We found that the best parallelism strategy depends on the specific workloads (batch size and input and output sequence lengths). We also found that model parallelism can be useful for reasons beyond fitting the model in the GPU memory –– for example, in the case where a model fits in a single GPU, in the generation stage, tensor parallelism can decrease the latency for small batch settings. We hope that a comprehensive understanding of the performance characteristics and trade-offs can serve as a guide for researchers to optimize hardware resource utilization and enhance the efficiency of large language models.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Understanding the Performance of Transformer Inference
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: ouyang-aouyang-meng-eecs-2023- ...
Size:: 688.0Kb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record

Understanding the Performance of Transformer Inference

Files in this item

This item appears in the following Collection(s)