Extending memory system semantics to accelerate irregular applications
Author(s)
Zhang, Guowei,Ph. D.Massachusetts Institute of Technology.
Download1252062370-MIT.pdf (3.034Mb)
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Daniel Sanchez.
Terms of use
Metadata
Show full item recordAbstract
Computer systems are increasingly bottlenecked by data movement, and rely on sophisticated memory hierarchies to address this issue. However, conventional memory systems suffer from poor performance on many irregular access patterns. This is because memory systems use an inexpressive interface that does not convey sufficient program semantics: they organize data in fixed-sized chunks and access data with only reads and writes. As a result, memory systems incur significant performance loss on several common patterns. In this thesis, we identify three such patterns: accesses to small data fragments suffer poor locality; concurrent updates introduce excessive traffic and serialization; and dependent reads incur long latencies that are on the critical path. To tackle these issues, this thesis proposes techniques that extend the semantics of the memory system. We apply this insight to address each of the three issues and propose solutions with different degrees of generality. COUP and COMMTM provide general architectural support by exploiting commutative updates to reduce communication and synchronization. COUP supports strict single-instruction commutativity by extending the cache coherence protocol, while COMMTM supports multi-instruction and semantic commutativity by leveraging hardware transactional memory. Whereas COUP and COMMTM are general, HTA and GAMMA target a specific data structure and a specific application, respectively. HTA addresses the inefficiencies of small fragments in the context of hash tables. It exploits the associativity in hash tables and leverages caches to reduce runtime overheads and to improve spatial locality. GAMMA is a sparse matrix-matrix multiplication accelerator. Its novel storage idiom, FIBERCACHE, combines caching and decoupled execution to ensure low latency for dependent reads with irregular reuse. This enables GAMMA to adopt an efficient dataflow, Gustavson's algorithm, to minimize off-chip traffic. In return, these techniques improve the performance and reduce the data movement of challenging applications significantly.
Description
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2021 Cataloged from the official PDF of thesis. Includes bibliographical references (pages 109-128).
Date issued
2021Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.