Compiler-Hardware Co-Design for Pervasive Parallelization

Ying, Victor A.

Author(s)

Ying, Victor A.

DownloadThesis PDF (2.516Mb)

Advisor

Sanchez, Daniel

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Modern computer systems have hundreds of processor cores, so highly parallel programs are critical to achieve high performance. But parallel programming remains difficult on current systems, so many programs are still sequential. This dissertation presents new compilers and hardware architectures that can parallelize complex programs while retaining the simplicity of sequential code. Our new systems allow real-world programs to use hundreds of cores without burdening programmers with concurrency, deadlock, or data races. This dissertation follows a novel approach that eliminates the burden of explicit parallel programming to make parallel execution pervasive. This approach relies on four guiding principles. First, exploiting implicit parallelism preserves the simplicity of sequential execution. Second, dividing computation into tiny tasks, as short as tens of instructions each, unlocks plentiful fine-grained parallelism in challenging programs. Hardware-compiler co-design techniques can create many tasks in parallel and reduce per-task overheads to make tiny tasks scale to many cores. Third, new hardware and software mechanisms can compose parallelism across entire programs, removing serializing barriers to overlap executions of nested parallel subroutines. Finally, exploiting static and dynamic information for data locality reduces data movement costs while maintaining load balance on large multicore systems. This dissertation presents three systems that embody these four principles. First, T4 introduces automatic program transformations that exploit a novel hardware architecture to parallelize sequential programs. As a result, T4 scales hard-to-parallelize real-world programs to tens of cores, resulting in order-of-magnitude speedups. Second, S5 builds on T4 with novel transformations to remove needless serialization in a broad class of challenging data structures. Thus, S5 scales complex real-world programs to hundreds of cores, delivers additional order-of-magnitude speedups over T4, and outperforms manually parallelized code tuned by experts. Finally, ASH is an accelerator that demonstrates the same approach can be applied with simpler mechanisms tailored for digital circuit simulation. A small ASH implementation is 32x faster than a large multicore CPU running a state-of-the-art parallel simulator.

Date issued

2023-09

URI

https://hdl.handle.net/1721.1/164488

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses