Show simple item record

dc.contributor.advisorSanchez, Daniel
dc.contributor.authorYing, Victor A.
dc.date.accessioned2026-01-12T19:40:00Z
dc.date.available2026-01-12T19:40:00Z
dc.date.issued2023-09
dc.date.submitted2023-09-21T14:26:39.410Z
dc.identifier.urihttps://hdl.handle.net/1721.1/164488
dc.description.abstractModern computer systems have hundreds of processor cores, so highly parallel programs are critical to achieve high performance. But parallel programming remains difficult on current systems, so many programs are still sequential. This dissertation presents new compilers and hardware architectures that can parallelize complex programs while retaining the simplicity of sequential code. Our new systems allow real-world programs to use hundreds of cores without burdening programmers with concurrency, deadlock, or data races. This dissertation follows a novel approach that eliminates the burden of explicit parallel programming to make parallel execution pervasive. This approach relies on four guiding principles. First, exploiting implicit parallelism preserves the simplicity of sequential execution. Second, dividing computation into tiny tasks, as short as tens of instructions each, unlocks plentiful fine-grained parallelism in challenging programs. Hardware-compiler co-design techniques can create many tasks in parallel and reduce per-task overheads to make tiny tasks scale to many cores. Third, new hardware and software mechanisms can compose parallelism across entire programs, removing serializing barriers to overlap executions of nested parallel subroutines. Finally, exploiting static and dynamic information for data locality reduces data movement costs while maintaining load balance on large multicore systems. This dissertation presents three systems that embody these four principles. First, T4 introduces automatic program transformations that exploit a novel hardware architecture to parallelize sequential programs. As a result, T4 scales hard-to-parallelize real-world programs to tens of cores, resulting in order-of-magnitude speedups. Second, S5 builds on T4 with novel transformations to remove needless serialization in a broad class of challenging data structures. Thus, S5 scales complex real-world programs to hundreds of cores, delivers additional order-of-magnitude speedups over T4, and outperforms manually parallelized code tuned by experts. Finally, ASH is an accelerator that demonstrates the same approach can be applied with simpler mechanisms tailored for digital circuit simulation. A small ASH implementation is 32x faster than a large multicore CPU running a state-of-the-art parallel simulator.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleCompiler-Hardware Co-Design for Pervasive Parallelization
dc.typeThesis
dc.description.degreePh.D.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.orcidhttps://orcid.org/0000-0001-9660-7082
mit.thesis.degreeDoctoral
thesis.degree.nameDoctor of Philosophy


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record