Vector-thread architecture and implementation
Author(s)Krashinsky, Ronny (Ronny Meir), 1978-
Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
MetadataShow full item record
This thesis proposes vector-thread architectures as a performance-efficient solution for all-purpose computing. The VT architectural paradigm unifies the vector and multithreaded compute models. VT provides the programmer with a control processor and a vector of virtual processors. The control processor can use vector-fetch commands to broadcast instructions to all the VPs or each VP can use thread-fetches to direct its own control flow. A seamless intermixing of the vector and threaded control mechanisms allows a VT architecture to flexibly and compactly encode application parallelism and locality. VT architectures can efficiently exploit a wide variety of loop-level parallelism, including non-vectorizable loops with cross-iteration dependencies or internal control flow. The Scale VT architecture is an instantiation of the vector-thread paradigm designed for low-power and high-performance embedded systems. Scale includes a scalar RISC control processor and a four-lane vector-thread unit that can execute 16 operations per cycle and supports up to 128 simultaneously active virtual processor threads. Scale provides unit-stride and strided-segment vector loads and stores, and it implements cache refill/access decoupling. The Scale memory system includes a four-port, non-blocking, 32-way set-associative, 32 KB cache. A prototype Scale VT processor was implemented in 180 nm technology using an ASIC-style design flow. The chip has 7.1 million transistors and a core area of 16.6 mm2, and it runs at 260 MHz while consuming 0.4-1.1 W. This thesis evaluates Scale using a diverse selection of embedded benchmarks, including example kernels for image processing, audio processing, text and data processing, cryptography, network processing, and wireless communication.(cont.) Larger applications also include a JPEG image encoder and an IEEE 802.11 la wireless transmitter. Scale achieves high performance on a range of different types of codes, generally executing 3-11 compute operations per cycle. Unlike other architectures which improve performance at the expense of increased energy consumption, Scale is generally even more energy efficient than a scalar RISC processor.
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 181-186).
DepartmentMassachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
Massachusetts Institute of Technology
Electrical Engineering and Computer Science.