Show simple item record

dc.contributor.advisorSaman Amarasinghe
dc.contributor.authorAmarasinghe, Samanen_US
dc.contributor.authorRabbah, Rodricen_US
dc.contributor.authorLarsen, Samuelen_US
dc.contributor.otherComputer Architectureen
dc.date.accessioned2009-12-18T19:30:12Z
dc.date.available2009-12-18T19:30:12Z
dc.date.issued2009-12-18
dc.identifier.urihttp://hdl.handle.net/1721.1/50235
dc.description.abstractMultimedia extensions are nearly ubiquitous in today's general-purpose processors. These extensions consist primarily of a set of short-vector instructions that apply the same opcode to a vector of operands. Vector instructions introduce a data-parallel component to processors that exploit instruction-level parallelism, and present an opportunity for increased performance. In fact, ignoring a processor's vector opcodes can leave a significant portion of the available resources unused. In order for software developers to find short-vector instructions generally useful, however, the compiler must target these extensions with complete transparency and consistent performance. This paper describes selective vectorization, a technique for balancing computation across a processor's scalar and vector units. Current approaches for targeting short-vector instructions directly adopt vectorizing technology first developed for supercomputers. Traditional vectorization, however, can lead to a performance degradation since it fails to account for a processor's scalar resources. We formulate selective vectorization in the context of software pipelining. Our approach creates software pipelines with shorter initiation intervals, and therefore, higher performance. A key aspect of selective vectorization is its ability to manage transfer of operands between vector and scalar instructions. Even when operand transfer is expensive, our technique is sufficiently sophisticated to achieve significant performance gains. We evaluate selective vectorization on a set of SPEC FP benchmarks. On a realistic VLIW processor model, the approach achieves whole-program speedups of up to 1.35x over existing approaches. For individual loops, it provides speedups of up to 1.75x.en_US
dc.format.extent25 p.en_US
dc.relation.ispartofseriesMIT-CSAIL-TR-2009-064
dc.rightsCreative Commons Attribution 3.0 Unporteden
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/
dc.subjectSIMDen_US
dc.subjectVectorizationen_US
dc.subjectCompileren_US
dc.titleSelective Vectorization for Short-Vector Instructionsen_US


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record