Improving the Performance of Parallel Loops in
OpenCilk
by
Luka Govedič
S.B. in Electrical Engineering and Computer Science
Massachusetts Institute of Technology(2022)
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2023
© 2023 Luka Govedič. All rights reserved.
The author hereby grants to MIT a nonexclusive, worldwide, irrevocable,
royalty-free license to exercise any and all rights under copyright, including to
reproduce, preserve, distribute and publicly display copies of the thesis, or release
the thesis under an open-access license.
Authored by: Luka Govedič
Department of Electrical Engineering and Computer Science
May 19, 2023
Certified by: Tao B. Schardl
Research Scientist
Thesis Supervisor
Accepted by: Katrina LaCurts
Chair, Master of Engineering Thesis Committee
2
Improving the Performance of Parallel Loops in OpenCilk
by
Luka Govedič
Submitted to the Department of Electrical Engineering and Computer Science
on May 19, 2023, in partial fulfillment of the
requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
For good performance, parallel loop scheduling must achieve low scheduling overheads
and multidimensional locality in nested loops. This thesis explores both challenges
and contributes an extension to randomized work-stealing for first-class loop support
that reduces scheduling overheads.
Randomized work-stealing schedulers traditionally execute parallel-for loops us-
ing parallel divide-and-conquer recursion, which is theoretically efficient and scalable
but can incur substantial overheads in practice. This thesis extends randomized
work-stealing with a custom work-stealing protocol called on-the-fly loop splitting. I
introduce loop frames to make work stealing on parallel-for loops more efficient and
flexible.
Loop frames make two key changes to work stealing for parallel-for loops. First,
loop frames extend work stealing by directly encoding information about intervals of
loop iterations in the runtime. Loop frames add first-class support to work stealing for
parallel-for loops that composes with classical randomized work stealing. Second, loop
frames allow intervals of loop iterations to be split on-the-fly, such that worker threads
attempt to steal half of the unexecuted loop iterations rather than a deterministically
constructed partition of loop iterations. On-the-fly loop splitting allows for more
flexible dynamic load balancing of loop iterations while keeping the work overheads
low and maintaining the theoretical efficiency of divide-and-conquer.
I evaluate loop frames in practice by implementing loop frames in the OpenCilk
runtime system. In particular, loop frames augment the THE protocol from Cilk
to coordinate updates to loop frames. I observe that loop frames and on-the-fly
loop splitting incur substantially less overhead than the divide-and-conquer algorithm
without sacrificing parallel scalability.
Finally, I study the impacts of increased locality in more than one dimension in
nested loop applications. Results show that both cache-aware and cache-oblivious
reordering of nested loop iterations can result in performance benefits up to a factor
of 1.7×.
Thesis Supervisor: Tao B. Schardl
3
Title: Research Scientist
4
Acknowledgments
While I received support from many, no one was as instrumental during this research
as Dr. Tao B. Schardl. I am extremely grateful for his continued guidance and
countless insightful discussions we had over the course of over three years. He helped
ignite my passion for performance engineering research and was indispensable in my
becoming the researcher I am today.
I would also like to express my gratitude to Prof. Charles Leiserson for supporting
me in every aspect of my academic and professional career and being a limitless source
of advice. I am thankful to all other Supertech research group members and Prof.
Rezaul Chowdhury for their repeated feedback and interest in my project.
I cannot thank Sam D’Alonzo enough for being a constant source of support. I
want to thank all of my friends and communities I have been a part of, and my family
for helping this dream happen in the first place.
I would also like to acknowledge that I used ChatGPT to proofread parts of my
thesis.
This research was sponsored in part by the United States Air Force Research
Laboratory under Cooperative Agreement Number FA8750-19-2-1000. The views
and conclusions contained in this document are those of the author and should not
be interpreted as representing the official policies, either expressed or implied, of the
United States Air Force or the U.S. Government. The U.S. Government is authorized
to reproduce and distribute reprints for Government purposes notwithstanding any
copyright notation herein.
5
6
Contents
1 Introduction 15
2 Background 23
2.1 The dag model of multithreading . . . . . . . . . . . . . . . . . . . . 23
2.2 Randomized work stealing . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Cache-oblivious and cache-aware algorithms . . . . . . . . . . . . . . 25
3 Loop frames 27
3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Synchronization protocol . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Empirical evaluation of loop frames 35
4.1 Microbenchmark experiments . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Application experiments . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Study of improved locality in nested loop applications 41
5.1 Locality in multidimensional loops . . . . . . . . . . . . . . . . . . . . 42
5.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Benchmarks and performance results . . . . . . . . . . . . . . . . . . 46
5.3.1 Matrix multiply . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.2 Blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.3 Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7
6 Related work 59
7 Conclusion 61
8
List of Figures
1-1 A DAG for a simple fork-join program and the corresponding pseu-
docode. In this example, there are 4 serial tasks, A-D, that need to
be executed. A needs to be executed before B and C (which can be
executed in parallel). Both B and C need to finish before D. In the
pseudocode, A is called first. Then, B is spawned, meaning C can exe-
cute before B is done. Sync on line 4 forces both B and C to complete
before the execution moves on to D. . . . . . . . . . . . . . . . . . . . 16
1-2 Pseudocode for the typical divide-and-conquer implementation of a
parallel-for loop. The function Body encodes the body of the parallel-
for loop. The spawn and sync keywords allow for parallel execution
of the operations in the function. The constant G is used to coarsen
the recursion to mitigate performance overheads. . . . . . . . . . . . . 19
3-1 Pseudocode for loop frame operations. LFParFor is used by the
worker that enters the loop frame, while SimpleSteal is used by the
thief when stealing a loop frame from a victim. LFParForHelper is
used by both to actually execute all iterations and contains a familiar
branch, body, and increment that can be found in all for loops. Next
returns true if there are more iterations available, and false otherwise. 28
9
3-2 Pseudocode for the optimized synchronization protocol to coordinate
operations on loop frames. The argument 𝐷 denotes a victim’s deque.
The functions lockOwnDeque and unlockOwnDeque acquire
and release the lock on the executing worker’s deque. The function
isLoopFrame tests if the given frame is a loop frame. . . . . . . . . 31
4-1 Median speedup with respect to number of processors for the daxpy,
nqueens, and mandelbrot, appearing from left to right. Each plot
features loop frames in orange squares and the divide-and-conquer al-
gorithm in yellow triangles. The daxpy plot also includes the results
of the simpler protocol for loop frames (briefly mentioned in Chapter
3) in red diamonds. The nqueens plot includes in blue crosses an al-
ternative implementation of the parallel-for loop where each iteration
is spawned off in sequence. . . . . . . . . . . . . . . . . . . . . . . . . 36
5-1 Regular (left), tiled (middle), and Morton (right) iteration orders for
a two-dimensional iteration space. In this case, 𝑁 = 16 and 𝐺 = 4. . 45
5-2 Simplified nested loop implementation of matrix multiply. The outer
two of the three loops in the nest are parallel. . . . . . . . . . . . . . 48
5-3 Speedup of mm implementations with respect to grainsize, run on a
single worker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5-4 Speedup of mm implementations with respect to grainsize, run on 24
workers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5-5 Parallel speedup of mm implementations for 𝐺 = 128. . . . . . . . . . 51
5-6 Simplified nested loop implementation of blur. blurPixel performs
the "blurring" of a pixel and is factored out for better readability. . . 52
5-7 Speedup of blur implementations with respect to grainsize run on a
single worker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5-8 Parallel speedup of blur implementations for 𝐺 = 1. . . . . . . . . . 54
5-9 Simplified nested loop implementation of transpose. . . . . . . . . . 55
10
5-10 Speedup of transpose implementations with respect to grainsize, run
on 24 workers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
11
12
List of Tables
4.1 Comparison of running times of programs from PBBS [53] with grain-
sizes 64, 256, and 2048. Each set of rows contains median running times
of the benchmarks run on 1 or 20 workers using DACParFor (DAC)
and loop frames (LF), as well as the ratio (Ratio) of the DACParFor
running time divided by the loop-frame running times. A ratio greater
than 1 indicates that DAC is slower than LF. The running times in
italic represent the fastest run for that benchmark and worker count. 39
5.1 Benchmarks used in the study and their properties. Loop nest rank
shows the number of nested loops, which corresponds to the number
of dimensions in the iteration space, with the number in parenthesis
indicating how many of those loops are parallel loops. Total distinct
reads and total reads demonstrate whether there is reuse. . . . . . . . 42
5.2 Implementations of nested loops used in the study. Base case size
reduces both the recursive call overhead and parallelism. While neither
<name> nor <name>-tiled use recursion, <name>-tiled still has a base
case, which is one tile. <name>-best and <name>-dac-full only apply
to mm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
13
14
Chapter 1
Introduction
A parallel loop is a control-flow construct that contains a sequence of independent
iterations, which are allowed to be executed concurrently. Parallel loops are common
in high-performance parallel programs, and their performance is crucial for the overall
performance of parallel code. This thesis tackles two main performance challenges of
parallel loops: scheduling overheads and multidimensional locality in nested loops.
Parallel computing is the future of faster code. Multicore computers started ap-
pearing around 2005 [39], when clock speeds plateaued, limiting the speedup of serial
code and forcing programmers to turn to parallel computing. However, running code
in parallel is harder due to race conditions and other issues that occur due to con-
currency. Ideally, parallel programs are desired to achieve perfect linear speedup,
meaning the program runs 𝑃 -times faster on 𝑃 processors (compared to running on a
single processor). In reality, however, this isn’t possible due to scheduling overheads.
Fork-join parallelism is a flexible task-based approach to parallel computing.
Each program is modeled as a directed acyclic graph (dag), where nodes represent
fine-grained indivisible tasks and edges represent dependencies between them. The
fork-join paradigm supports two main constructs: spawn/sync and parallel-for
loops. Spawn and sync are demonstrated in Figure 1-1. Similar to how B and C
in Figure 1-1 can execute in parallel, parallel-for loops allow all iterations to execute
in parallel. Fork-join parallelism is task-based because the programmer only has to
specify tasks and their dependencies, while the scheduling decisions are left to the
15
ForkJoinDemo()
1 A();
2 spawn B();
3 C();
4 sync
5 D();
Figure 1-1: A DAG for a simple fork-join program and the corresponding pseudocode.
In this example, there are 4 serial tasks, A-D, that need to be executed. A needs to
be executed before B and C (which can be executed in parallel). Both B and C need
to finish before D. In the pseudocode, A is called first. Then, B is spawned, meaning
C can execute before B is done. Sync on line 4 forces both B and C to complete
before the execution moves on to D.
runtime system at the time of execution.
Randomized work stealing [5, 15, 20, 22, 23, 26, 31, 32, 33, 44, 58, 13, 25]
is a popular (and asymptotically optimal) scheduling and load-balancing algorithm
for parallel fork-join programs. Typically, a randomized work-stealing scheduler
will implement parallel-for loops using recursive divide-and-conquer , where the
iteration space is recursively split into halves. Although efficient in theory [41, 13, 5],
this approach can exhibit high overheads in practice. Those overheads are made
worse by representing loops with other fundamental fork-join building blocks instead
of treating them as first-class parallel constructs.
Multi-dimensional loops present additional challenges to efficient scheduling. Xu
[59] argues for a locality-first approach to algorithm design and shows that locality
can be more important for good parallel performance than parallelism. In the case of
a multidimensional iteration space, good locality across all (or at least multiple) di-
mensions is often required for fast execution, which can be achieved with a reordering
of iterations to achieve better cache efficiency.
Two common approaches to achieving cache efficiency are cache-oblivious and
cache-aware algorithms. Cache-aware algorithms achieve good cache locality by
taking cache size into account, while cache-oblivious algorithms achieve good cache
locality without any explicit knowledge of cache parameters. For multidimensional
16
iteration spaces, loop tiling and recursive divide-and-conquer are cache-aware
and cache-oblivious approaches to improving cache efficiency by reordering loop iter-
ations.
When programmers implement these cache-efficient approaches manually, they
tightly couple the algorithm to the implementation. That coupling essentially elimi-
nates opportunities for other optimizations that operate on loops, e.g. vectorization
and Loop-Invariant Code Motion, which avoids recomputation inside the loop by ex-
ecuting loop-invariant code before the loop. Instead, if an automated system, such
as a compiler or a runtime system, was capable of transforming nested parallel loops
into an implementation with the desired iteration order, programmers could maintain
the simplicity of a nested loop implementation while also reaping the benefits of the
increased locality. Because the loops are parallel, the compiler is allowed to reorder
iterations without expensive aliasing and dependency analysis.
In this thesis, I tackle these two challenges of scheduling parallel loops. To reduce
scheduling overheads of parallel loops, I introduce loop frames . Loop frames extend
work stealing by adding first-class support for parallel-for loops. To improve the
performance of nested parallel loops, I study the impacts of improved locality along
multiple dimensions on cache locality and overall performance for a few benchmarks.
Unlike the divide-and-conquer approach, where loops are implemented using exist-
ing fork-join constructs, loop frames give loops a special representation in the runtime,
allowing them to optimize work stealing of parallel-for-loop iterations. Loop frames
support on-the-fly loop splitting , where intervals of loop iterations are dynami-
cally split based on the unexecuted iterations when the split occurs. On-the-fly loop
splitting allows unexecuted computation to be dynamically distributed more evenly
than the divide-and-conquer algorithm, but it introduces non-determinism that com-
plicates the theoretical analysis of work stealing [13, 5]. This thesis evaluates loop
frames in practice and observes that loop frames incur substantially less overhead
than the divide-and-conquer algorithm across the board, showing improvements in
serial and parallel running times.
I report on a study that examines the impact of improved locality in more than one
17
dimension of a multidimensional iteration space. I evaluate both cache-oblivious
and cache-aware implementations. The results show that improved locality along
multiple dimensions increases cache efficiency and boosts performance. That moti-
vates a compiler transformation that automatically reorders nested loop iterations to
achieve good locality along all dimensions.
Divide-and-conquer parallel-for loops
To motivate loop frames, let us first examine the overheads in the traditional divide-
and-conquer implementation of parallel-for loops. Figure 1-2 presents pseudocode for
DACParFor, a typical implementation of a parallel-for loop [41, Sec. 8.3]. More
specifically, a parallel-for loop over iterations {0, 1, . . . , 𝑛− 1} is compiled to a call to
DACParFor(0, 𝑛,Body), where the function Body encodes the loop body. As
Figure 1-2 shows, DACParFor uses the parallel keywords spawn and sync to
process the loop iterations using parallel divide-and-conquer recursion. DACParFor
starts by splitting the loop into halves, until less than 𝐺 iterations are left in a single
batch. Here, 𝐺 is a constant used to coarsen the recursion. The spawn keyword
on line 3 allows the recursive call to execute iterations {𝑠, 𝑠 + 1, . . . ,𝑚 − 1} to run
in parallel with the continuation in the parent caller, which executes iterations
{𝑚,𝑚+1, . . . , 𝑒− 1}. The sync statement on line 7 acts as a local barrier that joins
together the child computations spawned within the function.
A work-span analysis [18, Ch. 27] of DACParFor shows that this algorithm
is theoretically efficient. Consider the execution of DACParFor(0, 𝑛,Body), and
for didactic simplicity, suppose that the every execution of Body performs Θ(1)
instructions. The work 𝑇1 of this execution — the total number of instructions
executed — is Θ(𝑛). The span 𝑇∞ of this execution — the length of a longest path
of dependencies — is Θ(lg 𝑛+𝐺). The parallelism of the execution, which bounds
the maximum possible speedup on parallel processors, is the ratio of the work divided
by the span, that is, Θ(𝑛/(lg 𝑛+𝐺)). Randomized work stealing bounds the execution
time of a program on 𝑃 processors by 𝑇𝑃 ≤ 𝑇1/𝑃 +𝑂(𝑇∞) [13, 5].
18
DACParFor(𝑠, 𝑒,Body)
1 while 𝑒− 𝑠 > G
2 𝑚 = (𝑠+ 𝑒)/2
3 spawn DACParFor(𝑠,𝑚,Body)
4 𝑠 = 𝑚
5 for 𝑖 ∈ {𝑠, 𝑠+ 1, . . . , 𝑒− 1}
6 Body(𝑖)
7 sync
Figure 1-2: Pseudocode for the typical divide-and-conquer implementation of a
parallel-for loop. The function Body encodes the body of the parallel-for loop.
The spawn and sync keywords allow for parallel execution of the operations in the
function. The constant G is used to coarsen the recursion to mitigate performance
overheads.
Figure 1-2 also illustrates how parallel-for loops can exhibit substantial overheads
in practice. As Figure 1-2 shows, a call to DACParFor(0, 𝑛,Body) spawns Θ(𝑛)
function calls. Each spawned function call incurs overheads due to the function
calls themselves and due to the operations each spawn performs to enable parallel
execution. These overheads are particularly large in comparison to an ordinary serial
loop, which typically performs just a few operations on registers per iteration to
control the loop’s execution. For example, I measured these overheads on a simple
daxpy microbenchmark, which computes 𝑦[𝑖] += 𝑎 ·𝑥[𝑖] over all elements of two given
arrays 𝑥 and 𝑦 and scalar value 𝑎. On 1 processor, daxpy ran over 25× slower when
implemented using DACParFor with 𝐺 = 1 compared to an ordinary serial-loop
implementation with identical compiler optimizations.
Previous research has examined ways to improve the performance of parallel-for
loops. Typically, the overheads of DACParFor can be mitigated by increasing
the constant 𝐺 for coarsening the recursion. Cilk, for example, typically sets 𝐺 =
min{2048, 𝑛/8𝑃}, where 𝑛 is the number of loop iterations and 𝑃 is the number
of worker threads [30]. But increasing 𝐺 reduces the parallelism of DACParFor,
reducing the maximum possible parallel speedup the loop can achieve. As Section 6
discusses, previous work has explored many other approaches to optimizing parallel
loops, including static and dynamic schemes to partition loop iterations [48, 28, 57,
19
40, 27, 7, 8], strategies to exploit common memory access patterns in loops [54], and
techniques to reduce the synchronization overheads of work stealing [55, 56, 2].
Loop frames
Loop frames provide an alternative to DACParFor to load balance parallel-for loops
dynamically using work stealing. Loop frames aim to address the shortcomings of
DACParFor while providing the same theoretical performance guarantees. Loop
frames introduce a simple data structure and work-stealing protocol to randomized
work-stealing. Rather than build upon spawn and sync constructs, the loop-frame
data structure concisely represents an interval of loop iterations using two integers.
In addition, the loop-frame protocol supports on-the-fly loop splitting, which allows
a processor to steal half of the frame’s unexecuted loop iterations. In contrast, DAC-
ParFor splits the 𝑛 iterations of a parallel-for loop at deterministic points, i.e., at
iterations 𝑛/2, 𝑛/4, 3𝑛/4, etc., as Figure 1-2 shows. Loop frames compose with tradi-
tional randomized work stealing and serve as a drop-in replacement for DACParFor
in a work-stealing scheduler.
On-the-fly loop splitting introduces non-determinism into the scheduling of
parallel-for loops, complicating their analysis. Regardless, [47] shows that, despite
their non-deterministic behavior, randomized work stealing with loop frames achieves
the same theoretical performance guarantees as DACParFor.
Implementation of loop frames
I implemented loop frames in a beta version of the OpenCilk runtime system [51].
As Chapter 3 describes, my implementation extends the THE protocol from Cilk [25]
to synchronize operations on loop frames efficiently. I compare the empirical perfor-
mance of loop frames versus the traditional divide-and-conquer algorithm on 3 mi-
crobenchmarks and 14 application benchmarks from the Problem-Based Benchmark
Suite [10]. I evaluated loop frames and the divide-and-conquer algorithm with vari-
20
ous coarsening values 𝐺. For the divide-and-conquer algorithm, 𝐺 is used to coarsen
the recursion as shown in Figure 1-2. For loop frames, 𝐺 is used to strip-mine the
loop, which batches iterations into groups of size 𝐺, thereby amortizing overheads
associated with loop frames. Empirical results show that loop frames significantly
reduce scheduling overheads of DACParFor, reducing the coarsening parameter 𝐺
required for the best performance.
Study of locality benefits in multidimensional loops
To explore the benefits of increased locality of multidimensional iteration spaces, I
performed a study using various implementations of simple benchmarks containing
nested loops. The study focused on cache-oblivious and cache-aware approaches,
comparing them to a naive nested loop implementation and analyzing their perfor-
mance. The benchmarks used include mm, blur, and transpose. All benchmarks
exhibit better cache locality when locality along multiple dimensions is improved.
The implementations used achieve the regular, tiled (cache-aware), and multidimen-
sional divide-and-conquer (cache-oblivious) iteration orders with a mix of recursive
and iterative approaches. To eliminate the effects of recursive overheads, the study
analyzes the effects of recursive coarsening. Results show that improved locality along
more than one dimension improves performance for two of the three benchmarks.
Contributions and outline
This thesis makes the following contributions.
• Loop frames, which extend randomized work stealing with first-class support for
parallel-for loops. They support on-the-fly loop splitting, which provides flexible
dynamic splitting of loop iterations and allows for scheduling and load balancing
with significantly less overhead than the traditional divide-and-conquer algorithm.
Finally, loop frames are fully compatible with an existing fork-join scheduler.
21
• An efficient implementation of loop frames based on the OpenCilk runtime sys-
tem [51]. This implementation extends the THE protocol from Cilk [25] to coordi-
nate parallel operations on loop frames efficiently.
• An empirical evaluation of loop frames. I observe that on microbenchmarks and
application benchmarks, loop frames and on-the-fly loop splitting incur substan-
tially less overhead than the divide-and-conquer algorithm without sacrificing par-
allel scalability.
• A study on the impacts of reordering iterations of a multidimensional iteration space
on cache locality and overall performance. The study measures the performance of
multiple benchmarks and various implementations that achieve different iteration
orders. It finds that for two out of the three benchmarks, reordering iterations can
result in a significant reduction in cache misses and an improvement in bottom-line
performance.
The rest of this thesis is organized as follows. Chapter 2 reviews the dag model of
multithreading, randomized work stealing, and cache-efficient algorithms. Chapter 3
describes loop frames and their custom stealing and synchronization protocols used
in the OpenCilk implementation. Chapter 4 presents the empirical evaluation of
loop frames. Chapter 5 presents the study of improved locality in multidimensional
iteration spaces. Chapter 6 discusses related work on optimizing work stealing and
parallel loops. Finally, Chapter 7 offers concluding remarks.
22
Chapter 2
Background
This chapter reviews the dag model of multithreading [12, 13], which represents the
execution of a recursive fork-join program and classical randomized work stealing.
This chapter also reviews cache-oblivious algorithms and recursive divide-and-conquer
in multiple dimensions.
2.1 The dag model of multithreading
An execution dag 𝐺 = (𝑉,𝐸) represents the execution of a recursive fork-join
program, where each vertex 𝑥 ∈ 𝑉 represents a strand — a sequence of serially
executed instructions containing no parallel control — and edges represent control
dependencies between strands. The execution of a spawn terminates the current
strand and produces a vertex with out-degree 2. A sync creates a vertex with in-
degree greater than 1. The execution dag 𝐺 is a series-parallel dag [21] with a single
source vertex and a single sink vertex. To simplify the analysis, this thesis shall
assume that each strand represents a single executed instruction.
The work and span of the execution of a program can be measured in terms of its
execution dag 𝐺. The work of 𝐺 is the total number of strands in 𝐺. The span of 𝐺
is the length of the longest path of dependencies in 𝐺.
Task-parallel programming platforms that support the dag-parallel model include
dialects of Cilk [11, 25, 19, 38, 36, 29], Fortress [4], Habanero [9], Habanero-Java [16],
23
Hood [14], HotSLAW [42], Java Fork/Join Framework [35], OpenMP [46, 6], Task
Parallel Library [37], Threading Building Blocks (TBB) [49] and X10 [17].
2.2 Randomized work stealing
The operation of a randomized work-stealing scheduler can be described in terms of
the execution dag 𝐺 = (𝑉,𝐸). A randomized work-stealing scheduler dynamically
load balances a parallel computation across available threads, called workers . At
each time step, each worker 𝑝 maintains an assigned strand , which is the strand
that 𝑝 executes on that time step. A strand 𝑠 is said to be ready if all its predecessors
in 𝐺 have been completed. Executing an assigned strand can enable a strand 𝑠′ that
is a direct successor of 𝑠 in 𝐺 by making 𝑠′ ready. Each worker maintains a deque –
a double-ended queue – of ready strands. Typically, a worker operates on its deque
like a stack, pushing and popping strands from the tail of its deque. When a worker
runs out of strands on its deque, it becomes a thief and randomly chooses another
worker as its victim . If the selected victim has excess strands on its deque, then the
thief can steal a strand from the head of the victim’s deque.
A classical randomized work-stealing scheduler keeps track of a single ready strand
using a spawn frame . Hence, the deque of ready strands is stored as a deque of
spawn frames. Each spawn frame maintains the necessary information for a thief to
begin executing that strand, e. g. the code address and register state to resume
execution of the strand. When a thief steals from a victim, it removes the spawn
frame at the head of the victim’s deque and assigns the stolen frame to itself. The
extended deque is defined to include the assigned strand along with the other spawn
frames in the deque.
Worker deques support three methods: pushBottom, popBottom, and pop-
Top. A spawn statement causes a worker to execute pushBottom to push its
current frame onto the bottom of the deque. A worker executes popBottom when
it returns from a spawned function. If popBottom runs on an empty deque, then
popBottom does not return but instead causes the worker to become a thief. A
24
thief steals a spawn frame by calling popTop to remove the frame at the top of the
deque.
2.3 Cache-oblivious and cache-aware algorithms
A cache-oblivious algorithm [24] is an algorithm that achieves good cache locality
without explicit knowledge of the cache parameters 𝐵 and 𝑀 , cache line length and
cache size, respectively. In contrast, cache-aware algorithms explicitly use those
cache parameters to achieve good cache-locality.
Cache-oblivious algorithms have many advantages over cache-aware algorithms.
Because they do not require precise tuning to the hardware they execute on, they are
very portable. Obliviousness to cache parameters also makes them perform well on
more complex memory hierarchies like multi-level caches and heterogeneous hardware.
Conversely, cache-aware algorithms must be explicitly tuned for each level of the cache
hierarchy.
Optimal cache-oblivious algorithms achieve the lower asymptotic bound of cache
misses. Several problems, including matrix multiplication, matrix transposition, and
sorting, have known optimal cache-oblivious algorithms. In contrast, others, like the
Cooley-Tukey FFT, are optimally cache-oblivious with specific parameter choices.
These algorithms may require fine-tuning on particular machines, but the overall
goal is to minimize the extent of tuning necessary.
Typically, a cache-oblivious algorithm employs a recursive divide-and-conquer ap-
proach, dividing a problem into smaller subproblems until the subproblem fits into the
cache, regardless of the cache size. Hybrid algorithms may also combine cache-aware
and cache-oblivious algorithms for specific cache sizes.
One cache-aware alternative to nested loops in a multidimensional iteration space
where iterations can be reordered is loop tiling [34]. Instead of iterating from start
to finish for each dimension, the iteration space is divided into tiles , which are mul-
tidimensional sub-parts of the iteration space. The size of each tile is tuned so that
each data accessed in a tile fits in cache. In serial execution, each tile is processed as a
25
whole before starting the next. Most parallel executions allow tiles to run in parallel
while each tile runs serially, which provides coarsening and guarantees the whole tile
is executed on a single worker. Multiple levels of tiling are required to exploit all
cache levels in a multi-level cache hierarchy.
The cache-oblivious replacement for loop tiling is multidimensional recursive
divide-and-conquer . It performs a series of splits, always by the largest dimension,
and recursively visits both halves. This approach will traverse the iterations in the
Morton order [43], sometimes referred to as the z-order.
26
Chapter 3
Loop frames
This chapter describes loop frames, which extend randomized work stealing to provide
first-class support for parallel loops in the runtime. Section 3.1 describes loop frames
and how they support on-the-fly loop splitting in a randomized work-stealing sched-
uler. Section 3.2 outlines the detailed synchronization protocol between the victim
and the thieves that maintains correctness during concurrent execution.
I present loop frames for parallel-for loops without any coarsening (i.e. executing
chunks of iterations at a time to reduce the scheduling overhead) for simplicity. In
the implementation, coarsening is implemented via loop strip-mining, where the loop
is divided into chunks and a remainder. Each chunk is then executed as a whole, and
the remainder is executed at the end.
3.1 Implementation
A loop frame is an extension of a spawn frame. All spawn frames contain the execution
context, which assists workers in pausing and resuming different sub-parts of the
computation. In addition to that, loop frames store a range of loop iterations that
remain to be executed and additional flags that assist synchronization. Just like
spawn frames, loop frames are pushed onto the deque and popped from it by workers
and thieves.
A loop frame extends a spawn frame with start and end variables, which encode
27
LFParFor(𝑛,Body) StealLF_simple(𝐹 )
1 𝐹 = LoopFrame(1, 𝑛,Body) 1 𝐹 ′ = CopyLoopFrame(𝐹 )
2 pushBottom(𝐹 ) 2 𝑚 = (𝐹 ′.start + 𝐹 ′.end)/2
3 LFParForHelper(𝐹, 0) 3 𝐹.end = 𝑚
4 popBottom() 4 𝐹 ′.start = 𝑚+ 1
5 pushBottom(𝐹 ′)
LFParForHelper(𝐹, 𝑖) 6 LFParForHelper(𝐹,𝑚)
1 while Next(𝐹 ) 7 popBottom()
2 𝐹.Body(𝑖)
3 𝑖 ++
Figure 3-1: Pseudocode for loop frame operations. LFParFor is used by the worker
that enters the loop frame, while SimpleSteal is used by the thief when stealing a
loop frame from a victim. LFParForHelper is used by both to actually execute
all iterations and contains a familiar branch, body, and increment that can be found
in all for loops. Next returns true if there are more iterations available, and false
otherwise.
the parallel loop iterations in the half-open interval [start , end) = {start , start +
1, . . . , end − 1}. At any time during the execution, start and end represent the itera-
tions that haven’t been executed yet. While the divide-and-conquer implementation
will push and pop spawn frames onto the worker’s ready deque while executing the
iterations, LFParFor only operates on the start and end variables, reducing the
scheduling overhead per iteration.
Figure 3-1 contains pseudocode for the operations on a loop frame. Here, I assume
that the Next function and lines 1–3 in StealLF_simple operate atomically. The
detailed implementation of the protocol that provides these atomicity guarantees is
presented in Section 3.2.
Let us walk through the pseudocode in Figure 3-1. A parallel-for loop over 𝑛
iterations is implemented using a call to LFParFor(𝑛,Body) followed by a sync,
where the function Body encodes the body of the parallel-for loop. The sync ensures
that all workers executing iterations of the loop have completed. When a worker
executes LFParFor(𝑛,Body), line 1 first creates a new loop frame 𝐹 representing
iterations [1, 𝑛). Line 2 pushes 𝐹 onto the worker’s deque, and then line 3 calls
LFParForHelper to direct the worker to start executing the loop from iteration 0.
28
These lines thus assign iteration 0 to the worker and allow iterations [1, 𝑛) to be
stolen. Upon returning from LFParForHelper, line 4 pops 𝑓 off the bottom of the
worker’s deque or, if the pop fails (because the loop frame has been stolen completely),
causes the worker to become a thief.
The function LFParForHelper(𝐹, 𝑖) causes the worker to execute loop itera-
tions starting from 𝑖 = 𝐹.start − 1 one at a time until reaching iteration 𝐹.end . In
particular, line 1 calls Next(𝐹 ) to check if there are more loop iterations to exe-
cute. Next(𝐹 ) increments 𝐹.start and returns 𝐹𝑎𝑙𝑠𝑒 if 𝐹.start > 𝐹.end , or 𝑇𝑟𝑢𝑒
otherwise.
To steal from the loop frame 𝐹 using on-the-fly loop splitting, a thief calls the func-
tion StealLF_simple(𝐹 ), which operates as follows. Line 1 first produces a copy 𝐹 ′
of 𝐹 , and then line 2 computes the midpoint 𝑚 between 𝐹 ′.start and 𝐹 ′.end . Line 3
then performs on-the-fly loop splitting: The thief claims the unexecuted iterations
[𝑚,𝐹 ′.end) by setting 𝐹.end = 𝑚, thereby stealing half of the unexecuted itera-
tions. Finally, like an ordinary worker, the thief pushes 𝐹 ′ representing the iterations
[𝑚+ 1, 𝐹 ′.end) onto its deque (lines 4 and 5) and then calls LFParForHelper on
line 6 to start executing its loop iterations from iteration 𝑚. Finally, line 7 pops 𝐹 ′
off the bottom of the thief’s deque and allows the thief to resume work stealing.
3.2 Synchronization protocol
This section describes the synchronization protocol used in the OpenCilk runtime sys-
tem [51] implementation of loop frames. Section 3.1 described loop frames assuming
that two functions, Next and StealLF, execute atomically. This section describes
how these functions implement an efficient protocol, based on the THE protocol in
Cilk [25], to synchronize updates to loop frames.
Let us first review the THE protocol for coordinating accesses onto worker de-
ques [25]. Each deque maintains a head pointer 𝐻 and a tail pointer 𝑇 to track the
head and tail of the deque, as well as a mutex lock 𝐿. Workers normally push and pop
frames onto the deque by updating 𝑇 . A thief can pop a frame from the top of the
29
deque by incrementing 𝐻. To optimize deque operations, the THE protocol allows the
worker to speculatively update 𝑇 , and only requires the worker to acquire 𝐿 when the
deque appears to be empty. After acquiring 𝐿, the worker can determine whether the
deque was really empty or there was an uncompleted steal attempt in progress that
has since failed. A thief, meanwhile, always acquires 𝐿 before updating 𝐻, ensuring
a single steal attempt happening at a time and providing a correctness guarantee for
the worker’s second, protected check.
To ensure the atomicity of Next and StealLF, a simple but inefficient approach
is to give a thief exclusive access to a victim’s loop frame 𝐹 during StealLF. In-
tuitively, after locking the deque, the thief would increment 𝐻, update 𝐹.end , then
decrement H and release the deque lock. The Next function, meanwhile, would
decrement 𝑇 before attempting to update 𝐹.start . If the worker discovered 𝐻 > 𝑇
after decrementing 𝑇 , it would acquire the deque lock before completing its update
to 𝐹.start . Although this approach is simple, it turns out to be inefficient, as the
worker will push and pop frames for every iteration.
Figure 3-2 presents pseudocode for the optimized synchronization protocol on loop
frames, which I found to be 40% faster than the simple protocol. Conceptually, rather
than use 𝐻 and 𝑇 to coordinate all operations on loop frames, the optimized protocol
operates on the start and end variables of a loop frame directly, similarly to how the
THE protocol operates on 𝐻 and 𝑇 . The Next function implements the routine
used in LFParForHelper for a worker to get the next loop iteration to execute
from a loop frame 𝐹 . A thief steals from a deque 𝐷 by calling Steal(𝐷), which calls
StealLF(𝐹,𝐷) on line 8 if line 7 discovers that the frame 𝐹 at the top of 𝐷 is a
loop frame. The Steal function thus extends the steal function in Cilk-5 [25] to
handle loop frames.
Let us walk through the pseudocode in Figure 3-2, starting with Next. Line 1
gets the next iteration index 𝑖 to execute and speculatively increments 𝐹.start . Line 2
checks if the loop frame is out of iterations by checking whether the new value of
𝐹.start exceeds 𝐹.end . If the test fails, the worker returns 𝑇𝑟𝑢𝑒 (line 8). If the test
succeeds, then the worker locks its own deque (line 3) to gain exclusive access to
30
Next(𝐹 ) StealLF(𝐹,𝐷)
1 𝑖 = 𝐹.start ++ 1 𝐹 ′ = CopyLoopFrame(𝐹 )
2 if 𝐹.start > 𝐹.end 2 𝑚 = (𝐹.start + 𝐹.end)/2
3 lockOwnDeque() 3 𝐹.end = 𝑚
4 if 𝐹.start > 𝐹.end 4 if 𝐹.start > 𝐹.end
5 unlockOwnDeque() 5 𝐹.end = 𝐹 ′.end
6 return 𝐹𝑎𝑙𝑠𝑒 6 𝐻 −−
7 unlockOwnDeque() 7 unlock(𝐷.L)
8 return 𝑇𝑟𝑢𝑒 8 return FAIL
9 if 𝐹.start < 𝐹.end
Steal(𝐷) 10 𝐻 −−
1 lock(𝐷.L) 11 unlock(𝐷.L)
2 𝐹 = 𝐻 ++ 12 𝐹
′.start = 𝑚
3 if 𝐻 > 𝑇 13 return SUCCESS
4 𝐻 −−
5 unlock(𝐷.L)
6 return FAIL
7 if isLoopFrame(𝐹 )
8 return StealLF(𝐹 )
9 unlock(𝐷.L)
10 return SUCCESS
Figure 3-2: Pseudocode for the optimized synchronization protocol to coordinate op-
erations on loop frames. The argument 𝐷 denotes a victim’s deque. The functions
lockOwnDeque and unlockOwnDeque acquire and release the lock on the exe-
cuting worker’s deque. The function isLoopFrame tests if the given frame is a loop
frame.
𝐹 . The worker rechecks if 𝐹.start > 𝐹.end on line 4 and then returns either 𝐹𝑎𝑙𝑠𝑒
(line 6), if the loop frame is out of iterations, or 𝑇𝑟𝑢𝑒 otherwise (line 8). In either
case, the worker unlocks its deque before returning.
Function StealLF(𝐹,𝐷) in Figure 3-2 is a more detailed version of
StealLF_simple in Figure 3-1, additionally containing the synchronization protocol
on lines 4–11. After setting 𝐹.end = 𝑚 on the victim loop frame 𝐹 , line 4 checks if
𝐹 now contains an invalid interval of loop iterations by testing 𝐹.start > 𝐹.end . If
so, then lines 5–8 restore 𝐹.end and 𝐻 to their previous values, unlock the victim’s
deque, and report a failed steal attempt. Lines 5–8 perform the same rollback as lines
4–6. Otherwise, line 9 checks if 𝐹 still contains any loop iterations after the steal by
31
testing 𝐹.start < 𝐹.end . If so, then line 10 restores the victim’s head pointer to leave
𝐹 at the top of the victim’s deque. If not, the thief leaves 𝐻 unchanged, effectively
removing 𝐹 from the top of the victim’s deque and exposing frames below 𝐹 on the
deque to be stolen. Finally, the thief reports a successful steal (line 13), after which
it will begin executing iteration 𝑚.
Note that line 12 sets 𝐹 ′.start = 𝑚 as opposed to Line 4 of the simplified protocol
(depicted in Figure 3-1), which sets 𝐹 ′.start = 𝑚 + 1. That’s because the former is
only stealing the upper half of the iterations while the latter then also immediately
takes the first iteration 𝑚 from the loop frame 𝐹 ′ before 𝐹 ′ is pushed onto its deque.
In other words, it summarizes both operations into one: stealing the upper half of
iterations and taking the first one to make it unavailable for stealing. I believe that
this distinction makes it clearer which iterations are actually getting stolen from 𝐹
in the synchronization protocol in Figure 3-2.
The definition of correctness for this protocol follows from preserving the semantics
of a for-loop: it executes every iteration in the specified range exactly once, and it
doesn’t execute any iterations outside this range. The proof of correctness mirrors
that of the THE protocol [25], assuming sequential consistency. While the THE
protocol guarantees that a frame will only be taken (and executed) by either a worker
or a thief, the loop frame protocol provides the same guarantee for iterations.
Because a thief first acquires the victim deque’s lock before attempting to steal,
only the worker and one thief can operate concurrently on a deque and, therefore, on
the same loop frame 𝐹 . If 𝐹 contains no iterations, then both the victim and thief
will fail to take any iterations from 𝐹 . Otherwise, suppose a thief attempts to steal
iterations [𝑚,𝐹.end) from 𝐹 . If 𝐹 contains more than 1 iteration, then 𝐹.start < 𝑚,
and the worker can update 𝐹.start without impinging on any iterations in [𝑚,𝐹.end),
Hence, both updates to 𝐹 can happen concurrently without issue. If 𝐹 contains 1
iteration, then the thief and victim must coordinate to determine which worker gets
iteration 𝑚. If the thief discovers 𝐹.start > 𝐹.end on line 4 of StealLF, then line 1
in Next happened before this check, and the thief will restore 𝐹.end to its original
value, allowing the victim to claim the iteration. Otherwise, line 1 in Next happens
32
after this check, causing the thief to steal iteration 𝑚 and the victim to discover that
𝐹 is out of iterations.
33
34
Chapter 4
Empirical evaluation of loop frames
This section presents my empirical evaluation of loop frames. I tested my implemen-
tation of loop frames in version 1.0 of the OpenCilk runtime [51] and compared its
performance to that of the traditional divide-and-conquer algorithm, DACParFor . I
evaluated both algorithms on both microbenchmarks and application benchmarks.
Results show that loop frames achieve significantly lower scheduling overheads
than the reference DacParFor implementation. On microbenchmarks, which used
no coarsening to directly expose the scheduling overheads, loop frames outperformed
the reference implementation by a factor of up to 2.7×. Additionally, application
benchmarks from the Problem-Based Benchmark Suite (PBBS) [53] were used to
demonstrate that loop frames also perform well in a real-world setting. Those results
show loop frames require less coarsening than the reference implementation to achieve
good performance, providing more evidence for a reduction in scheduling overheads.
All experiments in this chapter were run on compute nodes on the MIT Supercloud
system [50]. Each compute node is a dual-socket Intel Xeon Gold 6248 system with
a total of 384 GB of main memory. Each Xeon is a 2.50GHz 20-core CPU. To reduce
the effects of noise on the performance measurements, I disabled hyperthreading and
pinned all workers to cores on a single socket. All the benchmarks were compiled with
OpenCilk [51], based on LLVM 9, using the highest optimization level. All results
are median running times, aggregated from 10 trials. Specific compilation strategies
are described in each section.
35
Figure 4-1: Median speedup with respect to number of processors for the daxpy,
nqueens, and mandelbrot, appearing from left to right. Each plot features loop
frames in orange squares and the divide-and-conquer algorithm in yellow triangles.
The daxpy plot also includes the results of the simpler protocol for loop frames (briefly
mentioned in Chapter 3) in red diamonds. The nqueens plot includes in blue crosses
an alternative implementation of the parallel-for loop where each iteration is spawned
off in sequence.
4.1 Microbenchmark experiments
Figure 4-1 presents the performance results for the three microbenchmarks: daxpy,
nqueens, and mandelbrot. I compare the overheads of parallel-for loops implemented
using loop frames with the reference DacParFor implementation on these programs.
Because each program makes heavy use of parallel-for loops, the performance of each
program depends substantially on the overhead of the parallel-for-loop implementa-
tion. The parallel-for loops in each program were implemented by directly inserting
calls to runtime ABI functions into the code to avoid modifications to the Open-
Cilk compiler. The three programs vary in their computational intensity, that is, the
ratio of arithmetic operations to memory operations. To evaluate the overheads of
the parallel-for-loop implementations, I study these programs with no coarsening, i.e.,
with 𝐺 = 1 in DACParFor. I examine the results for each of these microbenchmarks
results in turn.
The results for the daxpy microbenchmark, which was briefly described in Chapter
1, are shown in the leftmost plot in Figure 4-1. All running times in this plot are nor-
malized to the 1-processor running time of the DACParFor implementation. daxpy
36
runs a geometric mean 2.7× faster when using loop frames instead of DACParFor
on all processor counts.
The middle plot in Figure 4-1 presents the results for nqueens. This program per-
forms a parallel recursive backtracking search to count the solutions to the 𝑛-queens
problem. At each level of recursion, the program performs a parallel-for over the 𝑛
positions in a given row. The figure presents results for 𝑛 = 13, meaning that each
parallel-for loop contains just 13 compute-intensive iterations. As the figure shows,
when using loop frames, nqueens runs a geometric mean 1.3× faster than the DAC-
ParFor version on all processor counts. Furthermore, the figure shows that loop
frames cut in half the performance gap between the DACParFor implementation
and an optimized version of nqueens in which the parallel-for loops spawn off each
iteration in sequence after performing an earlier viability check. This pruning strat-
egy reduces the parallel overhead of spawning short tasks that don’t contribute to the
parallelism and is not possible for parallel loops, which is why it outperforms both
loop-based implementations.
The rightmost plot in Figure 4-1 presents the performance results for mandelbrot,
which plots a picture of the Mandelbrot fractal with a specified size. The results in
the plot are normalized to the serial projection of the program [25, 52], in which
all parallel constructs are replaced with their serial counterparts. mandelbrot using
loop frames runs a geometric mean 2× faster compared to using DACParFor on all
processor counts.
4.2 Application experiments
I evaluated the performance of loop frames on 14 applications from the Problem-
Based Benchmark Suite (PBBS) [53]. In particular, I examined the performance of
loop frames and DACParFor on these applications with different coarsening val-
ues 𝐺. I selected 14 applications from PBBS for which changing 𝐺 for the default
DACParFor implementation of parallel-for loops from 𝐺 = 2048 to 𝐺 = 64 pro-
duced at least a 2% change in running time. This selection criterion aims to identify
37
programs for which I anticipate the performance differences between DACParFor
and loop frames to have a measurable impact on program running time.
To evaluate loop frames on these benchmarks, I modified the OpenCilk runtime
system to implement a runtime-ABI function for cilk_for loops [30] using loop
frames. I developed a similar runtime-ABI function using DACParFor to provide
a fair comparison between the algorithms. All PBBS benchmarks were compiled to
use this runtime-ABI function for all cilk_for loops. To test a coarsening value
𝐺, I directed the OpenCilk compiler to use 𝐺 when coarsening loop statically, via
loop strip-mining, and when using a runtime-computed coarsening for the cilk_for
loop [30]. In addition, I compiled all programs with exceptions disabled, due to lack
of exception support in the modified runtime.
Table 4.1 presents the performance results for the PBBS benchmarks. Many of the
PBBS benchmarks have parallel-for loops with large loop bodies, which minimizes the
total contribution of parallel-loop overhead to the total running time of the program.
Nevertheless, I found that, with smaller values of 𝐺, the benchmarks perform better on
both 1 and 20 workers when using loop frames than when using DACParFor. With
𝐺 = 64, for example, the benchmarks ran 2−6% faster when using loop frames versus
DACParFor. The performance difference between loop frames and DACParFor
decreases as 𝐺 is increased because a larger value of 𝐺 decreases the total contribution
of parallel-loop overhead to the program’s running time. Nevertheless, the results
indicate that loop frames allow applications to achieve efficiency and scalability with
smaller coarsening values. Good performance at lower coarsening could speed up
other applications that don’t contain as much parallelism as PBBS. However, such
applications would need to be benchmarked directly to confirm this hypothesis.
I note that Table 4.1 shows loop frames performing worse than DACParFor on
several benchmarks with 𝐺 = 2048. For 𝐺 = 2048, I believe that loop frames are
performing worse than intended due in part to a performance bug in the implemen-
tation. For a parallel-for loop over 𝑛 iterations, after strip-mining the loop by 𝐺, the
loop-frame implementation executes 𝑛 mod 𝐺 iterations serially after the rest of the
loop-frame performs a sync. In contrast, the DACParFor implementation allows
38
those same iterations to execute in parallel with the other iterations of the loop. Any
future implementation using loop frames should correct this bug.
detbfs ndbfs chull irefine dict imis imatch ndmatch pkruskal remdup ist ndst pks prange
DAC, 𝑇1, 64 15.333 12.266 9.440 48.000 2.045 2.890 10.380 6.343 42.666 3.131 16.666 16.200 26.183 22.018
LF, 𝑇1, 64 14.433 12.033 9.212 47.250 2.002 2.800 9.623 6.120 41.266 2.902 17.200 17.466 25.933 19.081
Ratio 1.062 1.019 1.025 1.016 1.021 1.032 1.079 1.036 1.034 1.079 0.969 0.928 1.010 1.154
DAC, 𝑇20, 64 0.976 1.038 0.743 3.150 0.120 0.186 0.611 0.366 3.146 0.199 1.194 9.473 1.892 1.309
LF, 𝑇20, 64 0.963 0.809 0.742 3.205 0.116 0.184 0.575 0.356 3.096 0.191 1.165 9.383 1.891 1.295
Ratio 1.013 1.009 1.001 0.983 1.034 1.022 1.062 1.028 1.016 1.042 1.025 1.010 1.001 1.011
DAC, 𝑇1, 256 14.700 12.033 9.193 47.750 2.004 2.866 9.500 6.050 41.566 2.995 15.966 15.900 25.716 19.533
LF, 𝑇1, 256 14.500 11.900 9.112 47.250 1.983 2.810 9.216 6.056 41.233 2.900 15.566 16.099 25.500 19.111
Ratio 1.014 1.011 1.009 1.011 1.011 1.020 1.031 0.999 1.008 1.033 1.026 0.988 1.008 1.022
DAC, 𝑇20, 256 3.100 0.816 0.743 3.165 0.117 0.186 0.572 0.353 3.126 0.194 1.162 8.943 1.895 1.313
LF, 𝑇20, 256 0.970 0.803 0.747 3.180 0.115 0.183 0.558 0.349 3.130 0.191 1.151 9.083 1.877 1.298
Ratio 1.000 1.004 0.995 0.995 1.017 1.016 1.025 1.011 0.999 1.016 1.010 0.985 1.010 1.012
DAC, 𝑇1, 2048 14.266 11.799 9.081 46.750 1.958 2.796 9.013 6.013 40.233 2.867 15.400 15.500 25.133 18.970
LF, 𝑇1, 2048 14.233 11.900 9.090 47.450 1.958 2.706 9.083 6.010 40.433 2.844 15.500 15.633 25.250 19.073
Ratio 1.002 0.992 0.999 0.985 1.000 1.033 0.992 1.000 0.995 1.008 0.994 0.991 0.995 0.995
DAC, 𝑇20, 2048 0.956 0.806 0.739 3.130 0.114 0.184 0.554 0.353 3.106 0.190 1.159 9.286 1.857 1.280
LF, 𝑇20, 2048 0.989 0.830 0.746 3.190 0.114 0.184 0.552 0.351 3.126 0.192 1.162 8.896 1.870 1.297
Ratio 0.967 0.971 0.991 0.981 1.000 1.000 1.004 1.006 0.994 0.990 0.997 1.044 0.993 0.987
Table 4.1: Comparison of running times of programs from PBBS [53] with grainsizes
64, 256, and 2048. Each set of rows contains median running times of the benchmarks
run on 1 or 20 workers using DACParFor (DAC) and loop frames (LF), as well as
the ratio (Ratio) of the DACParFor running time divided by the loop-frame running
times. A ratio greater than 1 indicates that DAC is slower than LF. The running
times in italic represent the fastest run for that benchmark and worker count.
39
40
Chapter 5
Study of improved locality in nested
loop applications
In this chapter, I present a study of the impact of increased cache locality on three
simple benchmarks that contain nested loops. I implemented each benchmark using
cache-oblivious and cache-aware approaches and compared their performance to a
naive nested loop implementation. The results show that improved locality in more
than one dimension can result in performance gains up to 1.7× when better temporal
and spatial locality can be achieved.
Table 5.1 summarizes the benchmarks used in the study. mm computes a product
of two matrices. blur performs a convolution of a trivial 2D filter on a 2D image.
transpose computes an in-place matrix transposition. mm reuses each element 𝑁
times, blur reuses each element 𝐾 times, and transpose contains no reuse.
All benchmarks satisfy several criteria that make them relevant to this study.
First, memory operations represent a significant proportion of the running time, so
better locality can measurably affect overall performance. Second, locality along
more than one dimension improves spatial or temporal locality. Otherwise, loops
in the nests can be reordered to guarantee locality in the desired dimension. This
criterion also implies the nested loop implementation does not achieve the optimal
number of cache misses.
This chapter is organized as follows. Section 5.1 explains the importance and
41
Property mm blur transpose
Parameters N N, K N
Loop nest rank (parallel) 3(2) 2(2) 2(2)
Total distinct reads 2𝑁2 𝑁2 𝑁2 −𝑁
Total reads 2𝑁3 𝑁2𝐾2 𝑁2 −𝑁
Table 5.1: Benchmarks used in the study and their properties. Loop nest rank shows
the number of nested loops, which corresponds to the number of dimensions in the
iteration space, with the number in parenthesis indicating how many of those loops
are parallel loops. Total distinct reads and total reads demonstrate whether there is
reuse.
terminology of locality in multidimensional loops. Section 5.2 lists the various im-
plementations used in the study and describes their properties. Section 5.3 describes
each benchmark in more detail and presents the empirical results for each. Finally,
Section 5.4 summarizes the takeaways we can draw from the study.
5.1 Locality in multidimensional loops
This section discusses locality in multidimensional loops and when it benefits perfor-
mance. It explains the relationship between iteration order and locality. The section
also explains the distinction between spatial and temporal locality and how they
impact cache operation. Finally, an example of iterating a 2D array illustrates the
importance of locality in different dimensions.
The traversal order of a multidimensional iteration space is what determines its
locality. Recursive and iterative methods traversing a multidimensional iteration
space execute in some linear order. Even when run in parallel, each parallel processor
will execute iterations one after the other.
We rely on parallel loops because they allow us to reorder iterations, but we only
analyze locality from the perspective of the linear order. In the fork-join parallel-
programming model, reordering iterations of nested loops is always legal if all loops
are parallel. Conversely, prior knowledge of application semantics or expensive depen-
dency and aliasing analysis are required to reorder serial loops. Finally, randomized
work-stealing bounds the additional cache misses incurred due to steals [1].
42
Locality in a particular dimension of an iteration space is beneficial if executing
iterations with different values of that coordinate would result in memory accesses
to locations far apart. This almost translates to "executing iterations for the same
values of the coordinate results in memory accesses close together or at the same
location in memory," except for that to be true, good locality in multiple dimensions
at a time is often required.
The distinction between spatial and temporal locality is essential. Temporal lo-
cality can only be achieved if there is reuse , meaning that a location in memory is
accessed multiple times, specifically during multiple iterations in an iteration space.
That reuse can result in temporal locality if those points are executed closely in the
resulting iteration order. On the other hand, spatial locality means nearby iterations
access nearby elements in memory.
This distinction is important because of the way caches operate. Both types of
locality increase the likelihood that a memory location is already in the cache. Still,
spatial locality does so by another element on the same cache line already residing
in the cache. Hence, the benefits of spatial locality generally do not extend past the
size of a cache line 𝐵.
For example, let us consider a simple task of iterating a 2D array laid out con-
tiguously in memory in row-major layout . Row-major layout means that the 𝑥
coordinate of the array is the fastest-running dimension. Let us assume we are iter-
ating the array using two nested loops with dimensions (indices) 𝑖 and 𝑗, where 𝑖 and
𝑗 correspond to 𝑥 and 𝑦 coordinates of the array, respectively. The best locality is
achieved if iterating over 𝑖 is done in the inner loop, as iterations with varying 𝑖 and
the same 𝑗 access nearby elements, while the converse is not true. Hence, locality
along 𝑗 is more vital because it results in good spatial locality. Of course, locality
along 𝑖 also matters as we want to access consecutive elements in memory in con-
secutive order. Still, there is no benefit to executing iterations with the same 𝑖 and
different 𝑗 soon after the other. Hopefully, this example is familiar to the reader and
helps clarify the concepts of locality in different dimensions, but it would not serve
well as a benchmark in this study.
43
5.2 Implementations
This section describes the various implementations used in the study that traverse
the multidimensional iteration space. The primary purpose was to study the effect
of changing the order in which iterations are executed. However, I also wanted to
eliminate other effects on the overall performance, namely function call overheads in
recursive implementation. Therefore, I added a "grainsize" parameter (𝐺) to coarsen
the recursion. Additionally, multiple implementations may visit the iteration space
in the same order but use different approaches to separate the effects of increased
locality from the impact of recursive call overhead.
For each benchmark, if the benchmark is called <name>, the following implemen-
tations are available:
• <name> is a simple nested loops implementation.
• <name>-tiled iterates the space by dividing the space into 𝐺×𝐺 tiles.
• <name>-dac uses multidimensional divide-and-conquer recursion, always splitting
the largest dimension. All dimensions are split down to size 𝐺.
• <name>-dac-loop uses divide-and-conquer recursion but fully splits the outer di-
mension first. Only the inner dimension is coarsened (split to size 𝐺); other dimen-
sions are split to size 1. It achieves the same iteration order as <name>.
• <name>-dac-loop-tiled is like <name>-dac-loop but both dimensions are coars-
ened. It achieves the same iteration order as <name>-tiled.
• <name>-dac-full and <name>-best also recursively split the third (serial) dimen-
sion and only apply to mm. They are described in more detail in Subsection 5.3.1.
Figure 5-1 shows the three distinct iteration orders in the implementations.
<name> and <name>-dac-loop achieve the regular order, <name>-tiled and
<name>-dac-loop-tiled achieve the tiled order, and <name>-dac achieves the Mor-
ton order (also referred to as z-order). <name>-dac-full and <name>-best achieve
the Morton order in all 3 dimensions.
44
Figure 5-1: Regular (left), tiled (middle), and Morton (right) iteration orders for a
two-dimensional iteration space. In this case, 𝑁 = 16 and 𝐺 = 4.
Implementation Base case size 𝐺 affects order
<name> 1 N/A
<name>-tiled 𝐺2 Yes
<name>-dac 𝐺2 Yes
<name>-dac-loop 𝐺 No
<name>-dac-loop-tiled 𝐺2 Yes
<name>-dac-full* 𝐺3 Yes
<name>-best* 𝐺3 Yes
Table 5.2: Implementations of nested loops used in the study. Base case size re-
duces both the recursive call overhead and parallelism. While neither <name> nor
<name>-tiled use recursion, <name>-tiled still has a base case, which is one tile.
<name>-best and <name>-dac-full only apply to mm.
The grainsize parameter 𝐺 may affect both recursion coarsening and iteration
order. 𝐺 is the size of the tile in both <name>-tiled and <name>-dac-loop-tiled,
but because the latter uses recursion, it also impacts the recursion base-case size,
which is equal to one tile. <name>-dac also uses a tile as a base case, which means that
the resulting order is not exactly the z-order . However, the locality remains similar
as long as one tile fits in cache. Finally, <name>-dac-loop only allows coarsening in
the innermost dimension, and 𝐺 only affects the coarsening. Table 5.2 summarizes
the effects of the grainsize parameter, 𝐺, on recursion coarsening and iteration order
for all implementations.
Despite having a coarsening parameter, <name>-dac is cache-oblivious, as 𝐺 is
only used to reduce the recursion overhead and not to optimize the residual size in
cache. However, if 𝐺 becomes too large, the cache efficiency of <name>-dac can be
lost.
45
5.3 Benchmarks and performance results
This section presents more detail on each benchmark and the performance results.
All benchmark implementations were compiled with version 2.0 of OpenCilk and -O3
-DNDEBUG optimization flags. I ran all experiments on an AWS c5.metal machine,
which features an Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz processor
with 48 physical cores across 2 NUMA nodes. Each core has its own 32 KiB L1d
and L1i caches and a 1 MiB L2 cache. Meanwhile, the 35.75 MiB L3 cache is shared
between all cores on one NUMA node. L1 and L2 are 8-way associative, L3 is 11-way
associative, and all contain 64 B cache lines. Because all benchmarks operate on
32-bit (4 B) values, the effective cache line size 𝐵 is 16.
For each benchmark, I explored the impact of 𝐺 on the overall performance in
both serial and parallel settings. I measured the median running time (minimum when
running on one worker) from 10 trials for all power-of-two values of 𝐺 from 1 to 128. I
repeated those measurements for all worker counts between 1 and 24, always utilizing
distinct physical cores on a single NUMA node to reduce performance variability. All
plots in this chapter plot the speedup (larger is better), which is calculated as 𝑆(𝑡) =
𝑡𝑛𝑜𝑟𝑚 , where 𝑡𝑛𝑜𝑟𝑚 is the running time used for normalization. Generally, 𝑡𝑛𝑜𝑟𝑚 is the𝑡
minimum running time of the serial projection of the nested loop implementation
<name>. One exception is when plotting the speedup with respect to grainsize on a
certain number of workers, where 𝑡𝑛𝑜𝑟𝑚 is the median running time of <name> for that
worker count.
Parallel scalability was not affected by 𝐺 for most implementations. Because the
problem sizes had to be large for the data to not fit in cache, ample parallelism was
always available. Therefore, I only evaluate some of the results to decrease the amount
of redundant information. Specifically, I only look at speedup with respect to 𝐺 for
1 and 24 workers and parallel scalability for 𝐺 ∈ {1, 128}.
I chose inputs with sizes that are not precisely powers of two for all benchmarks.
It is easiest to implement recursive divide-and-conquer codes if the sizes of inputs are
always powers of two, making it easy to divide the iteration space in half recursively.
46
However, such sizes can also result in conflict misses because real-world caches are
not fully associative, and their sizes are often a power of two. Therefore, two elements
in the same column of a 2D input might have the same low-order bits and map to
the same cache set. One potential solution is to pad the 2D memory to offset the
conflicts, but I opted for running the benchmarks on inputs with sizes that are not
powers of 2.
5.3.1 Matrix multiply
The matrix multiply benchmark mm computes the product of two row-major square
matrices 𝐴 and 𝐵 of size 𝑁 ×𝑁 and stores the result in matrix 𝐶 of the same size.
Because it performs 𝑂(𝑁3) arithmetic and 𝑂(𝑁3) memory operations on data of size
𝑂(𝑁2), it is an excellent benchmark for this study. It exhibits a factor of 𝑁 reuse
on both reads and writes and enables excellent spatial and temporal locality benefits.
Empirical results show that it can significantly benefit from improved locality in
multiple dimensions.
Figure 5-2 shows simplified code that implements mm using nested loops. It is a
three-dimensional loop nest with dimensions 𝑖, 𝑗, and 𝑘. Each loop only iterates over
two matrices and always refers to the same element in the last one. For example, the
𝑖 loop strides through 𝐴 and 𝐶 but always refers to the same element in 𝐵. It follows
that locality in all three dimensions is essential. On the other hand, the 𝑗 loop is the
only one that does not stride through either matrix, meaning that 𝑖 and 𝑘 locality
are more critical for spatial locality.
The order of loops in the nest may not achieve the best locality but is used for
consistent comparisons with other benchmarks. The code would be correct under
any permutation of the loop order in the nest. However, because the 𝑘 loop must be
serial, different loop orders might decrease the parallelism or increase the scheduling
overheads.
In addition to the common implementations, I implemented mm-dac-full and
mm-best. Those implementations also recursively split the 𝑘 dimension, achieving
good locality along all three dimensions instead of just 𝑖 and 𝑗. The difference between
47
Figure 5-2: Simplified nested loop implementation of matrix multiply. The outer two
of the three loops in the nest are parallel.
Figure 5-3: Speedup of mm implementations with respect to grainsize, run on a single
worker.
them is that mm-dac-full splits along one dimension (the largest one in the current
subproblem) for every level of recursion, while mm-best splits all three at the same
time. For square matrices, both implementations perform splits in the same order
but differ in the number of levels of recursion and spawned tasks by a constant factor.
For the evaluation, I chose a size of 𝑁 = 2071 to avoid conflict misses and make
sure that a column of a matrix does not fit in the L1 cache. A whole row does fit,
but when iterating over a column, each element is on its own cache line. To cache the
entire column, 𝐵 ·𝑁 = 64 bytes * 2071 memory needs to fit in cache, which is larger
than L1.
48
Figure 5-3 shows the impact of the grainsize parameter on the performance of
all implementations of mm on a single worker. The performance of mm-dac and
mm-dac-loop is independent of 𝐺. They run 1.21× and 0.95× faster than the serial
elision of mm. Because 𝐺 does not noticeably affect the locality of those implementa-
tions, this lack of performance dependence on 𝐺 means that recursive overhead does
not measurably contribute to the overall running time. The large size of the base
case can explain that: it performs 𝑂(𝐺2𝑁) work, which is 𝑂(𝑁) even with 𝐺 = 1.
On the other hand, larger 𝐺 improves the performance of mm-dac-loop-tiled and
mm-tiled because it improves locality along the 𝑗 dimension. With large enough
𝐺, mm-dac-loop-tiled achieves enough locality to catch up to mm-dac, which is
expected. Finally, mm-best and mm-dac-full perform much better with larger 𝐺
because their base case does 𝑂(𝐺3) work, which is not enough to amortize the cost of
recursive calls when 𝐺 is low. They both achieve the best performance with 𝐺 = 128:
mm-best outperforms the baseline by 1.64× while mm-dac-full outperforms it by
1.55×.
The performance difference between mm-dac-full and mm-best can likely be at-
tributed to a combination of performance variability and different overheads of re-
cursion and spawning parallel tasks. The serial projection of mm-best is sometimes
slower than the serial projection of mm-dac-full.
It is less clear why mm-tiled has similar performance to mm-dac-loop-tiled for
𝐺 = 1 but outperforms it at larger 𝐺. It also measurably dips in performance at very
large grainsizes. Unfortunately, I ran out of time to thoroughly investigate these de-
viations. Still, they could be explained by divergent compiler transformations arising
from the differences in control flow between mm-tiled and mm-dac-loop-tiled.
Figure 5-4 shows the relationship between performance and 𝐺, this time for exe-
cution on 24 workers. The plot looks almost identical except for mm-tiled performing
worse for 𝐺 ∈ {64, 128}.
The performance difference is even more apparent in Figure 5-5, which shows the
parallel scalability of all implementations for 𝐺 = 128. Except mm-tiled, all imple-
mentations achieve almost perfect linear speedup, with their self-relative speedups on
49
Figure 5-4: Speedup of mm implementations with respect to grainsize, run on 24
workers.
24 workers ranging from 22.7× to 24×.
5.3.2 Blur
blur is a common operation in computational photography and image processing.
The benchmark takes a two-dimensional image of size 𝑁 ×𝑁 and performs a simple
2D convolution that averages the pixels in the 2D neighborhood of size 𝐾 around
each pixel. Despite a high factor of reuse, empirical results show that it does not
benefit from improved locality in multiple dimensions. Figure 5-6 shows the simplified
nested-loop implementation.
Because each iteration of blur accesses nearby cells in both directions, improved
locality in both 𝑥 and 𝑦 dimensions reduces the number of cache misses. The regular
iteration order with 𝑥 (fastest-running dimension of the image) in the inner loop
reuses each element along the 𝑥 dimension 𝐾 times, and it only incurs one cache miss
for every 𝐵 elements due to excellent spatial locality. Assuming a row of the image
does not fully fit in cache, the number of misses is 𝑂(𝑁2𝐾 ), and the fraction of misses
𝐵
to reads is 𝑂( 1 ). Suppose the tiled order with a sufficient tile size or Morton order
𝐾𝐵
is used. Then, elements can also be reused along the 𝑦 axis due to better locality in
the 𝑥 dimension, and the number of misses is 2𝑂(𝑁 ). The fraction becomes 𝑂( 12 ),𝐵 𝐾 𝐵
50
Figure 5-5: Parallel speedup of mm implementations for 𝐺 = 128.
51
Figure 5-6: Simplified nested loop implementation of blur. blurPixel performs the
"blurring" of a pixel and is factored out for better readability.
an additional factor of 𝐾 reduction.
Even though blur could be written as a 4D loop-nest, the inner two loops are
smaller and likely wouldn’t benefit as much from improved locality. Their indices
represent a relative offset in the calculation of the memory location, meaning that
two different values of x and y with the same x_diff or y_diff don’t have anything
in common, unless x and y are close, in which case locality in those two dimensions
is already present.
I chose a size of 𝑁 = 16401 for this benchmark to make sure that a whole row
of the image would not fit in the L1 cache. In addition, I chose 𝐾 = 11 to get a
significant factor of reuse. However, in hindsight, a high 𝐾 also resulted in a low
number of cache misses, so reducing the cache misses had a minor impact on the
overall performance.
Figure 5-7 shows the performance of blur with respect to 𝐺 on a single worker.
All implementations achieve very similar performance. These results show that the
improved locality does not affect the running time. For 𝐺 = 1, blur-dac achieves
52
Figure 5-7: Speedup of blur implementations with respect to grainsize run on a single
worker.
a 0.77× speedup compared to the serial elision of blur, whike blur-dac-loop and
blur-dac-loop-tiled achieve a 0.80× speedup. The reason they all perform worse
than blur at low 𝐺 can be directly attributed to recursive overhead. As 𝐺 increases,
the difference shrinks, and the performance of those benchmarks approaches that
of blur. This approach happens more slowly for blur-dac-loop, as its coarsening
only occurs with a factor of 𝐺 instead of 𝐺2, and the amortized cost of recursion
approaches 0 more slowly.
Meanwhile, blur-tiled achieves 0.97× of the performance of the baseline even
for 𝐺 = 1, and quickly converges towards the baseline for larger 𝐺. It performs best
with 𝐺 = 64, where it outperforms the baseline by 0.67%. That improvement is not
significant enough to conclude that the improved locality was the reason for improved
performance.
Figure 5-8 shows the parallel scalability of all blur implementations. They all
scale exceptionally well, achieving a speedup of over 23.9 on 24 cores. This happens
due to ample parallelism and sufficient arithmetic intensity that does not overwhelm
the memory bandwidth of the CPU, which notoriously does not scale with the number
of processors.
53
Figure 5-8: Parallel speedup of blur implementations for 𝐺 = 1.
54
Figure 5-9: Simplified nested loop implementation of transpose.
5.3.3 Transpose
transpose calculates an in-place transpose of a square matrix. It differs from the
other two because it features a triangular iteration space , meaning the iteration
range of the inner loop depends on the index in the outer loop. That gives the
iteration space a triangular shape. Empirical results show that it can greatly benefit
from improved locality in both dimensions. Figure 5-9 shows the simplified nested-
loop implementation.
In transpose, locality along both dimensions results in good spatial locality be-
cause the code writes to the element with the switched coordinates of the element
it reads from. On the other hand, there is no reuse; every element is only read and
written once, and that read and write happen one after the other in all implementa-
tions.
I chose a size of 𝑁 = 16401 for this benchmark to ensure that a whole row of the
matrix would not fit in the L1 cache.
Figure 5-10 shows the performance of transpose with respect to 𝐺 on 24 work-
ers. Because the work per iteration is low, recursive overheads are pretty high.
With larger 𝐺, improved locality of transpose-dac outperforms other implemen-
tations, achieving up to a 1.7× speedup compared to transpose. The second best
is transpose-dac-loop-tiled with 𝐺 = 128, achieving a 1.62× speedup. From
cache-misses alone, transpose-dac-loop-tiled is expected to achieve similar per-
55
Figure 5-10: Speedup of transpose implementations with respect to grainsize, run
on 24 workers.
formance after 𝐺 > 𝐵, but it keeps noticeably improving in performance. Mean-
while, transpose-dac-loop plateaus entirely below them, which is expected due
to its worse locality. But it is not likely for it to outperform transpose. Finally,
transpose-tiled performs poorly with large 𝐺 like other <name>-tiled implemen-
tations.
Because each iteration of transpose performs so little work, other differences
between the implementations and overheads may affect these results. Further in-
vestigation would be required to uncover the cause of the unexpected side of these
results.
5.4 Takeaways
Three main takeaways can be drawn from this study. First, convolutions (like blur)
don’t benefit as much from reordering despite a high factor of reuse. Second, appli-
cations where locality along more than one dimension results in either temporal or
spatial cache-locality can benefit tremendously from reordering the iterations. Bench-
marks in this study achieved up to a factor of 1.7×. Third, manually optimizing
implementations and making them cache-efficient can require much work.
56
Careful tuning is required to achieve the best cache performance, which can also
be hard to measure. Cache-efficient implementations, whether cache-aware or cache-
oblivious, can get complex and must be tightly coupled to the application logic,
making the code prone to off-by-one errors and other bugs. This coupling further
justifies compiler transformations that can reorder iterations automatically without
programmer input.
I could not thoroughly investigate the poor scalability of <name>-tiled for all
three benchmarks. I believe that the OpenCilk compiler performed its own coarsening
of the parallel loops. This coarsening could result in puny parallelism when combined
with the manual coarsening done by the implementation.
Future studies should focus on experimenting with different input sizes for blur,
other base case approaches for mm, and further investigation of transpose perfor-
mance results. A lower 𝐾 would mean lower reuse and a smaller relative reduction
in cache misses for blur-dac, but it might increase the absolute decrease in cache
misses and significantly impact performance. When experimenting with different or-
ders of loops, I found that changing the order of nested loops inside mm yielded an
implementation that was even faster than mm-best, which was partially due to better
vectorization, but it certainly warrants a closer look.
57
58
Chapter 6
Related work
This section discusses related work on parallel loops and optimizations to randomized
work stealing. It also presents related work on improving locality of nested loops.
A substantial body of previous work has studied a variety of schemes for partition-
ing parallel loops to improve performance [48, 28, 57, 40, 27, 7, 8, 54]. One common
scheme involves statically partitioning the 𝑛 iterations of a parallel loop into 𝑃 chunks
of 𝑛/𝑃 iterations each, where 𝑃 is the number of processing threads, and then as-
signing each chunk to a processor [45, 3]. Although static partitioning incurs low
overheads, its performance suffers when the work of the loop is not balanced across
its iterations or the execution of the parallel loop does not have dedicated access to the
processors. In contrast, dynamic partitioning schemes, like DACParFor, partition
parallel loop iterations into small chunks whose size is independent of the number of
processing threads and then allows a scheduler to dynamically schedule chunks onto
processors, e.g., via work stealing or work sharing. Dynamic partitioning can over-
come the shortcomings of static partitioning by providing automatic load balancing
but at the cost of increased overhead. Other schemes, such as guided self-scheduling
[48] or factoring [28], partition the loop iterations into variable-sized chunks, whose
size is based on 𝑃 and then the number of unexecuted loop iterations remaining.
Previous work has also explored schemes to schedule and load balance parallel loops
based on statistics collected at runtime [7, 8] and based on memory-access patterns,
such as spatial locality across parallel loop iterations [54].
59
Loop frames implement dynamic partitioning of parallel loops that load balances
work automatically via work stealing. Loop frames and on-the-fly loop splitting also
differ from other schemes that dynamically split unexecuted loop iterations, e.g.,
[48, 28, 54], which typically use a central queue to manage the loop iterations. In
contrast, on-the-fly loop splitting distributes unexecuted loop iterations in a decen-
tralized fashion using worker-local deques. Furthermore, in contrast to previous work,
on-the-fly loop splitting presented in this thesis has a parallel running time bound
with work stealing, derived in [47].
Tzannes et al. introduce lazy binary splitting [55], and a generalization called lazy
scheduling [56], to reduce the overheads of work stealing. These schemes split parallel
loops in a dynamic, lazy fashion by pushing work onto a worker’s deque only when the
deque is empty. Lazy scheduling differs from loop frames in two ways. First, although
lazy scheduling pushes partitions of parallel-loop iterations lazily onto the deque, these
partitions are still created in a manner consistent with DACParFor. In contrast, on-
the-fly loop splitting partitions the unexecuted iterations in the loop frame at the time
of a steal. These partitions do not necessarily correspond with partitions generated
using DACParFor, which necessitates a separate analysis of work stealing with loop
frames. In addition, loop frames and lazy scheduling reduce work-stealing overheads
in different ways. Lazy scheduling reduces overheads by avoiding synchronization
operations on a nonempty deque. In contrast, loop frames reduce overheads by using
a concise representation of unexecuted parallel-loop iterations in a single deque frame.
It remains a topic of future work to see if these techniques can be combined.
Acar et al. explore another approach to reducing work-stealing overheads by study-
ing work-stealing with private deques [2]. Private deques allow a work-stealing sched-
uler to avoid unnecessary memory fences during execution, thereby reducing synchro-
nization overheads. It remains an open research problem to apply similar techniques
to loop frames. In particular, future work might investigate designating a subinterval
of a loop frame’s iterations to be private in order to reduce synchronization costs
involved with updating the start variable in a loop frame.
60
Chapter 7
Conclusion
This thesis introduces loop frames and on-the-fly loop splitting, which extend ran-
domized work stealing with first-class support for parallel-for loops. Loop frames and
on-the-fly loop splitting maintain the same theoretical guarantees as the traditional
divide-and-conquer algorithm for parallel-for loops while providing lower overheads in
practice and flexibility in dynamically splitting loop iterations. This particularly im-
pacts applications with lower parallelism where extensive coarsening cannot be used.
Although this work introduces on-the-fly loop splitting, it remains an interesting re-
search question to fully leverage the flexibility it provides.
This thesis also investigates the impacts of improved locality of multidimen-
sional iteration spaces on performance. While convolutions like blur don’t benefit as
much from reordering, applications with temporal or spatial cache-locality can ben-
efit tremendously. Cache-efficient implementations can be complex and error-prone,
justifying the use of compiler transformations that automatically reorder iterations.
Future studies should continue to explore different input sizes and approaches for
various benchmarks.
61
62
Bibliography
[1] U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work
stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and
Architectures (SPAA 2000), pages 1–12, 2000.
[2] Umut A. Acar, Arthur Chargueraud, and Mike Rainey. Scheduling parallel pro-
grams by work stealing with private deques. In PPoPP, page 219–228, 2013.
[3] Marco Aldinucci, Marco Danelutto, Peter Kilpatrick, and Massimo Torquati.
FastFlow: High-Level and Efficient Streaming on Multicore, chapter 13, pages
261–280. John Wiley & Sons, Ltd, 2017.
[4] Eric Allen, David Chase, Joe Hallett, Victor Luchangco, Jan-Willem Maessen,
Sukyoung Ryu, Guy L. Steele Jr., and Sam Tobin-Hochstadt. The Fortress
Language Specification Version 1.0. Sun Microsystems, Inc., March 2008.
[5] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for
multiprogrammed multiprocessors. Theory Comput. Syst., 34(2):115–144, 2001.
[6] E. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Yuan Lin, F. Massaioli, X. Teruel,
P. Unnikrishnan, and Guansong Zhang. The design of OpenMP tasks. IEEE
Transactions on Parallel and Distributed Systems, 20(3):404–418, 2009.
[7] I. Banicescu and Z. Liu. A dynamic scheduling method tuned to rate of weight
changes. In HPC, pages 122–129, 2000.
[8] Ioana Banicescu and Vijay Velusamy. Performance of scheduling scientific appli-
cations with adaptive weighted factoring. In IPDPS, page 84, USA, 2001. IEEE
Computer Society.
[9] Rajkishore Barik, Zoran Budimlić, Vincent Cavè, Sanjay Chatterjee, Yi Guo,
David Peixotto, Raghavan Raman, Jun Shirako, Sağnak Taşırlar, Yonghong Yan,
Yisheng Zhao, and Vivek Sarkar. The Habanero multicore software research
project. In Proceedings of the 24th ACM SIGPLAN Conference Companion on
Object Oriented Programming Systems Languages and Applications, OOPSLA
’09, pages 735–736, Orlando, Florida, USA, 2009. ACM.
[10] Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun.
Internally deterministic parallel algorithms can be fast. In PPoPP, pages 181–
192, 2012.
63
[11] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E.
Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An efficient multithreaded
runtime system. Journal of Parallel and Distributed Computing, 37(1):55–69,
1996.
[12] Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multi-
threaded computations. SIAM Journal on Computing, 27(1):202–229, February
1998.
[13] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded compu-
tations by work stealing. Journal of the ACM, 46(5):720–748, 1999.
[14] Robert D. Blumofe and Dionisios Papadopoulos. Hood: A user-level threads
library for multiprogrammed multiprocessors. Technical Report, University of
Texas at Austin, 1999.
[15] F. Warren Burton and M. Ronan Sleep. Executing functional programs on a vir-
tual tree of processors. In Proceedings of the 1981 Conference on Functional Pro-
gramming Languages and Computer Architecture, pages 187–194, Portsmouth,
New Hampshire, October 1981.
[16] Vincent Cavé, Jisheng Zhao, Jun Shirako, and Vivek Sarkar. Habanero-Java:
The new adventures of old X10. In PPPJ, pages 51–61, 2011.
[17] Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Al-
lan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: An
object-oriented approach to non-uniform cluster computing. In OOPSLA, 2005.
[18] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
Introduction to Algorithms. The MIT Press, third edition, 2009.
[19] John S. Danaher, I-Ting Angelina Lee, and Charles E. Leiserson. Programming
with exceptions in JCilk. Science of Computer Programming, 63(2):147–171,
December 2008.
[20] Rainer Feldmann, Peter Mysliwietz, and Burkhard Monien. Studying overheads
in massively parallel min/max-tree evaluation. In SPAA, 1994.
[21] Mingdong Feng and Charles E. Leiserson. Efficient detection of determinacy
races in Cilk programs. Theory of Computing Systems, 32(3):301–326, 1999.
[22] Raphael Finkel and Udi Manber. DIB — A distributed implementation of back-
tracking. ACM TOPLAS, 9(2):235–256, April 1987.
[23] Vincent W. Freeh, David K. Lowenthal, and Gregory R. Andrews. Distributed
Filaments: Efficient fine-grain parallelism on a cluster of workstations. In Pro-
ceedings of the First Symposium on Operating Systems Design and Implementa-
tion, pages 201–213, Monterey, California, November 1994.
64
[24] Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran.
Cache-oblivious algorithms. In 40th Annual Symposium on Foundations of Com-
puter Science, pages 285–297, New York, New York, October 17–19 1999.
[25] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation
of the Cilk-5 multithreaded language. In PLDI, pages 212–223, 1998.
[26] Robert H. Halstead, Jr. Multilisp: A language for concurrent symbolic compu-
tation. ACM TOPLAS, 7(4):501–538, October 1985.
[27] Susan Flynn Hummel, Jeanette Schmidt, R. N. Uma, and Joel Wein. Load-
sharing in heterogeneous systems via weighted factoring. In Proceedings of the
Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, page
318–328, Padua, Italy, 1996. ACM.
[28] Susan Flynn Hummel, Edith Schonberg, and Lawrence E. Flynn. Factoring:
A method for scheduling parallel loops. Commun. ACM, 35(8):90–101, August
1992.
[29] Intel Corporation. Intel Cilk Plus Language Specification, 2010. Document
Number: 324396-001US. Available from http://software.intel.com/sites/
products/cilk-plus/cilk_plus_language_specification.pdf.
[30] Intel Corporation. Intel Cilk Plus Language Extension Specification, Version 1.2,
2013. Document 324396-003US.
[31] Richard M. Karp and Yanjun Zhang. Randomized parallel algorithms for
backtrack search and branch-and-bound computation. Journal of the ACM,
40(3):765–789, July 1993.
[32] David A. Kranz, Robert H. Halstead, Jr., and Eric Mohr. Mul-T: A high-
performance parallel Lisp. In Proceedings of the SIGPLAN ’89 Conference on
Programming Language Design and Implementation, pages 81–90, June 1989.
[33] Bradley C. Kuszmaul. Synchronized MIMD Computing. PhD thesis, Depart-
ment of Electrical Engineering and Computer Science, Massachusetts Institute
of Technology, May 1994. Available as MIT Laboratory for Computer Science
Technical Report MIT/LCS/TR-645.
[34] Monica S. Lam, E. Rothberg, and Michael E. Wolf. The cache performance
and optimizations of blocked algortihms. In Fourth International Conference
on Architectural Support for Programming Languages and Operating Systems
(ASPLOS), pages 63–74, Santa Clara, CA, April 1991. ACM SIGPLAN Notices
26:4.
[35] Doug Lea. A Java fork/join framework. In ACM 2000 Conference on Java
Grande, pages 36–43, 2000.
65
[36] I-Ting Angelina Lee. Memory Abstractions for Parallel Programming. PhD
thesis, MIT Department of Electrical Engineering and Computer Science, 2012.
[37] Daan Leijen and Judd Hall. Optimize managed code for multi-core ma-
chines. MSDN Magazine, 2007. Available from http://msdn.microsoft.com/
magazine/.
[38] Charles E. Leiserson. The Cilk++ concurrency platform. Journal of Supercom-
puting, 51(3):244–257, 2010.
[39] Charles E. Leiserson, Neil C. Thompson, Joel S. Emer, Bradley C. Kuszmaul,
Butler W. Lampson, Daniel Sanchez, and Tao B. Schardl. There’s plenty of room
at the top: What will drive computer performance after moore’s law? volume
368. American Association for the Advancement of Science, 2020.
[40] Jie Liu, Vikram A. Saletore, and Ted G. Lewis. Safe self-scheduling: A par-
allel loop scheduling scheme for shared-memory multiprocessors. International
Journal of Parallel Programming, 22(6):589–616, 1994.
[41] Michael McCool, Arch D. Robison, and James Reinders. Structured Parallel
Programming: Patterns for Efficient Computation. Elsevier Science, 2012.
[42] Seung-Jai Min, Costin Iancu, and Katherine Yelick. Hierarchical work stealing
on manycore clusters. In Fifth Conference on Partitioned Global Address Space
Programming Models (PGAS ’11), October 2011.
[43] G.M. Morton. A computer oriented geodetic data base and a new technique in
file sequencing. Technical report, IBM Ltd., Ottawa, Ontario, March 1966.
[44] Rishiyur S. Nikhil. Cid: A parallel, shared-memory C for distributed-memory
machines. In Proceedings of the Seventh Annual Workshop on Languages and
Compilers for Parallel Computing, August 1994.
[45] OpenMP Architecture Review Board. OpenMP Application Program Interface,
Version 4.5, July 2015. Available from http://www.openmp.org/wp-content/
uploads/openmp-4.5.pdf.
[46] OpenMP Application Program Interface, Version 3.0, May 2008.
[47] Nipun Pitimanaaree. Provably efficient randomized work stealing with first-
class parallel loops. Master’s thesis, Department of Electrical Engineering and
Computer Science, Massachusetts Institute of Technology, May 2019.
[48] C. D. Polychronopoulos and D. J. Kuck. Guided self-scheduling: A practical
scheduling scheme for parallel supercomputers. IEEE Transactions on Comput-
ers, C-36(12):1425–1439, 1987.
[49] James Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-core
Processor Parallelism. O’Reilly Media, Inc., 2007.
66
[50] Albert Reuther, Jeremy Kepner, Chansup Byun, Siddharth Samsi, William Ar-
cand, David Bestor, Bill Bergeron, Vijay Gadepally, Michael Houle, Matthew
Hubbell, et al. Interactive supercomputing on 40,000 cores for machine learning
and data analysis. In 2018 IEEE High Performance extreme Computing Confer-
ence (HPEC), pages 1–6. IEEE, 2018.
[51] Tao B. Schardl, I-Ting Angelina Lee, and Charles E. Leiserson. Brief announce-
ment: Open cilk. In Proceedings of the 30th on Symposium on Parallelism in
Algorithms and Architectures, SPAA ’18, page 351–353, New York, NY, USA,
2018. Association for Computing Machinery.
[52] Tao B. Schardl, William S. Moses, and Charles E. Leiserson. Tapir: Embedding
fork-join parallelism into LLVM’s intermediate representation. In PPoPP, pages
249–265, 2017.
[53] Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Ky-
rola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. Brief announcement:
The Problem Based Benchmark Suite. In SPAA, pages 68–70, 2012.
[54] Marc Tchiboukdjian, Vincent Danjean, Thierry Gautier, Fabien Le Mentec, and
Bruno Raffin. A work stealing scheduler for parallel loops on shared cache mul-
ticores. In Mario R. Guarracino, Frédéric Vivien, Jesper Larsson Träff, Mario
Cannatoro, Marco Danelutto, Anders Hast, Francesca Perla, Andreas Knüpfer,
Beniamino Di Martino, and Michael Alexander, editors, Euro-Par 2010 Parallel
Processing Workshops, pages 99–107, 2011.
[55] Alexandros Tzannes, George C. Caragea, Rajeev Barua, and Uzi Vishkin. Lazy
binary-splitting: A run-time adaptive work-stealing scheduler. In PPoPP, page
179–190, 2010.
[56] Alexandros Tzannes, George C. Caragea, Uzi Vishkin, and Rajeev Barua. Lazy
scheduling: A runtime adaptive scheduler for declarative parallelism. TOPLAS,
36(3), September 2014.
[57] T. H. Tzen and L. M. Ni. Trapezoid self-scheduling: A practical scheduling
scheme for parallel compilers. IEEE Trans. Parallel Distributed Syst., 4:87–98,
1993.
[58] Mark T. Vandevoorde and Eric S. Roberts. WorkCrews: An abstraction for con-
trolling parallelism. International Journal of Parallel Programming, 17(4):347–
366, August 1988.
[59] Helen Jiang Xu. The Locality-First Strategy for Developing Efficient Multicore
Algorithm. PhD thesis, Department of Electrical Engineering and Computer
Science, Massachusetts Institute of Technology, February 2022.
67