Improving the Performance of Parallel Loops in OpenCilk by Luka Govedič S.B. in Electrical Engineering and Computer Science Massachusetts Institute of Technology(2022) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2023 © 2023 Luka Govedič. All rights reserved. The author hereby grants to MIT a nonexclusive, worldwide, irrevocable, royalty-free license to exercise any and all rights under copyright, including to reproduce, preserve, distribute and publicly display copies of the thesis, or release the thesis under an open-access license. Authored by: Luka Govedič Department of Electrical Engineering and Computer Science May 19, 2023 Certified by: Tao B. Schardl Research Scientist Thesis Supervisor Accepted by: Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 Improving the Performance of Parallel Loops in OpenCilk by Luka Govedič Submitted to the Department of Electrical Engineering and Computer Science on May 19, 2023, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract For good performance, parallel loop scheduling must achieve low scheduling overheads and multidimensional locality in nested loops. This thesis explores both challenges and contributes an extension to randomized work-stealing for first-class loop support that reduces scheduling overheads. Randomized work-stealing schedulers traditionally execute parallel-for loops us- ing parallel divide-and-conquer recursion, which is theoretically efficient and scalable but can incur substantial overheads in practice. This thesis extends randomized work-stealing with a custom work-stealing protocol called on-the-fly loop splitting. I introduce loop frames to make work stealing on parallel-for loops more efficient and flexible. Loop frames make two key changes to work stealing for parallel-for loops. First, loop frames extend work stealing by directly encoding information about intervals of loop iterations in the runtime. Loop frames add first-class support to work stealing for parallel-for loops that composes with classical randomized work stealing. Second, loop frames allow intervals of loop iterations to be split on-the-fly, such that worker threads attempt to steal half of the unexecuted loop iterations rather than a deterministically constructed partition of loop iterations. On-the-fly loop splitting allows for more flexible dynamic load balancing of loop iterations while keeping the work overheads low and maintaining the theoretical efficiency of divide-and-conquer. I evaluate loop frames in practice by implementing loop frames in the OpenCilk runtime system. In particular, loop frames augment the THE protocol from Cilk to coordinate updates to loop frames. I observe that loop frames and on-the-fly loop splitting incur substantially less overhead than the divide-and-conquer algorithm without sacrificing parallel scalability. Finally, I study the impacts of increased locality in more than one dimension in nested loop applications. Results show that both cache-aware and cache-oblivious reordering of nested loop iterations can result in performance benefits up to a factor of 1.7×. Thesis Supervisor: Tao B. Schardl 3 Title: Research Scientist 4 Acknowledgments While I received support from many, no one was as instrumental during this research as Dr. Tao B. Schardl. I am extremely grateful for his continued guidance and countless insightful discussions we had over the course of over three years. He helped ignite my passion for performance engineering research and was indispensable in my becoming the researcher I am today. I would also like to express my gratitude to Prof. Charles Leiserson for supporting me in every aspect of my academic and professional career and being a limitless source of advice. I am thankful to all other Supertech research group members and Prof. Rezaul Chowdhury for their repeated feedback and interest in my project. I cannot thank Sam D’Alonzo enough for being a constant source of support. I want to thank all of my friends and communities I have been a part of, and my family for helping this dream happen in the first place. I would also like to acknowledge that I used ChatGPT to proofread parts of my thesis. This research was sponsored in part by the United States Air Force Research Laboratory under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. 5 6 Contents 1 Introduction 15 2 Background 23 2.1 The dag model of multithreading . . . . . . . . . . . . . . . . . . . . 23 2.2 Randomized work stealing . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Cache-oblivious and cache-aware algorithms . . . . . . . . . . . . . . 25 3 Loop frames 27 3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Synchronization protocol . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Empirical evaluation of loop frames 35 4.1 Microbenchmark experiments . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Application experiments . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 Study of improved locality in nested loop applications 41 5.1 Locality in multidimensional loops . . . . . . . . . . . . . . . . . . . . 42 5.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.3 Benchmarks and performance results . . . . . . . . . . . . . . . . . . 46 5.3.1 Matrix multiply . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3.2 Blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3.3 Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.4 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 7 6 Related work 59 7 Conclusion 61 8 List of Figures 1-1 A DAG for a simple fork-join program and the corresponding pseu- docode. In this example, there are 4 serial tasks, A-D, that need to be executed. A needs to be executed before B and C (which can be executed in parallel). Both B and C need to finish before D. In the pseudocode, A is called first. Then, B is spawned, meaning C can exe- cute before B is done. Sync on line 4 forces both B and C to complete before the execution moves on to D. . . . . . . . . . . . . . . . . . . . 16 1-2 Pseudocode for the typical divide-and-conquer implementation of a parallel-for loop. The function Body encodes the body of the parallel- for loop. The spawn and sync keywords allow for parallel execution of the operations in the function. The constant G is used to coarsen the recursion to mitigate performance overheads. . . . . . . . . . . . . 19 3-1 Pseudocode for loop frame operations. LFParFor is used by the worker that enters the loop frame, while SimpleSteal is used by the thief when stealing a loop frame from a victim. LFParForHelper is used by both to actually execute all iterations and contains a familiar branch, body, and increment that can be found in all for loops. Next returns true if there are more iterations available, and false otherwise. 28 9 3-2 Pseudocode for the optimized synchronization protocol to coordinate operations on loop frames. The argument 𝐷 denotes a victim’s deque. The functions lockOwnDeque and unlockOwnDeque acquire and release the lock on the executing worker’s deque. The function isLoopFrame tests if the given frame is a loop frame. . . . . . . . . 31 4-1 Median speedup with respect to number of processors for the daxpy, nqueens, and mandelbrot, appearing from left to right. Each plot features loop frames in orange squares and the divide-and-conquer al- gorithm in yellow triangles. The daxpy plot also includes the results of the simpler protocol for loop frames (briefly mentioned in Chapter 3) in red diamonds. The nqueens plot includes in blue crosses an al- ternative implementation of the parallel-for loop where each iteration is spawned off in sequence. . . . . . . . . . . . . . . . . . . . . . . . . 36 5-1 Regular (left), tiled (middle), and Morton (right) iteration orders for a two-dimensional iteration space. In this case, 𝑁 = 16 and 𝐺 = 4. . 45 5-2 Simplified nested loop implementation of matrix multiply. The outer two of the three loops in the nest are parallel. . . . . . . . . . . . . . 48 5-3 Speedup of mm implementations with respect to grainsize, run on a single worker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5-4 Speedup of mm implementations with respect to grainsize, run on 24 workers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5-5 Parallel speedup of mm implementations for 𝐺 = 128. . . . . . . . . . 51 5-6 Simplified nested loop implementation of blur. blurPixel performs the "blurring" of a pixel and is factored out for better readability. . . 52 5-7 Speedup of blur implementations with respect to grainsize run on a single worker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5-8 Parallel speedup of blur implementations for 𝐺 = 1. . . . . . . . . . 54 5-9 Simplified nested loop implementation of transpose. . . . . . . . . . 55 10 5-10 Speedup of transpose implementations with respect to grainsize, run on 24 workers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 11 12 List of Tables 4.1 Comparison of running times of programs from PBBS [53] with grain- sizes 64, 256, and 2048. Each set of rows contains median running times of the benchmarks run on 1 or 20 workers using DACParFor (DAC) and loop frames (LF), as well as the ratio (Ratio) of the DACParFor running time divided by the loop-frame running times. A ratio greater than 1 indicates that DAC is slower than LF. The running times in italic represent the fastest run for that benchmark and worker count. 39 5.1 Benchmarks used in the study and their properties. Loop nest rank shows the number of nested loops, which corresponds to the number of dimensions in the iteration space, with the number in parenthesis indicating how many of those loops are parallel loops. Total distinct reads and total reads demonstrate whether there is reuse. . . . . . . . 42 5.2 Implementations of nested loops used in the study. Base case size reduces both the recursive call overhead and parallelism. While neither nor -tiled use recursion, -tiled still has a base case, which is one tile. -best and -dac-full only apply to mm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 13 14 Chapter 1 Introduction A parallel loop is a control-flow construct that contains a sequence of independent iterations, which are allowed to be executed concurrently. Parallel loops are common in high-performance parallel programs, and their performance is crucial for the overall performance of parallel code. This thesis tackles two main performance challenges of parallel loops: scheduling overheads and multidimensional locality in nested loops. Parallel computing is the future of faster code. Multicore computers started ap- pearing around 2005 [39], when clock speeds plateaued, limiting the speedup of serial code and forcing programmers to turn to parallel computing. However, running code in parallel is harder due to race conditions and other issues that occur due to con- currency. Ideally, parallel programs are desired to achieve perfect linear speedup, meaning the program runs 𝑃 -times faster on 𝑃 processors (compared to running on a single processor). In reality, however, this isn’t possible due to scheduling overheads. Fork-join parallelism is a flexible task-based approach to parallel computing. Each program is modeled as a directed acyclic graph (dag), where nodes represent fine-grained indivisible tasks and edges represent dependencies between them. The fork-join paradigm supports two main constructs: spawn/sync and parallel-for loops. Spawn and sync are demonstrated in Figure 1-1. Similar to how B and C in Figure 1-1 can execute in parallel, parallel-for loops allow all iterations to execute in parallel. Fork-join parallelism is task-based because the programmer only has to specify tasks and their dependencies, while the scheduling decisions are left to the 15 ForkJoinDemo() 1 A(); 2 spawn B(); 3 C(); 4 sync 5 D(); Figure 1-1: A DAG for a simple fork-join program and the corresponding pseudocode. In this example, there are 4 serial tasks, A-D, that need to be executed. A needs to be executed before B and C (which can be executed in parallel). Both B and C need to finish before D. In the pseudocode, A is called first. Then, B is spawned, meaning C can execute before B is done. Sync on line 4 forces both B and C to complete before the execution moves on to D. runtime system at the time of execution. Randomized work stealing [5, 15, 20, 22, 23, 26, 31, 32, 33, 44, 58, 13, 25] is a popular (and asymptotically optimal) scheduling and load-balancing algorithm for parallel fork-join programs. Typically, a randomized work-stealing scheduler will implement parallel-for loops using recursive divide-and-conquer , where the iteration space is recursively split into halves. Although efficient in theory [41, 13, 5], this approach can exhibit high overheads in practice. Those overheads are made worse by representing loops with other fundamental fork-join building blocks instead of treating them as first-class parallel constructs. Multi-dimensional loops present additional challenges to efficient scheduling. Xu [59] argues for a locality-first approach to algorithm design and shows that locality can be more important for good parallel performance than parallelism. In the case of a multidimensional iteration space, good locality across all (or at least multiple) di- mensions is often required for fast execution, which can be achieved with a reordering of iterations to achieve better cache efficiency. Two common approaches to achieving cache efficiency are cache-oblivious and cache-aware algorithms. Cache-aware algorithms achieve good cache locality by taking cache size into account, while cache-oblivious algorithms achieve good cache locality without any explicit knowledge of cache parameters. For multidimensional 16 iteration spaces, loop tiling and recursive divide-and-conquer are cache-aware and cache-oblivious approaches to improving cache efficiency by reordering loop iter- ations. When programmers implement these cache-efficient approaches manually, they tightly couple the algorithm to the implementation. That coupling essentially elimi- nates opportunities for other optimizations that operate on loops, e.g. vectorization and Loop-Invariant Code Motion, which avoids recomputation inside the loop by ex- ecuting loop-invariant code before the loop. Instead, if an automated system, such as a compiler or a runtime system, was capable of transforming nested parallel loops into an implementation with the desired iteration order, programmers could maintain the simplicity of a nested loop implementation while also reaping the benefits of the increased locality. Because the loops are parallel, the compiler is allowed to reorder iterations without expensive aliasing and dependency analysis. In this thesis, I tackle these two challenges of scheduling parallel loops. To reduce scheduling overheads of parallel loops, I introduce loop frames . Loop frames extend work stealing by adding first-class support for parallel-for loops. To improve the performance of nested parallel loops, I study the impacts of improved locality along multiple dimensions on cache locality and overall performance for a few benchmarks. Unlike the divide-and-conquer approach, where loops are implemented using exist- ing fork-join constructs, loop frames give loops a special representation in the runtime, allowing them to optimize work stealing of parallel-for-loop iterations. Loop frames support on-the-fly loop splitting , where intervals of loop iterations are dynami- cally split based on the unexecuted iterations when the split occurs. On-the-fly loop splitting allows unexecuted computation to be dynamically distributed more evenly than the divide-and-conquer algorithm, but it introduces non-determinism that com- plicates the theoretical analysis of work stealing [13, 5]. This thesis evaluates loop frames in practice and observes that loop frames incur substantially less overhead than the divide-and-conquer algorithm across the board, showing improvements in serial and parallel running times. I report on a study that examines the impact of improved locality in more than one 17 dimension of a multidimensional iteration space. I evaluate both cache-oblivious and cache-aware implementations. The results show that improved locality along multiple dimensions increases cache efficiency and boosts performance. That moti- vates a compiler transformation that automatically reorders nested loop iterations to achieve good locality along all dimensions. Divide-and-conquer parallel-for loops To motivate loop frames, let us first examine the overheads in the traditional divide- and-conquer implementation of parallel-for loops. Figure 1-2 presents pseudocode for DACParFor, a typical implementation of a parallel-for loop [41, Sec. 8.3]. More specifically, a parallel-for loop over iterations {0, 1, . . . , 𝑛− 1} is compiled to a call to DACParFor(0, 𝑛,Body), where the function Body encodes the loop body. As Figure 1-2 shows, DACParFor uses the parallel keywords spawn and sync to process the loop iterations using parallel divide-and-conquer recursion. DACParFor starts by splitting the loop into halves, until less than 𝐺 iterations are left in a single batch. Here, 𝐺 is a constant used to coarsen the recursion. The spawn keyword on line 3 allows the recursive call to execute iterations {𝑠, 𝑠 + 1, . . . ,𝑚 − 1} to run in parallel with the continuation in the parent caller, which executes iterations {𝑚,𝑚+1, . . . , 𝑒− 1}. The sync statement on line 7 acts as a local barrier that joins together the child computations spawned within the function. A work-span analysis [18, Ch. 27] of DACParFor shows that this algorithm is theoretically efficient. Consider the execution of DACParFor(0, 𝑛,Body), and for didactic simplicity, suppose that the every execution of Body performs Θ(1) instructions. The work 𝑇1 of this execution — the total number of instructions executed — is Θ(𝑛). The span 𝑇∞ of this execution — the length of a longest path of dependencies — is Θ(lg 𝑛+𝐺). The parallelism of the execution, which bounds the maximum possible speedup on parallel processors, is the ratio of the work divided by the span, that is, Θ(𝑛/(lg 𝑛+𝐺)). Randomized work stealing bounds the execution time of a program on 𝑃 processors by 𝑇𝑃 ≤ 𝑇1/𝑃 +𝑂(𝑇∞) [13, 5]. 18 DACParFor(𝑠, 𝑒,Body) 1 while 𝑒− 𝑠 > G 2 𝑚 = (𝑠+ 𝑒)/2 3 spawn DACParFor(𝑠,𝑚,Body) 4 𝑠 = 𝑚 5 for 𝑖 ∈ {𝑠, 𝑠+ 1, . . . , 𝑒− 1} 6 Body(𝑖) 7 sync Figure 1-2: Pseudocode for the typical divide-and-conquer implementation of a parallel-for loop. The function Body encodes the body of the parallel-for loop. The spawn and sync keywords allow for parallel execution of the operations in the function. The constant G is used to coarsen the recursion to mitigate performance overheads. Figure 1-2 also illustrates how parallel-for loops can exhibit substantial overheads in practice. As Figure 1-2 shows, a call to DACParFor(0, 𝑛,Body) spawns Θ(𝑛) function calls. Each spawned function call incurs overheads due to the function calls themselves and due to the operations each spawn performs to enable parallel execution. These overheads are particularly large in comparison to an ordinary serial loop, which typically performs just a few operations on registers per iteration to control the loop’s execution. For example, I measured these overheads on a simple daxpy microbenchmark, which computes 𝑦[𝑖] += 𝑎 ·𝑥[𝑖] over all elements of two given arrays 𝑥 and 𝑦 and scalar value 𝑎. On 1 processor, daxpy ran over 25× slower when implemented using DACParFor with 𝐺 = 1 compared to an ordinary serial-loop implementation with identical compiler optimizations. Previous research has examined ways to improve the performance of parallel-for loops. Typically, the overheads of DACParFor can be mitigated by increasing the constant 𝐺 for coarsening the recursion. Cilk, for example, typically sets 𝐺 = min{2048, 𝑛/8𝑃}, where 𝑛 is the number of loop iterations and 𝑃 is the number of worker threads [30]. But increasing 𝐺 reduces the parallelism of DACParFor, reducing the maximum possible parallel speedup the loop can achieve. As Section 6 discusses, previous work has explored many other approaches to optimizing parallel loops, including static and dynamic schemes to partition loop iterations [48, 28, 57, 19 40, 27, 7, 8], strategies to exploit common memory access patterns in loops [54], and techniques to reduce the synchronization overheads of work stealing [55, 56, 2]. Loop frames Loop frames provide an alternative to DACParFor to load balance parallel-for loops dynamically using work stealing. Loop frames aim to address the shortcomings of DACParFor while providing the same theoretical performance guarantees. Loop frames introduce a simple data structure and work-stealing protocol to randomized work-stealing. Rather than build upon spawn and sync constructs, the loop-frame data structure concisely represents an interval of loop iterations using two integers. In addition, the loop-frame protocol supports on-the-fly loop splitting, which allows a processor to steal half of the frame’s unexecuted loop iterations. In contrast, DAC- ParFor splits the 𝑛 iterations of a parallel-for loop at deterministic points, i.e., at iterations 𝑛/2, 𝑛/4, 3𝑛/4, etc., as Figure 1-2 shows. Loop frames compose with tradi- tional randomized work stealing and serve as a drop-in replacement for DACParFor in a work-stealing scheduler. On-the-fly loop splitting introduces non-determinism into the scheduling of parallel-for loops, complicating their analysis. Regardless, [47] shows that, despite their non-deterministic behavior, randomized work stealing with loop frames achieves the same theoretical performance guarantees as DACParFor. Implementation of loop frames I implemented loop frames in a beta version of the OpenCilk runtime system [51]. As Chapter 3 describes, my implementation extends the THE protocol from Cilk [25] to synchronize operations on loop frames efficiently. I compare the empirical perfor- mance of loop frames versus the traditional divide-and-conquer algorithm on 3 mi- crobenchmarks and 14 application benchmarks from the Problem-Based Benchmark Suite [10]. I evaluated loop frames and the divide-and-conquer algorithm with vari- 20 ous coarsening values 𝐺. For the divide-and-conquer algorithm, 𝐺 is used to coarsen the recursion as shown in Figure 1-2. For loop frames, 𝐺 is used to strip-mine the loop, which batches iterations into groups of size 𝐺, thereby amortizing overheads associated with loop frames. Empirical results show that loop frames significantly reduce scheduling overheads of DACParFor, reducing the coarsening parameter 𝐺 required for the best performance. Study of locality benefits in multidimensional loops To explore the benefits of increased locality of multidimensional iteration spaces, I performed a study using various implementations of simple benchmarks containing nested loops. The study focused on cache-oblivious and cache-aware approaches, comparing them to a naive nested loop implementation and analyzing their perfor- mance. The benchmarks used include mm, blur, and transpose. All benchmarks exhibit better cache locality when locality along multiple dimensions is improved. The implementations used achieve the regular, tiled (cache-aware), and multidimen- sional divide-and-conquer (cache-oblivious) iteration orders with a mix of recursive and iterative approaches. To eliminate the effects of recursive overheads, the study analyzes the effects of recursive coarsening. Results show that improved locality along more than one dimension improves performance for two of the three benchmarks. Contributions and outline This thesis makes the following contributions. • Loop frames, which extend randomized work stealing with first-class support for parallel-for loops. They support on-the-fly loop splitting, which provides flexible dynamic splitting of loop iterations and allows for scheduling and load balancing with significantly less overhead than the traditional divide-and-conquer algorithm. Finally, loop frames are fully compatible with an existing fork-join scheduler. 21 • An efficient implementation of loop frames based on the OpenCilk runtime sys- tem [51]. This implementation extends the THE protocol from Cilk [25] to coordi- nate parallel operations on loop frames efficiently. • An empirical evaluation of loop frames. I observe that on microbenchmarks and application benchmarks, loop frames and on-the-fly loop splitting incur substan- tially less overhead than the divide-and-conquer algorithm without sacrificing par- allel scalability. • A study on the impacts of reordering iterations of a multidimensional iteration space on cache locality and overall performance. The study measures the performance of multiple benchmarks and various implementations that achieve different iteration orders. It finds that for two out of the three benchmarks, reordering iterations can result in a significant reduction in cache misses and an improvement in bottom-line performance. The rest of this thesis is organized as follows. Chapter 2 reviews the dag model of multithreading, randomized work stealing, and cache-efficient algorithms. Chapter 3 describes loop frames and their custom stealing and synchronization protocols used in the OpenCilk implementation. Chapter 4 presents the empirical evaluation of loop frames. Chapter 5 presents the study of improved locality in multidimensional iteration spaces. Chapter 6 discusses related work on optimizing work stealing and parallel loops. Finally, Chapter 7 offers concluding remarks. 22 Chapter 2 Background This chapter reviews the dag model of multithreading [12, 13], which represents the execution of a recursive fork-join program and classical randomized work stealing. This chapter also reviews cache-oblivious algorithms and recursive divide-and-conquer in multiple dimensions. 2.1 The dag model of multithreading An execution dag 𝐺 = (𝑉,𝐸) represents the execution of a recursive fork-join program, where each vertex 𝑥 ∈ 𝑉 represents a strand — a sequence of serially executed instructions containing no parallel control — and edges represent control dependencies between strands. The execution of a spawn terminates the current strand and produces a vertex with out-degree 2. A sync creates a vertex with in- degree greater than 1. The execution dag 𝐺 is a series-parallel dag [21] with a single source vertex and a single sink vertex. To simplify the analysis, this thesis shall assume that each strand represents a single executed instruction. The work and span of the execution of a program can be measured in terms of its execution dag 𝐺. The work of 𝐺 is the total number of strands in 𝐺. The span of 𝐺 is the length of the longest path of dependencies in 𝐺. Task-parallel programming platforms that support the dag-parallel model include dialects of Cilk [11, 25, 19, 38, 36, 29], Fortress [4], Habanero [9], Habanero-Java [16], 23 Hood [14], HotSLAW [42], Java Fork/Join Framework [35], OpenMP [46, 6], Task Parallel Library [37], Threading Building Blocks (TBB) [49] and X10 [17]. 2.2 Randomized work stealing The operation of a randomized work-stealing scheduler can be described in terms of the execution dag 𝐺 = (𝑉,𝐸). A randomized work-stealing scheduler dynamically load balances a parallel computation across available threads, called workers . At each time step, each worker 𝑝 maintains an assigned strand , which is the strand that 𝑝 executes on that time step. A strand 𝑠 is said to be ready if all its predecessors in 𝐺 have been completed. Executing an assigned strand can enable a strand 𝑠′ that is a direct successor of 𝑠 in 𝐺 by making 𝑠′ ready. Each worker maintains a deque – a double-ended queue – of ready strands. Typically, a worker operates on its deque like a stack, pushing and popping strands from the tail of its deque. When a worker runs out of strands on its deque, it becomes a thief and randomly chooses another worker as its victim . If the selected victim has excess strands on its deque, then the thief can steal a strand from the head of the victim’s deque. A classical randomized work-stealing scheduler keeps track of a single ready strand using a spawn frame . Hence, the deque of ready strands is stored as a deque of spawn frames. Each spawn frame maintains the necessary information for a thief to begin executing that strand, e. g. the code address and register state to resume execution of the strand. When a thief steals from a victim, it removes the spawn frame at the head of the victim’s deque and assigns the stolen frame to itself. The extended deque is defined to include the assigned strand along with the other spawn frames in the deque. Worker deques support three methods: pushBottom, popBottom, and pop- Top. A spawn statement causes a worker to execute pushBottom to push its current frame onto the bottom of the deque. A worker executes popBottom when it returns from a spawned function. If popBottom runs on an empty deque, then popBottom does not return but instead causes the worker to become a thief. A 24 thief steals a spawn frame by calling popTop to remove the frame at the top of the deque. 2.3 Cache-oblivious and cache-aware algorithms A cache-oblivious algorithm [24] is an algorithm that achieves good cache locality without explicit knowledge of the cache parameters 𝐵 and 𝑀 , cache line length and cache size, respectively. In contrast, cache-aware algorithms explicitly use those cache parameters to achieve good cache-locality. Cache-oblivious algorithms have many advantages over cache-aware algorithms. Because they do not require precise tuning to the hardware they execute on, they are very portable. Obliviousness to cache parameters also makes them perform well on more complex memory hierarchies like multi-level caches and heterogeneous hardware. Conversely, cache-aware algorithms must be explicitly tuned for each level of the cache hierarchy. Optimal cache-oblivious algorithms achieve the lower asymptotic bound of cache misses. Several problems, including matrix multiplication, matrix transposition, and sorting, have known optimal cache-oblivious algorithms. In contrast, others, like the Cooley-Tukey FFT, are optimally cache-oblivious with specific parameter choices. These algorithms may require fine-tuning on particular machines, but the overall goal is to minimize the extent of tuning necessary. Typically, a cache-oblivious algorithm employs a recursive divide-and-conquer ap- proach, dividing a problem into smaller subproblems until the subproblem fits into the cache, regardless of the cache size. Hybrid algorithms may also combine cache-aware and cache-oblivious algorithms for specific cache sizes. One cache-aware alternative to nested loops in a multidimensional iteration space where iterations can be reordered is loop tiling [34]. Instead of iterating from start to finish for each dimension, the iteration space is divided into tiles , which are mul- tidimensional sub-parts of the iteration space. The size of each tile is tuned so that each data accessed in a tile fits in cache. In serial execution, each tile is processed as a 25 whole before starting the next. Most parallel executions allow tiles to run in parallel while each tile runs serially, which provides coarsening and guarantees the whole tile is executed on a single worker. Multiple levels of tiling are required to exploit all cache levels in a multi-level cache hierarchy. The cache-oblivious replacement for loop tiling is multidimensional recursive divide-and-conquer . It performs a series of splits, always by the largest dimension, and recursively visits both halves. This approach will traverse the iterations in the Morton order [43], sometimes referred to as the z-order. 26 Chapter 3 Loop frames This chapter describes loop frames, which extend randomized work stealing to provide first-class support for parallel loops in the runtime. Section 3.1 describes loop frames and how they support on-the-fly loop splitting in a randomized work-stealing sched- uler. Section 3.2 outlines the detailed synchronization protocol between the victim and the thieves that maintains correctness during concurrent execution. I present loop frames for parallel-for loops without any coarsening (i.e. executing chunks of iterations at a time to reduce the scheduling overhead) for simplicity. In the implementation, coarsening is implemented via loop strip-mining, where the loop is divided into chunks and a remainder. Each chunk is then executed as a whole, and the remainder is executed at the end. 3.1 Implementation A loop frame is an extension of a spawn frame. All spawn frames contain the execution context, which assists workers in pausing and resuming different sub-parts of the computation. In addition to that, loop frames store a range of loop iterations that remain to be executed and additional flags that assist synchronization. Just like spawn frames, loop frames are pushed onto the deque and popped from it by workers and thieves. A loop frame extends a spawn frame with start and end variables, which encode 27 LFParFor(𝑛,Body) StealLF_simple(𝐹 ) 1 𝐹 = LoopFrame(1, 𝑛,Body) 1 𝐹 ′ = CopyLoopFrame(𝐹 ) 2 pushBottom(𝐹 ) 2 𝑚 = (𝐹 ′.start + 𝐹 ′.end)/2 3 LFParForHelper(𝐹, 0) 3 𝐹.end = 𝑚 4 popBottom() 4 𝐹 ′.start = 𝑚+ 1 5 pushBottom(𝐹 ′) LFParForHelper(𝐹, 𝑖) 6 LFParForHelper(𝐹,𝑚) 1 while Next(𝐹 ) 7 popBottom() 2 𝐹.Body(𝑖) 3 𝑖 ++ Figure 3-1: Pseudocode for loop frame operations. LFParFor is used by the worker that enters the loop frame, while SimpleSteal is used by the thief when stealing a loop frame from a victim. LFParForHelper is used by both to actually execute all iterations and contains a familiar branch, body, and increment that can be found in all for loops. Next returns true if there are more iterations available, and false otherwise. the parallel loop iterations in the half-open interval [start , end) = {start , start + 1, . . . , end − 1}. At any time during the execution, start and end represent the itera- tions that haven’t been executed yet. While the divide-and-conquer implementation will push and pop spawn frames onto the worker’s ready deque while executing the iterations, LFParFor only operates on the start and end variables, reducing the scheduling overhead per iteration. Figure 3-1 contains pseudocode for the operations on a loop frame. Here, I assume that the Next function and lines 1–3 in StealLF_simple operate atomically. The detailed implementation of the protocol that provides these atomicity guarantees is presented in Section 3.2. Let us walk through the pseudocode in Figure 3-1. A parallel-for loop over 𝑛 iterations is implemented using a call to LFParFor(𝑛,Body) followed by a sync, where the function Body encodes the body of the parallel-for loop. The sync ensures that all workers executing iterations of the loop have completed. When a worker executes LFParFor(𝑛,Body), line 1 first creates a new loop frame 𝐹 representing iterations [1, 𝑛). Line 2 pushes 𝐹 onto the worker’s deque, and then line 3 calls LFParForHelper to direct the worker to start executing the loop from iteration 0. 28 These lines thus assign iteration 0 to the worker and allow iterations [1, 𝑛) to be stolen. Upon returning from LFParForHelper, line 4 pops 𝑓 off the bottom of the worker’s deque or, if the pop fails (because the loop frame has been stolen completely), causes the worker to become a thief. The function LFParForHelper(𝐹, 𝑖) causes the worker to execute loop itera- tions starting from 𝑖 = 𝐹.start − 1 one at a time until reaching iteration 𝐹.end . In particular, line 1 calls Next(𝐹 ) to check if there are more loop iterations to exe- cute. Next(𝐹 ) increments 𝐹.start and returns 𝐹𝑎𝑙𝑠𝑒 if 𝐹.start > 𝐹.end , or 𝑇𝑟𝑢𝑒 otherwise. To steal from the loop frame 𝐹 using on-the-fly loop splitting, a thief calls the func- tion StealLF_simple(𝐹 ), which operates as follows. Line 1 first produces a copy 𝐹 ′ of 𝐹 , and then line 2 computes the midpoint 𝑚 between 𝐹 ′.start and 𝐹 ′.end . Line 3 then performs on-the-fly loop splitting: The thief claims the unexecuted iterations [𝑚,𝐹 ′.end) by setting 𝐹.end = 𝑚, thereby stealing half of the unexecuted itera- tions. Finally, like an ordinary worker, the thief pushes 𝐹 ′ representing the iterations [𝑚+ 1, 𝐹 ′.end) onto its deque (lines 4 and 5) and then calls LFParForHelper on line 6 to start executing its loop iterations from iteration 𝑚. Finally, line 7 pops 𝐹 ′ off the bottom of the thief’s deque and allows the thief to resume work stealing. 3.2 Synchronization protocol This section describes the synchronization protocol used in the OpenCilk runtime sys- tem [51] implementation of loop frames. Section 3.1 described loop frames assuming that two functions, Next and StealLF, execute atomically. This section describes how these functions implement an efficient protocol, based on the THE protocol in Cilk [25], to synchronize updates to loop frames. Let us first review the THE protocol for coordinating accesses onto worker de- ques [25]. Each deque maintains a head pointer 𝐻 and a tail pointer 𝑇 to track the head and tail of the deque, as well as a mutex lock 𝐿. Workers normally push and pop frames onto the deque by updating 𝑇 . A thief can pop a frame from the top of the 29 deque by incrementing 𝐻. To optimize deque operations, the THE protocol allows the worker to speculatively update 𝑇 , and only requires the worker to acquire 𝐿 when the deque appears to be empty. After acquiring 𝐿, the worker can determine whether the deque was really empty or there was an uncompleted steal attempt in progress that has since failed. A thief, meanwhile, always acquires 𝐿 before updating 𝐻, ensuring a single steal attempt happening at a time and providing a correctness guarantee for the worker’s second, protected check. To ensure the atomicity of Next and StealLF, a simple but inefficient approach is to give a thief exclusive access to a victim’s loop frame 𝐹 during StealLF. In- tuitively, after locking the deque, the thief would increment 𝐻, update 𝐹.end , then decrement H and release the deque lock. The Next function, meanwhile, would decrement 𝑇 before attempting to update 𝐹.start . If the worker discovered 𝐻 > 𝑇 after decrementing 𝑇 , it would acquire the deque lock before completing its update to 𝐹.start . Although this approach is simple, it turns out to be inefficient, as the worker will push and pop frames for every iteration. Figure 3-2 presents pseudocode for the optimized synchronization protocol on loop frames, which I found to be 40% faster than the simple protocol. Conceptually, rather than use 𝐻 and 𝑇 to coordinate all operations on loop frames, the optimized protocol operates on the start and end variables of a loop frame directly, similarly to how the THE protocol operates on 𝐻 and 𝑇 . The Next function implements the routine used in LFParForHelper for a worker to get the next loop iteration to execute from a loop frame 𝐹 . A thief steals from a deque 𝐷 by calling Steal(𝐷), which calls StealLF(𝐹,𝐷) on line 8 if line 7 discovers that the frame 𝐹 at the top of 𝐷 is a loop frame. The Steal function thus extends the steal function in Cilk-5 [25] to handle loop frames. Let us walk through the pseudocode in Figure 3-2, starting with Next. Line 1 gets the next iteration index 𝑖 to execute and speculatively increments 𝐹.start . Line 2 checks if the loop frame is out of iterations by checking whether the new value of 𝐹.start exceeds 𝐹.end . If the test fails, the worker returns 𝑇𝑟𝑢𝑒 (line 8). If the test succeeds, then the worker locks its own deque (line 3) to gain exclusive access to 30 Next(𝐹 ) StealLF(𝐹,𝐷) 1 𝑖 = 𝐹.start ++ 1 𝐹 ′ = CopyLoopFrame(𝐹 ) 2 if 𝐹.start > 𝐹.end 2 𝑚 = (𝐹.start + 𝐹.end)/2 3 lockOwnDeque() 3 𝐹.end = 𝑚 4 if 𝐹.start > 𝐹.end 4 if 𝐹.start > 𝐹.end 5 unlockOwnDeque() 5 𝐹.end = 𝐹 ′.end 6 return 𝐹𝑎𝑙𝑠𝑒 6 𝐻 −− 7 unlockOwnDeque() 7 unlock(𝐷.L) 8 return 𝑇𝑟𝑢𝑒 8 return FAIL 9 if 𝐹.start < 𝐹.end Steal(𝐷) 10 𝐻 −− 1 lock(𝐷.L) 11 unlock(𝐷.L) 2 𝐹 = 𝐻 ++ 12 𝐹 ′.start = 𝑚 3 if 𝐻 > 𝑇 13 return SUCCESS 4 𝐻 −− 5 unlock(𝐷.L) 6 return FAIL 7 if isLoopFrame(𝐹 ) 8 return StealLF(𝐹 ) 9 unlock(𝐷.L) 10 return SUCCESS Figure 3-2: Pseudocode for the optimized synchronization protocol to coordinate op- erations on loop frames. The argument 𝐷 denotes a victim’s deque. The functions lockOwnDeque and unlockOwnDeque acquire and release the lock on the exe- cuting worker’s deque. The function isLoopFrame tests if the given frame is a loop frame. 𝐹 . The worker rechecks if 𝐹.start > 𝐹.end on line 4 and then returns either 𝐹𝑎𝑙𝑠𝑒 (line 6), if the loop frame is out of iterations, or 𝑇𝑟𝑢𝑒 otherwise (line 8). In either case, the worker unlocks its deque before returning. Function StealLF(𝐹,𝐷) in Figure 3-2 is a more detailed version of StealLF_simple in Figure 3-1, additionally containing the synchronization protocol on lines 4–11. After setting 𝐹.end = 𝑚 on the victim loop frame 𝐹 , line 4 checks if 𝐹 now contains an invalid interval of loop iterations by testing 𝐹.start > 𝐹.end . If so, then lines 5–8 restore 𝐹.end and 𝐻 to their previous values, unlock the victim’s deque, and report a failed steal attempt. Lines 5–8 perform the same rollback as lines 4–6. Otherwise, line 9 checks if 𝐹 still contains any loop iterations after the steal by 31 testing 𝐹.start < 𝐹.end . If so, then line 10 restores the victim’s head pointer to leave 𝐹 at the top of the victim’s deque. If not, the thief leaves 𝐻 unchanged, effectively removing 𝐹 from the top of the victim’s deque and exposing frames below 𝐹 on the deque to be stolen. Finally, the thief reports a successful steal (line 13), after which it will begin executing iteration 𝑚. Note that line 12 sets 𝐹 ′.start = 𝑚 as opposed to Line 4 of the simplified protocol (depicted in Figure 3-1), which sets 𝐹 ′.start = 𝑚 + 1. That’s because the former is only stealing the upper half of the iterations while the latter then also immediately takes the first iteration 𝑚 from the loop frame 𝐹 ′ before 𝐹 ′ is pushed onto its deque. In other words, it summarizes both operations into one: stealing the upper half of iterations and taking the first one to make it unavailable for stealing. I believe that this distinction makes it clearer which iterations are actually getting stolen from 𝐹 in the synchronization protocol in Figure 3-2. The definition of correctness for this protocol follows from preserving the semantics of a for-loop: it executes every iteration in the specified range exactly once, and it doesn’t execute any iterations outside this range. The proof of correctness mirrors that of the THE protocol [25], assuming sequential consistency. While the THE protocol guarantees that a frame will only be taken (and executed) by either a worker or a thief, the loop frame protocol provides the same guarantee for iterations. Because a thief first acquires the victim deque’s lock before attempting to steal, only the worker and one thief can operate concurrently on a deque and, therefore, on the same loop frame 𝐹 . If 𝐹 contains no iterations, then both the victim and thief will fail to take any iterations from 𝐹 . Otherwise, suppose a thief attempts to steal iterations [𝑚,𝐹.end) from 𝐹 . If 𝐹 contains more than 1 iteration, then 𝐹.start < 𝑚, and the worker can update 𝐹.start without impinging on any iterations in [𝑚,𝐹.end), Hence, both updates to 𝐹 can happen concurrently without issue. If 𝐹 contains 1 iteration, then the thief and victim must coordinate to determine which worker gets iteration 𝑚. If the thief discovers 𝐹.start > 𝐹.end on line 4 of StealLF, then line 1 in Next happened before this check, and the thief will restore 𝐹.end to its original value, allowing the victim to claim the iteration. Otherwise, line 1 in Next happens 32 after this check, causing the thief to steal iteration 𝑚 and the victim to discover that 𝐹 is out of iterations. 33 34 Chapter 4 Empirical evaluation of loop frames This section presents my empirical evaluation of loop frames. I tested my implemen- tation of loop frames in version 1.0 of the OpenCilk runtime [51] and compared its performance to that of the traditional divide-and-conquer algorithm, DACParFor . I evaluated both algorithms on both microbenchmarks and application benchmarks. Results show that loop frames achieve significantly lower scheduling overheads than the reference DacParFor implementation. On microbenchmarks, which used no coarsening to directly expose the scheduling overheads, loop frames outperformed the reference implementation by a factor of up to 2.7×. Additionally, application benchmarks from the Problem-Based Benchmark Suite (PBBS) [53] were used to demonstrate that loop frames also perform well in a real-world setting. Those results show loop frames require less coarsening than the reference implementation to achieve good performance, providing more evidence for a reduction in scheduling overheads. All experiments in this chapter were run on compute nodes on the MIT Supercloud system [50]. Each compute node is a dual-socket Intel Xeon Gold 6248 system with a total of 384 GB of main memory. Each Xeon is a 2.50GHz 20-core CPU. To reduce the effects of noise on the performance measurements, I disabled hyperthreading and pinned all workers to cores on a single socket. All the benchmarks were compiled with OpenCilk [51], based on LLVM 9, using the highest optimization level. All results are median running times, aggregated from 10 trials. Specific compilation strategies are described in each section. 35 Figure 4-1: Median speedup with respect to number of processors for the daxpy, nqueens, and mandelbrot, appearing from left to right. Each plot features loop frames in orange squares and the divide-and-conquer algorithm in yellow triangles. The daxpy plot also includes the results of the simpler protocol for loop frames (briefly mentioned in Chapter 3) in red diamonds. The nqueens plot includes in blue crosses an alternative implementation of the parallel-for loop where each iteration is spawned off in sequence. 4.1 Microbenchmark experiments Figure 4-1 presents the performance results for the three microbenchmarks: daxpy, nqueens, and mandelbrot. I compare the overheads of parallel-for loops implemented using loop frames with the reference DacParFor implementation on these programs. Because each program makes heavy use of parallel-for loops, the performance of each program depends substantially on the overhead of the parallel-for-loop implementa- tion. The parallel-for loops in each program were implemented by directly inserting calls to runtime ABI functions into the code to avoid modifications to the Open- Cilk compiler. The three programs vary in their computational intensity, that is, the ratio of arithmetic operations to memory operations. To evaluate the overheads of the parallel-for-loop implementations, I study these programs with no coarsening, i.e., with 𝐺 = 1 in DACParFor. I examine the results for each of these microbenchmarks results in turn. The results for the daxpy microbenchmark, which was briefly described in Chapter 1, are shown in the leftmost plot in Figure 4-1. All running times in this plot are nor- malized to the 1-processor running time of the DACParFor implementation. daxpy 36 runs a geometric mean 2.7× faster when using loop frames instead of DACParFor on all processor counts. The middle plot in Figure 4-1 presents the results for nqueens. This program per- forms a parallel recursive backtracking search to count the solutions to the 𝑛-queens problem. At each level of recursion, the program performs a parallel-for over the 𝑛 positions in a given row. The figure presents results for 𝑛 = 13, meaning that each parallel-for loop contains just 13 compute-intensive iterations. As the figure shows, when using loop frames, nqueens runs a geometric mean 1.3× faster than the DAC- ParFor version on all processor counts. Furthermore, the figure shows that loop frames cut in half the performance gap between the DACParFor implementation and an optimized version of nqueens in which the parallel-for loops spawn off each iteration in sequence after performing an earlier viability check. This pruning strat- egy reduces the parallel overhead of spawning short tasks that don’t contribute to the parallelism and is not possible for parallel loops, which is why it outperforms both loop-based implementations. The rightmost plot in Figure 4-1 presents the performance results for mandelbrot, which plots a picture of the Mandelbrot fractal with a specified size. The results in the plot are normalized to the serial projection of the program [25, 52], in which all parallel constructs are replaced with their serial counterparts. mandelbrot using loop frames runs a geometric mean 2× faster compared to using DACParFor on all processor counts. 4.2 Application experiments I evaluated the performance of loop frames on 14 applications from the Problem- Based Benchmark Suite (PBBS) [53]. In particular, I examined the performance of loop frames and DACParFor on these applications with different coarsening val- ues 𝐺. I selected 14 applications from PBBS for which changing 𝐺 for the default DACParFor implementation of parallel-for loops from 𝐺 = 2048 to 𝐺 = 64 pro- duced at least a 2% change in running time. This selection criterion aims to identify 37 programs for which I anticipate the performance differences between DACParFor and loop frames to have a measurable impact on program running time. To evaluate loop frames on these benchmarks, I modified the OpenCilk runtime system to implement a runtime-ABI function for cilk_for loops [30] using loop frames. I developed a similar runtime-ABI function using DACParFor to provide a fair comparison between the algorithms. All PBBS benchmarks were compiled to use this runtime-ABI function for all cilk_for loops. To test a coarsening value 𝐺, I directed the OpenCilk compiler to use 𝐺 when coarsening loop statically, via loop strip-mining, and when using a runtime-computed coarsening for the cilk_for loop [30]. In addition, I compiled all programs with exceptions disabled, due to lack of exception support in the modified runtime. Table 4.1 presents the performance results for the PBBS benchmarks. Many of the PBBS benchmarks have parallel-for loops with large loop bodies, which minimizes the total contribution of parallel-loop overhead to the total running time of the program. Nevertheless, I found that, with smaller values of 𝐺, the benchmarks perform better on both 1 and 20 workers when using loop frames than when using DACParFor. With 𝐺 = 64, for example, the benchmarks ran 2−6% faster when using loop frames versus DACParFor. The performance difference between loop frames and DACParFor decreases as 𝐺 is increased because a larger value of 𝐺 decreases the total contribution of parallel-loop overhead to the program’s running time. Nevertheless, the results indicate that loop frames allow applications to achieve efficiency and scalability with smaller coarsening values. Good performance at lower coarsening could speed up other applications that don’t contain as much parallelism as PBBS. However, such applications would need to be benchmarked directly to confirm this hypothesis. I note that Table 4.1 shows loop frames performing worse than DACParFor on several benchmarks with 𝐺 = 2048. For 𝐺 = 2048, I believe that loop frames are performing worse than intended due in part to a performance bug in the implemen- tation. For a parallel-for loop over 𝑛 iterations, after strip-mining the loop by 𝐺, the loop-frame implementation executes 𝑛 mod 𝐺 iterations serially after the rest of the loop-frame performs a sync. In contrast, the DACParFor implementation allows 38 those same iterations to execute in parallel with the other iterations of the loop. Any future implementation using loop frames should correct this bug. detbfs ndbfs chull irefine dict imis imatch ndmatch pkruskal remdup ist ndst pks prange DAC, 𝑇1, 64 15.333 12.266 9.440 48.000 2.045 2.890 10.380 6.343 42.666 3.131 16.666 16.200 26.183 22.018 LF, 𝑇1, 64 14.433 12.033 9.212 47.250 2.002 2.800 9.623 6.120 41.266 2.902 17.200 17.466 25.933 19.081 Ratio 1.062 1.019 1.025 1.016 1.021 1.032 1.079 1.036 1.034 1.079 0.969 0.928 1.010 1.154 DAC, 𝑇20, 64 0.976 1.038 0.743 3.150 0.120 0.186 0.611 0.366 3.146 0.199 1.194 9.473 1.892 1.309 LF, 𝑇20, 64 0.963 0.809 0.742 3.205 0.116 0.184 0.575 0.356 3.096 0.191 1.165 9.383 1.891 1.295 Ratio 1.013 1.009 1.001 0.983 1.034 1.022 1.062 1.028 1.016 1.042 1.025 1.010 1.001 1.011 DAC, 𝑇1, 256 14.700 12.033 9.193 47.750 2.004 2.866 9.500 6.050 41.566 2.995 15.966 15.900 25.716 19.533 LF, 𝑇1, 256 14.500 11.900 9.112 47.250 1.983 2.810 9.216 6.056 41.233 2.900 15.566 16.099 25.500 19.111 Ratio 1.014 1.011 1.009 1.011 1.011 1.020 1.031 0.999 1.008 1.033 1.026 0.988 1.008 1.022 DAC, 𝑇20, 256 3.100 0.816 0.743 3.165 0.117 0.186 0.572 0.353 3.126 0.194 1.162 8.943 1.895 1.313 LF, 𝑇20, 256 0.970 0.803 0.747 3.180 0.115 0.183 0.558 0.349 3.130 0.191 1.151 9.083 1.877 1.298 Ratio 1.000 1.004 0.995 0.995 1.017 1.016 1.025 1.011 0.999 1.016 1.010 0.985 1.010 1.012 DAC, 𝑇1, 2048 14.266 11.799 9.081 46.750 1.958 2.796 9.013 6.013 40.233 2.867 15.400 15.500 25.133 18.970 LF, 𝑇1, 2048 14.233 11.900 9.090 47.450 1.958 2.706 9.083 6.010 40.433 2.844 15.500 15.633 25.250 19.073 Ratio 1.002 0.992 0.999 0.985 1.000 1.033 0.992 1.000 0.995 1.008 0.994 0.991 0.995 0.995 DAC, 𝑇20, 2048 0.956 0.806 0.739 3.130 0.114 0.184 0.554 0.353 3.106 0.190 1.159 9.286 1.857 1.280 LF, 𝑇20, 2048 0.989 0.830 0.746 3.190 0.114 0.184 0.552 0.351 3.126 0.192 1.162 8.896 1.870 1.297 Ratio 0.967 0.971 0.991 0.981 1.000 1.000 1.004 1.006 0.994 0.990 0.997 1.044 0.993 0.987 Table 4.1: Comparison of running times of programs from PBBS [53] with grainsizes 64, 256, and 2048. Each set of rows contains median running times of the benchmarks run on 1 or 20 workers using DACParFor (DAC) and loop frames (LF), as well as the ratio (Ratio) of the DACParFor running time divided by the loop-frame running times. A ratio greater than 1 indicates that DAC is slower than LF. The running times in italic represent the fastest run for that benchmark and worker count. 39 40 Chapter 5 Study of improved locality in nested loop applications In this chapter, I present a study of the impact of increased cache locality on three simple benchmarks that contain nested loops. I implemented each benchmark using cache-oblivious and cache-aware approaches and compared their performance to a naive nested loop implementation. The results show that improved locality in more than one dimension can result in performance gains up to 1.7× when better temporal and spatial locality can be achieved. Table 5.1 summarizes the benchmarks used in the study. mm computes a product of two matrices. blur performs a convolution of a trivial 2D filter on a 2D image. transpose computes an in-place matrix transposition. mm reuses each element 𝑁 times, blur reuses each element 𝐾 times, and transpose contains no reuse. All benchmarks satisfy several criteria that make them relevant to this study. First, memory operations represent a significant proportion of the running time, so better locality can measurably affect overall performance. Second, locality along more than one dimension improves spatial or temporal locality. Otherwise, loops in the nests can be reordered to guarantee locality in the desired dimension. This criterion also implies the nested loop implementation does not achieve the optimal number of cache misses. This chapter is organized as follows. Section 5.1 explains the importance and 41 Property mm blur transpose Parameters N N, K N Loop nest rank (parallel) 3(2) 2(2) 2(2) Total distinct reads 2𝑁2 𝑁2 𝑁2 −𝑁 Total reads 2𝑁3 𝑁2𝐾2 𝑁2 −𝑁 Table 5.1: Benchmarks used in the study and their properties. Loop nest rank shows the number of nested loops, which corresponds to the number of dimensions in the iteration space, with the number in parenthesis indicating how many of those loops are parallel loops. Total distinct reads and total reads demonstrate whether there is reuse. terminology of locality in multidimensional loops. Section 5.2 lists the various im- plementations used in the study and describes their properties. Section 5.3 describes each benchmark in more detail and presents the empirical results for each. Finally, Section 5.4 summarizes the takeaways we can draw from the study. 5.1 Locality in multidimensional loops This section discusses locality in multidimensional loops and when it benefits perfor- mance. It explains the relationship between iteration order and locality. The section also explains the distinction between spatial and temporal locality and how they impact cache operation. Finally, an example of iterating a 2D array illustrates the importance of locality in different dimensions. The traversal order of a multidimensional iteration space is what determines its locality. Recursive and iterative methods traversing a multidimensional iteration space execute in some linear order. Even when run in parallel, each parallel processor will execute iterations one after the other. We rely on parallel loops because they allow us to reorder iterations, but we only analyze locality from the perspective of the linear order. In the fork-join parallel- programming model, reordering iterations of nested loops is always legal if all loops are parallel. Conversely, prior knowledge of application semantics or expensive depen- dency and aliasing analysis are required to reorder serial loops. Finally, randomized work-stealing bounds the additional cache misses incurred due to steals [1]. 42 Locality in a particular dimension of an iteration space is beneficial if executing iterations with different values of that coordinate would result in memory accesses to locations far apart. This almost translates to "executing iterations for the same values of the coordinate results in memory accesses close together or at the same location in memory," except for that to be true, good locality in multiple dimensions at a time is often required. The distinction between spatial and temporal locality is essential. Temporal lo- cality can only be achieved if there is reuse , meaning that a location in memory is accessed multiple times, specifically during multiple iterations in an iteration space. That reuse can result in temporal locality if those points are executed closely in the resulting iteration order. On the other hand, spatial locality means nearby iterations access nearby elements in memory. This distinction is important because of the way caches operate. Both types of locality increase the likelihood that a memory location is already in the cache. Still, spatial locality does so by another element on the same cache line already residing in the cache. Hence, the benefits of spatial locality generally do not extend past the size of a cache line 𝐵. For example, let us consider a simple task of iterating a 2D array laid out con- tiguously in memory in row-major layout . Row-major layout means that the 𝑥 coordinate of the array is the fastest-running dimension. Let us assume we are iter- ating the array using two nested loops with dimensions (indices) 𝑖 and 𝑗, where 𝑖 and 𝑗 correspond to 𝑥 and 𝑦 coordinates of the array, respectively. The best locality is achieved if iterating over 𝑖 is done in the inner loop, as iterations with varying 𝑖 and the same 𝑗 access nearby elements, while the converse is not true. Hence, locality along 𝑗 is more vital because it results in good spatial locality. Of course, locality along 𝑖 also matters as we want to access consecutive elements in memory in con- secutive order. Still, there is no benefit to executing iterations with the same 𝑖 and different 𝑗 soon after the other. Hopefully, this example is familiar to the reader and helps clarify the concepts of locality in different dimensions, but it would not serve well as a benchmark in this study. 43 5.2 Implementations This section describes the various implementations used in the study that traverse the multidimensional iteration space. The primary purpose was to study the effect of changing the order in which iterations are executed. However, I also wanted to eliminate other effects on the overall performance, namely function call overheads in recursive implementation. Therefore, I added a "grainsize" parameter (𝐺) to coarsen the recursion. Additionally, multiple implementations may visit the iteration space in the same order but use different approaches to separate the effects of increased locality from the impact of recursive call overhead. For each benchmark, if the benchmark is called , the following implemen- tations are available: • is a simple nested loops implementation. • -tiled iterates the space by dividing the space into 𝐺×𝐺 tiles. • -dac uses multidimensional divide-and-conquer recursion, always splitting the largest dimension. All dimensions are split down to size 𝐺. • -dac-loop uses divide-and-conquer recursion but fully splits the outer di- mension first. Only the inner dimension is coarsened (split to size 𝐺); other dimen- sions are split to size 1. It achieves the same iteration order as . • -dac-loop-tiled is like -dac-loop but both dimensions are coars- ened. It achieves the same iteration order as -tiled. • -dac-full and -best also recursively split the third (serial) dimen- sion and only apply to mm. They are described in more detail in Subsection 5.3.1. Figure 5-1 shows the three distinct iteration orders in the implementations. and -dac-loop achieve the regular order, -tiled and -dac-loop-tiled achieve the tiled order, and -dac achieves the Mor- ton order (also referred to as z-order). -dac-full and -best achieve the Morton order in all 3 dimensions. 44 Figure 5-1: Regular (left), tiled (middle), and Morton (right) iteration orders for a two-dimensional iteration space. In this case, 𝑁 = 16 and 𝐺 = 4. Implementation Base case size 𝐺 affects order 1 N/A -tiled 𝐺2 Yes -dac 𝐺2 Yes -dac-loop 𝐺 No -dac-loop-tiled 𝐺2 Yes -dac-full* 𝐺3 Yes -best* 𝐺3 Yes Table 5.2: Implementations of nested loops used in the study. Base case size re- duces both the recursive call overhead and parallelism. While neither nor -tiled use recursion, -tiled still has a base case, which is one tile. -best and -dac-full only apply to mm. The grainsize parameter 𝐺 may affect both recursion coarsening and iteration order. 𝐺 is the size of the tile in both -tiled and -dac-loop-tiled, but because the latter uses recursion, it also impacts the recursion base-case size, which is equal to one tile. -dac also uses a tile as a base case, which means that the resulting order is not exactly the z-order . However, the locality remains similar as long as one tile fits in cache. Finally, -dac-loop only allows coarsening in the innermost dimension, and 𝐺 only affects the coarsening. Table 5.2 summarizes the effects of the grainsize parameter, 𝐺, on recursion coarsening and iteration order for all implementations. Despite having a coarsening parameter, -dac is cache-oblivious, as 𝐺 is only used to reduce the recursion overhead and not to optimize the residual size in cache. However, if 𝐺 becomes too large, the cache efficiency of -dac can be lost. 45 5.3 Benchmarks and performance results This section presents more detail on each benchmark and the performance results. All benchmark implementations were compiled with version 2.0 of OpenCilk and -O3 -DNDEBUG optimization flags. I ran all experiments on an AWS c5.metal machine, which features an Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz processor with 48 physical cores across 2 NUMA nodes. Each core has its own 32 KiB L1d and L1i caches and a 1 MiB L2 cache. Meanwhile, the 35.75 MiB L3 cache is shared between all cores on one NUMA node. L1 and L2 are 8-way associative, L3 is 11-way associative, and all contain 64 B cache lines. Because all benchmarks operate on 32-bit (4 B) values, the effective cache line size 𝐵 is 16. For each benchmark, I explored the impact of 𝐺 on the overall performance in both serial and parallel settings. I measured the median running time (minimum when running on one worker) from 10 trials for all power-of-two values of 𝐺 from 1 to 128. I repeated those measurements for all worker counts between 1 and 24, always utilizing distinct physical cores on a single NUMA node to reduce performance variability. All plots in this chapter plot the speedup (larger is better), which is calculated as 𝑆(𝑡) = 𝑡𝑛𝑜𝑟𝑚 , where 𝑡𝑛𝑜𝑟𝑚 is the running time used for normalization. Generally, 𝑡𝑛𝑜𝑟𝑚 is the𝑡 minimum running time of the serial projection of the nested loop implementation . One exception is when plotting the speedup with respect to grainsize on a certain number of workers, where 𝑡𝑛𝑜𝑟𝑚 is the median running time of for that worker count. Parallel scalability was not affected by 𝐺 for most implementations. Because the problem sizes had to be large for the data to not fit in cache, ample parallelism was always available. Therefore, I only evaluate some of the results to decrease the amount of redundant information. Specifically, I only look at speedup with respect to 𝐺 for 1 and 24 workers and parallel scalability for 𝐺 ∈ {1, 128}. I chose inputs with sizes that are not precisely powers of two for all benchmarks. It is easiest to implement recursive divide-and-conquer codes if the sizes of inputs are always powers of two, making it easy to divide the iteration space in half recursively. 46 However, such sizes can also result in conflict misses because real-world caches are not fully associative, and their sizes are often a power of two. Therefore, two elements in the same column of a 2D input might have the same low-order bits and map to the same cache set. One potential solution is to pad the 2D memory to offset the conflicts, but I opted for running the benchmarks on inputs with sizes that are not powers of 2. 5.3.1 Matrix multiply The matrix multiply benchmark mm computes the product of two row-major square matrices 𝐴 and 𝐵 of size 𝑁 ×𝑁 and stores the result in matrix 𝐶 of the same size. Because it performs 𝑂(𝑁3) arithmetic and 𝑂(𝑁3) memory operations on data of size 𝑂(𝑁2), it is an excellent benchmark for this study. It exhibits a factor of 𝑁 reuse on both reads and writes and enables excellent spatial and temporal locality benefits. Empirical results show that it can significantly benefit from improved locality in multiple dimensions. Figure 5-2 shows simplified code that implements mm using nested loops. It is a three-dimensional loop nest with dimensions 𝑖, 𝑗, and 𝑘. Each loop only iterates over two matrices and always refers to the same element in the last one. For example, the 𝑖 loop strides through 𝐴 and 𝐶 but always refers to the same element in 𝐵. It follows that locality in all three dimensions is essential. On the other hand, the 𝑗 loop is the only one that does not stride through either matrix, meaning that 𝑖 and 𝑘 locality are more critical for spatial locality. The order of loops in the nest may not achieve the best locality but is used for consistent comparisons with other benchmarks. The code would be correct under any permutation of the loop order in the nest. However, because the 𝑘 loop must be serial, different loop orders might decrease the parallelism or increase the scheduling overheads. In addition to the common implementations, I implemented mm-dac-full and mm-best. Those implementations also recursively split the 𝑘 dimension, achieving good locality along all three dimensions instead of just 𝑖 and 𝑗. The difference between 47 Figure 5-2: Simplified nested loop implementation of matrix multiply. The outer two of the three loops in the nest are parallel. Figure 5-3: Speedup of mm implementations with respect to grainsize, run on a single worker. them is that mm-dac-full splits along one dimension (the largest one in the current subproblem) for every level of recursion, while mm-best splits all three at the same time. For square matrices, both implementations perform splits in the same order but differ in the number of levels of recursion and spawned tasks by a constant factor. For the evaluation, I chose a size of 𝑁 = 2071 to avoid conflict misses and make sure that a column of a matrix does not fit in the L1 cache. A whole row does fit, but when iterating over a column, each element is on its own cache line. To cache the entire column, 𝐵 ·𝑁 = 64 bytes * 2071 memory needs to fit in cache, which is larger than L1. 48 Figure 5-3 shows the impact of the grainsize parameter on the performance of all implementations of mm on a single worker. The performance of mm-dac and mm-dac-loop is independent of 𝐺. They run 1.21× and 0.95× faster than the serial elision of mm. Because 𝐺 does not noticeably affect the locality of those implementa- tions, this lack of performance dependence on 𝐺 means that recursive overhead does not measurably contribute to the overall running time. The large size of the base case can explain that: it performs 𝑂(𝐺2𝑁) work, which is 𝑂(𝑁) even with 𝐺 = 1. On the other hand, larger 𝐺 improves the performance of mm-dac-loop-tiled and mm-tiled because it improves locality along the 𝑗 dimension. With large enough 𝐺, mm-dac-loop-tiled achieves enough locality to catch up to mm-dac, which is expected. Finally, mm-best and mm-dac-full perform much better with larger 𝐺 because their base case does 𝑂(𝐺3) work, which is not enough to amortize the cost of recursive calls when 𝐺 is low. They both achieve the best performance with 𝐺 = 128: mm-best outperforms the baseline by 1.64× while mm-dac-full outperforms it by 1.55×. The performance difference between mm-dac-full and mm-best can likely be at- tributed to a combination of performance variability and different overheads of re- cursion and spawning parallel tasks. The serial projection of mm-best is sometimes slower than the serial projection of mm-dac-full. It is less clear why mm-tiled has similar performance to mm-dac-loop-tiled for 𝐺 = 1 but outperforms it at larger 𝐺. It also measurably dips in performance at very large grainsizes. Unfortunately, I ran out of time to thoroughly investigate these de- viations. Still, they could be explained by divergent compiler transformations arising from the differences in control flow between mm-tiled and mm-dac-loop-tiled. Figure 5-4 shows the relationship between performance and 𝐺, this time for exe- cution on 24 workers. The plot looks almost identical except for mm-tiled performing worse for 𝐺 ∈ {64, 128}. The performance difference is even more apparent in Figure 5-5, which shows the parallel scalability of all implementations for 𝐺 = 128. Except mm-tiled, all imple- mentations achieve almost perfect linear speedup, with their self-relative speedups on 49 Figure 5-4: Speedup of mm implementations with respect to grainsize, run on 24 workers. 24 workers ranging from 22.7× to 24×. 5.3.2 Blur blur is a common operation in computational photography and image processing. The benchmark takes a two-dimensional image of size 𝑁 ×𝑁 and performs a simple 2D convolution that averages the pixels in the 2D neighborhood of size 𝐾 around each pixel. Despite a high factor of reuse, empirical results show that it does not benefit from improved locality in multiple dimensions. Figure 5-6 shows the simplified nested-loop implementation. Because each iteration of blur accesses nearby cells in both directions, improved locality in both 𝑥 and 𝑦 dimensions reduces the number of cache misses. The regular iteration order with 𝑥 (fastest-running dimension of the image) in the inner loop reuses each element along the 𝑥 dimension 𝐾 times, and it only incurs one cache miss for every 𝐵 elements due to excellent spatial locality. Assuming a row of the image does not fully fit in cache, the number of misses is 𝑂(𝑁2𝐾 ), and the fraction of misses 𝐵 to reads is 𝑂( 1 ). Suppose the tiled order with a sufficient tile size or Morton order 𝐾𝐵 is used. Then, elements can also be reused along the 𝑦 axis due to better locality in the 𝑥 dimension, and the number of misses is 2𝑂(𝑁 ). The fraction becomes 𝑂( 12 ),𝐵 𝐾 𝐵 50 Figure 5-5: Parallel speedup of mm implementations for 𝐺 = 128. 51 Figure 5-6: Simplified nested loop implementation of blur. blurPixel performs the "blurring" of a pixel and is factored out for better readability. an additional factor of 𝐾 reduction. Even though blur could be written as a 4D loop-nest, the inner two loops are smaller and likely wouldn’t benefit as much from improved locality. Their indices represent a relative offset in the calculation of the memory location, meaning that two different values of x and y with the same x_diff or y_diff don’t have anything in common, unless x and y are close, in which case locality in those two dimensions is already present. I chose a size of 𝑁 = 16401 for this benchmark to make sure that a whole row of the image would not fit in the L1 cache. In addition, I chose 𝐾 = 11 to get a significant factor of reuse. However, in hindsight, a high 𝐾 also resulted in a low number of cache misses, so reducing the cache misses had a minor impact on the overall performance. Figure 5-7 shows the performance of blur with respect to 𝐺 on a single worker. All implementations achieve very similar performance. These results show that the improved locality does not affect the running time. For 𝐺 = 1, blur-dac achieves 52 Figure 5-7: Speedup of blur implementations with respect to grainsize run on a single worker. a 0.77× speedup compared to the serial elision of blur, whike blur-dac-loop and blur-dac-loop-tiled achieve a 0.80× speedup. The reason they all perform worse than blur at low 𝐺 can be directly attributed to recursive overhead. As 𝐺 increases, the difference shrinks, and the performance of those benchmarks approaches that of blur. This approach happens more slowly for blur-dac-loop, as its coarsening only occurs with a factor of 𝐺 instead of 𝐺2, and the amortized cost of recursion approaches 0 more slowly. Meanwhile, blur-tiled achieves 0.97× of the performance of the baseline even for 𝐺 = 1, and quickly converges towards the baseline for larger 𝐺. It performs best with 𝐺 = 64, where it outperforms the baseline by 0.67%. That improvement is not significant enough to conclude that the improved locality was the reason for improved performance. Figure 5-8 shows the parallel scalability of all blur implementations. They all scale exceptionally well, achieving a speedup of over 23.9 on 24 cores. This happens due to ample parallelism and sufficient arithmetic intensity that does not overwhelm the memory bandwidth of the CPU, which notoriously does not scale with the number of processors. 53 Figure 5-8: Parallel speedup of blur implementations for 𝐺 = 1. 54 Figure 5-9: Simplified nested loop implementation of transpose. 5.3.3 Transpose transpose calculates an in-place transpose of a square matrix. It differs from the other two because it features a triangular iteration space , meaning the iteration range of the inner loop depends on the index in the outer loop. That gives the iteration space a triangular shape. Empirical results show that it can greatly benefit from improved locality in both dimensions. Figure 5-9 shows the simplified nested- loop implementation. In transpose, locality along both dimensions results in good spatial locality be- cause the code writes to the element with the switched coordinates of the element it reads from. On the other hand, there is no reuse; every element is only read and written once, and that read and write happen one after the other in all implementa- tions. I chose a size of 𝑁 = 16401 for this benchmark to ensure that a whole row of the matrix would not fit in the L1 cache. Figure 5-10 shows the performance of transpose with respect to 𝐺 on 24 work- ers. Because the work per iteration is low, recursive overheads are pretty high. With larger 𝐺, improved locality of transpose-dac outperforms other implemen- tations, achieving up to a 1.7× speedup compared to transpose. The second best is transpose-dac-loop-tiled with 𝐺 = 128, achieving a 1.62× speedup. From cache-misses alone, transpose-dac-loop-tiled is expected to achieve similar per- 55 Figure 5-10: Speedup of transpose implementations with respect to grainsize, run on 24 workers. formance after 𝐺 > 𝐵, but it keeps noticeably improving in performance. Mean- while, transpose-dac-loop plateaus entirely below them, which is expected due to its worse locality. But it is not likely for it to outperform transpose. Finally, transpose-tiled performs poorly with large 𝐺 like other -tiled implemen- tations. Because each iteration of transpose performs so little work, other differences between the implementations and overheads may affect these results. Further in- vestigation would be required to uncover the cause of the unexpected side of these results. 5.4 Takeaways Three main takeaways can be drawn from this study. First, convolutions (like blur) don’t benefit as much from reordering despite a high factor of reuse. Second, appli- cations where locality along more than one dimension results in either temporal or spatial cache-locality can benefit tremendously from reordering the iterations. Bench- marks in this study achieved up to a factor of 1.7×. Third, manually optimizing implementations and making them cache-efficient can require much work. 56 Careful tuning is required to achieve the best cache performance, which can also be hard to measure. Cache-efficient implementations, whether cache-aware or cache- oblivious, can get complex and must be tightly coupled to the application logic, making the code prone to off-by-one errors and other bugs. This coupling further justifies compiler transformations that can reorder iterations automatically without programmer input. I could not thoroughly investigate the poor scalability of -tiled for all three benchmarks. I believe that the OpenCilk compiler performed its own coarsening of the parallel loops. This coarsening could result in puny parallelism when combined with the manual coarsening done by the implementation. Future studies should focus on experimenting with different input sizes for blur, other base case approaches for mm, and further investigation of transpose perfor- mance results. A lower 𝐾 would mean lower reuse and a smaller relative reduction in cache misses for blur-dac, but it might increase the absolute decrease in cache misses and significantly impact performance. When experimenting with different or- ders of loops, I found that changing the order of nested loops inside mm yielded an implementation that was even faster than mm-best, which was partially due to better vectorization, but it certainly warrants a closer look. 57 58 Chapter 6 Related work This section discusses related work on parallel loops and optimizations to randomized work stealing. It also presents related work on improving locality of nested loops. A substantial body of previous work has studied a variety of schemes for partition- ing parallel loops to improve performance [48, 28, 57, 40, 27, 7, 8, 54]. One common scheme involves statically partitioning the 𝑛 iterations of a parallel loop into 𝑃 chunks of 𝑛/𝑃 iterations each, where 𝑃 is the number of processing threads, and then as- signing each chunk to a processor [45, 3]. Although static partitioning incurs low overheads, its performance suffers when the work of the loop is not balanced across its iterations or the execution of the parallel loop does not have dedicated access to the processors. In contrast, dynamic partitioning schemes, like DACParFor, partition parallel loop iterations into small chunks whose size is independent of the number of processing threads and then allows a scheduler to dynamically schedule chunks onto processors, e.g., via work stealing or work sharing. Dynamic partitioning can over- come the shortcomings of static partitioning by providing automatic load balancing but at the cost of increased overhead. Other schemes, such as guided self-scheduling [48] or factoring [28], partition the loop iterations into variable-sized chunks, whose size is based on 𝑃 and then the number of unexecuted loop iterations remaining. Previous work has also explored schemes to schedule and load balance parallel loops based on statistics collected at runtime [7, 8] and based on memory-access patterns, such as spatial locality across parallel loop iterations [54]. 59 Loop frames implement dynamic partitioning of parallel loops that load balances work automatically via work stealing. Loop frames and on-the-fly loop splitting also differ from other schemes that dynamically split unexecuted loop iterations, e.g., [48, 28, 54], which typically use a central queue to manage the loop iterations. In contrast, on-the-fly loop splitting distributes unexecuted loop iterations in a decen- tralized fashion using worker-local deques. Furthermore, in contrast to previous work, on-the-fly loop splitting presented in this thesis has a parallel running time bound with work stealing, derived in [47]. Tzannes et al. introduce lazy binary splitting [55], and a generalization called lazy scheduling [56], to reduce the overheads of work stealing. These schemes split parallel loops in a dynamic, lazy fashion by pushing work onto a worker’s deque only when the deque is empty. Lazy scheduling differs from loop frames in two ways. First, although lazy scheduling pushes partitions of parallel-loop iterations lazily onto the deque, these partitions are still created in a manner consistent with DACParFor. In contrast, on- the-fly loop splitting partitions the unexecuted iterations in the loop frame at the time of a steal. These partitions do not necessarily correspond with partitions generated using DACParFor, which necessitates a separate analysis of work stealing with loop frames. In addition, loop frames and lazy scheduling reduce work-stealing overheads in different ways. Lazy scheduling reduces overheads by avoiding synchronization operations on a nonempty deque. In contrast, loop frames reduce overheads by using a concise representation of unexecuted parallel-loop iterations in a single deque frame. It remains a topic of future work to see if these techniques can be combined. Acar et al. explore another approach to reducing work-stealing overheads by study- ing work-stealing with private deques [2]. Private deques allow a work-stealing sched- uler to avoid unnecessary memory fences during execution, thereby reducing synchro- nization overheads. It remains an open research problem to apply similar techniques to loop frames. In particular, future work might investigate designating a subinterval of a loop frame’s iterations to be private in order to reduce synchronization costs involved with updating the start variable in a loop frame. 60 Chapter 7 Conclusion This thesis introduces loop frames and on-the-fly loop splitting, which extend ran- domized work stealing with first-class support for parallel-for loops. Loop frames and on-the-fly loop splitting maintain the same theoretical guarantees as the traditional divide-and-conquer algorithm for parallel-for loops while providing lower overheads in practice and flexibility in dynamically splitting loop iterations. This particularly im- pacts applications with lower parallelism where extensive coarsening cannot be used. Although this work introduces on-the-fly loop splitting, it remains an interesting re- search question to fully leverage the flexibility it provides. This thesis also investigates the impacts of improved locality of multidimen- sional iteration spaces on performance. While convolutions like blur don’t benefit as much from reordering, applications with temporal or spatial cache-locality can ben- efit tremendously. Cache-efficient implementations can be complex and error-prone, justifying the use of compiler transformations that automatically reorder iterations. Future studies should continue to explore different input sizes and approaches for various benchmarks. 61 62 Bibliography [1] U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 2000), pages 1–12, 2000. [2] Umut A. Acar, Arthur Chargueraud, and Mike Rainey. Scheduling parallel pro- grams by work stealing with private deques. In PPoPP, page 219–228, 2013. [3] Marco Aldinucci, Marco Danelutto, Peter Kilpatrick, and Massimo Torquati. FastFlow: High-Level and Efficient Streaming on Multicore, chapter 13, pages 261–280. John Wiley & Sons, Ltd, 2017. [4] Eric Allen, David Chase, Joe Hallett, Victor Luchangco, Jan-Willem Maessen, Sukyoung Ryu, Guy L. Steele Jr., and Sam Tobin-Hochstadt. The Fortress Language Specification Version 1.0. Sun Microsystems, Inc., March 2008. [5] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. Theory Comput. Syst., 34(2):115–144, 2001. [6] E. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Yuan Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and Guansong Zhang. The design of OpenMP tasks. IEEE Transactions on Parallel and Distributed Systems, 20(3):404–418, 2009. [7] I. Banicescu and Z. Liu. A dynamic scheduling method tuned to rate of weight changes. In HPC, pages 122–129, 2000. [8] Ioana Banicescu and Vijay Velusamy. Performance of scheduling scientific appli- cations with adaptive weighted factoring. In IPDPS, page 84, USA, 2001. IEEE Computer Society. [9] Rajkishore Barik, Zoran Budimlić, Vincent Cavè, Sanjay Chatterjee, Yi Guo, David Peixotto, Raghavan Raman, Jun Shirako, Sağnak Taşırlar, Yonghong Yan, Yisheng Zhao, and Vivek Sarkar. The Habanero multicore software research project. In Proceedings of the 24th ACM SIGPLAN Conference Companion on Object Oriented Programming Systems Languages and Applications, OOPSLA ’09, pages 735–736, Orlando, Florida, USA, 2009. ACM. [10] Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. Internally deterministic parallel algorithms can be fast. In PPoPP, pages 181– 192, 2012. 63 [11] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1):55–69, 1996. [12] Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multi- threaded computations. SIAM Journal on Computing, 27(1):202–229, February 1998. [13] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded compu- tations by work stealing. Journal of the ACM, 46(5):720–748, 1999. [14] Robert D. Blumofe and Dionisios Papadopoulos. Hood: A user-level threads library for multiprogrammed multiprocessors. Technical Report, University of Texas at Austin, 1999. [15] F. Warren Burton and M. Ronan Sleep. Executing functional programs on a vir- tual tree of processors. In Proceedings of the 1981 Conference on Functional Pro- gramming Languages and Computer Architecture, pages 187–194, Portsmouth, New Hampshire, October 1981. [16] Vincent Cavé, Jisheng Zhao, Jun Shirako, and Vivek Sarkar. Habanero-Java: The new adventures of old X10. In PPPJ, pages 51–61, 2011. [17] Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Al- lan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In OOPSLA, 2005. [18] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. The MIT Press, third edition, 2009. [19] John S. Danaher, I-Ting Angelina Lee, and Charles E. Leiserson. Programming with exceptions in JCilk. Science of Computer Programming, 63(2):147–171, December 2008. [20] Rainer Feldmann, Peter Mysliwietz, and Burkhard Monien. Studying overheads in massively parallel min/max-tree evaluation. In SPAA, 1994. [21] Mingdong Feng and Charles E. Leiserson. Efficient detection of determinacy races in Cilk programs. Theory of Computing Systems, 32(3):301–326, 1999. [22] Raphael Finkel and Udi Manber. DIB — A distributed implementation of back- tracking. ACM TOPLAS, 9(2):235–256, April 1987. [23] Vincent W. Freeh, David K. Lowenthal, and Gregory R. Andrews. Distributed Filaments: Efficient fine-grain parallelism on a cluster of workstations. In Pro- ceedings of the First Symposium on Operating Systems Design and Implementa- tion, pages 201–213, Monterey, California, November 1994. 64 [24] Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. In 40th Annual Symposium on Foundations of Com- puter Science, pages 285–297, New York, New York, October 17–19 1999. [25] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212–223, 1998. [26] Robert H. Halstead, Jr. Multilisp: A language for concurrent symbolic compu- tation. ACM TOPLAS, 7(4):501–538, October 1985. [27] Susan Flynn Hummel, Jeanette Schmidt, R. N. Uma, and Joel Wein. Load- sharing in heterogeneous systems via weighted factoring. In Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, page 318–328, Padua, Italy, 1996. ACM. [28] Susan Flynn Hummel, Edith Schonberg, and Lawrence E. Flynn. Factoring: A method for scheduling parallel loops. Commun. ACM, 35(8):90–101, August 1992. [29] Intel Corporation. Intel Cilk Plus Language Specification, 2010. Document Number: 324396-001US. Available from http://software.intel.com/sites/ products/cilk-plus/cilk_plus_language_specification.pdf. [30] Intel Corporation. Intel Cilk Plus Language Extension Specification, Version 1.2, 2013. Document 324396-003US. [31] Richard M. Karp and Yanjun Zhang. Randomized parallel algorithms for backtrack search and branch-and-bound computation. Journal of the ACM, 40(3):765–789, July 1993. [32] David A. Kranz, Robert H. Halstead, Jr., and Eric Mohr. Mul-T: A high- performance parallel Lisp. In Proceedings of the SIGPLAN ’89 Conference on Programming Language Design and Implementation, pages 81–90, June 1989. [33] Bradley C. Kuszmaul. Synchronized MIMD Computing. PhD thesis, Depart- ment of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, May 1994. Available as MIT Laboratory for Computer Science Technical Report MIT/LCS/TR-645. [34] Monica S. Lam, E. Rothberg, and Michael E. Wolf. The cache performance and optimizations of blocked algortihms. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 63–74, Santa Clara, CA, April 1991. ACM SIGPLAN Notices 26:4. [35] Doug Lea. A Java fork/join framework. In ACM 2000 Conference on Java Grande, pages 36–43, 2000. 65 [36] I-Ting Angelina Lee. Memory Abstractions for Parallel Programming. PhD thesis, MIT Department of Electrical Engineering and Computer Science, 2012. [37] Daan Leijen and Judd Hall. Optimize managed code for multi-core ma- chines. MSDN Magazine, 2007. Available from http://msdn.microsoft.com/ magazine/. [38] Charles E. Leiserson. The Cilk++ concurrency platform. Journal of Supercom- puting, 51(3):244–257, 2010. [39] Charles E. Leiserson, Neil C. Thompson, Joel S. Emer, Bradley C. Kuszmaul, Butler W. Lampson, Daniel Sanchez, and Tao B. Schardl. There’s plenty of room at the top: What will drive computer performance after moore’s law? volume 368. American Association for the Advancement of Science, 2020. [40] Jie Liu, Vikram A. Saletore, and Ted G. Lewis. Safe self-scheduling: A par- allel loop scheduling scheme for shared-memory multiprocessors. International Journal of Parallel Programming, 22(6):589–616, 1994. [41] Michael McCool, Arch D. Robison, and James Reinders. Structured Parallel Programming: Patterns for Efficient Computation. Elsevier Science, 2012. [42] Seung-Jai Min, Costin Iancu, and Katherine Yelick. Hierarchical work stealing on manycore clusters. In Fifth Conference on Partitioned Global Address Space Programming Models (PGAS ’11), October 2011. [43] G.M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ltd., Ottawa, Ontario, March 1966. [44] Rishiyur S. Nikhil. Cid: A parallel, shared-memory C for distributed-memory machines. In Proceedings of the Seventh Annual Workshop on Languages and Compilers for Parallel Computing, August 1994. [45] OpenMP Architecture Review Board. OpenMP Application Program Interface, Version 4.5, July 2015. Available from http://www.openmp.org/wp-content/ uploads/openmp-4.5.pdf. [46] OpenMP Application Program Interface, Version 3.0, May 2008. [47] Nipun Pitimanaaree. Provably efficient randomized work stealing with first- class parallel loops. Master’s thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, May 2019. [48] C. D. Polychronopoulos and D. J. Kuck. Guided self-scheduling: A practical scheduling scheme for parallel supercomputers. IEEE Transactions on Comput- ers, C-36(12):1425–1439, 1987. [49] James Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media, Inc., 2007. 66 [50] Albert Reuther, Jeremy Kepner, Chansup Byun, Siddharth Samsi, William Ar- cand, David Bestor, Bill Bergeron, Vijay Gadepally, Michael Houle, Matthew Hubbell, et al. Interactive supercomputing on 40,000 cores for machine learning and data analysis. In 2018 IEEE High Performance extreme Computing Confer- ence (HPEC), pages 1–6. IEEE, 2018. [51] Tao B. Schardl, I-Ting Angelina Lee, and Charles E. Leiserson. Brief announce- ment: Open cilk. In Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures, SPAA ’18, page 351–353, New York, NY, USA, 2018. Association for Computing Machinery. [52] Tao B. Schardl, William S. Moses, and Charles E. Leiserson. Tapir: Embedding fork-join parallelism into LLVM’s intermediate representation. In PPoPP, pages 249–265, 2017. [53] Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Ky- rola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. Brief announcement: The Problem Based Benchmark Suite. In SPAA, pages 68–70, 2012. [54] Marc Tchiboukdjian, Vincent Danjean, Thierry Gautier, Fabien Le Mentec, and Bruno Raffin. A work stealing scheduler for parallel loops on shared cache mul- ticores. In Mario R. Guarracino, Frédéric Vivien, Jesper Larsson Träff, Mario Cannatoro, Marco Danelutto, Anders Hast, Francesca Perla, Andreas Knüpfer, Beniamino Di Martino, and Michael Alexander, editors, Euro-Par 2010 Parallel Processing Workshops, pages 99–107, 2011. [55] Alexandros Tzannes, George C. Caragea, Rajeev Barua, and Uzi Vishkin. Lazy binary-splitting: A run-time adaptive work-stealing scheduler. In PPoPP, page 179–190, 2010. [56] Alexandros Tzannes, George C. Caragea, Uzi Vishkin, and Rajeev Barua. Lazy scheduling: A runtime adaptive scheduler for declarative parallelism. TOPLAS, 36(3), September 2014. [57] T. H. Tzen and L. M. Ni. Trapezoid self-scheduling: A practical scheduling scheme for parallel compilers. IEEE Trans. Parallel Distributed Syst., 4:87–98, 1993. [58] Mark T. Vandevoorde and Eric S. Roberts. WorkCrews: An abstraction for con- trolling parallelism. International Journal of Parallel Programming, 17(4):347– 366, August 1988. [59] Helen Jiang Xu. The Locality-First Strategy for Developing Efficient Multicore Algorithm. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, February 2022. 67