MSc thesis defense by Rasmus Wriedt Larsen: Generating Efficient Code for Futhark's Segmented Redomap
Generating Efficient Code for Futhark's Segmented Redomap
The current code generation for computing segmented reductions -- reducing the innermost array of a multidimensional array -- in the compiler for the purely functional array programming language Futhark, is suboptimal. The Futhark compiler will use a segmented scan to compute segmented reductions, which requires us to store all the intermediate steps of computing the reduction for each segment. As Futhark aims at achieving high performance on general purpose graphics processing units (GPGPUs), using this technique is a problem: the performance of most reductions are in nature constrained by the memory bandwidth of the GPU device, so writing all intermediate results back to memory will be inefficient.
Futhark has a special internal construct to represent the fusion of a reduction on the result of a map, called a redomap. For segmented redomaps -- using a redomap within a map -- the Futhark compiler will use a single GPU thread to sequentially compute the redomap for a whole segment. This can give very good performance when we launch so many threads that we fully utilize the GPU device; however, when there are only few segments with many elements, this technique is utterly useless.
In this thesis, we explore how to extend the Futhark compiler to generate efficient code for both segmented reductions and segmented redomaps. As Futhark requires that all multidimensional arrays must be regular, all segments must have the same size. We exploit this fact, and create multiple GPU kernels that are specialized to some configurations of number of segments and segment sizes. At runtime, we can decide which kernel will be most suitable, given a configuration of number of segments and segment size.
We study the performance of this implementation using simple performance experiments. For these experiments, the new implementation has significantly better performance than the two baseline approaches outlined above. We evaluate the implementation on several benchmarks ported from the Rodinia and Parboil benchmark suites; However, only the Backprop and K-means benchmarks from Rodinia shows any speedup, due to the fact that not all of the benchmarks have significant computations within a segmented reduction or a segmented redomap. From evaluation on four different GPUs, we can demonstrate a speedup, over the unmodified Futhark compiler, by a harmonic-mean factor of 1.39x for Backprop, and by a harmonic-mean factor of 1.36x for K-means.
- Supervisor: Cosmin Eugen Oancea
- Co-supervisor: Troels Henriksen
- External Examiner: Michael Reichardt Hansen