Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. ·...
Transcript of Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. ·...
![Page 1: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/1.jpg)
Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place
Array UpdatesTroels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, Cosmin Oancea
Presented by:-Zaid Qureshi
![Page 2: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/2.jpg)
Motivation● GPUs are traditionally programmed using sequential programming languages
○ Requires expertise to exploit the parallelism provided by GPUs
● Functional programming languages provide parallelizable primitives (ie. map, reduce, scan)
○ But when compiled naively, their performance is very bad
2
![Page 3: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/3.jpg)
Futhark● Purely Functional Array programming language for GPUs
○ To ease GPU programming
● Expresses computation/parallelism using basic and streaming second-order array combinators (SOACs)
● Type system that allows expression of race-free in-place updates● Compiler implements partial flattening to allow for more parallelism without
destroying memory access patterns
3
![Page 4: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/4.jpg)
Futhark Syntax
4
![Page 5: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/5.jpg)
Basic SOACs
5
![Page 6: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/6.jpg)
Example Futhark Code
6
![Page 7: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/7.jpg)
Example Futhark CodeINPUT: nxm matrix
7
![Page 8: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/8.jpg)
Example Futhark CodeINPUT: nxm matrix OUTPUT: tuple of
nxm matrix, array of size n
8
![Page 9: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/9.jpg)
Example Futhark Code
Map over the rows of the matrix
9
![Page 10: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/10.jpg)
Example Futhark Code
Generate new row
10
![Page 11: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/11.jpg)
Example Futhark Code
Get sum of row
11
![Page 12: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/12.jpg)
Example Futhark Code
Return tuple of new row and sum
12
![Page 13: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/13.jpg)
Example Futhark Code
13
![Page 14: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/14.jpg)
Parallel operator sFold and Streaming Operators
# - concat𝝐 - empty partition
14
![Page 15: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/15.jpg)
Parallel operator sFold and Streaming Operators
Applies f to each partition and then concatenates the resulting partitions
# - concat𝝐 - empty partition
15
![Page 16: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/16.jpg)
Parallel operator sFold and Streaming Operators
Extends stream_map by allowing each chunk to produce an additional output which is reduced in parallel
Applies f to each partition and then concatenates the resulting partitions
# - concat𝝐 - empty partition
16
![Page 17: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/17.jpg)
Sequential Histogramming in Futhark
17
![Page 18: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/18.jpg)
Parallel Histogramming in Futhark
18
![Page 19: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/19.jpg)
Efficient Parallel Histogramming in Futhark
19
![Page 20: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/20.jpg)
Efficient Parallel Histogramming in Futhark
20
![Page 21: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/21.jpg)
In-Place Updates and Uniqueness Types
● In purely functional languages array updates require copying array and updating copy (to avoid side effects)
● If it is known that the original array won’t be used after the update, the update can occur in place
21
![Page 22: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/22.jpg)
In-Place Updates and Uniqueness Types
● Futhark has uniqueness types that allow programmer to specify function arguments that won’t be referenced by the caller after the function call
○ The callee gains ownership of that argument
● An array is consumed when it is source of in-place update or when it is passed as a unique parameter.
● After the consumption point, the array or its aliases may not be used.○ Type system checks this via aliasing rules
22
![Page 23: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/23.jpg)
Aliasing Rules● Alias sets for values produced by SOACs are empty (new copies)● Scalar read from an array does not alias its origin array (alias set not
modified)● Array slicing aliases origin array● Function application:
○ If the result being returned is unique the alias set is empty○ Otherwise the result aliases all non-unique parameters
● Other rules can be found in Figure 5 of the paper
23
![Page 24: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/24.jpg)
In-Place Update Checking● Each expression e has a observed set of variables (O) and a consumed set of
variables (C)○ the pair <C,O> forms the occurrence trace for e
● Inference rules used to check uniqueness and parameter consumption (Figure 6)
Sequence Judgement
Inference Rule
If-then-else uniqueness inference rule
24
![Page 25: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/25.jpg)
In-Place Update Checking (Example)
This program passes as the function of the map consumes its parameter as
This program doesn’t pass as it implies d, bound outside the function of the map, is consumed for every iteration of the map
25
![Page 26: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/26.jpg)
Streaming SOAC Fusion
26
![Page 27: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/27.jpg)
Streaming SOAC Fusion Example
27
![Page 28: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/28.jpg)
Moderate Flattening● Flattening algorithm based on map-loop interchange and map distribution● Attempt to exploit some top-level parallelism
○ Not seeking parallelism inside branches○ Terminating map distribution when it would introduce irregular arrays
map f ◦ map g ⇒ map (f ◦ g)
28
![Page 29: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/29.jpg)
Moderate Flattening● Flattening algorithm based on map-loop interchange and map distribution● Attempt to exploit partial top-level parallelism
○ Not seeking parallelism inside branches○ Terminating map distribution when it would introduce irregular arrays
map f ◦ map g ⇒ map (f ◦ g)let bss: [m][m]i32 = map (\(ps: [m]i32) (ps: [m]i32) -> loop (ws=ps) for i < n do map (\w -> w * 2) ws) pss
let bss: [m][m]i32 = loop (wss=pss) for i < n do map (\ws -> map (\w -> w * 2) ws) wss
29
![Page 30: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/30.jpg)
Locality of Reference Optimizations● Naive translation of Flattened and Fused code can lead to bad memory
access patterns● Futhark compiler can optimize memory access patterns by transforming data
Transpose:
Tiling:
30
![Page 31: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/31.jpg)
Evaluation Methodology● Tested with 2 GPUs
○ Nvidia GX 780○ AMD W8100
● Generated OpenCL code is run on both GPUs
● Baseline implementations taken from benchmark suites
Rodinia
FinPar
Parboil
Accelerate
31
![Page 32: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/32.jpg)
Results
32
![Page 33: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/33.jpg)
Results
33
![Page 34: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/34.jpg)
Results
Futhark performs better than other functional programming environments for GPU due to higher level optimizations 34
![Page 35: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/35.jpg)
Results
Rodinia doesn’t implement all optimizations: sequential reductions (Backprop, NN), not parallelizing computation of new cluster centers (k-means), not coalescing all accesses (Myocyte) 35
![Page 36: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/36.jpg)
Results
For OptionPricing, Futhark sequentializes excessive parallelism.
36
![Page 37: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/37.jpg)
Results
Furthark gets around 70-80% of the performance of hand-tuned code.
37
![Page 38: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/38.jpg)
Impact of Optimizations● SOAC Fusion
○ K-means (x1.42), LavaMD (x4.55), Myocyte (x1.66), SRAD (x1.21), Crystal (x10.1), LocVolCalid (x9.4)
○ Without fusion OptionPricing, N-body, and MRI-Q have too high memory requirements
● In-place Updates○ K-means (x8.3), LocVolCalib (x1.7)○ OptionPricing can’t even be implemented without in-place updates
● Coalescing○ K-means (x9.26), Myocyte (x4.2), OptionPricing (x8.79), LocVolCalib (x8.4)
● Tiling○ LavaMD (x1.35), MRI-Q (x1.33), N-body (x2.29)
38
![Page 39: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/39.jpg)
ConclusionPros:
● Futhark code is independent of the underlying hardware● Futhark’s type system allows expression of race-free in-place updates● Optimizations done by compiler using higher level functions/reasoning● Compiler implements partial flattening to allow for more parallelism without destroying memory
access patterns● Compiler can aggressively fuse and decompose code to best use available parallelism
Cons:
● Requires rewrite of applications● Although it does optimizations like flattening and fusion, Futhark’s compiler can’t optimize all the
time○ Ie. it can’t convert inefficient histogramming to the efficient one○ Still leaves a huge design space for the programmer to explore to write good performant code
39
![Page 40: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/40.jpg)
Thank you!
40
![Page 41: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/41.jpg)
Other slides
41
![Page 42: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/42.jpg)
Futhark Syntax
42
![Page 43: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/43.jpg)
Results
43
![Page 44: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/44.jpg)
Basic SOACs
Can be implemented with Parallel Operator fold
44
![Page 45: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/45.jpg)
45
![Page 46: Futhark Purely Functional GPU-programming with Nested Parallelism and … · 2018. 12. 8. · Futhark: Purely Functional GPU-programming with Nested Parallelism and in-place Array](https://reader033.fdocuments.us/reader033/viewer/2022053119/60a0a2fa044b603b842ac4f5/html5/thumbnails/46.jpg)
Futhark Compiler Architecture
46