Autotuning (1/2) - Vuduc
Transcript of Autotuning (1/2) - Vuduc
![Page 1: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/1.jpg)
Autotuning (1/2):Cache-oblivious algorithms
Prof. Richard Vuduc
Georgia Institute of Technology
CSE/CS 8803 PNA: Parallel Numerical Algorithms
[L.17] Tuesday, March 4, 2008
1
![Page 2: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/2.jpg)
Today’s sources
CS 267 (Demmel & Yelick @ UCB; Spring 2007)
“An experimental comparison of cache-oblivious and cache-conscious programs?” by Yotov, et al. (SPAA 2007)
“The memory behavior of cache oblivious stencil computations,” by Frigo & Strumpen (2007)
Talks by Matteo Frigo and Kaushik Datta at CScADS Autotuning Workshop (2007)
Demaine’s @ MIT: http://courses.csail.mit.edu/6.897/spring03/scribe_notes
2
![Page 3: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/3.jpg)
Review:Tuning matrix multiply
3
![Page 4: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/4.jpg)
Tiled MM on AMD Opteron 2.2 GHz (4.4 Gflop/s peak), 1 MB L2 cache
< 25% peak! We evidently still have a lot of work to do...
4
![Page 5: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/5.jpg)
Fast
Slow
Registers
L1
TLB
L2
Main
5
![Page 6: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/6.jpg)
C
B
A
6
![Page 7: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/7.jpg)
Software pipelining: Interleave iterations to delay dependent instructions
i-4i-3
i
i+1
Source: Clint Whaley’s code optimization course (UTSA Spring 2007)
m3;
7
![Page 8: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/8.jpg)
0
0.0625
0.125
0.1875
0.25
0.3125
0.375
0.4375
0.5
0.5625
0.625
0.6875
0.75
0.8125
0.875
Fraction of Peak
0 100 200 300 400 500 600 700 800 900 1000 1100 12000
50
100
150
200
250
300
350
400
450
500
550
600
650
700
matrix dimension (n)
Per
form
ance
(Mflo
p/s)
Dense Matrix Multiply Performance (Square n×n Operands) [800 MHz Intel Pentium III−mobile]
VendorGoto−BLASReg/insn−level + cache tiling + copyCache tiling + copy opt.Reference
Source: Vuduc, Demmel, Bilmes (IJHPCA 2004)
8
![Page 9: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/9.jpg)
Cache-oblivious matrix multiply[Yotov, Roeder, Pingali, Gunnels, Gustavson (SPAA 2007)][Talk by M. Frigo at CScADS Autotuning Workshop 2007]
9
![Page 10: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/10.jpg)
Fast
Slow
Two-level memory hierarchy
M = capacity of cache (“fast”)
L = cache line size
Fully associative
Optimal replacement
Evicts most distant use
Sleator & Tarjan (CACM 1985): LRU, FIFO w/in constant of optimal w/ cache larger by constant factor
“Tall-cache:” M ≥ O(L2)
Limits: See Brodal & Fagerberg (STOC 2003)
When might this not hold?
Memory model for analyzing cache-oblivious algorithms
10
![Page 11: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/11.jpg)
A recursive algorithm for matrix-multiply
A11 A12
A21 A22
C11 C12
C21 C22
B11 B12
B21 B22Divide all dimensions in half
Bilardi, et al.: Use Gray code ordering
Cost (flops) = T (n) ={
8 · T (n2 ) n > 1
O(1) n = 1
= O(n3)
11
![Page 12: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/12.jpg)
A recursive algorithm for matrix-multiply
A11 A12
A21 A22
C11 C12
C21 C22
B11 B12
B21 B22Divide all dimensions in half
Bilardi, et al.: Use grey-code ordering
I/O Complexity?
12
![Page 13: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/13.jpg)
A recursive algorithm for matrix-multiply
A11 A12
A21 A22
C11 C12
C21 C22
B11 B12
B21 B22Divide all dimensions in half
Bilardi, et al.: Use grey-code ordering
No. of misses, with tall-cache assumption:
Q(n) =
{8 · Q(n
2 ) if n >√
M3
3n2
L otherwise
}≤ Θ
(n3
L√
M
)
13
![Page 14: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/14.jpg)
Alternative: Divide longest dimension (Frigo, et al.)
C1
C2
A1
A2
m
k
Bk
n
CA1 A2
k
B2
B1k
n
B2B1k
n
k
A C1 C2
Cache misses Q(m, k, n) ≤
Θ(
mk+kn+mnL
)if mk + kn + mn ≤ αM
2Q(
m2 , k, n
)if m ≥ k, n
2Q(m, k2 , n) if k > m, k ≥ n
2Q(m, k, n2 ) otherwise
= Θ(
mkn
L√
M
)
14
![Page 15: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/15.jpg)
Relax tall-cache assumption using suitable layout
Source: Yotov, et al. (SPAA 2007) and Frigo, et al. (FOCS ’99)
Row-major Row-block-row Morton Z
No assumptionM ≥ Ω(L)Need tall cache
15
![Page 16: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/16.jpg)
I
K
K
J
Latency-centric vs. bandwidth-centric views of blocking
Time per flop ≈ 1 +α
τ· 1b
α
τ· 1κ≤ b ≤
√M
3
⇐ Assume can perfectly overlap
computation & communication
Peak flop/cy ≡ φ
Bandwidth, word/cy ≡ β
2n3
b· 1β
! 2n3 · 1φ
=⇒ φ
β≤ b
16
![Page 17: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/17.jpg)
FPU Registers L2 L3 MemoryL1
4*
≥2
2*
4
4
≥6≈0.5
1 ≤ bR ≤ 6
1.33 ≤ β(R,L2) ≤ 4
1 ≤ bL2 ≤ 6
1.33 ≤ β(L2,L3) ≤ 4
8 ≤ bL3 ≤ 418
0.02 ≤ β(L3,Memory) ≤ 0.52 FMAs/cycle
Latency-centric vs. bandwidth-centric views of blocking
Example platform: Itanium 2
Consider L3 ←→ memory bandwidth
Φ = 4 flops / cycle; β = 0.5 words / cycle
L3 capacity = 4 MB (512 kwords)
Need 8 ≤ bL3 ≤ 418
Implications: Approximate cache-oblivious blocking works
Wide range of block sizes should be OK
If upper bound > 2*lower, divide-and-conquer generates block size in range
Source: Yotov, et al. (SPAA 2007)
17
![Page 18: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/18.jpg)
Cache-oblivious vs. cache-aware
Does cache-oblivious perform as well as cache-aware?
If not, what can be done?
Next: Summary of Yotov, et al., study (SPAA 2007)
Stole slides liberally
18
![Page 19: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/19.jpg)
All- vs. largest-dimension
Similar; assume “all-dim”
19
![Page 20: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/20.jpg)
Data structures
Morton-Z complicated and yields same or worse performance, so assume row-block-row
20
![Page 21: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/21.jpg)
Example 1: Ultra IIIi
1 GHz ⇒ 2 Gflop/s peak
Memory hierarchy
32 registers
L1 = 64 KB, 4-way
L2 = 1 MB, 4-way
Sun compiler
21
![Page 22: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/22.jpg)
• Iterative: triply nested loop• Recursive: down to 1 x 1 x 1
Outer Control Structure
Iterative Recursive
Inner Control Structure
Statement
22
![Page 23: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/23.jpg)
Outer Control Structure
Iterative Recursive
Inner Control Structure
Statement Recursive
Micro-Kernel
None /Compiler
• Recursion down to NB• Unfold completely below
NB to get a basic block• Micro-Kernel:
• The basic block compiled with native compiler
• Best performance for NB =12
• Compiler unable to use registers
• Unfolding reduces control overhead
• limited by I-cache
23
![Page 24: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/24.jpg)
• Recursion down to NB• Unfold completely
below NB to get a basic block
• Micro-Kernel• Scalarize all array
references in the basic block
• Compile with native compiler
Outer Control Structure
Iterative Recursive
Inner Control Structure
Statement Recursive
Micro-Kernel
None /Compil
er
24
![Page 25: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/25.jpg)
• Recursion down to NB• Unfold completely below NB to get a
basic block• Micro-Kernel
• Perform Belady’s register allocation on the basic block
• Schedule using BRILA compiler
Belady /BRILA
Scalarized /Compiler
Outer Control Structure
Iterative Recursive
Inner Control Structure
Statement Recursive
Micro-Kernel
None /Compiler
25
![Page 26: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/26.jpg)
• Recursion down to NB• Unfold completely below NB to get a
basic block• Micro-Kernel
• Construct a preliminary schedule• Perform Graph Coloring register
allocation• Schedule using BRILA compiler
Belady /BRILA
Scalarized /Compiler
Outer Control Structure
Iterative Recursive
Inner Control Structure
Statement Recursive
Micro-Kernel
None /Compiler
Coloring /BRILA
26
![Page 27: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/27.jpg)
• Recursion down to MU x NU x KU
• Micro-Kernel• Completely unroll MU x NU
x KU triply nested loop• Construct a preliminary
schedule• Perform Graph Coloring
register allocation• Schedule using BRILA
compiler
Belady /BRILA
Scalarized /Compiler
Outer Control Structure
Iterative Recursive
Inner Control Structure
Statement Recursive
Micro-Kernel
None /Compiler
Coloring /BRILA
Iterative
27
![Page 28: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/28.jpg)
Mini-Kernel
• Recursion down to NB• Mini-Kernel
• NB x NB x NB triply nested loop
• Tiling for L1 cache• Body is Micro-Kernel
Belady /BRILA
Scalarized /Compiler
Outer Control Structure
Iterative Recursive
Inner Control Structure
Statement Recursive
Micro-Kernel
None /Compiler
Coloring /BRILA
Iterative
28
![Page 29: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/29.jpg)
Mini-Kernel
Belady /BRILA
Scalarized /Compiler
Outer Control Structure
Iterative Recursive
Inner Control Structure
Statement Recursive
Micro-Kernel
None /Compiler
Coloring /BRILA
Iterative
ATLAS CGw/SATLAS Unleashed
Specialized code generator with search
29
![Page 30: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/30.jpg)
Mini-Kernel
Belady /BRILA
Scalarized /Compiler
Outer Control Structure
Iterative Recursive
Inner Control Structure
Statement Recursive
Micro-Kernel
None /Compiler
Coloring /BRILA
Iterative
ATLAS CGw/SATLAS Unleashed
30
![Page 31: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/31.jpg)
Summary:Engineering considerations
Need to cut-off recursion
Careful scheduling/tuning required at “leaves”
Yotov, et al., report that full-recursion + tuned micro-kernel ≤ 2/3 best
Open issues
Recursively-scheduled kernels worse than iteratively-schedule kernels — why?
Prefetching needed, but how best to apply in recursive case?
31
![Page 32: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/32.jpg)
Administrivia
32
![Page 33: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/33.jpg)
Upcoming schedule changes
Some adjustment of topics (TBD)
Tu 3/11 — Project proposals due
Th 3/13 — SIAM Parallel Processing (attendance encouraged)
Tu 4/1 — No class
Th 4/3 — Attend talk by Doug Post from DoD HPC Modernization Program
33
![Page 34: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/34.jpg)
Homework 1:Parallel conjugate gradients
Put name on write-up!
Grading: 100 pts max
Correct implementation — 50 pts
Evaluation — 30 pts
Tested on two samples matrices — 5
Implemented and tested on stencil — 10
“Explained” performance (e.g., per proc, load balance, comp. vs. comm) — 15
Performance model — 15 pts
Write-up “quality” — 5 pts
34
![Page 35: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/35.jpg)
Projects
Proposals due Tu 3/11
Your goal should be to do something useful, interesting, and/or publishable!
Something you’re already working on, suitably adapted for this course
Faculty-sponsored/mentored
Collaborations encouraged
35
![Page 36: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/36.jpg)
My criteria for “approving” your project
“Relevant to this course:” Many themes, so think (and “do”) broadly
Parallelism and architectures
Numerical algorithms
Programming models
Performance modeling/analysis
36
![Page 37: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/37.jpg)
General styles of projects
Theoretical: Prove something hard (high risk)
Experimental:
Parallelize something
Take existing parallel program, and improve it using models & experiments
Evaluate algorithm, architecture, or programming model
37
![Page 38: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/38.jpg)
Anything of interest to a faculty member/project outside CoC
Parallel sparse triple product (R*A*RT, used in multigrid)
Future FFT
Out-of-core or I/O-intensive data analysis and algorithms
Block iterative solvers (convergence & performance trade-offs)
Sparse LU
Data structures and algorithms (trees, graphs)
Look at mixed-precision
Discrete-event approaches to continuous systems simulation
Automated performance analysis and modeling, tuning
“Unconventional,” but related
Distributed deadlock detection for MPI
UPC language extensions (dynamic block sizes)
Exact linear algebra
Examples
38
![Page 39: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/39.jpg)
Switch: M. Frigo’s talk slidesfrom CScADS 2007 autotuning workshophttp://cscads.rice.edu/workshops/july2007/autotune-workshop-07
39
![Page 40: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/40.jpg)
Cache-oblivious stencil computations[Frigo and Strumpen (ICS 2005)][Datta, et al. (2007)]
40
![Page 41: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/41.jpg)
t=0x=0 16
5
8
Cache-oblivious stencil computation
10
41
![Page 42: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/42.jpg)
t=0x=0 16
5
8
Cache-oblivious stencil computation
10
w < 2×h:
42
![Page 43: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/43.jpg)
t=0x=0 16
5
8
Cache-oblivious stencil computation
w < 2×h ⇒ “Time-cut”:
10
43
![Page 44: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/44.jpg)
t=0x=0 16
5
8
Cache-oblivious stencil computation
10
w ≥ 2×h:
44
![Page 45: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/45.jpg)
t=0x=0 16
5
8
Cache-oblivious stencil computation
10
w ≥ 2×h ⇒ “Space-cut”:
45
![Page 46: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/46.jpg)
t=0x=0 16
5
8
Cache-oblivious stencil computation
10
w ≥ 2×h ⇒ “Space-cut”:
46
![Page 47: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/47.jpg)
t=0x=0 16
5
8
Cache-oblivious stencil computation
10
w < 2×h ⇒ “Time-cut”:
47
![Page 48: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/48.jpg)
t=0x=0 16
5
8
Cache-oblivious stencil computation
10
Theorem [Frigo & Strumpen (ICS 2005)]:d = dimension ⇒
Q(n, t; d) = O
(nd · t
M1d
)
48
![Page 49: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/49.jpg)
Source: Datta, et al. (2007)
Cache-oblivious stencil computation:Fewer misses but more time
49
![Page 50: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/50.jpg)
t=0x=0 16
5
8
Cache-conscious algorithm
10
b
50
![Page 51: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/51.jpg)
Cache-conscious algorithm
Source: Datta, et al. (2007)
51
![Page 52: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/52.jpg)
“In conclusion…”
52
![Page 53: Autotuning (1/2) - Vuduc](https://reader034.fdocuments.us/reader034/viewer/2022050611/62738c9881ca6e6abb1bb5e8/html5/thumbnails/53.jpg)
Backup slides
53