Multithreaded Programming in
CilkLECTURE 2
Multithreaded Programming in
CilkLECTURE 2
Charles E. LeisersonSupercomputing Technologies Research Group
Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology
July 14, 2006 2© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Minicourse Outline●LECTURE 1
Basic Cilk programming: Cilk keywords, performance measures, scheduling.
●LECTURE 2Analysis of Cilk algorithms: matrix multiplication, sorting, tableau construction.
●LABORATORYProgramming matrix multiplication in Cilk — Dr. Bradley C. Kuszmaul
●LECTURE 3Advanced Cilk programming: inlets, abort, speculation, data synchronization, & more.
July 14, 2006 3© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
LECTURE 2
• Matrix Multiplication
• Tableau Construction
• Recurrences (Review)
• Conclusion
• Merge Sort
July 14, 2006 4© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
The Master MethodThe Master Method for solving recurrences applies to recurrences of the form
T(n) = a T(n/b) + f (n) , where a ≥ 1, b > 1, and f is asymptotically positive.
IDEA: Compare nlogba with f (n) .IDEA: Compare nlogba with f (n) .
*The unstated base case is T(n) = Θ(1) for sufficiently small n.
*
July 14, 2006 5© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Master Method — CASE 1
nlogba À f (n)nlogba À f (n)
Specifically, f (n) = O(nlogba – ε) for some constant ε > 0.Solution: T(n) = Θ(nlogba) .
T(n) = a T(n/b) + f (n)T(n) = a T(n/b) + f (n)
July 14, 2006 6© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Master Method — CASE 2
Specifically, f (n) = Θ(nlogba lgkn) for some constant k ≥ 0.Solution: T(n) = Θ(nlogba lgk+1n) .
nlogba ≈ f (n)nlogba ≈ f (n)
T(n) = a T(n/b) + f (n)T(n) = a T(n/b) + f (n)
July 14, 2006 7© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Master Method — CASE 3
Specifically, f (n) = Ω(nlogba + ε) for some constant ε > 0 and f (n) satisfies the regularity condition that a f (n/b) · c f (n)for some constant c < 1.Solution: T(n) = Θ(f (n)) .
nlogba ¿ f (n)nlogba ¿ f (n)
T(n) = a T(n/b) + f (n)T(n) = a T(n/b) + f (n)
July 14, 2006 8© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Master Method Summary
CASE 1: f (n) = O(nlogba – ε), constant ε > 0⇒ T(n) = Θ(nlogba) .
CASE 2: f (n) = Θ(nlogba lgkn), constant k ≥ 0⇒ T(n) = Θ(nlogba lgk+1n) .
CASE 3: f (n) = Ω(nlogba + ε ), constant ε > 0, and regularity condition
⇒ T(n) = Θ( f (n)) .
T(n) = a T(n/b) + f (n)T(n) = a T(n/b) + f (n)
July 14, 2006 9© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Master Method Quiz• T(n) = 4 T(n/2) + n
nlogba = n2 À n ⇒ CASE 1: T(n) = Θ(n2).• T(n) = 4 T(n/2) + n2
nlogba = n2 = n2 lg0n ⇒ CASE 2: T(n) = Θ(n2lg n).• T(n) = 4 T(n/2) + n3
nlogba = n2 ¿ n3 ⇒ CASE 3: T(n) = Θ(n3).• T(n) = 4 T(n/2) + n2/ lg n
Master method does not apply!
July 14, 2006 10© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
LECTURE 2
• Matrix Multiplication
• Tableau Construction
• Recurrences (Review)
• Conclusion
• Merge Sort
July 14, 2006 11© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Square-Matrix Multiplicationc11 c12 L c1nc21 c22 L c2n
M M O Mcn1 cn2 L cnn
a11 a12 L a1na21 a22 L a2n
M M O Man1 an2 L ann
b11 b12 L b1nb21 b22 L b2n
M M O Mbn1 bn2 L bnn
= ×
C A B
cij = ∑k = 1
n
aik bkj
Assume for simplicity that n = 2k.
July 14, 2006 12© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Recursive Matrix Multiplication
8 multiplications of (n/2) × (n/2) matrices.1 addition of n × n matrices.
Divide and conquer —
C11 C12
C21 C22
= ×A11 A12
A21 A22
B11 B12
B21 B22
= +A11B11 A11B12
A21B11 A21B12
A12B21 A12B22
A22B21 A22B22
July 14, 2006 13© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);spawn Mult(C22,A21,B12,n/2);spawn Mult(C21,A21,B11,n/2);spawn Mult(T11,A12,B21,n/2);spawn Mult(T12,A12,B22,n/2);spawn Mult(T22,A22,B22,n/2);spawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);spawn Mult(C22,A21,B12,n/2);spawn Mult(C21,A21,B11,n/2);spawn Mult(T11,A12,B21,n/2);spawn Mult(T12,A12,B22,n/2);spawn Mult(T22,A22,B22,n/2);spawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
Matrix Multiply in Pseudo-Cilk
C = A· BAbsence of type
declarations.Absence of type
declarations.
July 14, 2006 14© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);spawn Mult(C22,A21,B12,n/2);spawn Mult(C21,A21,B11,n/2);spawn Mult(T11,A12,B21,n/2);spawn Mult(T12,A12,B22,n/2);spawn Mult(T22,A22,B22,n/2);spawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);spawn Mult(C22,A21,B12,n/2);spawn Mult(C21,A21,B11,n/2);spawn Mult(T11,A12,B21,n/2);spawn Mult(T12,A12,B22,n/2);spawn Mult(T22,A22,B22,n/2);spawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
C = A· B Coarsen base cases for efficiency.Coarsen base cases for efficiency.
Matrix Multiply in Pseudo-Cilk
July 14, 2006 15© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);spawn Mult(C22,A21,B12,n/2);spawn Mult(C21,A21,B11,n/2);spawn Mult(T11,A12,B21,n/2);spawn Mult(T12,A12,B22,n/2);spawn Mult(T22,A22,B22,n/2);spawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);spawn Mult(C22,A21,B12,n/2);spawn Mult(C21,A21,B11,n/2);spawn Mult(T11,A12,B21,n/2);spawn Mult(T12,A12,B22,n/2);spawn Mult(T22,A22,B22,n/2);spawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
C = A· B
Submatrices are produced by pointer calculation, not copying of elements.
Submatrices are produced by pointer calculation, not copying of elements.
Also need a row-size argument for array indexing.
Also need a row-size argument for array indexing.
Matrix Multiply in Pseudo-Cilk
July 14, 2006 16© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);spawn Mult(C22,A21,B12,n/2);spawn Mult(C21,A21,B11,n/2);spawn Mult(T11,A12,B21,n/2);spawn Mult(T12,A12,B22,n/2);spawn Mult(T22,A22,B22,n/2);spawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);spawn Mult(C22,A21,B12,n/2);spawn Mult(C21,A21,B11,n/2);spawn Mult(T11,A12,B21,n/2);spawn Mult(T12,A12,B22,n/2);spawn Mult(T22,A22,B22,n/2);spawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
C = A· B
cilk void Add(*C, *T, n) {h base case & partition matrices ispawn Add(C11,T11,n/2);spawn Add(C12,T12,n/2);spawn Add(C21,T21,n/2);spawn Add(C22,T22,n/2);sync;return;
}
cilk void Add(*C, *T, n) {h base case & partition matrices ispawn Add(C11,T11,n/2);spawn Add(C12,T12,n/2);spawn Add(C21,T21,n/2);spawn Add(C22,T22,n/2);sync;return;
}C = C + T
Matrix Multiply in Pseudo-Cilk
July 14, 2006 17© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
A1(n) = ?4A1(n/2) + Θ(1)
cilk void Add(*C, *T, n) {h base case & partition matrices ispawn Add(C11,T11,n/2);spawn Add(C12,T12,n/2);spawn Add(C21,T21,n/2);spawn Add(C22,T22,n/2);sync;return;
}
cilk void Add(*C, *T, n) {h base case & partition matrices ispawn Add(C11,T11,n/2);spawn Add(C12,T12,n/2);spawn Add(C21,T21,n/2);spawn Add(C22,T22,n/2);sync;return;
}
Work of Matrix Addition
Work:
nlogba = nlog24 = n2 À Θ(1) .
— CASE 1= Θ(n2)
July 14, 2006 18© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
cilk void Add(*C, *T, n) {h base case & partition matrices ispawn Add(C11,T11,n/2);spawn Add(C12,T12,n/2);spawn Add(C21,T21,n/2);spawn Add(C22,T22,n/2);sync;return;
}
cilk void Add(*C, *T, n) {h base case & partition matrices ispawn Add(C11,T11,n/2);spawn Add(C12,T12,n/2);spawn Add(C21,T21,n/2);spawn Add(C22,T22,n/2);sync;return;
}
cilk void Add(*C, *T, n) {h base case & partition matrices ispawn Add(C11,T11,n/2);spawn Add(C12,T12,n/2);spawn Add(C21,T21,n/2);spawn Add(C22,T22,n/2);sync;return;
}
cilk void Add(*C, *T, n) {h base case & partition matrices ispawn Add(C11,T11,n/2);spawn Add(C12,T12,n/2);spawn Add(C21,T21,n/2);spawn Add(C22,T22,n/2);sync;return;
}
A∞(n) = ?
Span of Matrix Addition
A∞(n/2) + Θ(1)Span:
nlogba = nlog21 = 1 ⇒ f (n) = Θ(nlogba lg0n) .
— CASE 2= Θ(lg n)
maximum
July 14, 2006 19© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
M1(n) = ?
Work of Matrix Multiplication
8 M1(n/2) +A1(n) + Θ(1)Work:
cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);Mspawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);Mspawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
8
nlogba = nlog28 = n3 À Θ(n2) .
= 8 M1(n/2) + Θ(n2)= Θ(n3) — CASE 1
July 14, 2006 20© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);Mspawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);Mspawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);Mspawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);Mspawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
M∞(n) = ?M∞(n/2) + A∞(n) + Θ(1)
Span of Matrix Multiplication
Span:
nlogba = nlog21 = 1 ⇒ f (n) = Θ(nlogba lg1n) .
= M∞(n/2) + Θ(lg n)= Θ(lg2 n) — CASE 2
8
July 14, 2006 21© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Parallelism of Matrix Multiply
M1(n) = Θ(n3)Work:M∞(n) = Θ(lg2n)Span:
Parallelism: M1(n)M∞(n) = Θ(n3/lg2n)
For 1000 × 1000 matrices, parallelism ≈ (103)3/102 = 107.
July 14, 2006 22© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
cilk void Mult(*C, *A, *B, n) {
h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);Mspawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
cilk void Mult(*C, *A, *B, n) {
h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);Mspawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;
}
Stack Temporariesfloat *T = Cilk_alloca(n*n*sizeof(float));
In hierarchical-memory machines (especially chip multiprocessors), memory accesses are so expensive that
minimizing storage often yields higher performance.
IDEA: Trade off parallelism for less storage.IDEA: Trade off parallelism for less storage.
July 14, 2006 23© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
No-Temp Matrix Multiplicationcilk void MultA(*C, *A, *B, n) {// C = C + A * Bh base case & partition matrices ispawn MultA(C11,A11,B11,n/2);spawn MultA(C12,A11,B12,n/2);spawn MultA(C22,A21,B12,n/2);spawn MultA(C21,A21,B11,n/2);sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2);spawn MultA(C12,A12,B22,n/2);spawn MultA(C11,A12,B21,n/2);sync; return;
}
cilk void MultA(*C, *A, *B, n) {// C = C + A * Bh base case & partition matrices ispawn MultA(C11,A11,B11,n/2);spawn MultA(C12,A11,B12,n/2);spawn MultA(C22,A21,B12,n/2);spawn MultA(C21,A21,B11,n/2);sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2);spawn MultA(C12,A12,B22,n/2);spawn MultA(C11,A12,B21,n/2);sync; return;
}
Saves space, but at what expense?
July 14, 2006 24© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
= Θ(n3)
Work of No-Temp Multiply
M1(n) = ?8 M1(n/2) + Θ(1)Work:— CASE 1
cilk void MultA(*C, *A, *B, n) {// C = C + A * Bh base case & partition matrices ispawn MultA(C11,A11,B11,n/2);spawn MultA(C12,A11,B12,n/2);spawn MultA(C22,A21,B12,n/2);spawn MultA(C21,A21,B11,n/2);sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2);spawn MultA(C12,A12,B22,n/2);spawn MultA(C11,A12,B21,n/2);sync; return;
}
cilk void MultA(*C, *A, *B, n) {// C = C + A * Bh base case & partition matrices ispawn MultA(C11,A11,B11,n/2);spawn MultA(C12,A11,B12,n/2);spawn MultA(C22,A21,B12,n/2);spawn MultA(C21,A21,B11,n/2);sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2);spawn MultA(C12,A12,B22,n/2);spawn MultA(C11,A12,B21,n/2);sync; return;
}
July 14, 2006 25© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
cilk void MultA(*C, *A, *B, n) {// C = C + A * Bh base case & partition matrices ispawn MultA(C11,A11,B11,n/2);spawn MultA(C12,A11,B12,n/2);spawn MultA(C22,A21,B12,n/2);spawn MultA(C21,A21,B11,n/2);sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2);spawn MultA(C12,A12,B22,n/2);spawn MultA(C11,A12,B21,n/2);sync; return;
}
cilk void MultA(*C, *A, *B, n) {// C = C + A * Bh base case & partition matrices ispawn MultA(C11,A11,B11,n/2);spawn MultA(C12,A11,B12,n/2);spawn MultA(C22,A21,B12,n/2);spawn MultA(C21,A21,B11,n/2);sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2);spawn MultA(C12,A12,B22,n/2);spawn MultA(C11,A12,B21,n/2);sync; return;
}
cilk void MultA(*C, *A, *B, n) {// C = C + A * Bh base case & partition matrices ispawn MultA(C11,A11,B11,n/2);spawn MultA(C12,A11,B12,n/2);spawn MultA(C22,A21,B12,n/2);spawn MultA(C21,A21,B11,n/2);sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2);spawn MultA(C12,A12,B22,n/2);spawn MultA(C11,A12,B21,n/2);sync; return;
}
cilk void MultA(*C, *A, *B, n) {// C = C + A * Bh base case & partition matrices ispawn MultA(C11,A11,B11,n/2);spawn MultA(C12,A11,B12,n/2);spawn MultA(C22,A21,B12,n/2);spawn MultA(C21,A21,B11,n/2);sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2);spawn MultA(C12,A12,B22,n/2);spawn MultA(C11,A12,B21,n/2);sync; return;
}
= Θ(n)M∞(n) = ?
Span of No-Temp Multiply
Span:— CASE 1
2 M∞(n/2) + Θ(1)
maximum
maximum
July 14, 2006 26© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Parallelism of No-Temp MultiplyM1(n) = Θ(n3)Work:
M∞(n) = Θ(n)Span:
Parallelism: M1(n)M∞(n) = Θ(n2)
For 1000 × 1000 matrices, parallelism ≈ (103)3/103 = 106.
Faster in practice!
July 14, 2006 27© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Testing SynchronizationCilk language feature: A programmer can check whether a Cilk procedure is “synched”(without actually performing a sync) by testing the pseudovariable SYNCHED:•SYNCHED = 0 ⇒ some spawned children
might not have returned.•SYNCHED = 1 ⇒ all spawned children
have definitely returned.
July 14, 2006 28© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Best of Both Worldscilk void Mult1(*C, *A, *B, n) {// multiply & store
h base case & partition matrices ispawn Mult1(C11,A11,B11,n/2); // multiply & storespawn Mult1(C12,A11,B12,n/2);spawn Mult1(C22,A21,B12,n/2); spawn Mult1(C21,A21,B11,n/2); if (SYNCHED) {
spawn MultA1(C11,A12,B21,n/2); // multiply & addspawn MultA1(C12,A12,B22,n/2); spawn MultA1(C22,A22,B22,n/2);spawn MultA1(C21,A22,B21,n/2);
} else {float *T = Cilk_alloca(n*n*sizeof(float));spawn Mult1(T11,A12,B21,n/2); // multiply & storespawn Mult1(T12,A12,B22,n/2); spawn Mult1(T22,A22,B22,n/2); spawn Mult1(T21,A22,B21,n/2);sync;spawn Add(C,T,n); // C = C + T
}sync; return;
}
cilk void Mult1(*C, *A, *B, n) {// multiply & storeh base case & partition matrices ispawn Mult1(C11,A11,B11,n/2); // multiply & storespawn Mult1(C12,A11,B12,n/2);spawn Mult1(C22,A21,B12,n/2); spawn Mult1(C21,A21,B11,n/2); if (SYNCHED) {
spawn MultA1(C11,A12,B21,n/2); // multiply & addspawn MultA1(C12,A12,B22,n/2); spawn MultA1(C22,A22,B22,n/2);spawn MultA1(C21,A22,B21,n/2);
} else {float *T = Cilk_alloca(n*n*sizeof(float));spawn Mult1(T11,A12,B21,n/2); // multiply & storespawn Mult1(T12,A12,B22,n/2); spawn Mult1(T22,A22,B22,n/2); spawn Mult1(T21,A22,B21,n/2);sync;spawn Add(C,T,n); // C = C + T
}sync; return;
}
This code is just as parallel as the original, but it only uses more space if runtime parallelism actually exists.
This code is just as parallel as the original, but it only uses more space if runtime parallelism actually exists.
July 14, 2006 29© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Ordinary Matrix Multiplication
cij = ∑k = 1
n
aik bkj
IDEA: Spawn n2 inner products in parallel. Compute each inner product in parallel.
Work: Θ(n3)Span: Θ(lg n)Parallelism: Θ(n3/lg n)
BUT, this algorithm exhibits poor locality and does not exploit the cache hierarchy of modern microprocessors, especially CMP’s.
July 14, 2006 30© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
LECTURE 2
• Matrix Multiplication
• Tableau Construction
• Recurrences (Review)
• Conclusion
• Merge Sort
July 14, 2006 31© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
3 12 19 46
4 14 21 23
193
4
12
14 21 23
46
Merging Two Sorted Arraysvoid Merge(int *C, int *A, int *B, int na, int nb) {
while (na>0 && nb>0) {if (*A <= *B) {
*C++ = *A++; na--;} else {
*C++ = *B++; nb--;}
}while (na>0) {
*C++ = *A++; na--;}while (nb>0) {
*C++ = *B++; nb--;}
}
void Merge(int *C, int *A, int *B, int na, int nb) {while (na>0 && nb>0) {
if (*A <= *B) {*C++ = *A++; na--;
} else {*C++ = *B++; nb--;
}}while (na>0) {
*C++ = *A++; na--;}while (nb>0) {
*C++ = *B++; nb--;}
}
Time to merge nelements = ?Time to merge nelements = ?Θ(n).
July 14, 2006 32© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
cilk void MergeSort(int *B, int *A, int n) {if (n==1) {
B[0] = A[0];} else {
int *C;C = (int*) Cilk_alloca(n*sizeof(int));spawn MergeSort(C, A, n/2);spawn MergeSort(C+n/2, A+n/2, n-n/2);sync;Merge(B, C, C+n/2, n/2, n-n/2);
} }
cilk void MergeSort(int *B, int *A, int n) {if (n==1) {
B[0] = A[0];} else {
int *C;C = (int*) Cilk_alloca(n*sizeof(int));spawn MergeSort(C, A, n/2);spawn MergeSort(C+n/2, A+n/2, n-n/2);sync;Merge(B, C, C+n/2, n/2, n-n/2);
} }
Merge Sort
144619 3 12 33 4 21
4 3319 46 143 12 21
46 333 12 19 4 14 21
46143 4 12 19 21 33
merge
merge
merge
July 14, 2006 33© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
= Θ(n lg n)T1(n) = ?2T1(n/2) + Θ(n)
Work of Merge Sort
Work:— CASE 2
nlogba = nlog22 = n ⇒ f (n) = Θ(nlogba lg0n) .
cilk void MergeSort(int *B, int *A, int n) {if (n==1) {
B[0] = A[0];} else {
int *C;C = (int*) Cilk_alloca(n*sizeof(int));spawn MergeSort(C, A, n/2);spawn MergeSort(C+n/2, A+n/2, n-n/2);sync;Merge(B, C, C+n/2, n/2, n-n/2);
} }
cilk void MergeSort(int *B, int *A, int n) {if (n==1) {
B[0] = A[0];} else {
int *C;C = (int*) Cilk_alloca(n*sizeof(int));spawn MergeSort(C, A, n/2);spawn MergeSort(C+n/2, A+n/2, n-n/2);sync;Merge(B, C, C+n/2, n/2, n-n/2);
} }
July 14, 2006 34© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
T∞(n) = ?T∞(n/2) + Θ(n)
Span of Merge Sort
Span:— CASE 3= Θ(n)
nlogba = nlog21 = 1 ¿ Θ(n) .
cilk void MergeSort(int *B, int *A, int n) {if (n==1) {
B[0] = A[0];} else {
int *C;C = (int*) Cilk_alloca(n*sizeof(int));spawn MergeSort(C, A, n/2);spawn MergeSort(C+n/2, A+n/2, n-n/2);sync;Merge(B, C, C+n/2, n/2, n-n/2);
} }
cilk void MergeSort(int *B, int *A, int n) {if (n==1) {
B[0] = A[0];} else {
int *C;C = (int*) Cilk_alloca(n*sizeof(int));spawn MergeSort(C, A, n/2);spawn MergeSort(C+n/2, A+n/2, n-n/2);sync;Merge(B, C, C+n/2, n/2, n-n/2);
} }
July 14, 2006 35© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Parallelism of Merge SortT1(n) = Θ(n lg n)Work:
T∞(n) = Θ(n)Span:
Parallelism: T1(n)T∞(n) = Θ(lg n)
We need to parallelize the merge!We need to parallelize the merge!
July 14, 2006 36© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
B
A0 na
0 nbna ≥ nb
Parallel Merge
· A[na/2] ≥ A[na/2]
Binary search
j j+1
Recursivemerge
Recursivemerge
na/2
· A[na/2] ≥ A[na/2]
KEY IDEA: If the total number of elements to be merged in the two arrays is n = na + nb, the total number of elements in the larger of the two recursive merges is at most ?(3/4) n .
July 14, 2006 37© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Parallel Mergecilk void P_Merge(int *C, int *A, int *B,
int na, int nb) {if (na < nb) {
spawn P_Merge(C, B, A, nb, na);} else if (na==1) {
if (nb == 0) {C[0] = A[0];
} else {C[0] = (A[0]<B[0]) ? A[0] : B[0]; /* minimum */C[1] = (A[0]<B[0]) ? B[0] : A[0]; /* maximum */
} } else {
int ma = na/2;int mb = BinarySearch(A[ma], B, nb);spawn P_Merge(C, A, B, ma, mb);spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb);sync;
}}
cilk void P_Merge(int *C, int *A, int *B, int na, int nb) {
if (na < nb) {spawn P_Merge(C, B, A, nb, na);
} else if (na==1) {if (nb == 0) {
C[0] = A[0];} else {
C[0] = (A[0]<B[0]) ? A[0] : B[0]; /* minimum */C[1] = (A[0]<B[0]) ? B[0] : A[0]; /* maximum */
} } else {
int ma = na/2;int mb = BinarySearch(A[ma], B, nb);spawn P_Merge(C, A, B, ma, mb);spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb);sync;
}}
Coarsen base cases for efficiency.
July 14, 2006 38© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
T∞(n) = ?T∞(3n/4) + Θ(lg n)
Span of P_Merge
Span:— CASE 2= Θ(lg2n)
cilk void P_Merge(int *C, int *A, int *B, int na, int nb) {
if (na < nb) {M} else {
int ma = na/2;int mb = BinarySearch(A[ma], B, nb);spawn P_Merge(C, A, B, ma, mb);spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb);sync;
}}
cilk void P_Merge(int *C, int *A, int *B, int na, int nb) {
if (na < nb) {M} else {
int ma = na/2;int mb = BinarySearch(A[ma], B, nb);spawn P_Merge(C, A, B, ma, mb);spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb);sync;
}}
nlogba = nlog4/31 = 1 ⇒ f (n) = Θ(nlogba lg1n) .
July 14, 2006 39© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
T1(n) = ?T1(αn) + T1((1–α)n) + Θ(lg n),where 1/4 · α · 3/4 .
Work of P_Merge
Work:
cilk void P_Merge(int *C, int *A, int *B, int na, int nb) {
if (na < nb) {M} else {
int ma = na/2;int mb = BinarySearch(A[ma], B, nb);spawn P_Merge(C, A, B, ma, mb);spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb);sync;
}}
cilk void P_Merge(int *C, int *A, int *B, int na, int nb) {
if (na < nb) {M} else {
int ma = na/2;int mb = BinarySearch(A[ma], B, nb);spawn P_Merge(C, A, B, ma, mb);spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb);sync;
}}
CLAIM: T1(n) = Θ(n) .
July 14, 2006 40© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Analysis of Work Recurrence
Substitution method: Inductive hypothesis is T1(k) · c1k – c2lg k, where c1,c2 > 0. Prove that the relation holds, and solve for c1 and c2.
T1(n) = T1(αn) + T1((1–α)n) + Θ(lg n),where 1/4 · α · 3/4 .
T1(n) = T1(αn) + T1((1–α)n) + Θ(lg n)· c1(αn) – c2lg(αn)
+ c1((1–α)n) – c2lg((1–α)n) + Θ(lg n)
July 14, 2006 41© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Analysis of Work Recurrence
T1(n) = T1(αn) + T1((1–α)n) + Θ(lg n)· c1(αn) – c2lg(αn)
+ c1(1–α)n – c2lg((1–α)n) + Θ(lg n)
T1(n) = T1(αn) + T1((1–α)n) + Θ(lg n),where 1/4 · α · 3/4 .
July 14, 2006 42© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
T1(n) = T1(αn) + T1((1–α)n) + Θ(lg n)· c1(αn) – c2lg(αn)
+ c1(1–α)n – c2lg((1–α)n) + Θ(lg n)
Analysis of Work Recurrence
· c1n – c2lg(αn) – c2lg((1–α)n) + Θ(lg n)· c1n – c2 ( lg(α(1–α)) + 2 lg n ) + Θ(lg n)· c1n – c2 lg n
– (c2(lg n + lg(α(1–α))) – Θ(lg n))· c1n – c2 lg n
by choosing c1 and c2 large enough.
T1(n) = T1(αn) + T1((1–α)n) + Θ(lg n),where 1/4 · α · 3/4 .
July 14, 2006 43© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Parallelism of P_Merge
T1(n) = Θ(n)Work:
T∞(n) = Θ(lg2n)Span:
Parallelism: T1(n)T∞(n) = Θ(n/lg2n)
July 14, 2006 44© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
cilk void P_MergeSort(int *B, int *A, int n) {if (n==1) {
B[0] = A[0];} else {
int *C;C = (int*) Cilk_alloca(n*sizeof(int));spawn P_MergeSort(C, A, n/2);spawn P_MergeSort(C+n/2, A+n/2, n-n/2);sync;spawn P_Merge(B, C, C+n/2, n/2, n-n/2);
} }
cilk void P_MergeSort(int *B, int *A, int n) {if (n==1) {
B[0] = A[0];} else {
int *C;C = (int*) Cilk_alloca(n*sizeof(int));spawn P_MergeSort(C, A, n/2);spawn P_MergeSort(C+n/2, A+n/2, n-n/2);sync;spawn P_Merge(B, C, C+n/2, n/2, n-n/2);
} }
Parallel Merge Sort
July 14, 2006 45© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
T1(n) = 2 T1(n/2) + Θ(n)
Work of Parallel Merge Sort
Work:— CASE 2= Θ(n lg n)
cilk void P_MergeSort(int *B, int *A, int n) {if (n==1) {
B[0] = A[0];} else {
int *C;C = (int*) Cilk_alloca(n*sizeof(int));spawn P_MergeSort(C, A, n/2);spawn P_MergeSort(C+n/2, A+n/2, n-n/2);sync;spawn P_Merge(B, C, C+n/2, n/2, n-n/2);
} }
cilk void P_MergeSort(int *B, int *A, int n) {if (n==1) {
B[0] = A[0];} else {
int *C;C = (int*) Cilk_alloca(n*sizeof(int));spawn P_MergeSort(C, A, n/2);spawn P_MergeSort(C+n/2, A+n/2, n-n/2);sync;spawn P_Merge(B, C, C+n/2, n/2, n-n/2);
} }
July 14, 2006 46© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Span of Parallel Merge Sort
T∞(n) = ?T∞(n/2) + Θ(lg2n)Span:— CASE 2= Θ(lg3n)
nlogba = nlog21 = 1 ⇒ f (n) = Θ(nlogba lg2n) .
cilk void P_MergeSort(int *B, int *A, int n) {if (n==1) {
B[0] = A[0];} else {
int *C;C = (int*) Cilk_alloca(n*sizeof(int));spawn P_MergeSort(C, A, n/2);spawn P_MergeSort(C+n/2, A+n/2, n-n/2);sync;spawn P_Merge(B, C, C+n/2, n/2, n-n/2);
} }
cilk void P_MergeSort(int *B, int *A, int n) {if (n==1) {
B[0] = A[0];} else {
int *C;C = (int*) Cilk_alloca(n*sizeof(int));spawn P_MergeSort(C, A, n/2);spawn P_MergeSort(C+n/2, A+n/2, n-n/2);sync;spawn P_Merge(B, C, C+n/2, n/2, n-n/2);
} }
July 14, 2006 47© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Parallelism of Merge Sort
T1(n) = Θ(n lg n)Work:
T∞(n) = Θ(lg3n)Span:
Parallelism: T1(n)T∞(n) = Θ(n/lg2n)
July 14, 2006 48© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
LECTURE 2
• Matrix Multiplication
• Tableau Construction
• Recurrences (Review)
• Conclusion
• Merge Sort
July 14, 2006 49© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Tableau Construction
A[i, j] = f ( A[i, j–1], A[i–1, j], A[i–1, j–1] ).Problem: Fill in an n × n tableau A, where
Dynamic programming• Longest common
subsequence• Edit distance• Time warping
0000 0101 0202 0303 0404 0505 0606 0707
1010 1111 1212 1313 1414 1515 1616 1717
2020 2121 2222 2323 2424 2525 2626 2727
3030 3131 3232 3333 3434 3535 3636 3737
4040 4141 4242 4343 4444 4545 4646 4747
5050 5151 5252 5353 5454 5555 5656 5757
6060 6161 6262 6363 6464 6565 6666 6767
7070 7171 7272 7373 7474 7575 7676 7777Work: Θ(n2).
July 14, 2006 50© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
n
n
spawn I;sync;spawn II;spawn III;sync;spawn IV;sync;
II IIII
IIIIII IVIV
Cilk code
Recursive Construction
July 14, 2006 51© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
n
n
Work: T1(n) = ?4T1(n/2) + Θ(1)
spawn I;sync;spawn II;spawn III;sync;spawn IV;sync;
II IIII
IIIIII IVIV
Cilk code
Recursive Construction
= Θ(n2) — CASE 1
July 14, 2006 52© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Span: T∞(n) = ?
n
n
spawn I;sync;spawn II;spawn III;sync;spawn IV;sync;
II IIII
IIIIII IVIV
Cilk code
Recursive Construction
3T∞(n/2) + Θ(1)= Θ(nlg3) — CASE 1
July 14, 2006 53© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Analysis of Tableau ConstructionWork: T1(n) = Θ(n2)
Span: T∞(n) = Θ(nlg3)≈ Θ(n1.58)
Parallelism: T1(n)T∞(n) ≈ Θ(n0.42)
July 14, 2006 54© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
n spawn I;sync;spawn II;spawn III;sync;spawn IV;spawn V;spawn VIsync;spawn VII;spawn VIII;sync;spawn IX;sync;
A More-Parallel Construction
II IIII
IIIIII
IVIV
VV
VIVI
VIIVII
VIIIVIII IXIX
n
July 14, 2006 55© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
n spawn I;sync;spawn II;spawn III;sync;spawn IV;spawn V;spawn VIsync;spawn VII;spawn VIII;sync;spawn IX;sync;
A More-Parallel Construction
II IIII
IIIIII
IVIV
VV
VIVI
VIIVII
VIIIVIII IXIX
n
Work: T1(n) = ?9T1(n/3) + Θ(1)= Θ(n2) — CASE 1
July 14, 2006 56© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
n spawn I;sync;spawn II;spawn III;sync;spawn IV;spawn V;spawn VIsync;spawn VII;spawn VIII;sync;spawn IX;sync;
A More-Parallel Construction
II IIII
IIIIII
IVIV
VV
VIVI
VIIVII
VIIIVIII IXIX
n
Span: T∞(n) = ?5T∞(n/3) + Θ(1)= Θ(nlog35) — CASE 1
July 14, 2006 57© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Analysis of Revised ConstructionWork: T1(n) = Θ(n2)
Span: T∞(n) = Θ(nlog35)≈ Θ(n1.46)
Parallelism: T1(n)T∞(n) ≈ Θ(n0.54)
More parallel by a factor ofΘ(n0.54)/Θ(n0.42) = Θ(n0.12) .
July 14, 2006 58© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Puzzle
• You may only use basic Cilk control constructs (spawn, sync) for synchronization.
• No locks, synchronizing through memory, etc.
What is the largest parallelism that can be obtained for the tableau-
construction problem using Cilk?
What is the largest parallelism that can be obtained for the tableau-
construction problem using Cilk?
July 14, 2006 59© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
LECTURE 2
• Matrix Multiplication
• Tableau Construction
• Recurrences (Review)
• Conclusion
• Merge Sort
July 14, 2006 60© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Key Ideas• Cilk is simple: cilk, spawn, sync,SYNCHED
• Recurrences, recurrences, recurrences, …• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span
July 14, 2006 61© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2
Minicourse Outline●LECTURE 1
Basic Cilk programming: Cilk keywords, performance measures, scheduling.
●LECTURE 2Analysis of Cilk algorithms: matrix multiplication, sorting, tableau construction.
●LABORATORYProgramming matrix multiplication in Cilk — Dr. Bradley C. Kuszmaul
●LECTURE 3Advanced Cilk programming: inlets, abort, speculation, data synchronization, & more.
Top Related