Download - Programming in Cilk - Massachusetts Institute of …supertech.lcs.mit.edu/cilk/lecture-2.pdfProgramming matrix multiplication in Cilk — Dr. Bradley C. Kuszmaul LECTURE 3 Advanced

Multithreaded Programming in

CilkLECTURE 2

Multithreaded Programming in

CilkLECTURE 2

Charles E. LeisersonSupercomputing Technologies Research Group

Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology

July 14, 2006 2© 2006 by Charles E. Leiserson Multithreaded Programming in Cilk —LECTURE 2

Minicourse Outline●LECTURE 1

Basic Cilk programming: Cilk keywords, performance measures, scheduling.

●LECTURE 2Analysis of Cilk algorithms: matrix multiplication, sorting, tableau construction.

●LABORATORYProgramming matrix multiplication in Cilk — Dr. Bradley C. Kuszmaul

●LECTURE 3Advanced Cilk programming: inlets, abort, speculation, data synchronization, & more.


LECTURE 2

• Matrix Multiplication

• Tableau Construction

• Recurrences (Review)

• Conclusion

• Merge Sort


The Master MethodThe Master Method for solving recurrences applies to recurrences of the form

T(n) = a T(n/b) + f (n) , where a ≥ 1, b > 1, and f is asymptotically positive.

IDEA: Compare nlogba with f (n) .IDEA: Compare nlogba with f (n) .

*The unstated base case is T(n) = Θ(1) for sufficiently small n.

*


Master Method — CASE 1

nlogba À f (n)nlogba À f (n)

Specifically, f (n) = O(nlogba – ε) for some constant ε > 0.Solution: T(n) = Θ(nlogba) .

T(n) = a T(n/b) + f (n)T(n) = a T(n/b) + f (n)



Specifically, f (n) = Θ(nlogba lgkn) for some constant k ≥ 0.Solution: T(n) = Θ(nlogba lgk+1n) .

nlogba ≈ f (n)nlogba ≈ f (n)




Specifically, f (n) = Ω(nlogba + ε) for some constant ε > 0 and f (n) satisfies the regularity condition that a f (n/b) · c f (n)for some constant c < 1.Solution: T(n) = Θ(f (n)) .

nlogba ¿ f (n)nlogba ¿ f (n)



Master Method Summary

CASE 1: f (n) = O(nlogba – ε), constant ε > 0⇒ T(n) = Θ(nlogba) .

CASE 2: f (n) = Θ(nlogba lgkn), constant k ≥ 0⇒ T(n) = Θ(nlogba lgk+1n) .

CASE 3: f (n) = Ω(nlogba + ε ), constant ε > 0, and regularity condition

⇒ T(n) = Θ( f (n)) .



Master Method Quiz• T(n) = 4 T(n/2) + n

nlogba = n2 À n ⇒ CASE 1: T(n) = Θ(n2).• T(n) = 4 T(n/2) + n2

nlogba = n2 = n2 lg0n ⇒ CASE 2: T(n) = Θ(n2lg n).• T(n) = 4 T(n/2) + n3

nlogba = n2 ¿ n3 ⇒ CASE 3: T(n) = Θ(n3).• T(n) = 4 T(n/2) + n2/ lg n

Master method does not apply!


LECTURE 2




• Conclusion

• Merge Sort


Square-Matrix Multiplicationc11 c12 L c1nc21 c22 L c2n

M M O Mcn1 cn2 L cnn

a11 a12 L a1na21 a22 L a2n

M M O Man1 an2 L ann

b11 b12 L b1nb21 b22 L b2n

M M O Mbn1 bn2 L bnn

= ×

C A B

cij = ∑k = 1

n

aik bkj

Assume for simplicity that n = 2k.


Recursive Matrix Multiplication

8 multiplications of (n/2) × (n/2) matrices.1 addition of n × n matrices.

Divide and conquer —

C11 C12

C21 C22

= ×A11 A12

A21 A22

B11 B12

B21 B22

= +A11B11 A11B12

A21B11 A21B12

A12B21 A12B22

A22B21 A22B22


cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);spawn Mult(C22,A21,B12,n/2);spawn Mult(C21,A21,B11,n/2);spawn Mult(T11,A12,B21,n/2);spawn Mult(T12,A12,B22,n/2);spawn Mult(T22,A22,B22,n/2);spawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;

}


}

Matrix Multiply in Pseudo-Cilk

C = A· BAbsence of type

declarations.Absence of type

declarations.



}


}

C = A· B Coarsen base cases for efficiency.Coarsen base cases for efficiency.




}


}

C = A· B

Submatrices are produced by pointer calculation, not copying of elements.

Submatrices are produced by pointer calculation, not copying of elements.

Also need a row-size argument for array indexing.

Also need a row-size argument for array indexing.




}


}

C = A· B

cilk void Add(*C, *T, n) {h base case & partition matrices ispawn Add(C11,T11,n/2);spawn Add(C12,T12,n/2);spawn Add(C21,T21,n/2);spawn Add(C22,T22,n/2);sync;return;

}


}C = C + T



A1(n) = ?4A1(n/2) + Θ(1)


}


}

Work of Matrix Addition

Work:

nlogba = nlog24 = n2 À Θ(1) .

— CASE 1= Θ(n2)



}


}


}


}

A∞(n) = ?

Span of Matrix Addition

A∞(n/2) + Θ(1)Span:

nlogba = nlog21 = 1 ⇒ f (n) = Θ(nlogba lg0n) .

— CASE 2= Θ(lg n)

maximum


M1(n) = ?

Work of Matrix Multiplication

8 M1(n/2) +A1(n) + Θ(1)Work:

cilk void Mult(*C, *A, *B, n) {float *T = Cilk_alloca(n*n*sizeof(float));h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);Mspawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;

}


}

8

nlogba = nlog28 = n3 À Θ(n2) .

= 8 M1(n/2) + Θ(n2)= Θ(n3) — CASE 1



}


}


}


}

M∞(n) = ?M∞(n/2) + A∞(n) + Θ(1)

Span of Matrix Multiplication

Span:


= M∞(n/2) + Θ(lg n)= Θ(lg2 n) — CASE 2

8


Parallelism of Matrix Multiply

M1(n) = Θ(n3)Work:M∞(n) = Θ(lg2n)Span:

Parallelism: M1(n)M∞(n) = Θ(n3/lg2n)

For 1000 × 1000 matrices, parallelism ≈ (103)3/102 = 107.


cilk void Mult(*C, *A, *B, n) {

h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);Mspawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;

}

cilk void Mult(*C, *A, *B, n) {

h base case & partition matrices ispawn Mult(C11,A11,B11,n/2);spawn Mult(C12,A11,B12,n/2);Mspawn Mult(T21,A22,B21,n/2);sync;spawn Add(C,T,n);sync; return;

}

Stack Temporariesfloat *T = Cilk_alloca(n*n*sizeof(float));

In hierarchical-memory machines (especially chip multiprocessors), memory accesses are so expensive that

minimizing storage often yields higher performance.

IDEA: Trade off parallelism for less storage.IDEA: Trade off parallelism for less storage.


No-Temp Matrix Multiplicationcilk void MultA(*C, *A, *B, n) {// C = C + A * Bh base case & partition matrices ispawn MultA(C11,A11,B11,n/2);spawn MultA(C12,A11,B12,n/2);spawn MultA(C22,A21,B12,n/2);spawn MultA(C21,A21,B11,n/2);sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2);spawn MultA(C12,A12,B22,n/2);spawn MultA(C11,A12,B21,n/2);sync; return;

}

cilk void MultA(*C, *A, *B, n) {// C = C + A * Bh base case & partition matrices ispawn MultA(C11,A11,B11,n/2);spawn MultA(C12,A11,B12,n/2);spawn MultA(C22,A21,B12,n/2);spawn MultA(C21,A21,B11,n/2);sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2);spawn MultA(C12,A12,B22,n/2);spawn MultA(C11,A12,B21,n/2);sync; return;

}

Saves space, but at what expense?


= Θ(n3)

Work of No-Temp Multiply

M1(n) = ?8 M1(n/2) + Θ(1)Work:— CASE 1


}


}



}


}


}


}

= Θ(n)M∞(n) = ?

Span of No-Temp Multiply

Span:— CASE 1

2 M∞(n/2) + Θ(1)

maximum

maximum


Parallelism of No-Temp MultiplyM1(n) = Θ(n3)Work:

M∞(n) = Θ(n)Span:

Parallelism: M1(n)M∞(n) = Θ(n2)

For 1000 × 1000 matrices, parallelism ≈ (103)3/103 = 106.

Faster in practice!


Testing SynchronizationCilk language feature: A programmer can check whether a Cilk procedure is “synched”(without actually performing a sync) by testing the pseudovariable SYNCHED:•SYNCHED = 0 ⇒ some spawned children

might not have returned.•SYNCHED = 1 ⇒ all spawned children

have definitely returned.


Best of Both Worldscilk void Mult1(*C, *A, *B, n) {// multiply & store

h base case & partition matrices ispawn Mult1(C11,A11,B11,n/2); // multiply & storespawn Mult1(C12,A11,B12,n/2);spawn Mult1(C22,A21,B12,n/2); spawn Mult1(C21,A21,B11,n/2); if (SYNCHED) {

spawn MultA1(C11,A12,B21,n/2); // multiply & addspawn MultA1(C12,A12,B22,n/2); spawn MultA1(C22,A22,B22,n/2);spawn MultA1(C21,A22,B21,n/2);

} else {float *T = Cilk_alloca(n*n*sizeof(float));spawn Mult1(T11,A12,B21,n/2); // multiply & storespawn Mult1(T12,A12,B22,n/2); spawn Mult1(T22,A22,B22,n/2); spawn Mult1(T21,A22,B21,n/2);sync;spawn Add(C,T,n); // C = C + T

}sync; return;

}

cilk void Mult1(*C, *A, *B, n) {// multiply & storeh base case & partition matrices ispawn Mult1(C11,A11,B11,n/2); // multiply & storespawn Mult1(C12,A11,B12,n/2);spawn Mult1(C22,A21,B12,n/2); spawn Mult1(C21,A21,B11,n/2); if (SYNCHED) {

spawn MultA1(C11,A12,B21,n/2); // multiply & addspawn MultA1(C12,A12,B22,n/2); spawn MultA1(C22,A22,B22,n/2);spawn MultA1(C21,A22,B21,n/2);

} else {float *T = Cilk_alloca(n*n*sizeof(float));spawn Mult1(T11,A12,B21,n/2); // multiply & storespawn Mult1(T12,A12,B22,n/2); spawn Mult1(T22,A22,B22,n/2); spawn Mult1(T21,A22,B21,n/2);sync;spawn Add(C,T,n); // C = C + T

}sync; return;

}

This code is just as parallel as the original, but it only uses more space if runtime parallelism actually exists.

This code is just as parallel as the original, but it only uses more space if runtime parallelism actually exists.


Ordinary Matrix Multiplication

cij = ∑k = 1

n

aik bkj

IDEA: Spawn n2 inner products in parallel. Compute each inner product in parallel.

Work: Θ(n3)Span: Θ(lg n)Parallelism: Θ(n3/lg n)

BUT, this algorithm exhibits poor locality and does not exploit the cache hierarchy of modern microprocessors, especially CMP’s.


LECTURE 2




• Conclusion

• Merge Sort


3 12 19 46

4 14 21 23

193

4

12

14 21 23

46

Merging Two Sorted Arraysvoid Merge(int *C, int *A, int *B, int na, int nb) {

while (na>0 && nb>0) {if (*A <= *B) {

*C++ = *A++; na--;} else {

*C++ = *B++; nb--;}

}while (na>0) {

*C++ = *A++; na--;}while (nb>0) {

*C++ = *B++; nb--;}

}

void Merge(int *C, int *A, int *B, int na, int nb) {while (na>0 && nb>0) {

if (*A <= *B) {*C++ = *A++; na--;

} else {*C++ = *B++; nb--;

}}while (na>0) {

*C++ = *A++; na--;}while (nb>0) {

*C++ = *B++; nb--;}

}

Time to merge nelements = ?Time to merge nelements = ?Θ(n).


cilk void MergeSort(int *B, int *A, int n) {if (n==1) {

B[0] = A[0];} else {

int *C;C = (int*) Cilk_alloca(n*sizeof(int));spawn MergeSort(C, A, n/2);spawn MergeSort(C+n/2, A+n/2, n-n/2);sync;Merge(B, C, C+n/2, n/2, n-n/2);

} }


B[0] = A[0];} else {


} }

Merge Sort

144619 3 12 33 4 21

4 3319 46 143 12 21

46 333 12 19 4 14 21

46143 4 12 19 21 33

merge

merge

merge


= Θ(n lg n)T1(n) = ?2T1(n/2) + Θ(n)

Work of Merge Sort

Work:— CASE 2

nlogba = nlog22 = n ⇒ f (n) = Θ(nlogba lg0n) .


B[0] = A[0];} else {


} }


B[0] = A[0];} else {


} }


T∞(n) = ?T∞(n/2) + Θ(n)

Span of Merge Sort

Span:— CASE 3= Θ(n)

nlogba = nlog21 = 1 ¿ Θ(n) .


B[0] = A[0];} else {


} }


B[0] = A[0];} else {


} }


Parallelism of Merge SortT1(n) = Θ(n lg n)Work:

T∞(n) = Θ(n)Span:

Parallelism: T1(n)T∞(n) = Θ(lg n)

We need to parallelize the merge!We need to parallelize the merge!


B

A0 na

0 nbna ≥ nb

Parallel Merge

· A[na/2] ≥ A[na/2]

Binary search

j j+1

Recursivemerge

Recursivemerge

na/2

· A[na/2] ≥ A[na/2]

KEY IDEA: If the total number of elements to be merged in the two arrays is n = na + nb, the total number of elements in the larger of the two recursive merges is at most ?(3/4) n .


Parallel Mergecilk void P_Merge(int *C, int *A, int *B,

int na, int nb) {if (na < nb) {

spawn P_Merge(C, B, A, nb, na);} else if (na==1) {

if (nb == 0) {C[0] = A[0];

} else {C[0] = (A[0]<B[0]) ? A[0] : B[0]; /* minimum */C[1] = (A[0]<B[0]) ? B[0] : A[0]; /* maximum */

} } else {

int ma = na/2;int mb = BinarySearch(A[ma], B, nb);spawn P_Merge(C, A, B, ma, mb);spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb);sync;

}}

cilk void P_Merge(int *C, int *A, int *B, int na, int nb) {

if (na < nb) {spawn P_Merge(C, B, A, nb, na);

} else if (na==1) {if (nb == 0) {

C[0] = A[0];} else {

C[0] = (A[0]<B[0]) ? A[0] : B[0]; /* minimum */C[1] = (A[0]<B[0]) ? B[0] : A[0]; /* maximum */

} } else {


}}

Coarsen base cases for efficiency.


T∞(n) = ?T∞(3n/4) + Θ(lg n)

Span of P_Merge

Span:— CASE 2= Θ(lg2n)


if (na < nb) {M} else {


}}




}}

nlogba = nlog4/31 = 1 ⇒ f (n) = Θ(nlogba lg1n) .


T1(n) = ?T1(αn) + T1((1–α)n) + Θ(lg n),where 1/4 · α · 3/4 .

Work of P_Merge

Work:




}}




}}

CLAIM: T1(n) = Θ(n) .


Analysis of Work Recurrence

Substitution method: Inductive hypothesis is T1(k) · c1k – c2lg k, where c1,c2 > 0. Prove that the relation holds, and solve for c1 and c2.

T1(n) = T1(αn) + T1((1–α)n) + Θ(lg n),where 1/4 · α · 3/4 .

T1(n) = T1(αn) + T1((1–α)n) + Θ(lg n)· c1(αn) – c2lg(αn)

+ c1((1–α)n) – c2lg((1–α)n) + Θ(lg n)




+ c1(1–α)n – c2lg((1–α)n) + Θ(lg n)




+ c1(1–α)n – c2lg((1–α)n) + Θ(lg n)


· c1n – c2lg(αn) – c2lg((1–α)n) + Θ(lg n)· c1n – c2 ( lg(α(1–α)) + 2 lg n ) + Θ(lg n)· c1n – c2 lg n

– (c2(lg n + lg(α(1–α))) – Θ(lg n))· c1n – c2 lg n

by choosing c1 and c2 large enough.



Parallelism of P_Merge

T1(n) = Θ(n)Work:

T∞(n) = Θ(lg2n)Span:

Parallelism: T1(n)T∞(n) = Θ(n/lg2n)


cilk void P_MergeSort(int *B, int *A, int n) {if (n==1) {

B[0] = A[0];} else {

int *C;C = (int*) Cilk_alloca(n*sizeof(int));spawn P_MergeSort(C, A, n/2);spawn P_MergeSort(C+n/2, A+n/2, n-n/2);sync;spawn P_Merge(B, C, C+n/2, n/2, n-n/2);

} }


B[0] = A[0];} else {


} }

Parallel Merge Sort


T1(n) = 2 T1(n/2) + Θ(n)

Work of Parallel Merge Sort

Work:— CASE 2= Θ(n lg n)


B[0] = A[0];} else {


} }


B[0] = A[0];} else {


} }


Span of Parallel Merge Sort

T∞(n) = ?T∞(n/2) + Θ(lg2n)Span:— CASE 2= Θ(lg3n)



B[0] = A[0];} else {


} }


B[0] = A[0];} else {


} }


Parallelism of Merge Sort

T1(n) = Θ(n lg n)Work:

T∞(n) = Θ(lg3n)Span:

Parallelism: T1(n)T∞(n) = Θ(n/lg2n)


LECTURE 2




• Conclusion

• Merge Sort


Tableau Construction

A[i, j] = f ( A[i, j–1], A[i–1, j], A[i–1, j–1] ).Problem: Fill in an n × n tableau A, where

Dynamic programming• Longest common

subsequence• Edit distance• Time warping

0000 0101 0202 0303 0404 0505 0606 0707

1010 1111 1212 1313 1414 1515 1616 1717

2020 2121 2222 2323 2424 2525 2626 2727

3030 3131 3232 3333 3434 3535 3636 3737

4040 4141 4242 4343 4444 4545 4646 4747

5050 5151 5252 5353 5454 5555 5656 5757

6060 6161 6262 6363 6464 6565 6666 6767

7070 7171 7272 7373 7474 7575 7676 7777Work: Θ(n2).


n

n

spawn I;sync;spawn II;spawn III;sync;spawn IV;sync;

II IIII

IIIIII IVIV

Cilk code

Recursive Construction


n

n

Work: T1(n) = ?4T1(n/2) + Θ(1)


II IIII

IIIIII IVIV

Cilk code


= Θ(n2) — CASE 1


Span: T∞(n) = ?

n

n


II IIII

IIIIII IVIV

Cilk code


3T∞(n/2) + Θ(1)= Θ(nlg3) — CASE 1


Analysis of Tableau ConstructionWork: T1(n) = Θ(n2)

Span: T∞(n) = Θ(nlg3)≈ Θ(n1.58)

Parallelism: T1(n)T∞(n) ≈ Θ(n0.42)


n spawn I;sync;spawn II;spawn III;sync;spawn IV;spawn V;spawn VIsync;spawn VII;spawn VIII;sync;spawn IX;sync;

A More-Parallel Construction

II IIII

IIIIII

IVIV

VV

VIVI

VIIVII

VIIIVIII IXIX

n




II IIII

IIIIII

IVIV

VV

VIVI

VIIVII

VIIIVIII IXIX

n

Work: T1(n) = ?9T1(n/3) + Θ(1)= Θ(n2) — CASE 1




II IIII

IIIIII

IVIV

VV

VIVI

VIIVII

VIIIVIII IXIX

n

Span: T∞(n) = ?5T∞(n/3) + Θ(1)= Θ(nlog35) — CASE 1


Analysis of Revised ConstructionWork: T1(n) = Θ(n2)

Span: T∞(n) = Θ(nlog35)≈ Θ(n1.46)

Parallelism: T1(n)T∞(n) ≈ Θ(n0.54)

More parallel by a factor ofΘ(n0.54)/Θ(n0.42) = Θ(n0.12) .


Puzzle

• You may only use basic Cilk control constructs (spawn, sync) for synchronization.

• No locks, synchronizing through memory, etc.

What is the largest parallelism that can be obtained for the tableau-

construction problem using Cilk?

What is the largest parallelism that can be obtained for the tableau-

construction problem using Cilk?


LECTURE 2




• Conclusion

• Merge Sort


Key Ideas• Cilk is simple: cilk, spawn, sync,SYNCHED

• Recurrences, recurrences, recurrences, …• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span


Minicourse Outline●LECTURE 1

Basic Cilk programming: Cilk keywords, performance measures, scheduling.

●LECTURE 2Analysis of Cilk algorithms: matrix multiplication, sorting, tableau construction.

●LABORATORYProgramming matrix multiplication in Cilk — Dr. Bradley C. Kuszmaul

●LECTURE 3Advanced Cilk programming: inlets, abort, speculation, data synchronization, & more.