LRPD - Parasol Laboratory · The LRPD est T [3 ] Sp e eculativ do Lo op arallelization P Our Basic...

26

Transcript of LRPD - Parasol Laboratory · The LRPD est T [3 ] Sp e eculativ do Lo op arallelization P Our Basic...

The LRPD Test:Speculative Run{Time Parallelization of Loops withPrivatization and Reduction Parallelization

Lawrence Rauchwerger and David PaduaCenter for Supercomputing Research & DevelopmentUniversity of Illinois at Urbana - ChampaignComputer & Systems Research Laboratory1308 West Main StreetUrbana, Illinois 61801-2987

The LRPD Test [ 1 ]Parallelizing Compilers and Irregular Problems� Viable Parallel Computing Requires Automatic Parallelization{ automatic (vs. hand) parallelization of sequential codes is:portable � error free � easy to use� Current Parallelizing Compilers Cannot Handle:{ programs with di�cult to analyze access patterns{ programs with irregular access patterns{ programs with dynamically changing access patterns� A Large, Important Class of Problems with Irregular orDynamic Domains { about 50% of all programs.SPICE circuit simulationDYNA{3D structural mechanicsPRONTO{3DGAUSSIAN quantum mechanical simulationDMOL of moleculesCHARMM molecular dynamics simulationsDISCOVER of organic systemsFIDAP modeling complex uid owsNew Compiler Methods are Needed for These Problems:Detect and Exploit Parallelism at Run{TimeC S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 2 ]The ProblemAutomatic Loop Level Parallelization for Multiprocessors requires:� data dependence analysis followed by� loop transformations (to remove dependences)most e�ective: privatization � reduction parallelizationdo i = 1, nA(f(i)) = ......... = A(g(i))enddoRegular Problemswhen f (i) and g(i) are \easy" to analyze,then compilers can obtain reasonable performanceHard or Run{Time Determined ProblemsCompilers fail : : :� nonlinear subscripts� symbolic values that are di�cult to evaluate� subscripted subscripts (indirection, pointers)� control ow dependent access patternsOur Solution: Speculative Run Time TechniquesC S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 3 ]Speculative do Loop ParallelizationOur Basic Strategy1. speculatively apply the most promising transformations2. speculatively execute the loop in parallel (as a doall)3. test subsequently if:(a) the transformations were legal(b) the loop was truly parallel (no potential dependences)for Speculative Parallel Execution we need:� a checkpointing/restoration mechanism: to save the originalvalues of program variables for the possible sequential re{execution� an error (hazard) detection method: to test the validity of thespeculative parallel execution� an automatable strategy: to decide when to use speculative par-allel executionC S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 4 ]Error DetectionTypes of Errors During Parallel Execution� writing a memory location in di�erent iterations(an output dependence)� reading and writing a memory location in di�erent iterations(a ow and/or anti dependence)� exceptionsOur General Approach for Detecting Errors� to identify dependences:1. shadow arrays under test2. follow accesses to shadow structures3. detect errors by identifying multiple accesses to same location� exceptions =) treat as errors and abort parallel executionC S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 5 ]De�nitions: independent variables and privatizationIndependent Shared Variablesdo i=1,nf(i) = A(i)B(i) = g(i)end doa shared variable is independent if it is:� read{only (e.g., A)� accessed (written and read) in only one iteration (e.g., B)Privatizable Shared Variablesdo i=1,nA(l:m) = f(i)h(i) = A(l:m)end doa shared array A can be privatized if and only if every read access toan element of A is preceded by a write access to that same element ofA within the same iteration of the loopC S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 6 ]Error Detection: The Lazy Privatizing doall TestI. The Marking Phase| shadow arrays Ar[1:s], Aw[1:s] and Anp[1:s]| initially assume all array elements are privatizable1. de�nitions (done when the value is written):� mark Aw if �rst write to this element in this iteration� increment tw(A) number of de�nitions marked in iteration2. uses (done when the value that was read is used):� mark Ar if element never written in this iteration� mark Anp as not privatizable if element not written beforeread in this iteration ,

z = A[K[i]] if B[i] then A[L[i]] = z + C[i] endif

z = A[K[i]] if B[i] then markread(K[i]) markwrite(L[i]) increment(tw_A) A[L[i]] = z + C[i] endifenddoall

B[1:5] = [1,0,1,0,1]

L[1:5] = [2,2,4,4,2]K[1:5] = [1,2,3,4,1]

do i = 1,5

enddo

doall i = 1,5

shadow arrays written1 2 3 4 tw(A)Aw[1:4] 0 1 0 1 3Ar[1:4] 1 0 1 0Anp[1:4] 1 0 1 0C S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 7 ]Error Detection: The Lazy Privatizing doall TestII. The Analysis Phasearray elements are: read{only independent privatizable dependent1. compute: tm(A) = PAw[1:s] | marks in Aw2. if any(Aw[:] ^Ar[:]), then loop was not a doall(at least one ow or anti dependence)3. else if tw(A) = tm(A), then loop was a doall (no dependences)4. else if any(Aw[:] ^ Anp[:]), then loop was not a doall(at least one dependence was not removed by privatization)5. otherwise, do loop transformed into doall by privatizing A(all memory-related dependences are removed by privatization) z = A[K[i]] if B[i] then A[L[i]] = z + C[i] endifenddo

do i = 1,5

B[1:5] = [1,0,1,0,1]K[1:5] = [1,2,3,4,1]L[1:5] = [2,2,4,4,2]shadow array attempted counted outcome1 2 3 4 tw(A) tm(A)Aw[1:4] 0 1 0 1 3 2 fail 3Ar[1:4] 1 0 1 0Anp[1:4] 1 0 1 0Aw[:] ^ Ar[:] 0 0 0 0 pass 2Aw[:] ^ Anp[:] 0 0 0 0 pass 4

C S R DCenter for SupercomputingResearch and Development

The LRPD Test [ 8 ]Implementation Notes1. Private Marking Phase� private shadow arrays or private hash tables� iteration number used for marking{ makes initialization between iterations unnecessary� last value assignment requires time-stamping{ comes for free since \marks" are iteration numbers{ rarely neededMerge Private Structures into Global Structure� elements written to global storage without synchronization� when more information needed: software/hardware combined2. Variants of the LPD Test� processor{wise test: only check cross{processor dependences� inspector/executor usage: if an e�cient inspector existsC S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 9 ]Reduction Parallelizationdo i = 1, ndo j = 1, mA(j) = A(j) + exp()enddoenddoDe�nition: A reduction variable is a variable whose value is usedin one associative and commutative operation of the form:x = x expwhere is the associative and commutative operatorand x does not occur in exp or anywhere else in the loop.To Apply Reduction Parallelization We Need:1. reduction variable recognition: static pattern matching (until now)2. reduction parallelization: known parallel algorithms3. reduction validation: use data dependence analysis to verifyreduction variable is not referenced elsewhere in loop| regular access patterns: static dependence analysis su�ces| irregular problems require run{time validationC S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 10]Run{Time Reduction Validation: The LRPD testIdea: speculatively assume all potential reductions are valid andinvalidate references that occur outside reduction statementI. The Marking Phase| shadow arrays Ar[1:s], Aw[1:s], Anp[1:s], and Anx[1:s]1. de�nitions (done when the value is written):� ..............� ..............2. uses (done when the value that was read is used):� ...............� ...............3. de�nitions and uses: mark Anx if element is not valid referencein reduction statement (element is not a reduction variable) markwrite(K[i])

A[K[i]] = ...

R[1:4] = [1,2,1,2]

doall i=1,4

A[R[i]] = A[R[i]]+exp

A[K[i]] = ... do i=1,4

... = A[L[i]]

L[1:4] = [3,4,3,4]K[1:4] = [3,4,3,4]

enddo markread(L[i]) markredux(L[i])

markwrite(R[i])

enddoall

... = A[L[i]]

markredux(K[i])

A[R[i]] = A[R[i]]+expshadow arrays written1 2 3 4 tw(A)Aw[1:4] 1 1 1 1 8Ar[1:4] 0 0 0 0Anp[1:4] 1 1 0 0Anx[1:4] 0 0 1 1C S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 11]Run-Time Reduction Validation: The LRPD testII. The Analysis Phasearray elements are:read{only independent privatizable reduction operands dependent1. compute: ....................2. if .......................3. else if ...................4. else if any(Aw[:] ^ Anp[:] ^ Anx[:]), then loop was not a doall(some dependence not removed by privatization or reduction par-allelization)5. otherwise, do loop transformed into doall by privatizing A andreduction parallelization (all dependences removed)shadow array attempted counted outcome1 2 3 4 tw(A) tm(A)Aw[1:4] 1 1 1 1 8 4 fail 3Ar[1:4] 0 0 0 0Anp[1:4] 1 1 0 0Anx[1:4] 0 0 1 1Aw[:] ^ Ar[:] 0 0 0 0 pass 2Aw[:] ^ Anp[:] ^Anx[:] 0 0 0 0 pass 4C S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 12]more cases of pattern matched reductions A[S[i]] = A[S[i]]+exp(i)enddo

A[R[i]] = A[R[i]]*exp(A[X[i]])do i = 1,n

init A_nx[:] = falsedoall i = 1,n markwrite(R[i]) if (A_nx(R[i]).ne.true) then if (A_nx(R[i]).ne.’*’) markredux(R[i]) else A_nx[R[i]] = ’*’ endif markread(X[i]) markredux(X[i])

markwrite(S[i])

else

if (A_nx(S[i]).ne.true) then if (A_nx(S[i]).ne.’+’) markredux(S[i])

endif A_nx[S[i]] = ’+’

enddoall A[S[i]] = A[S[i]]+exp

A[R[i]] = A[R[i]]*exp(A[X[i]])

C S R DCenter for SupercomputingResearch and Development

The LRPD Test [ 13]Potential reductions that are not pattern matcheddo i = 1,n A[K[i]] = ...

A[K[i]] = ...

init A_nx[:] = .false.doall i = 1,n private int count = 0 markwrite(K[i]) markredux(K[i])

ct = ct + 1

enddoall

markread(S[i])

ct = ct + 1

endif markread(T[i])

endif

if equal(S[i],R[i]) then

else markredux(S[i])

else markredux(T[i])

if equal(T[i],R[i]) then

A[R[i]] = A[S[i]]+A[T[i]]

markwrite(R[i]) if (ct .ne. 1) markredux(R[i])

enddo A[R[i]] = A[S[i]] + A[T[i]]

C S R DCenter for SupercomputingResearch and Development

The LRPD Test [ 14]Multiple statement reductionsAn Expanded Reduction Statement (ERS)

endifenddoall

doall i = 1,n z = A[K[i]]

markwrite(R[i]) if (K[i].ne.R[i]) then markredux(K[i]) markredux(R[i]) endif

B[f(i)] = t

A[R[i]] = z + y

markwrite(L[i]) if (K[i].ne.L[i]) then

markredux(L[i]) endif A[L[i]] = t + y if (exp) then

markredux(K[i])

y = constant markread(K[i])

markredux(K[i])

if (exp) then

enddo endif B[f(i)] = t

A[L[i]] = t + y t = z A[R[i]] = z + y y = constant z = A[K[i]]

t = z

do i = 1,n

C S R DCenter for SupercomputingResearch and Development

The LRPD Test [ 15]Control Flow Dependent ERSs

if (B2) t = z

z = A[L[i]]

z = A[K[i]]

markread(B2*B1*K[i] + B2*~B1*L[i] + ~B1*J[i])

markread(B1*K[i]+~B1*L[i]) markredux(B1*K[i]+~B1*L[i]) markwrite(R[i]) A[R[i]]=A[R[i]]+z endif

if (B3) then

if (B4) then

markredux(B2*B1*K[i] + B2*~B1*L[i]

if (B1) then

else

+ ~B1*J[i])

endif

do i = 1,n

if (B1) then z = A[K[i]] else z = A[L[i]] endif if (B2) t = z

t = A[J[i]]

enddo

if (B3) A[R[i]]=A[R[i]]+z if (B4) Y[i] = t

endif

Y[i] = t

t = A[J[i]]

enddoall

doall i = 1,n

C S R DCenter for SupercomputingResearch and Development

The LRPD Test [ 16]Checkpointing { RollbackCheckpointing: Saving Program State for Possible Re{Execution.Techniques1. use private copies of variables with (on-the- y) copy{in/copy{out2. use original variables and save state in private or global storageMethods for minimizing the time-space requirements� compiler identi�es point of minumum state| not necessarily at loop start� only shared, modi�ed variables need to be saved| read{only and privatized variables don't need it� sparse access patterns: save \on the y" into hash tables� strip-mining:| static: introduces global synchronization barriers| adaptive: number of strips just enough to lower storagerequirements to acceptable performanceC S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 17]Putting it all togetherAt Compile Time1. a cost/performance analysis determines if loop should be:(a) speculatively executed in parallel,(b) parallelized with inspector/executor version of LRPD test, or(c) executed sequentially.2. generate code for the speculative parallel execution.At Run{Time1. checkpoint if necessary (save the state of program variables)2. execute parallel version of the loop (including marking phase)3. execute analysis phase of the test (result = pass/fail)4. if test passed (a) copy{out values of live private variables(b) complete reductions across processors, copy{outif test failed (a) restore state(b) execute sequential version of loop5. collect statistics for use in� schedule reuse in this run� future runs� future compilationsC S R D

Center for SupercomputingResearch and Development

TheLRPDTest[17] A_nx = pA_nx[1:m,i]enddoallresult = test(A_w,A_r,A_nx)if (result .eq. pass) then /* compute reductions */

A_r = pA_r[1:m,i]

doall i = 1,m

/* run sequential loop */else

endif

if (A_nx[i] .eq. .false.) A[i]=sum(pA[i,1:procs])

A_w = pA_w[1:m,i]

init(pA,pA_w,pA_r,pA_nx)

enddoall

A_nx[m], pA_nx[m,nprocs]

/* Analysis Phase */

A_w[m], pA_w[m,nprocs]A[m], pA[m,procs]

doall i = 1,n p = get_proc_id() pA_w[R[i],p] = i

if (pA_w[L[i]] .ne. i)

enddoall

pA[R[i],p]=pA[R[i],p]+exp

pA_r[L[i],p] = i pA_nx[L[i],p]=.true. ... = pA[L[i],p]

/* Marking Phase */ /* declare and init */ doall i = 1,nprocs

A_r[m], pA_r[m,nprocs]

A[1:m]do i = 1,n

... = A[L[i]]enddo

A[R[i]] = A[R[i]]+exp

C S R

DC

enter for Supercomputing

Research and D

evelopment

The LRPD Test [ 18]Algorithm Complexityp { number of processorsa { total number of accesses in loops { number of elements in shared arraym { size of original subscript arrayW { size of modi�ed workspacemarking phase: O(a=p + log p) time | prop. to max(a=p; log p)� p processors do a accesses in O(a=p) time in private storage� O(a=p + log p) to move private shadow arrays to global storage| log p due common writesanalysis phase: O(s=p + log p) time� compare/count: s=p in private storage, log p in global storagecheckpointing: O(W=p) time� nonselective (a priori) when dense access pattern� selective (on-the- y) when sparse access patternlast value assignment/across{processor reduction: O(s + log p)� not always neededIf s > a=p use more sophisticated private shadow arrays: hash tablesC S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 19]Cost/Performance PredictionWhat is potential speedup/slowdown for run-time parallelization?Compiler can estimate (statically and/or at run{time):� Tseq { sequential execution time� Tdoall { ideal (no overhead) parallel execution timeattainable speedup:(test passes) Sp = TseqTmark + Tanalysis + Tsave + Tdoallpotential slowdown:(test fails) Sl = Tseq+Tmark+Tanalysis+Ts=r+Tdoallworst case: kernel loop with Tsave; Tmark; Tanalysis; Trestore / TdoallSp � Tseq4Tdoall = 14Spideal ) O(Spideal)Sl � 5pTseq ) O(1pTseq)Optimizations:� schedule reuse� statistics { collection and use� one/(p� 1) processor solution� decoupled execution of inspector/executorC S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 20]Speculative Run{Time Detection of ParallelismStatic Automatic Parallelization limited to � 50% of ApplicationsRun{Time Methods� can break this fundamental barrier� are the only viable solution for irregular problemsSpeculating about Parallelism at Run{Time Pays O�� exploits available parallelism (at least 25% Spideal)� risks only small slowdowns (proportional to 1pTseq)provided that:| we use parallel techniques| we bias speculation with run{time collected statistics (pro�ling)| we do it only when there enough parallelism is available| (if we decouple run{time dependence analysis and perform it inadvance { it might even come for free)More needs to be done !C S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 21]Conclusion: More work needs to be doneexperimental results indicate run-time parallelization is promisingto exploit run-time parallelization we need to develop/study:� seamless integration of methods into parallel compilers/computers� compiler strategies for optimizing decisions at run-time| automatic application of the run-time tests| speculative execution of partially parallel loops| generating static \incomplete" information for run-time use� high level architectural model| for predicting execution time of subprograms| for cost/performance analysis in the presence of speculation� methods for statistics collection, validation and feedback:| parallelism pro�ling is not a proven practice� Desirable: architectural support� Generalization for CC S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 22]Choice between strategiesChoosing between the two strategies:� Can an inspector loop be extracted?� Compare costs: (Tsave + Tmark) vs. Timark� How much extra storage is required for speculative execution?� Will working set increase due to speculative execution?

C S R DCenter for SupercomputingResearch and Development

The LRPD Test [ 23]Allocating the Private VariablesThe Inspector/Executor Strategy:� Privatize entire array, or� Privatize only array elements that are written1. Allocate Private Storage PA for the Elements of A Written, i.e.,those elements A[k] with Aw[k] = 1 (shaded elements)2. Determine the Position of Privatized Element A[k] in PA from thekth Pre�x Sum Value (sk) of Aw.3. Find O�set between Address of A and PA: d = &PA[0]�&A[0]4. Allocate and Compute Private Subscript Array PS[1 : s]{ if A[S[k]] is not private, then PS[k] = S[k]{ if A[S[k]] is private, then PS[k] = d + sS[k]5. Substitute Accesses A[S[k]] with A[PS[k]]C S R D

Center for SupercomputingResearch and Development

The LRPD Test [ 24]Modi�ed Loop Example { from TRFD**Original Version**DO 540 I=1,NPDO 540 J=1,IIJ=IA(I)+J.... DO 540 K=1,IMAXL=KIF(K.EQ.I) MAXL=JDO 540 L=1,MAXLKL=IA(K)+L.....C **DOALL Test checks the writes to X**X(IJ,KL)=....540 CONTINUE **Augmented Loop for Marking Phase{Speculative Strategy**DOALL I = 1,NPC **private variables -- do once per processor**integer X_w(:,:), tw_i, IJ, MAXL, J, K, L, KLLOOP ** do once per iteration (I) **tw_i = 0DO J=1,IIJ=IA(I)+JDO K=1,IMAXL=KIF(K.EQ.I) MAXL=JDO L=1,MAXLKL=IA(K)+LX(IJ,KL)=....C **mark shadow array X_w (if not already marked)**IF (X_w(IJ ,KL) .NE. I) THENX_w(IJ,KL) = Itw_i = tw_i + 1ENDIFENDDOENDDOENDDOtw(X) = tw_i + tw(X)ENDDOALLC S R D

Center for SupercomputingResearch and Development