Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel...

Parallel Processing:Why, When, How?Why, When, How?

SimLab2010 BelgradeSimLab2010, BelgradeOctober, 5, 2010R di P t lRodica Potolea

Parallel Processing Why When How?Parallel Processing Why, When, How?

• Why?• Why?• Problems “too” costly to be solved with the classical

approach• The need of results on specific (or reasonable) timeThe need of results on specific (or reasonable) time

• When?• Are there any sequences which are better suited for

parallel implementationparallel implementation• Are there situations when is better NOT to parallelize

• How?Whi h th t i t (if ) h d t• Which are the constraints (if any) when we need to pass from sequential to parallel implementation

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 2

OutlineOutline

• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Techniques for dependences removalq p• Systolic architectures & simulations• Increase the parallel potential• Increase the parallel potential • Examples on GRID


OutlineOutline



Computation paradigmsComputation paradigms

• Sequential model –• referred to as Von Neumann architecturereferred to as Von Neumann architecture• basis of comparison • he simultaneously proposed other models• he simultaneously proposed other models• due to the technological constraints, at that time,

only the sequential model has been implementedonly the sequential model has been implemented• Von Neumann could be considered as the founder

of parallel computers as well (conceptual view)


of parallel computers as well (conceptual view)

Computation paradigms tdComputation paradigms contd.

• Flynn’s taxonomy (according to instruction/data flow

single/multiple)• SISD • SIMD single control unit dispatches instructions to each• SIMD single control unit dispatches instructions to each

processing unit

• MISD the same data is analyzed from different points of• MISD the same data is analyzed from different points of view (MI)

• MIMD different instructions run on different data


• MIMD different instructions run on different data

Parallel computationParallel computation

Wh• Why parallel solutions?• Increase speed• Process huge amount of data• Solve problems in real timep• Solve problems in due time


Parallel computation EfficiencyParallel computation Efficiency• Parameters to be evaluated (in sequential view)

Ti• Time• Memory

• Cases to be considered• Best• Worst• Average

• Optimal algorithms• Ω() = O()• An algorithm is optimal if the running time in the worstAn algorithm is optimal if the running time in the worst

case is of the same order of magnitude with function Ωof the problem to be solved


Parallel computation Efficiency dcontd.

• Order of magnitude: a function which expresses th i ti d Ω ti lthe running time and Ω respectively• if polynomial function, the polynomial should have the

same maximum degree, regardless the multiplicative constantconstant

• Ex: Ω() = c1np and O()= c2np it is considered Ω = O, written as Ω(n2) = O(n2)

Al ith h O() > Ω i th t• Algorithms have O() >=Ω in the worst case• O() <= Ω could be true for best case scenario, or even

for the average case( l ) d ( ) b i f• Ex Sorting: Ω(n lgn) and O(1) best case scenario for

some algs.


Parallel computation Efficiencydcontd.

• Do we really need parallel computers? • Any “computing greedy” algorithms better suited for

parallel implementation?• Consider 2 classes of algorithms:• Consider 2 classes of algorithms:

• Alg1: polynomial• Alg2: exponentialg p

• Assume the design of a new system which increases the speed of execution V times

• ? How does the maximum dimension of the problem that can be solved in the new system increases? i e express n=f(V)


increases? i.e. express n=f(V)

Parallel computation Efficiencytdcontd.

Alg1: O(nk)Alg1: O(n )Oper. Time

M1( ld) k TM1(old): nk TM2(new): nk T/V( ) /

Vnk T(n )k= Vn k (v1/k n )k(n2)k= Vn1

k=(v1/k n1)k

n2= v1/k*n1 => n2= c*n1


2 1 2 1

Parallel computation Efficiencytdcontd.

Alg2: O(2n)Alg2: O(2 )Oper. Time

M1(old): 2n T M2(new): 2n T/VM2(new): 2 T/V

V2n T(2) V 2 2 l V(2) n2 = V 2 n1 =2 lgV + n

1

n2= n1 + lgV => n2= c + n1


n2 n1 + lgV > n2 c + n1

Parallel computation Efficiencycontd.

Speed of the new machine in terms of the pspeed of the old machine: V2=V ·V1

Alg1: O(nk): n2= v1/k *nAlg2: O(2n): n2= n + lgV

Conclusion: BAD NEWS!!5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 13

Conclusion: BAD NEWS!!


Are there algorithms that really need exponential g y prunning time?

backtrackingavoid it! “Never” use!

Are there problems that are inherently exponential? p y pNP-complete, NP-hard problems“difficult” problems – we do not face them indifficult problems we do not face them in

real-world situations often?Hamiltonian cycle, sum problem, …


y , p ,


• Parameters to be evaluated for• Parameters to be evaluated for parallel processing• Time

Memory• Memory• Number of processors involved• Cost = Time * Number of processors

(c = t * n)• Efficiency

• E = O()/cost• E = O()/cost • E ≤ 1



• Speedup• Speedup• Δ = best sequential execution time/time of

parallel execution • Δ ≤ nΔ ≤ n

• Optimal algorithms• E=1, Δ = n• Cost = O()• Cost = O()• A parallel algorithm is optimal if the cost is of

the same order of magnitude with the best known algorithm to solve the problem in theknown algorithm to solve the problem in the worst case

• Other


Efficiency tEfficiency cont

• Parallel running time• Computation time• Communication time (data transfer, partial

l f f )results transfer, information communication)• Computation time

• As the number of processors involved increases, the computation time decreases

Comm nication time• Communication time• Quite the opposite!!!


Efficiency tEfficiency cont

• Diagram of running timeT Diagram of running timeT

N


OutlineOutline



Constraints in parallel computationConstraints in parallel computation

• Data/code to be distributed among / gthe different processing units in the systemy

• Observe the precedence relations between statements/variablesbetween statements/variables

• Avoid conflicts (data and control dependencies)dependencies)

• Remove constraints (where possible)


Constraints in parallel computation dConstraints in parallel computation contd.

• Data parallel• Input data split to be processed by different CPU• Characterizes the SIMD model of computation

C t l ll l• Control parallel• Different sequences processed by different CPU• Characterizes the MISD and MIMD models of• Characterizes the MISD and MIMD models of

computation • Conflicts should be avoided

if constraint removed then parallel sequence of codeelse sequential sequence of code


OutlineOutline• Computation paradigms• Constraints in parallel computation• Dependence analysis• Dependence analysis• Systolic architectures & simulations

h f d d l• Techniques for dependences removal• Increase the parallel potential p p• Examples on GRID


Dependence analysisDependence analysis

• Dependences represent the majorDependences represent the major limitation in parallelism

• Data dependencies• Data dependencies• Refers to read/write conflicts

• Data Flow dependence• Data Flow dependence• Data Antidependence• Output dependenceOutput dependence

• Control dependencies• Exist between statements


• Exist between statements

Dependence analysisDependence analysis• Data Flow dependence

i b f d• Indicates a write before read ordering of operations which must be satisfied

• Can NOT be avoided therefore represents the limitation of the parallelism’s potential

Ex: (assume sequence in a loop on i)

S1: a(i) = x(i) – 3 //assign a(i), i.e. write in its memory location

S2: b(i) = a(i) /c(i)//use a(i), i.e. read from the cor. mem loc.

• a(i) should be first calculated in S1, and only then used in S2. Therefore, S1 and S2 cannot be processed in parallel!!!


Parallel Algorithms cont.Dependencies analysis

• Data AntidependenceI di t read before write d i f ti hi h• Indicates a read before write ordering of operations which must be satisfied

• Can be avoided (techniques to remove this type of dependence)Ex: (assume sequence in a loop on i)Ex: (assume sequence in a loop on i)

S1: a(i) = x(i) – 3 // write laterS2 b(i) (i) + (i+1) // d fi tS2: b(i) = c(i) + a(i+1) // read first

• a(i+1) should be first used (with its former value) in S2, and only then computed in S1 at the next execution of the looponly then computed in S1, at the next execution of the loop. Therefore, several iterations of the loop cannot be processed in parallel!



• Output dependenceI di t write before write d i f ti hi h• Indicates a write before write ordering of operations which must be satisfied

• Can be avoided (techniques to remove this type of dependence)Ex: (assume sequence in a loop on i)Ex: (assume sequence in a loop on i)

S1: c(i+4) = b(i) + a(i+1) //assign first hereS2 (i+1) (i) // d h ft 3S2: c(i+1) = x(i) //and here after 3

//executions of the loop

• c(i+4) is assigned in S1; after 3 executions of the loop the same• c(i+4) is assigned in S1; after 3 executions of the loop, the same element in the array is assigned in S2. Therefore, several iterations of the loop cannot be processed in parallel



• Each dependence defines a dependence vector• They are defined by the distance between the loop where they interfere• They are defined by the distance between the loop where they interfere

(data flow)S1: a(i) = x(i) – 3 d = 0 on var aS2: b(i) = a(i) /c(i)S2: b(i) = a(i) /c(i)

(data antidep.)S1: a(i) = x(i) – 3 d = 1 on var aS2: b(i) = c(i) + a(i+1)S2: b(i) = c(i) + a(i+1)

(output dep.)S1: c(i+4) = b(i) + a(i+1) d = 3 on var cS2: c(i+1) = x(i)S2: c(i+1) = x(i)

• All dependence vectors within a sequence define a dependence matrix



• Data dependencies• Data Flow dependence (wbr) cannotData Flow dependence (wbr) cannot

be removed

• Data Antidependence (rbw) can be removed

• Output dependence (wbw) can be removed


OutlineOutline

• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Dependence analysis• Techniques for dependences

removal• Systolic architectures & simulations• Increase the parallel potential p p• Examples on GRID


Techniques for dependencies removalTechniques for dependencies removal

• Many dependencies are introduced artificially• Many dependencies are introduced artificially by programmers

• Output and antidependencies = falsep pdependencies

• Occur not because data is passed but b th l ti i d ibecause the same memory location is used in more than one place

• Removal of (some of) those dependencies• Removal of (some of) those dependencies increase the parallelization potential


Techniques for dependencies removal tdTechniques for dependencies removal contd.

• RenamingS1: a = b * cS1: a = b c S2: d = a + 1S3: a = a * d

• Dependence analysis:• Dependence analysis:• S1->S2 (data flow on a)• S2->S3 (data flow on d)( )• S1->S3 (data flow on a)• S1->S3 (output dep. on a)

( d d )• S2->S3 (antidependence on a)



• RenamingS1: a = b * cS1: a = b c S2: d = a + 1S’3: newa = a * d

• Dependence analysis:• Dependence analysis:• S1->S2 (data flow on a) holds• S2->S’3 (data flow on d) holds( )• S1->S'3 (data flow on a) holds• S1->S’3 (output dep. on a)

’ ( d d )• S2->S’3 (antidependence on a)



• Expansionfor i=1 to nfor i 1 to nS1: x = b(i) - 2 S2: y = c(i) * b(i)S3: a(i) = x + y

• Dependence analysis:• Dependence analysis:• S1->S3 (data flow on x – same iteration)• S2->S3 (data flow on y – same iteration)S S3 (data o o y sa e te at o )• S1->S1 (output dep. on x – consecutive iterations)• S2->S2 (output dep. on y – consecutive iterations)• S3->S1 (antidep. on x – consecutive iterations)• S3->S2 (antidep. on y – consecutive iterations)



• Expansion (loop iterations become independent)for i=1 to nfor i=1 to nS’1: x(i) = b(i) - 2 S’2: y(i) = c(i) * b(i)S’3: a(i) = x(i) + y(i)

• Dependence analysis:• S1->S3 (data flow on x) holds

S2 >S3 (data flow on y) holds• S2->S3 (data flow on y) holds• S1->S1 (output dep. on x)• S2->S2 (output dep. on y)( p p y)• S3->S1 (antidep. on x) replaced by the same data flow on x

• S3->S2 (antidep. on y) replaced by the same data flow on y



• Splittingfor i=1 to nS1: a(i) = b(i) + c(i) S2: d(i) = a(i) + 2S3: f(i) = d(i) + a(i+1)( ) ( ) ( )

• Dependence analysis:• S1->S2 (data flow on a(i) – same iteration)• S1 >S2 (data flow on a(i) same iteration)• S2->S3 (data flow on d(i) – same iteration)

S3 >S1 ( tid (i 1) )• S3->S1 (antidep. on a(i+1) – consecutive iterations)We have a cycle of dependencies: S1->S2 ->S3->S1



• Splitting (cycle removed)for i=1 to nfor i=1 to nS0: newa(i) = a(i+1) //copy the value obtained at the previous stepS1: a(i) = b(i) + c(i) S2: d(i) = a(i) + 2S’3: f(i) = d(i) + newa(i)S 3: f(i) = d(i) + newa(i)

• Dependence analysis:• S1->S2 (data flow on a(i) – same iteration)• S1 >S2 (data flow on a(i) same iteration)• S2->S’3 (data flow on d(i) – same iteration)• S’3->S1 (antidep on a(i+1)– consec iterations)• S 3 >S1 (antidep. on a(i+1) consec. iterations)• S0->S’3 (data flow on a(i+1) – same iteration)• S0->S1 antidep on a(i+1) consecutive iterations)


• S0 >S1 antidep. on a(i+1) – consecutive iterations)

OutlineOutline



Systolic architectures & simulationssimulations

• Systolic architectureSystolic architecture• Special purpose architecture• pipe of processing units called cells

• Advantages• Faster • Scalable

• Limitations• Expensive • Highly specialized for particular applicationsg y p p pp• Difficult to build => simulations



Goals for simulation:• Goals for simulation:• Verification and validation• Tests and availability of solutions• EfficiencyEfficiency • Proof of concept



• Models of systolic architectures• Models of systolic architectures• Many problems in terms of systems of liner

equations• Linear arrays (from special purpose architectures

to multi-purpose architectures)• Computing values of a polynomialp g p y• Convolution product• Vector-matrix multiplication

• Mesh arraysMesh arrays• Matrix multiplication• Speech recognition (simplified mesh array)



• All but one simulated within a software• All but one simulated within a software model in PARLOG• Easy to implementy p• Ready for run• Specification language S h i i P l/C b h• Speech recognition Pascal/C submesh array• Vocabulary of thousands of words• Vocabulary of thousands of words• Real time recognition



PARLOG example ( )• PARLOG example (convolution product)

cel([],[],_,_,[],[]).cel([X|Tx],[Y|Ty],B,W,[Xo|Txo],[Yo|Tyo])<-([ | ],[ | y], , ,[ | ],[ | y ])

Yo is Y+W*x,Xo is B,Xo is B,cel(Tx,Ty,X,W,Txo,Tyo).


OutlineOutline

• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Techniques for dependences removalq p• Systolic architectures & simulations• Increase the parallel potential• Increase the parallel potential• Examples on GRID


Program transformationsProgram transformations

Loop transformation• Loop transformation• To allow execution of parallel loops• Techniques borrowed from compilers

techniques


Program transformations tdProgram transformations contd.

• Cycle shrinkingA li h d d i l f l i l• Applies when dependencies cycle of a loop involves dependencies between loops far apart (say distance d)

• The loop is executed on parallel on that distance d

for i = 1 to na(i+d) = b(i) – 1b(i d) (i) (i)b(i+d) = a(i) + c(i)

Transforms intof i1 1 dfor i1 = 1 to n step d

parallel for i=i1 to i1+d-1a(i+d) = b(i) – 1


b(i+d) = a(i) + c(i)


• Loop interchanging• Consist of interchanging of indexes• It requires the new dep vector to be positivef j 1 tfor j = 1 to n

for i = 1 to na(i j) = a(i-1 j) +b(i-2 j)a(i, j) = a(i-1, j) +b(i-2,j)

Inner loop cannot be vectorised (executed in parallel)

for i = 1 to nfor i 1 to nparallel for j = 1 to n

a(i, j) = a(i-1, j) +b(i-2,j)



• Loop fusion• Transform 2 adjacent loops into a single one• Reduces the overhead of loops• It requires not to violate the ordering imposed by data equ es o o o a e e o de g posed by da a

dependenciesfor i = 1 to n

a(i) = b(i) + c(i+1)a(i) = b(i) + c(i+1)for i = 1 to n

c(i) = a(i-1) //write before read

for i = 1 to na(i) = b(i) + c(i+1)


c(i) = a(i-1)


• Loop collapsing• Combines 2 nested loops• Increases the vector length

f i 1 tfor i = 1 to nfor j = 1 to m

a(i j) = a(i j-1) + 1a(i,j) = a(i,j-1) + 1

for ij = 1 to n*mi=i*j div mj=ij mod m


a(i,j) = a(i,j-1) + 1


• Loop partitioning• Partition into independent problemsfor i = 1 to nfor i = 1 to n

a(i) = a(i-g) + 1Loop can be split into g independent problemsLoop can be split into g independent problems

for i = 1 to (n div g) *g +k step ga(i) = a(i-g) + 1

Where k=1,g


Each loop could be executed in parallel

OutlineOutline

• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Systolic architectures & simulationsy• Techniques for dependences removal• Increase the parallel potential• Increase the parallel potential • Examples on GRID


GRIDGRID

• It provides access to• It provides access to• computational power• storage capacity• storage capacity


Examples on GRIDExamples on GRID

M t i lti li ti• Matrix multiplication• Matrix power• Gaussian elimination• Sorting• Graph algorithms


Examples on GRIDMatrix multiplication tMatrix multiplication cont.

Analysis 1D (absolute analysis)y ( y )Matrix Multiplication 1D

10000

100000

100

1000

10000

Tim

e

1003007001000

1

10

0 5 10 15 20 25 30 35 40

Processors

• Optimal sol: p=f(n)



Analysis 1D (relative analysis)Matrix Multiplication 1D

80

90

100

30

40

50

60

70

Tim

e %

1003007001000

0

10

20

0 5 10 15 20 25 30 35 40

Processors

Computation time (for small dim) is small compared to communication time: hence parallel implementation with 1D efficient on very large

matrix dim.



Analysis 2D (absolute analysis)Matrix Multiplication 2DMatrix Multiplication 2D

10000

100000

100

1000

Time

1000x1000700x700300x300100x100

10

100

11 4 9 16 25 36

Processors



Analysis 2D (relative analysis)M i M l i li i 2DMatrix Multiplication 2D

80

90

100

40

50

60

70

Tim

e %

1000700

300

100

0

10

20

30

0 5 10 15 20 25 30 35 40

Processors



Comparison 1D- 2D

Comparison 1D - 2D

100000

100

1000

10000

Tim

e

100 - 1D300 - 1D700 - 1D1000 - 1D100 - 2D300 2D

1

10

0 5 10 15 20 25 30 35 40

300 - 2D700 - 2D1000 - 2D

0 5 10 15 20 25 30 35 40

Processors


Examples on GRID Graph algorithmsExamples on GRID Graph algorithms

• Minimum Spanning Trees• Minimum Spanning Trees7 2

3 15 1 8 10 5

7 2 5

Graph with 7 vertices: MST will have 6 edges with the sum of their weight minimumPossible solutions: MST weight = 20


Examples on GRID Graph algorithms tExamples on GRID Graph algorithms cont.

• MST Absolute Analysis • MST Relative AnalysisMinimum Spannig Tree

10000

100000

Minumum Spanning Tree

80

90

100

100

1000

Tim

e

500020001000500

30

40

50

60

70

Tim

e %

500100020005000

1

10

1 2 4 8 10 20

P

0

10

20

30

1 2 4 8 10 20 25

ProcessorsProcessors

• Observe the min value shifts to the right as graph’s dimension increases

• Observe the running time improvement with one order of

it d ( d t ti l

Processors


magnitude (compared to sequential approach) for graphs with n = 5000


• Transitive Closure: source-partitionedE h t i i d t d h t• Each vertex vi is assigned to a process, and each process computes shortest path from each vertex in its set to any other vertex in the graph, by using the sequential algorithm (hence, data partitioning)Req i es no inte p ocess comm nication• Requires no inter process communication

Transitive closure - source partition

1000000

100

1000

10000

100000

Tim

e

20050010002000

1

10

100

1 2 4 8 16 20

Processors



• Transitive Closure: source-paralleli h t t t f d th ll l• assigns each vertex to a set of processes and uses the parallel

formulation of the single-source algorithm (similar to MST)• Requires inter process communication

Transitive closure - source parallel

1000000

10000000

1000

10000

100000

Tim

e

20050010002000

1

10

100

1 2 4 8 16


Processors


• Graph Isomorphism: exponential running time!• Graph Isomorphism: exponential running time!• search if 2 graphs matches• trivial solution: generate all permutations of a graph (n!) and match with

the other graphg p

n!

• Each process is responsible for computing (n 1)! (data parallel approach)

(n-1)! (n-1)! (n-1)!

• Each process is responsible for computing (n-1)! (data parallel approach)



Graph Isomorphism

100000

1000000

le

1000

10000

arith

mic

sca

9 vertices10 vertices11 vertices

10

100

Tim

e - l

oga 11 vertices

12 vertices

10 2 4 6 8 10 12 14

Processors


Conclusions•The need for parallel processing (why)

Existence of problems that require large amount of time to be solved (like NPcomplete and NP hard problems)NPcomplete and NP hard problems)

•The constraints in parallel processing (how)

Dependence analysis: define the boundaries of the problem

•The particularities of the solution give the amount p gof parallelism (when)

The parabolic shape: more significant when large communication is i l dinvolved

Shift of optimal number of processors

How parallelization is done matters


How parallelization is done matters

Time up to 2 orders of magnitude decrease

Thank you for your attention!

Questions?


Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel...

Documents

Transcript of Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel...