Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel...

65
Parallel Processing: Why, When, How? Why, When, How? SimLab2010 Belgrade SimLab2010, Belgrade October, 5, 2010 R di Ptl Rodica Potolea

Transcript of Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel...

Page 1: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel Processing:Why, When, How?Why, When, How?

SimLab2010 BelgradeSimLab2010, BelgradeOctober, 5, 2010R di P t lRodica Potolea

Page 2: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel Processing Why When How?Parallel Processing Why, When, How?

• Why?• Why?• Problems “too” costly to be solved with the classical

approach• The need of results on specific (or reasonable) timeThe need of results on specific (or reasonable) time

• When?• Are there any sequences which are better suited for

parallel implementationparallel implementation• Are there situations when is better NOT to parallelize

• How?Whi h th t i t (if ) h d t• Which are the constraints (if any) when we need to pass from sequential to parallel implementation

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 2

Page 3: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

OutlineOutline

• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Techniques for dependences removalq p• Systolic architectures & simulations• Increase the parallel potential• Increase the parallel potential • Examples on GRID

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 3

Page 4: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

OutlineOutline

• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Techniques for dependences removalq p• Systolic architectures & simulations• Increase the parallel potential• Increase the parallel potential • Examples on GRID

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 4

Page 5: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Computation paradigmsComputation paradigms

• Sequential model –• referred to as Von Neumann architecturereferred to as Von Neumann architecture• basis of comparison • he simultaneously proposed other models• he simultaneously proposed other models• due to the technological constraints, at that time,

only the sequential model has been implementedonly the sequential model has been implemented• Von Neumann could be considered as the founder

of parallel computers as well (conceptual view)

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 5

of parallel computers as well (conceptual view)

Page 6: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Computation paradigms tdComputation paradigms contd.

• Flynn’s taxonomy (according to instruction/data flow

single/multiple)• SISD • SIMD single control unit dispatches instructions to each• SIMD single control unit dispatches instructions to each

processing unit

• MISD the same data is analyzed from different points of• MISD the same data is analyzed from different points of view (MI)

• MIMD different instructions run on different data

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 6

• MIMD different instructions run on different data

Page 7: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel computationParallel computation

Wh• Why parallel solutions?• Increase speed• Process huge amount of data• Solve problems in real timep• Solve problems in due time

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 7

Page 8: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel computation EfficiencyParallel computation Efficiency• Parameters to be evaluated (in sequential view)

Ti• Time• Memory

• Cases to be considered• Best• Worst• Average

• Optimal algorithms• Ω() = O()• An algorithm is optimal if the running time in the worstAn algorithm is optimal if the running time in the worst

case is of the same order of magnitude with function Ωof the problem to be solved

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 8

Page 9: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel computation Efficiency dcontd.

• Order of magnitude: a function which expresses th i ti d Ω ti lthe running time and Ω respectively• if polynomial function, the polynomial should have the

same maximum degree, regardless the multiplicative constantconstant

• Ex: Ω() = c1np and O()= c2np it is considered Ω = O, written as Ω(n2) = O(n2)

Al ith h O() > Ω i th t• Algorithms have O() >=Ω in the worst case• O() <= Ω could be true for best case scenario, or even

for the average case( l ) d ( ) b i f• Ex Sorting: Ω(n lgn) and O(1) best case scenario for

some algs.

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 9

Page 10: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel computation Efficiencydcontd.

• Do we really need parallel computers? • Any “computing greedy” algorithms better suited for

parallel implementation?• Consider 2 classes of algorithms:• Consider 2 classes of algorithms:

• Alg1: polynomial• Alg2: exponentialg p

• Assume the design of a new system which increases the speed of execution V times

• ? How does the maximum dimension of the problem that can be solved in the new system increases? i e express n=f(V)

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 10

increases? i.e. express n=f(V)

Page 11: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel computation Efficiencytdcontd.

Alg1: O(nk)Alg1: O(n )Oper. Time

M1( ld) k TM1(old): nk TM2(new): nk T/V( ) /

Vnk T(n )k= Vn k (v1/k n )k(n2)k= Vn1

k=(v1/k n1)k

n2= v1/k*n1 => n2= c*n1

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 11

2 1 2 1

Page 12: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel computation Efficiencytdcontd.

Alg2: O(2n)Alg2: O(2 )Oper. Time

M1(old): 2n T M2(new): 2n T/VM2(new): 2 T/V

V2n T(2) V 2 2 l V(2) n2 = V 2 n1 =2 lgV + n

1

n2= n1 + lgV => n2= c + n1

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 12

n2 n1 + lgV > n2 c + n1

Page 13: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel computation Efficiencycontd.

Speed of the new machine in terms of the pspeed of the old machine: V2=V ·V1

Alg1: O(nk): n2= v1/k *nAlg2: O(2n): n2= n + lgV

Conclusion: BAD NEWS!!5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 13

Conclusion: BAD NEWS!!

Page 14: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel computation Efficiencycontd.

Are there algorithms that really need exponential g y prunning time?

backtrackingavoid it! “Never” use!

Are there problems that are inherently exponential? p y pNP-complete, NP-hard problems“difficult” problems – we do not face them indifficult problems we do not face them in

real-world situations often?Hamiltonian cycle, sum problem, …

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 14

y , p ,

Page 15: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel computation Efficiencycontd.

• Parameters to be evaluated for• Parameters to be evaluated for parallel processing• Time

Memory• Memory• Number of processors involved• Cost = Time * Number of processors

(c = t * n)• Efficiency

• E = O()/cost• E = O()/cost • E ≤ 1

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 15

Page 16: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel computation Efficiencycontd.

• Speedup• Speedup• Δ = best sequential execution time/time of

parallel execution • Δ ≤ nΔ ≤ n

• Optimal algorithms• E=1, Δ = n• Cost = O()• Cost = O()• A parallel algorithm is optimal if the cost is of

the same order of magnitude with the best known algorithm to solve the problem in theknown algorithm to solve the problem in the worst case

• Other

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 16

Page 17: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Efficiency tEfficiency cont

• Parallel running time• Computation time• Communication time (data transfer, partial

l f f )results transfer, information communication)• Computation time

• As the number of processors involved increases, the computation time decreases

Comm nication time• Communication time• Quite the opposite!!!

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 17

Page 18: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Efficiency tEfficiency cont

• Diagram of running timeT Diagram of running timeT

N

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 18

Page 19: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

OutlineOutline

• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Techniques for dependences removalq p• Systolic architectures & simulations• Increase the parallel potential• Increase the parallel potential • Examples on GRID

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 19

Page 20: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Constraints in parallel computationConstraints in parallel computation

• Data/code to be distributed among / gthe different processing units in the systemy

• Observe the precedence relations between statements/variablesbetween statements/variables

• Avoid conflicts (data and control dependencies)dependencies)

• Remove constraints (where possible)

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 20

Page 21: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Constraints in parallel computation dConstraints in parallel computation contd.

• Data parallel• Input data split to be processed by different CPU• Characterizes the SIMD model of computation

C t l ll l• Control parallel• Different sequences processed by different CPU• Characterizes the MISD and MIMD models of• Characterizes the MISD and MIMD models of

computation • Conflicts should be avoided

if constraint removed then parallel sequence of codeelse sequential sequence of code

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 21

Page 22: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

OutlineOutline• Computation paradigms• Constraints in parallel computation• Dependence analysis• Dependence analysis• Systolic architectures & simulations

h f d d l• Techniques for dependences removal• Increase the parallel potential p p• Examples on GRID

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 22

Page 23: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Dependence analysisDependence analysis

• Dependences represent the majorDependences represent the major limitation in parallelism

• Data dependencies• Data dependencies• Refers to read/write conflicts

• Data Flow dependence• Data Flow dependence• Data Antidependence• Output dependenceOutput dependence

• Control dependencies• Exist between statements

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 23

• Exist between statements

Page 24: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Dependence analysisDependence analysis• Data Flow dependence

i b f d• Indicates a write before read ordering of operations which must be satisfied

• Can NOT be avoided therefore represents the limitation of the parallelism’s potential

Ex: (assume sequence in a loop on i)

S1: a(i) = x(i) – 3 //assign a(i), i.e. write in its memory location

S2: b(i) = a(i) /c(i)//use a(i), i.e. read from the cor. mem loc.

• a(i) should be first calculated in S1, and only then used in S2. Therefore, S1 and S2 cannot be processed in parallel!!!

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 24

Page 25: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel Algorithms cont.Dependencies analysis

• Data AntidependenceI di t read before write d i f ti hi h• Indicates a read before write ordering of operations which must be satisfied

• Can be avoided (techniques to remove this type of dependence)Ex: (assume sequence in a loop on i)Ex: (assume sequence in a loop on i)

S1: a(i) = x(i) – 3 // write laterS2 b(i) (i) + (i+1) // d fi tS2: b(i) = c(i) + a(i+1) // read first

• a(i+1) should be first used (with its former value) in S2, and only then computed in S1 at the next execution of the looponly then computed in S1, at the next execution of the loop. Therefore, several iterations of the loop cannot be processed in parallel!

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 25

Page 26: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel Algorithms cont.Dependencies analysis

• Output dependenceI di t write before write d i f ti hi h• Indicates a write before write ordering of operations which must be satisfied

• Can be avoided (techniques to remove this type of dependence)Ex: (assume sequence in a loop on i)Ex: (assume sequence in a loop on i)

S1: c(i+4) = b(i) + a(i+1) //assign first hereS2 (i+1) (i) // d h ft 3S2: c(i+1) = x(i) //and here after 3

//executions of the loop

• c(i+4) is assigned in S1; after 3 executions of the loop the same• c(i+4) is assigned in S1; after 3 executions of the loop, the same element in the array is assigned in S2. Therefore, several iterations of the loop cannot be processed in parallel

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 26

Page 27: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel Algorithms cont.Dependencies analysis

• Each dependence defines a dependence vector• They are defined by the distance between the loop where they interfere• They are defined by the distance between the loop where they interfere

(data flow)S1: a(i) = x(i) – 3 d = 0 on var aS2: b(i) = a(i) /c(i)S2: b(i) = a(i) /c(i)

(data antidep.)S1: a(i) = x(i) – 3 d = 1 on var aS2: b(i) = c(i) + a(i+1)S2: b(i) = c(i) + a(i+1)

(output dep.)S1: c(i+4) = b(i) + a(i+1) d = 3 on var cS2: c(i+1) = x(i)S2: c(i+1) = x(i)

• All dependence vectors within a sequence define a dependence matrix

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 27

Page 28: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Parallel Algorithms cont.Dependencies analysis

• Data dependencies• Data Flow dependence (wbr) cannotData Flow dependence (wbr) cannot

be removed

• Data Antidependence (rbw) can be removed

• Output dependence (wbw) can be removed

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 28

Page 29: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

OutlineOutline

• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Dependence analysis• Techniques for dependences

removal• Systolic architectures & simulations• Increase the parallel potential p p• Examples on GRID

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 29

Page 30: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Techniques for dependencies removalTechniques for dependencies removal

• Many dependencies are introduced artificially• Many dependencies are introduced artificially by programmers

• Output and antidependencies = falsep pdependencies

• Occur not because data is passed but b th l ti i d ibecause the same memory location is used in more than one place

• Removal of (some of) those dependencies• Removal of (some of) those dependencies increase the parallelization potential

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 30

Page 31: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Techniques for dependencies removal tdTechniques for dependencies removal contd.

• RenamingS1: a = b * cS1: a = b c S2: d = a + 1S3: a = a * d

• Dependence analysis:• Dependence analysis:• S1->S2 (data flow on a)• S2->S3 (data flow on d)( )• S1->S3 (data flow on a)• S1->S3 (output dep. on a)

( d d )• S2->S3 (antidependence on a)

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 31

Page 32: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Techniques for dependencies removal tdTechniques for dependencies removal contd.

• RenamingS1: a = b * cS1: a = b c S2: d = a + 1S’3: newa = a * d

• Dependence analysis:• Dependence analysis:• S1->S2 (data flow on a) holds• S2->S’3 (data flow on d) holds( )• S1->S'3 (data flow on a) holds• S1->S’3 (output dep. on a)

’ ( d d )• S2->S’3 (antidependence on a)

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 32

Page 33: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Techniques for dependencies removal tdTechniques for dependencies removal contd.

• Expansionfor i=1 to nfor i 1 to nS1: x = b(i) - 2 S2: y = c(i) * b(i)S3: a(i) = x + y

• Dependence analysis:• Dependence analysis:• S1->S3 (data flow on x – same iteration)• S2->S3 (data flow on y – same iteration)S S3 (data o o y sa e te at o )• S1->S1 (output dep. on x – consecutive iterations)• S2->S2 (output dep. on y – consecutive iterations)• S3->S1 (antidep. on x – consecutive iterations)• S3->S2 (antidep. on y – consecutive iterations)

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 33

Page 34: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Techniques for dependencies removal tdTechniques for dependencies removal contd.

• Expansion (loop iterations become independent)for i=1 to nfor i=1 to nS’1: x(i) = b(i) - 2 S’2: y(i) = c(i) * b(i)S’3: a(i) = x(i) + y(i)

• Dependence analysis:• S1->S3 (data flow on x) holds

S2 >S3 (data flow on y) holds• S2->S3 (data flow on y) holds• S1->S1 (output dep. on x)• S2->S2 (output dep. on y)( p p y)• S3->S1 (antidep. on x) replaced by the same data flow on x

• S3->S2 (antidep. on y) replaced by the same data flow on y

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 34

Page 35: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Techniques for dependencies removal tdTechniques for dependencies removal contd.

• Splittingfor i=1 to nS1: a(i) = b(i) + c(i) S2: d(i) = a(i) + 2S3: f(i) = d(i) + a(i+1)( ) ( ) ( )

• Dependence analysis:• S1->S2 (data flow on a(i) – same iteration)• S1 >S2 (data flow on a(i) same iteration)• S2->S3 (data flow on d(i) – same iteration)

S3 >S1 ( tid (i 1) )• S3->S1 (antidep. on a(i+1) – consecutive iterations)We have a cycle of dependencies: S1->S2 ->S3->S1

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 35

Page 36: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Techniques for dependencies removal tdTechniques for dependencies removal contd.

• Splitting (cycle removed)for i=1 to nfor i=1 to nS0: newa(i) = a(i+1) //copy the value obtained at the previous stepS1: a(i) = b(i) + c(i) S2: d(i) = a(i) + 2S’3: f(i) = d(i) + newa(i)S 3: f(i) = d(i) + newa(i)

• Dependence analysis:• S1->S2 (data flow on a(i) – same iteration)• S1 >S2 (data flow on a(i) same iteration)• S2->S’3 (data flow on d(i) – same iteration)• S’3->S1 (antidep on a(i+1)– consec iterations)• S 3 >S1 (antidep. on a(i+1) consec. iterations)• S0->S’3 (data flow on a(i+1) – same iteration)• S0->S1 antidep on a(i+1) consecutive iterations)

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 36

• S0 >S1 antidep. on a(i+1) – consecutive iterations)

Page 37: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

OutlineOutline

• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Techniques for dependences removalq p• Systolic architectures & simulations• Increase the parallel potential• Increase the parallel potential • Examples on GRID

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 37

Page 38: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Systolic architectures & simulationssimulations

• Systolic architectureSystolic architecture• Special purpose architecture• pipe of processing units called cells

• Advantages• Faster • Scalable

• Limitations• Expensive • Highly specialized for particular applicationsg y p p pp• Difficult to build => simulations

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 38

Page 39: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Systolic architectures & simulationssimulations

Goals for simulation:• Goals for simulation:• Verification and validation• Tests and availability of solutions• EfficiencyEfficiency • Proof of concept

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 39

Page 40: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Systolic architectures & simulationssimulations

• Models of systolic architectures• Models of systolic architectures• Many problems in terms of systems of liner

equations• Linear arrays (from special purpose architectures

to multi-purpose architectures)• Computing values of a polynomialp g p y• Convolution product• Vector-matrix multiplication

• Mesh arraysMesh arrays• Matrix multiplication• Speech recognition (simplified mesh array)

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 40

Page 41: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Systolic architectures & simulationssimulations

• All but one simulated within a software• All but one simulated within a software model in PARLOG• Easy to implementy p• Ready for run• Specification language S h i i P l/C b h• Speech recognition Pascal/C submesh array• Vocabulary of thousands of words• Vocabulary of thousands of words• Real time recognition

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 41

Page 42: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Systolic architectures & simulationssimulations

PARLOG example ( )• PARLOG example (convolution product)

cel([],[],_,_,[],[]).cel([X|Tx],[Y|Ty],B,W,[Xo|Txo],[Yo|Tyo])<-([ | ],[ | y], , ,[ | ],[ | y ])

Yo is Y+W*x,Xo is B,Xo is B,cel(Tx,Ty,X,W,Txo,Tyo).

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 42

Page 43: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

OutlineOutline

• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Techniques for dependences removalq p• Systolic architectures & simulations• Increase the parallel potential• Increase the parallel potential• Examples on GRID

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 43

Page 44: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Program transformationsProgram transformations

Loop transformation• Loop transformation• To allow execution of parallel loops• Techniques borrowed from compilers

techniques

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 44

Page 45: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Program transformations tdProgram transformations contd.

• Cycle shrinkingA li h d d i l f l i l• Applies when dependencies cycle of a loop involves dependencies between loops far apart (say distance d)

• The loop is executed on parallel on that distance d

for i = 1 to na(i+d) = b(i) – 1b(i d) (i) (i)b(i+d) = a(i) + c(i)

Transforms intof i1 1 dfor i1 = 1 to n step d

parallel for i=i1 to i1+d-1a(i+d) = b(i) – 1

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 45

b(i+d) = a(i) + c(i)

Page 46: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Program transformations tdProgram transformations contd.

• Loop interchanging• Consist of interchanging of indexes• It requires the new dep vector to be positivef j 1 tfor j = 1 to n

for i = 1 to na(i j) = a(i-1 j) +b(i-2 j)a(i, j) = a(i-1, j) +b(i-2,j)

Inner loop cannot be vectorised (executed in parallel)

for i = 1 to nfor i 1 to nparallel for j = 1 to n

a(i, j) = a(i-1, j) +b(i-2,j)

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 46

Page 47: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Program transformations tdProgram transformations contd.

• Loop fusion• Transform 2 adjacent loops into a single one• Reduces the overhead of loops• It requires not to violate the ordering imposed by data equ es o o o a e e o de g posed by da a

dependenciesfor i = 1 to n

a(i) = b(i) + c(i+1)a(i) = b(i) + c(i+1)for i = 1 to n

c(i) = a(i-1) //write before read

for i = 1 to na(i) = b(i) + c(i+1)

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 47

c(i) = a(i-1)

Page 48: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Program transformations tdProgram transformations contd.

• Loop collapsing• Combines 2 nested loops• Increases the vector length

f i 1 tfor i = 1 to nfor j = 1 to m

a(i j) = a(i j-1) + 1a(i,j) = a(i,j-1) + 1

for ij = 1 to n*mi=i*j div mj=ij mod m

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 48

a(i,j) = a(i,j-1) + 1

Page 49: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Program transformations tdProgram transformations contd.

• Loop partitioning• Partition into independent problemsfor i = 1 to nfor i = 1 to n

a(i) = a(i-g) + 1Loop can be split into g independent problemsLoop can be split into g independent problems

for i = 1 to (n div g) *g +k step ga(i) = a(i-g) + 1

Where k=1,g

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 49

Each loop could be executed in parallel

Page 50: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

OutlineOutline

• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Systolic architectures & simulationsy• Techniques for dependences removal• Increase the parallel potential• Increase the parallel potential • Examples on GRID

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 50

Page 51: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

GRIDGRID

• It provides access to• It provides access to• computational power• storage capacity• storage capacity

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 51

Page 52: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Examples on GRIDExamples on GRID

M t i lti li ti• Matrix multiplication• Matrix power• Gaussian elimination• Sorting• Graph algorithms

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 52

Page 53: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Examples on GRIDMatrix multiplication tMatrix multiplication cont.

Analysis 1D (absolute analysis)y ( y )Matrix Multiplication 1D

10000

100000

100

1000

10000

Tim

e

1003007001000

1

10

0 5 10 15 20 25 30 35 40

Processors

• Optimal sol: p=f(n)

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 53

Page 54: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Examples on GRIDMatrix multiplication tMatrix multiplication cont.

Analysis 1D (relative analysis)Matrix Multiplication 1D

80

90

100

30

40

50

60

70

Tim

e %

1003007001000

0

10

20

0 5 10 15 20 25 30 35 40

Processors

Computation time (for small dim) is small compared to communication time: hence parallel implementation with 1D efficient on very large

matrix dim.

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 54

Page 55: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Examples on GRIDMatrix multiplication tMatrix multiplication cont.

Analysis 2D (absolute analysis)Matrix Multiplication 2DMatrix Multiplication 2D

10000

100000

100

1000

Time

1000x1000700x700300x300100x100

10

100

11 4 9 16 25 36

Processors

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 55

Page 56: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Examples on GRIDMatrix multiplication tMatrix multiplication cont.

Analysis 2D (relative analysis)M i M l i li i 2DMatrix Multiplication 2D

80

90

100

40

50

60

70

Tim

e %

1000700

300

100

0

10

20

30

0 5 10 15 20 25 30 35 40

Processors

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 56

Page 57: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Examples on GRIDMatrix multiplication tMatrix multiplication cont.

Comparison 1D- 2D

Comparison 1D - 2D

100000

100

1000

10000

Tim

e

100 - 1D300 - 1D700 - 1D1000 - 1D100 - 2D300 2D

1

10

0 5 10 15 20 25 30 35 40

300 - 2D700 - 2D1000 - 2D

0 5 10 15 20 25 30 35 40

Processors

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 57

Page 58: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Examples on GRID Graph algorithmsExamples on GRID Graph algorithms

• Minimum Spanning Trees• Minimum Spanning Trees7 2

3 15 1 8 10 5

7 2 5

Graph with 7 vertices: MST will have 6 edges with the sum of their weight minimumPossible solutions: MST weight = 20

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 58

Page 59: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Examples on GRID Graph algorithms tExamples on GRID Graph algorithms cont.

• MST Absolute Analysis • MST Relative AnalysisMinimum Spannig Tree

10000

100000

Minumum Spanning Tree

80

90

100

100

1000

Tim

e

500020001000500

30

40

50

60

70

Tim

e %

500100020005000

1

10

1 2 4 8 10 20

P

0

10

20

30

1 2 4 8 10 20 25

ProcessorsProcessors

• Observe the min value shifts to the right as graph’s dimension increases

• Observe the running time improvement with one order of

it d ( d t ti l

Processors

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 59

magnitude (compared to sequential approach) for graphs with n = 5000

Page 60: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Examples on GRID Graph algorithms tExamples on GRID Graph algorithms cont.

• Transitive Closure: source-partitionedE h t i i d t d h t• Each vertex vi is assigned to a process, and each process computes shortest path from each vertex in its set to any other vertex in the graph, by using the sequential algorithm (hence, data partitioning)Req i es no inte p ocess comm nication• Requires no inter process communication

Transitive closure - source partition

1000000

100

1000

10000

100000

Tim

e

20050010002000

1

10

100

1 2 4 8 16 20

Processors

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 60

Page 61: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Examples on GRID Graph algorithms tExamples on GRID Graph algorithms cont.

• Transitive Closure: source-paralleli h t t t f d th ll l• assigns each vertex to a set of processes and uses the parallel

formulation of the single-source algorithm (similar to MST)• Requires inter process communication

Transitive closure - source parallel

1000000

10000000

1000

10000

100000

Tim

e

20050010002000

1

10

100

1 2 4 8 16

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 61

Processors

Page 62: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Examples on GRID Graph algorithms tExamples on GRID Graph algorithms cont.

• Graph Isomorphism: exponential running time!• Graph Isomorphism: exponential running time!• search if 2 graphs matches• trivial solution: generate all permutations of a graph (n!) and match with

the other graphg p

n!

• Each process is responsible for computing (n 1)! (data parallel approach)

(n-1)! (n-1)! (n-1)!

• Each process is responsible for computing (n-1)! (data parallel approach)

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 62

Page 63: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Examples on GRID Graph algorithms tExamples on GRID Graph algorithms cont.

Graph Isomorphism

100000

1000000

le

1000

10000

arith

mic

sca

9 vertices10 vertices11 vertices

10

100

Tim

e - l

oga 11 vertices

12 vertices

10 2 4 6 8 10 12 14

Processors

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 63

Page 64: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Conclusions•The need for parallel processing (why)

Existence of problems that require large amount of time to be solved (like NPcomplete and NP hard problems)NPcomplete and NP hard problems)

•The constraints in parallel processing (how)

Dependence analysis: define the boundaries of the problem

•The particularities of the solution give the amount p gof parallelism (when)

The parabolic shape: more significant when large communication is i l dinvolved

Shift of optimal number of processors

How parallelization is done matters

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 64

How parallelization is done matters

Time up to 2 orders of magnitude decrease

Page 65: Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel Processing Why, When, How? • Why? • Problems “too” costly to be solved with

Thank you for your attention!

Questions?

5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 65