Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel...
-
Upload
hoangthien -
Category
Documents
-
view
227 -
download
0
Transcript of Parallel Processing: Why, When, How?Why, When, … · Parallel Processing Why When How?Parallel...
Parallel Processing:Why, When, How?Why, When, How?
SimLab2010 BelgradeSimLab2010, BelgradeOctober, 5, 2010R di P t lRodica Potolea
Parallel Processing Why When How?Parallel Processing Why, When, How?
• Why?• Why?• Problems “too” costly to be solved with the classical
approach• The need of results on specific (or reasonable) timeThe need of results on specific (or reasonable) time
• When?• Are there any sequences which are better suited for
parallel implementationparallel implementation• Are there situations when is better NOT to parallelize
• How?Whi h th t i t (if ) h d t• Which are the constraints (if any) when we need to pass from sequential to parallel implementation
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 2
OutlineOutline
• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Techniques for dependences removalq p• Systolic architectures & simulations• Increase the parallel potential• Increase the parallel potential • Examples on GRID
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 3
OutlineOutline
• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Techniques for dependences removalq p• Systolic architectures & simulations• Increase the parallel potential• Increase the parallel potential • Examples on GRID
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 4
Computation paradigmsComputation paradigms
• Sequential model –• referred to as Von Neumann architecturereferred to as Von Neumann architecture• basis of comparison • he simultaneously proposed other models• he simultaneously proposed other models• due to the technological constraints, at that time,
only the sequential model has been implementedonly the sequential model has been implemented• Von Neumann could be considered as the founder
of parallel computers as well (conceptual view)
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 5
of parallel computers as well (conceptual view)
Computation paradigms tdComputation paradigms contd.
• Flynn’s taxonomy (according to instruction/data flow
single/multiple)• SISD • SIMD single control unit dispatches instructions to each• SIMD single control unit dispatches instructions to each
processing unit
• MISD the same data is analyzed from different points of• MISD the same data is analyzed from different points of view (MI)
• MIMD different instructions run on different data
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 6
• MIMD different instructions run on different data
Parallel computationParallel computation
Wh• Why parallel solutions?• Increase speed• Process huge amount of data• Solve problems in real timep• Solve problems in due time
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 7
Parallel computation EfficiencyParallel computation Efficiency• Parameters to be evaluated (in sequential view)
Ti• Time• Memory
• Cases to be considered• Best• Worst• Average
• Optimal algorithms• Ω() = O()• An algorithm is optimal if the running time in the worstAn algorithm is optimal if the running time in the worst
case is of the same order of magnitude with function Ωof the problem to be solved
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 8
Parallel computation Efficiency dcontd.
• Order of magnitude: a function which expresses th i ti d Ω ti lthe running time and Ω respectively• if polynomial function, the polynomial should have the
same maximum degree, regardless the multiplicative constantconstant
• Ex: Ω() = c1np and O()= c2np it is considered Ω = O, written as Ω(n2) = O(n2)
Al ith h O() > Ω i th t• Algorithms have O() >=Ω in the worst case• O() <= Ω could be true for best case scenario, or even
for the average case( l ) d ( ) b i f• Ex Sorting: Ω(n lgn) and O(1) best case scenario for
some algs.
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 9
Parallel computation Efficiencydcontd.
• Do we really need parallel computers? • Any “computing greedy” algorithms better suited for
parallel implementation?• Consider 2 classes of algorithms:• Consider 2 classes of algorithms:
• Alg1: polynomial• Alg2: exponentialg p
• Assume the design of a new system which increases the speed of execution V times
• ? How does the maximum dimension of the problem that can be solved in the new system increases? i e express n=f(V)
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 10
increases? i.e. express n=f(V)
Parallel computation Efficiencytdcontd.
Alg1: O(nk)Alg1: O(n )Oper. Time
M1( ld) k TM1(old): nk TM2(new): nk T/V( ) /
Vnk T(n )k= Vn k (v1/k n )k(n2)k= Vn1
k=(v1/k n1)k
n2= v1/k*n1 => n2= c*n1
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 11
2 1 2 1
Parallel computation Efficiencytdcontd.
Alg2: O(2n)Alg2: O(2 )Oper. Time
M1(old): 2n T M2(new): 2n T/VM2(new): 2 T/V
V2n T(2) V 2 2 l V(2) n2 = V 2 n1 =2 lgV + n
1
n2= n1 + lgV => n2= c + n1
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 12
n2 n1 + lgV > n2 c + n1
Parallel computation Efficiencycontd.
Speed of the new machine in terms of the pspeed of the old machine: V2=V ·V1
Alg1: O(nk): n2= v1/k *nAlg2: O(2n): n2= n + lgV
Conclusion: BAD NEWS!!5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 13
Conclusion: BAD NEWS!!
Parallel computation Efficiencycontd.
Are there algorithms that really need exponential g y prunning time?
backtrackingavoid it! “Never” use!
Are there problems that are inherently exponential? p y pNP-complete, NP-hard problems“difficult” problems – we do not face them indifficult problems we do not face them in
real-world situations often?Hamiltonian cycle, sum problem, …
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 14
y , p ,
Parallel computation Efficiencycontd.
• Parameters to be evaluated for• Parameters to be evaluated for parallel processing• Time
Memory• Memory• Number of processors involved• Cost = Time * Number of processors
(c = t * n)• Efficiency
• E = O()/cost• E = O()/cost • E ≤ 1
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 15
Parallel computation Efficiencycontd.
• Speedup• Speedup• Δ = best sequential execution time/time of
parallel execution • Δ ≤ nΔ ≤ n
• Optimal algorithms• E=1, Δ = n• Cost = O()• Cost = O()• A parallel algorithm is optimal if the cost is of
the same order of magnitude with the best known algorithm to solve the problem in theknown algorithm to solve the problem in the worst case
• Other
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 16
Efficiency tEfficiency cont
• Parallel running time• Computation time• Communication time (data transfer, partial
l f f )results transfer, information communication)• Computation time
• As the number of processors involved increases, the computation time decreases
Comm nication time• Communication time• Quite the opposite!!!
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 17
Efficiency tEfficiency cont
• Diagram of running timeT Diagram of running timeT
N
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 18
OutlineOutline
• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Techniques for dependences removalq p• Systolic architectures & simulations• Increase the parallel potential• Increase the parallel potential • Examples on GRID
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 19
Constraints in parallel computationConstraints in parallel computation
• Data/code to be distributed among / gthe different processing units in the systemy
• Observe the precedence relations between statements/variablesbetween statements/variables
• Avoid conflicts (data and control dependencies)dependencies)
• Remove constraints (where possible)
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 20
Constraints in parallel computation dConstraints in parallel computation contd.
• Data parallel• Input data split to be processed by different CPU• Characterizes the SIMD model of computation
C t l ll l• Control parallel• Different sequences processed by different CPU• Characterizes the MISD and MIMD models of• Characterizes the MISD and MIMD models of
computation • Conflicts should be avoided
if constraint removed then parallel sequence of codeelse sequential sequence of code
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 21
OutlineOutline• Computation paradigms• Constraints in parallel computation• Dependence analysis• Dependence analysis• Systolic architectures & simulations
h f d d l• Techniques for dependences removal• Increase the parallel potential p p• Examples on GRID
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 22
Dependence analysisDependence analysis
• Dependences represent the majorDependences represent the major limitation in parallelism
• Data dependencies• Data dependencies• Refers to read/write conflicts
• Data Flow dependence• Data Flow dependence• Data Antidependence• Output dependenceOutput dependence
• Control dependencies• Exist between statements
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 23
• Exist between statements
Dependence analysisDependence analysis• Data Flow dependence
i b f d• Indicates a write before read ordering of operations which must be satisfied
• Can NOT be avoided therefore represents the limitation of the parallelism’s potential
Ex: (assume sequence in a loop on i)
S1: a(i) = x(i) – 3 //assign a(i), i.e. write in its memory location
S2: b(i) = a(i) /c(i)//use a(i), i.e. read from the cor. mem loc.
• a(i) should be first calculated in S1, and only then used in S2. Therefore, S1 and S2 cannot be processed in parallel!!!
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 24
Parallel Algorithms cont.Dependencies analysis
• Data AntidependenceI di t read before write d i f ti hi h• Indicates a read before write ordering of operations which must be satisfied
• Can be avoided (techniques to remove this type of dependence)Ex: (assume sequence in a loop on i)Ex: (assume sequence in a loop on i)
S1: a(i) = x(i) – 3 // write laterS2 b(i) (i) + (i+1) // d fi tS2: b(i) = c(i) + a(i+1) // read first
• a(i+1) should be first used (with its former value) in S2, and only then computed in S1 at the next execution of the looponly then computed in S1, at the next execution of the loop. Therefore, several iterations of the loop cannot be processed in parallel!
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 25
Parallel Algorithms cont.Dependencies analysis
• Output dependenceI di t write before write d i f ti hi h• Indicates a write before write ordering of operations which must be satisfied
• Can be avoided (techniques to remove this type of dependence)Ex: (assume sequence in a loop on i)Ex: (assume sequence in a loop on i)
S1: c(i+4) = b(i) + a(i+1) //assign first hereS2 (i+1) (i) // d h ft 3S2: c(i+1) = x(i) //and here after 3
//executions of the loop
• c(i+4) is assigned in S1; after 3 executions of the loop the same• c(i+4) is assigned in S1; after 3 executions of the loop, the same element in the array is assigned in S2. Therefore, several iterations of the loop cannot be processed in parallel
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 26
Parallel Algorithms cont.Dependencies analysis
• Each dependence defines a dependence vector• They are defined by the distance between the loop where they interfere• They are defined by the distance between the loop where they interfere
(data flow)S1: a(i) = x(i) – 3 d = 0 on var aS2: b(i) = a(i) /c(i)S2: b(i) = a(i) /c(i)
(data antidep.)S1: a(i) = x(i) – 3 d = 1 on var aS2: b(i) = c(i) + a(i+1)S2: b(i) = c(i) + a(i+1)
(output dep.)S1: c(i+4) = b(i) + a(i+1) d = 3 on var cS2: c(i+1) = x(i)S2: c(i+1) = x(i)
• All dependence vectors within a sequence define a dependence matrix
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 27
Parallel Algorithms cont.Dependencies analysis
• Data dependencies• Data Flow dependence (wbr) cannotData Flow dependence (wbr) cannot
be removed
• Data Antidependence (rbw) can be removed
• Output dependence (wbw) can be removed
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 28
OutlineOutline
• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Dependence analysis• Techniques for dependences
removal• Systolic architectures & simulations• Increase the parallel potential p p• Examples on GRID
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 29
Techniques for dependencies removalTechniques for dependencies removal
• Many dependencies are introduced artificially• Many dependencies are introduced artificially by programmers
• Output and antidependencies = falsep pdependencies
• Occur not because data is passed but b th l ti i d ibecause the same memory location is used in more than one place
• Removal of (some of) those dependencies• Removal of (some of) those dependencies increase the parallelization potential
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 30
Techniques for dependencies removal tdTechniques for dependencies removal contd.
• RenamingS1: a = b * cS1: a = b c S2: d = a + 1S3: a = a * d
• Dependence analysis:• Dependence analysis:• S1->S2 (data flow on a)• S2->S3 (data flow on d)( )• S1->S3 (data flow on a)• S1->S3 (output dep. on a)
( d d )• S2->S3 (antidependence on a)
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 31
Techniques for dependencies removal tdTechniques for dependencies removal contd.
• RenamingS1: a = b * cS1: a = b c S2: d = a + 1S’3: newa = a * d
• Dependence analysis:• Dependence analysis:• S1->S2 (data flow on a) holds• S2->S’3 (data flow on d) holds( )• S1->S'3 (data flow on a) holds• S1->S’3 (output dep. on a)
’ ( d d )• S2->S’3 (antidependence on a)
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 32
Techniques for dependencies removal tdTechniques for dependencies removal contd.
• Expansionfor i=1 to nfor i 1 to nS1: x = b(i) - 2 S2: y = c(i) * b(i)S3: a(i) = x + y
• Dependence analysis:• Dependence analysis:• S1->S3 (data flow on x – same iteration)• S2->S3 (data flow on y – same iteration)S S3 (data o o y sa e te at o )• S1->S1 (output dep. on x – consecutive iterations)• S2->S2 (output dep. on y – consecutive iterations)• S3->S1 (antidep. on x – consecutive iterations)• S3->S2 (antidep. on y – consecutive iterations)
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 33
Techniques for dependencies removal tdTechniques for dependencies removal contd.
• Expansion (loop iterations become independent)for i=1 to nfor i=1 to nS’1: x(i) = b(i) - 2 S’2: y(i) = c(i) * b(i)S’3: a(i) = x(i) + y(i)
• Dependence analysis:• S1->S3 (data flow on x) holds
S2 >S3 (data flow on y) holds• S2->S3 (data flow on y) holds• S1->S1 (output dep. on x)• S2->S2 (output dep. on y)( p p y)• S3->S1 (antidep. on x) replaced by the same data flow on x
• S3->S2 (antidep. on y) replaced by the same data flow on y
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 34
Techniques for dependencies removal tdTechniques for dependencies removal contd.
• Splittingfor i=1 to nS1: a(i) = b(i) + c(i) S2: d(i) = a(i) + 2S3: f(i) = d(i) + a(i+1)( ) ( ) ( )
• Dependence analysis:• S1->S2 (data flow on a(i) – same iteration)• S1 >S2 (data flow on a(i) same iteration)• S2->S3 (data flow on d(i) – same iteration)
S3 >S1 ( tid (i 1) )• S3->S1 (antidep. on a(i+1) – consecutive iterations)We have a cycle of dependencies: S1->S2 ->S3->S1
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 35
Techniques for dependencies removal tdTechniques for dependencies removal contd.
• Splitting (cycle removed)for i=1 to nfor i=1 to nS0: newa(i) = a(i+1) //copy the value obtained at the previous stepS1: a(i) = b(i) + c(i) S2: d(i) = a(i) + 2S’3: f(i) = d(i) + newa(i)S 3: f(i) = d(i) + newa(i)
• Dependence analysis:• S1->S2 (data flow on a(i) – same iteration)• S1 >S2 (data flow on a(i) same iteration)• S2->S’3 (data flow on d(i) – same iteration)• S’3->S1 (antidep on a(i+1)– consec iterations)• S 3 >S1 (antidep. on a(i+1) consec. iterations)• S0->S’3 (data flow on a(i+1) – same iteration)• S0->S1 antidep on a(i+1) consecutive iterations)
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 36
• S0 >S1 antidep. on a(i+1) – consecutive iterations)
OutlineOutline
• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Techniques for dependences removalq p• Systolic architectures & simulations• Increase the parallel potential• Increase the parallel potential • Examples on GRID
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 37
Systolic architectures & simulationssimulations
• Systolic architectureSystolic architecture• Special purpose architecture• pipe of processing units called cells
• Advantages• Faster • Scalable
• Limitations• Expensive • Highly specialized for particular applicationsg y p p pp• Difficult to build => simulations
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 38
Systolic architectures & simulationssimulations
Goals for simulation:• Goals for simulation:• Verification and validation• Tests and availability of solutions• EfficiencyEfficiency • Proof of concept
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 39
Systolic architectures & simulationssimulations
• Models of systolic architectures• Models of systolic architectures• Many problems in terms of systems of liner
equations• Linear arrays (from special purpose architectures
to multi-purpose architectures)• Computing values of a polynomialp g p y• Convolution product• Vector-matrix multiplication
• Mesh arraysMesh arrays• Matrix multiplication• Speech recognition (simplified mesh array)
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 40
Systolic architectures & simulationssimulations
• All but one simulated within a software• All but one simulated within a software model in PARLOG• Easy to implementy p• Ready for run• Specification language S h i i P l/C b h• Speech recognition Pascal/C submesh array• Vocabulary of thousands of words• Vocabulary of thousands of words• Real time recognition
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 41
Systolic architectures & simulationssimulations
PARLOG example ( )• PARLOG example (convolution product)
cel([],[],_,_,[],[]).cel([X|Tx],[Y|Ty],B,W,[Xo|Txo],[Yo|Tyo])<-([ | ],[ | y], , ,[ | ],[ | y ])
Yo is Y+W*x,Xo is B,Xo is B,cel(Tx,Ty,X,W,Txo,Tyo).
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 42
OutlineOutline
• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Techniques for dependences removalq p• Systolic architectures & simulations• Increase the parallel potential• Increase the parallel potential• Examples on GRID
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 43
Program transformationsProgram transformations
Loop transformation• Loop transformation• To allow execution of parallel loops• Techniques borrowed from compilers
techniques
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 44
Program transformations tdProgram transformations contd.
• Cycle shrinkingA li h d d i l f l i l• Applies when dependencies cycle of a loop involves dependencies between loops far apart (say distance d)
• The loop is executed on parallel on that distance d
for i = 1 to na(i+d) = b(i) – 1b(i d) (i) (i)b(i+d) = a(i) + c(i)
Transforms intof i1 1 dfor i1 = 1 to n step d
parallel for i=i1 to i1+d-1a(i+d) = b(i) – 1
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 45
b(i+d) = a(i) + c(i)
Program transformations tdProgram transformations contd.
• Loop interchanging• Consist of interchanging of indexes• It requires the new dep vector to be positivef j 1 tfor j = 1 to n
for i = 1 to na(i j) = a(i-1 j) +b(i-2 j)a(i, j) = a(i-1, j) +b(i-2,j)
Inner loop cannot be vectorised (executed in parallel)
for i = 1 to nfor i 1 to nparallel for j = 1 to n
a(i, j) = a(i-1, j) +b(i-2,j)
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 46
Program transformations tdProgram transformations contd.
• Loop fusion• Transform 2 adjacent loops into a single one• Reduces the overhead of loops• It requires not to violate the ordering imposed by data equ es o o o a e e o de g posed by da a
dependenciesfor i = 1 to n
a(i) = b(i) + c(i+1)a(i) = b(i) + c(i+1)for i = 1 to n
c(i) = a(i-1) //write before read
for i = 1 to na(i) = b(i) + c(i+1)
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 47
c(i) = a(i-1)
Program transformations tdProgram transformations contd.
• Loop collapsing• Combines 2 nested loops• Increases the vector length
f i 1 tfor i = 1 to nfor j = 1 to m
a(i j) = a(i j-1) + 1a(i,j) = a(i,j-1) + 1
for ij = 1 to n*mi=i*j div mj=ij mod m
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 48
a(i,j) = a(i,j-1) + 1
Program transformations tdProgram transformations contd.
• Loop partitioning• Partition into independent problemsfor i = 1 to nfor i = 1 to n
a(i) = a(i-g) + 1Loop can be split into g independent problemsLoop can be split into g independent problems
for i = 1 to (n div g) *g +k step ga(i) = a(i-g) + 1
Where k=1,g
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 49
Each loop could be executed in parallel
OutlineOutline
• Computation paradigms• Computation paradigms• Constraints in parallel computation• Dependence analysis• Systolic architectures & simulationsy• Techniques for dependences removal• Increase the parallel potential• Increase the parallel potential • Examples on GRID
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 50
GRIDGRID
• It provides access to• It provides access to• computational power• storage capacity• storage capacity
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 51
Examples on GRIDExamples on GRID
M t i lti li ti• Matrix multiplication• Matrix power• Gaussian elimination• Sorting• Graph algorithms
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 52
Examples on GRIDMatrix multiplication tMatrix multiplication cont.
Analysis 1D (absolute analysis)y ( y )Matrix Multiplication 1D
10000
100000
100
1000
10000
Tim
e
1003007001000
1
10
0 5 10 15 20 25 30 35 40
Processors
• Optimal sol: p=f(n)
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 53
Examples on GRIDMatrix multiplication tMatrix multiplication cont.
Analysis 1D (relative analysis)Matrix Multiplication 1D
80
90
100
30
40
50
60
70
Tim
e %
1003007001000
0
10
20
0 5 10 15 20 25 30 35 40
Processors
Computation time (for small dim) is small compared to communication time: hence parallel implementation with 1D efficient on very large
matrix dim.
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 54
Examples on GRIDMatrix multiplication tMatrix multiplication cont.
Analysis 2D (absolute analysis)Matrix Multiplication 2DMatrix Multiplication 2D
10000
100000
100
1000
Time
1000x1000700x700300x300100x100
10
100
11 4 9 16 25 36
Processors
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 55
Examples on GRIDMatrix multiplication tMatrix multiplication cont.
Analysis 2D (relative analysis)M i M l i li i 2DMatrix Multiplication 2D
80
90
100
40
50
60
70
Tim
e %
1000700
300
100
0
10
20
30
0 5 10 15 20 25 30 35 40
Processors
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 56
Examples on GRIDMatrix multiplication tMatrix multiplication cont.
Comparison 1D- 2D
Comparison 1D - 2D
100000
100
1000
10000
Tim
e
100 - 1D300 - 1D700 - 1D1000 - 1D100 - 2D300 2D
1
10
0 5 10 15 20 25 30 35 40
300 - 2D700 - 2D1000 - 2D
0 5 10 15 20 25 30 35 40
Processors
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 57
Examples on GRID Graph algorithmsExamples on GRID Graph algorithms
• Minimum Spanning Trees• Minimum Spanning Trees7 2
3 15 1 8 10 5
7 2 5
Graph with 7 vertices: MST will have 6 edges with the sum of their weight minimumPossible solutions: MST weight = 20
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 58
Examples on GRID Graph algorithms tExamples on GRID Graph algorithms cont.
• MST Absolute Analysis • MST Relative AnalysisMinimum Spannig Tree
10000
100000
Minumum Spanning Tree
80
90
100
100
1000
Tim
e
500020001000500
30
40
50
60
70
Tim
e %
500100020005000
1
10
1 2 4 8 10 20
P
0
10
20
30
1 2 4 8 10 20 25
ProcessorsProcessors
• Observe the min value shifts to the right as graph’s dimension increases
• Observe the running time improvement with one order of
it d ( d t ti l
Processors
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 59
magnitude (compared to sequential approach) for graphs with n = 5000
Examples on GRID Graph algorithms tExamples on GRID Graph algorithms cont.
• Transitive Closure: source-partitionedE h t i i d t d h t• Each vertex vi is assigned to a process, and each process computes shortest path from each vertex in its set to any other vertex in the graph, by using the sequential algorithm (hence, data partitioning)Req i es no inte p ocess comm nication• Requires no inter process communication
Transitive closure - source partition
1000000
100
1000
10000
100000
Tim
e
20050010002000
1
10
100
1 2 4 8 16 20
Processors
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 60
Examples on GRID Graph algorithms tExamples on GRID Graph algorithms cont.
• Transitive Closure: source-paralleli h t t t f d th ll l• assigns each vertex to a set of processes and uses the parallel
formulation of the single-source algorithm (similar to MST)• Requires inter process communication
Transitive closure - source parallel
1000000
10000000
1000
10000
100000
Tim
e
20050010002000
1
10
100
1 2 4 8 16
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 61
Processors
Examples on GRID Graph algorithms tExamples on GRID Graph algorithms cont.
• Graph Isomorphism: exponential running time!• Graph Isomorphism: exponential running time!• search if 2 graphs matches• trivial solution: generate all permutations of a graph (n!) and match with
the other graphg p
n!
• Each process is responsible for computing (n 1)! (data parallel approach)
(n-1)! (n-1)! (n-1)!
• Each process is responsible for computing (n-1)! (data parallel approach)
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 62
Examples on GRID Graph algorithms tExamples on GRID Graph algorithms cont.
Graph Isomorphism
100000
1000000
le
1000
10000
arith
mic
sca
9 vertices10 vertices11 vertices
10
100
Tim
e - l
oga 11 vertices
12 vertices
10 2 4 6 8 10 12 14
Processors
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 63
Conclusions•The need for parallel processing (why)
Existence of problems that require large amount of time to be solved (like NPcomplete and NP hard problems)NPcomplete and NP hard problems)
•The constraints in parallel processing (how)
Dependence analysis: define the boundaries of the problem
•The particularities of the solution give the amount p gof parallelism (when)
The parabolic shape: more significant when large communication is i l dinvolved
Shift of optimal number of processors
How parallelization is done matters
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 64
How parallelization is done matters
Time up to 2 orders of magnitude decrease
Thank you for your attention!
Questions?
5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 65