Parallelizing the Jacobi Iteration...
Transcript of Parallelizing the Jacobi Iteration...
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Parallelizing the Jacobi Iteration Algorithm
Alberto Rodriguez
Higher Institute of Technologies and Applied Sciences (InSTEC)
October 14, 2016
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Outline
1 Serial Approach
2 Using OpenMP1D Decomposition2D Decomposition
3 Using MPI1D Decomposition
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Serial Approach
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Analyzing the Code
With a serial code we have:
Time Complexity O(I ∗ N2) (Nested loops to compute thematrix for each iteration).
Memory Complexity O(N2) (Size of the matrix).
We expect:
Long times to process matrix of bigger size and/or greatamount of iterations.
Size of the matrix is limit by system (Size of the memoryof the node) .
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Results
0.000976562
0.00390625
0.015625
0.0625
0.25
1
4
16
64
128 256 512 1024 2048 4096 8192
t(s)
Matrix Size
O3
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Results
0.000976562
0.00390625
0.015625
0.0625
0.25
1
4
16
64
128 256 512 1024 2048 4096 8192
t(s)
Matrix Size
O3mavx
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Results
Data obtained from perf tool.
8
16
32
64
128
128 256 512 1024 2048 4096 8192
% o
f cache m
isses
Matrix Size
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Using OpenMP
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
What part of the code need parallelization?
for(iCount = 1; iCount <= Iterations; iCount + +)for(i = 0; i < Dimension; i + +)
for(j = 0; j < Dimension; j + +){...}
The outer loop can’t be parallelize because the new matrixdepends to the old matrix.
We can parallelize the inner loops.
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Instructions using OpenMP
for(iCount = 1; iCount <= Iterations; iCount + +)#pragma omp parallel for private(i , j)for(i = 0; i < Dimension; i + +)
for(j = 0; j < Dimension; j + +){...}
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Results
0
0.005
0.01
0.015
0.02
0.025
0.03
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 256
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 1024
1
2
3
4
5
6
7
8
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 4096
10
20
30
40
50
60
70
80
90
100
110
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 16384
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Instructions using OpenMP
for(iCount = 1; iCount <= Iterations; iCount + +)#pragma omp parallel for collapse(2) private(i , j)for(i = 0; i < Dimension; i + +)
for(j = 0; j < Dimension; j + +){...}
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Results
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 256
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 1024
1
2
3
4
5
6
7
8
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 4096
10
20
30
40
50
60
70
80
90
100
110
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 16384
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Results
Comparison between 1D and 2D decomposition.
0
0.005
0.01
0.015
0.02
0.025
0.03
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 256
1D2D
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size:1024
1D2D
1
2
3
4
5
6
7
8
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size:4096
1D2D
10
20
30
40
50
60
70
80
90
100
110
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size:16384
1D2D
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Using MPI
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
How to divide the matrix
We want:
Minimize cache misses:
Divide the matrix by groups of rows. (C)
Every process has roughly the same amount of job:
Every process will have the same amount of rows.If that is no posible the last ones are going to have 1 morerow.
Every process access some data that is store in theprevious process and in the next one:
We need to allocate 2 more rows for each process. (Ghostor boundary cells)
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
How to divide the matrix
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Communications
How many communications we are going to have betweenprocesses?
Every process (except the first an the last ones) needs tosend data to the previous process and to the next one.
Every process (except the first an the last ones) needs toreceive data from the previous process and the next one inorder to update its ghost cells.
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Communications
How many values are communicated per iteration?
The program has N processes.
Every process needs to communicate four times.
In every communication an entire row is communicated.
Every row has M elements.
Math: N ∗ 4 ∗ 1 ∗M = 4 ∗ N ∗M
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Results
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
50 100 150 200 250 300 350 400 450 500 550
tim
e(s
)
Number of processes
Matrix Size: 512
0
5
10
15
20
25
30
50 100 150 200 250 300 350 400 450 500 550
tim
e(s
)
Number of processes
Matrix Size: 2048
200
400
600
800
1000
1200
1400
50 100 150 200 250 300 350 400 450 500 550
tim
e(s
)
Number of processes
Matrix Size: 8192
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Results
Comparison between AMD and Intel machines.
0
0.5
1
1.5
2
2.5
3
60 80 100 120 140 160 180 200 220 240 260
tim
e(s
)
Number of processes
Matrix Size: 512
AMDIntel
0
10
20
30
40
50
60
70
80
90
60 80 100 120 140 160 180 200 220 240 260
tim
e(s
)
Number of processes
Matrix Size: 2048
AMDIntel
0
500
1000
1500
2000
2500
3000
3500
4000
4500
60 80 100 120 140 160 180 200 220 240 260
tim
e(s
)
Number of processes
Matrix Size: 8192
AMDIntel