Post on 23-Aug-2018
Science Advisory Committee Meeting - 20 September 3, 2010 • Stanford University
1 04_Parallel Processing
Parallel Processing
Majid AlMeshari John W. Conklin
Science Advisory Committee Meeting - 20
2
September 3, 2010 • Stanford University
04_Parallel Processing
Outline • Challenge • Requirements • Resources • Approach • Status • Tools for Processing
Science Advisory Committee Meeting - 20
3
September 3, 2010 • Stanford University
04_Parallel Processing
Challenge
A computationally intensive algorithm is applied on a huge set of data. Verification of filter’s correctness takes a long time and hinders progress.
Science Advisory Committee Meeting - 20
4
September 3, 2010 • Stanford University
04_Parallel Processing
Requirements • Achieve a speed-up between 5-10x over serial version • Minimize parallelization overhead • Minimize time to parallelize new releases of code • Achieve reasonable scalability
Number of Processors
Tim
e (s
ec)
SAC 19
Science Advisory Committee Meeting - 20
5
September 3, 2010 • Stanford University
04_Parallel Processing
Resources • Hardware
– Computer Cluster (Regelation), a 44 64-bit CPUs » 64-bit enables us to address a memory space beyond 4 GB » Using this cluster because:
(a) likely will not need more than 44 processors (b) same platform as our Linux boxes
• Software – MatlabMPI
» Matlab parallel computing toolbox created by MIT Lincoln Lab
» Eliminates the need to port our code into another language
Science Advisory Committee Meeting - 20
6
September 3, 2010 • Stanford University
04_Parallel Processing
Approach - Serial Code Structure Initialization
Loop over iterations
Loop Gyros
Loop over orbits
Loop over Segments
Build h(x) (the right hand side)
Build Jacobian, J(x)
Load data (Z), perform fit Bayesian estimation,
display results
Computationally Intensive
Components
(Parallelize!!!)
Science Advisory Committee Meeting - 20
7
September 3, 2010 • Stanford University
04_Parallel Processing
Approach - Parallel Code Structure Initialization
Loop over iterations Loop Gyros
Loop over orbits Loop over Segments
Build h(x)
Build Jacobian, J(x)
Bayesian estimation, display results
Distribute J(x), all nodes
Load data (Z), perform fit Loop over partial orbits
Send information matrices to master & combine
In parallel, split up by columns of J(x)
In parallel, split up by orbits
Custom data distribution code
required (Overhead)
Science Advisory Committee Meeting - 20
8
September 3, 2010 • Stanford University
04_Parallel Processing
Code in Action
Proc
1
Proc
2
Proc
3
Proc
n
Proc
1
Proc
2
Proc
3
Proc
n
. . . . . . . . .
. . . . . . . . .
Go
for n
ext G
yro,
Seg
men
t
Go
for
anot
her
Iter
atio
n
Initialization
Compute h and Partial J
Sync Point Get Complete J
Perform Fit
Sync Point Combine Info Matrix
Bayesian Estimation Another Iteration?
Split by Columns of J
Split by Rows of J
Science Advisory Committee Meeting - 20
9
September 3, 2010 • Stanford University
04_Parallel Processing
Parallelization Techniques - 1 • Minimize inter-processor communication
– Use MatlabMPI for one-to-one communication only – Use a custom approach for one-to-many communication
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Comp J
Dist J (I/O)
Est
comp info (I/O)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Comp J
Dist J (I/O)
Est
comp info (I/O)
Processors
Tim
e (S
ec) Comm time with
custom approach Comm time with
MPI
Science Advisory Committee Meeting - 20
10
September 3, 2010 • Stanford University
04_Parallel Processing
Parallelization Techniques - 2 • Balance processing load among processors
Load per processor = ∑ length(reducedVectork)*(computationWeightk + diskIOWeight)
k: {RT, Tel, Cg, TFM, S0}
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Comp J
Dist J (I/O)
Est
comp info (I/O)
Processors
Tim
e (S
ec) CPUs have
uniform load
Science Advisory Committee Meeting - 20
11
September 3, 2010 • Stanford University
04_Parallel Processing
Parallelization Techniques - 3 • Minimize memory footprint of slave code
– Tightly manage memory to prevent swapping to disk, or “out of memory” errors
– We managed to save ~40% of RAM per processor and never seen “out of memory” since
Science Advisory Committee Meeting - 20
12
September 3, 2010 • Stanford University
04_Parallel Processing
Parallelization Techniques - 4 • Find contention points
– Cache what can be cached – Optimize what can be optimized
0
500
1000
1500
2000
2500
3000
3500
4000
Old Code Optimized Code
Example: Compute h Optimization
Tim
e (S
ec)
Science Advisory Committee Meeting - 20
13
September 3, 2010 • Stanford University
04_Parallel Processing
Status – Current Speed up
1.00
4.26
6.50 7.22
1.00
4.08
7.19
9.62
1.00
5.76
10.83
14.50
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
1 CPU 8 CPUs 16 CPUs 20 CPUs
Sep. '09
May '10
July '10
Number of Processors
Spe
edup
Science Advisory Committee Meeting - 20
14
September 3, 2010 • Stanford University
04_Parallel Processing
Tools for Processing - 1 • Run Manager/Scheduler/Version Controller
– Built by Ahmad Aljadaan – Given a batch of options files, it schedules batches of runs,
serial or parallel, on cluster – Facilitates version control by creating a unique directory for
each run with its options file, code and results – Used with large number of test cases – Helped us schedule 1048+ runs since January 2010
» and counting…
Science Advisory Committee Meeting - 20
15
September 3, 2010 • Stanford University
04_Parallel Processing
Tools for Processing - 2 • Result Viewer
– Built by Ahmad Aljadaan – Allows searching for runs, displays results, plots related
charts – Facilitates comparison of results