Parasol LaboratoryTexas A&M University IPDPS 20021 The R-LRPD Test: Speculative Parallelization of...
-
date post
21-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Parasol LaboratoryTexas A&M University IPDPS 20021 The R-LRPD Test: Speculative Parallelization of...
IPDPS 2002 1
Parasol Laboratory Texas A&M University
The R-LRPD Test:Speculative Parallelization of
Partially Parallel Loops
Francis Dang, Hao Yu, and Lawrence Rauchwerger
Department of Computer Science
Texas A&M University
IPDPS 2002 2
Parasol Laboratory Texas A&M University
Motivation
To maximize performance, extract the maximum available parallelism from loops.
Static compiler methods may be insufficient.– Access patterns may be too complex.– Required information is only available at runtime.
Run-time methods needed to extract loop parallelism– Inspector/Executor– Speculative Parallelization
IPDPS 2002 3
Parasol Laboratory Texas A&M University
Speculative Parallelization: LRPD Test
Main Idea– Execute a loop as a DOALL.– Record memory references during execution.– Check for data dependences.– If there was a dependence, re-execute the loop sequentially.
Disadvantages– One data dependence can invalidate speculative
parallelization.– Slowdown is proportional to speculative parallel execution
time.– Partial parallelism is not exploited.
IPDPS 2002 4
Parasol Laboratory Texas A&M University
Partially Parallel Loop Example
do i = 1, 8
z = A[K[i]]
A[L[i]] = z + C[i]
end do
K[1:8] = [1,2,3,1,4,2,1,1]
L[1:8] = [4,5,5,4,3,5,3,3]
iter 1 2 3 4 5 6 7 8
A()
1 R R R R
2 R R
3 R W W W
4 W W R
5 W W W
IPDPS 2002 5
Parasol Laboratory Texas A&M University
The Recursive LRPD
Main Idea– Transform a partially parallel loop into a sequence of fully
parallel, block-scheduled loops.– Iterations before the first data dependence are correct and
committed.– Re-apply the LRPD test on the remaining iterations.
Worst case– Sequential time plus testing overhead
IPDPS 2002 6
Parasol Laboratory Texas A&M University
Algorithm
success
Initialize
Commit
Analyze
Execute as DOALL
Checkpoint
if failure
Reinitialize
Restore
Restart
IPDPS 2002 7
Parasol Laboratory Texas A&M University
Implementation
Implemented in run-time pass in Polaris and additional hand-inserted code.– Privatization with copy-in/copy-out for arrays under test.– Replicated buffers for reductions.– Backup arrays for checkpointing.
IPDPS 2002 8
Parasol Laboratory Texas A&M University
Recursive LRPD Example
do i = 1, 8
z = A[K[i]]
A[L[i]] = z + C[i]
end do
K[1:8] = [1,2,3,1,4,2,1,1]
L[1:8] = [4,5,5,4,2,5,3,3]
7-85-63-41-2iter
WWW5
RWW4
WR3
WR2
RRR1
A()
P4P3P2P1proc
First Stage
7-85-6iter
W5
R4
W3
W2
R1
A()
P4P3P2P1proc
Second Stage
IPDPS 2002 9
Parasol Laboratory Texas A&M University
Heuristics
Work Redistribution Sliding Window Approach Data Dependence Graph Extraction
IPDPS 2002 10
Parasol Laboratory Texas A&M University
Work Redistribution
Redistribute remaining iterations across processors. Execution time for each stage will decrease. Disadvantages:
– May uncover new dependences across processors.– May incur remote cache misses from data redistribution.
p1 p2 p3 p4
1st stage
After 1st stage
2nd stage
After 2nd stage
With Redistribution
p1 p2 p3 p4
1st stage
After 1st stage
2nd stage
After 2nd stage
Without Redistribution
IPDPS 2002 11
Parasol Laboratory Texas A&M University
Work Redistribution Example
do i = 1, 8
z = A[K[i]]
A[L[i]] = z + C[i]
end do
K[1:8] = [1,2,3,1,4,2,1,1]
L[1:8] = [4,5,5,4,2,5,3,3]
7-85-63-41-2iter
WWW5
RWW4
WR3
WR2
RRR1
A()
P4P3P2P1proc
First Stage Second Stage
8765iter
W5
R4
WW3
RW2
RR1
A()
P4P3P2P1proc
Third Stage
876iter
W5
4
WW3
R2
RR1
A()
P4P3P2P1proc
IPDPS 2002 12
Parasol Laboratory Texas A&M University
Redistribution Model
Redistribution may not always be beneficial. Stop redistribution if:
– The cost of data redistribution outweighs the benefit from work redistribution.
Synthetic loop to model this adaptive method.
IPDPS 2002 13
Parasol Laboratory Texas A&M University
Redistribution Model
Time Breakdown of Model - 8 Processors
0
2
4
6
8
10
12
Never
Alw
ays
Adaptiv
e
Never
Alw
ays
Adaptiv
e
Never
Alw
ays
Adaptiv
e
Never
Alw
ays
Adaptiv
e
Tim
e (
seco
nd
s)
Redistribution Overhead
Synchronization
Speculative Loop Time
Stage 1 Stage 2 Stage 3 Stage 4
Time Progression of Model - 8 Processors
0
5
10
15
20
25
30
0 1 2 3 4 5
LRPD Stage
Tim
e (
seco
nd
s)
Alw ays Adaptive Never
IPDPS 2002 14
Parasol Laboratory Texas A&M University
Sliding Window R-LRPD
R-LRPD can generate a sequential schedule for long dependence distributions.
Strip-mine the speculative execution.
Apply the R-LRPD on a contiguous block of iterations.
Only dependences within the window cause failures.
Adds more global synchronizations and test overhead.
After 1st stage
After 2nd stage
p1
1st stage
p2
p2 p1
2nd stage
IPDPS 2002 15
Parasol Laboratory Texas A&M University
DDG Extraction
R-LRPD can generate sequential schedules for complex dependence distributions.
Use the SW R-LRPD scheme to extract the data dependence graph (DDG).
Generate an optimized schedule from the DDG.
Obtains the DDG for loops from which a proper inspector cannot be extracted.
p1
1st stage
p2
p2 p1
2nd stage
After 1st stage 1 3
After 2nd stage 2 5
3 4
IPDPS 2002 16
Parasol Laboratory Texas A&M University
Performance Issues
Performance issues:– Blocked scheduling – potential cause for load imbalance.– Checkpointing can be expensive.
Feedback guided blocked scheduling– Use the timing information from the previous instantiation
(Bull, EuroPar 98)– Estimate the processor chunk sizes for minimal load
imbalance.
On-Demand Checkpointing– Checkpoint only data modified during execution.
IPDPS 2002 17
Parasol Laboratory Texas A&M University
Experiments
Setup:– 16 processor HP V-Class– 4 GB memory– HP-UX 11.0
DCDCMP_do15
DCDCMP_do70
BJT
SPICE 2G6
Quadrilateral LoopFMA3D
NLFILT_do300
EXTEND_do400
FPTRAK_do300
TRACK
Codes and Loops:
IPDPS 2002 18
Parasol Laboratory Texas A&M University
Experimental Results – Input Profiles
TRACK Input Profile
0%
10%
20%
30%
40%
50%
60%
70%
15-250 16-400 16-450 5-400 50-100
Input
Pe
rce
nt
Tim
e (
%)
FPTRAK_do300 EXTEND_do400
NLFILT_do300 Other
Spice Input Profile
0%
10%
20%
30%
40%
50%
60%
128-bit adder Extended Reference
Input
Pe
rcen
t Tim
e (%
)DCDCMP_DO70 DCDCMP_DO15
BJT Other
IPDPS 2002 19
Parasol Laboratory Texas A&M University
Experimental Results - TRACK
NLFILT_do300 Speedup
0
2
4
6
8
10
0 4 8 12 16 20
Processors
Sp
ee
du
p
15-250 16-400 16-450
5-400 50-100
restarts ofnumber ionsinstantiat ofnumber
ionsinstantiat ofnumber ratio mParallelis
NLFILT_do300 Parallelism Ratio
0.00
0.20
0.40
0.60
0.80
1.00
0 4 8 12 16 20
Processors
Pa
ralle
lism
Ra
tio15-250 16-400 16-450
5-400 50-100
IPDPS 2002 20
Parasol Laboratory Texas A&M University
Experimental Results - TRACK
EXTEND_do400 Speedup
0
1
2
3
4
5
6
0 4 8 12 16 20
Processors
Sp
eedu
p
15-250 16-400 16-450
5-400 50-100
EXTEND_do400 Parallelism Ratio
0.0
0.2
0.4
0.6
0.8
1.0
0 4 8 12 16 20
ProcessorsP
ara
llelis
m R
ati
o15-250 16-400 16-450
5-400 50-100
IPDPS 2002 21
Parasol Laboratory Texas A&M University
Experimental Results - TRACK
FPTRAK_do300 Speedup
0
1
2
3
4
5
6
0 4 8 12 16 20
Processors
Sp
eedu
p
15-250 16-400 16-450
5-400 50-100
FPTRAK_do300 Parallelism Ratio
0.0
0.2
0.4
0.6
0.8
1.0
0 4 8 12 16 20
Processors
Pa
ralle
lism
Ra
tio
15-250 16-400 16-450
5-400 50-100
IPDPS 2002 22
Parasol Laboratory Texas A&M University
Experimental Results - TRACK
TRACK Program Speedup
0
2
4
6
0 4 8 12 16 20
Processors
Spe
edu
p
15-250 16-400 16-450
5-400 50-100
NLFILT_do300 Optimization ContributionsInput: 16-400
0
2
4
6
8
0 4 8 12 16 20
Processors
Sp
ee
du
p
No optimizations FB
RD RD and FB
RD, FB, and ODC
IPDPS 2002 23
Parasol Laboratory Texas A&M University
Experimental Results – Sliding Window
NLFILT_do300 Speedup ComparisonInput: 15-250
0
2
4
6
8
0 4 8 12 16 20
Processors
Sp
eed
up
R-LRPD: All Opts. SW Blocksize = 256
SW Blocksize = 512
NLFILT_do300 Parallelism Ratio ComparisonInput: 15-250
0.5
0.6
0.7
0.8
0.9
1.0
0 4 8 12 16 20
Processors
Para
llelism
Rati
o
R-LRPD: All Opts. SW Blocksize = 256
SW Blocksize = 512
IPDPS 2002 24
Parasol Laboratory Texas A&M University
Experimental Results – Sliding Window
NLFILT_do300 Speedup ComparisonInput: 16-400
0
2
4
6
8
10
0 4 8 12 16 20
Processors
Sp
eed
up
R-LRPD: All Opts. SW Blocksize = 256
SW Blocksize = 512
NLFILT_do300 Parallelism Ratio ComparisonInput: 16-400
0.5
0.6
0.7
0.8
0.9
1.0
0 4 8 12 16 20
Processors
Para
llelism
Rati
o
R-LRPD: All Opts. SW Blocksize = 256
SW Blocksize = 512
IPDPS 2002 25
Parasol Laboratory Texas A&M University
Experimental Results – FMA3D
FMA3D SpeedupQuadrilateral Loop
0
2
4
6
8
10
12
0 4 8 12 16 20
Processors
Sp
ee
du
p
IPDPS 2002 26
Parasol Laboratory Texas A&M University
Experimental Results – SPICE 2G6
SPICE 2G6 SpeedupInput: Extended Reference
0
2
4
6
0 2 4 6 8 10
Processors
Sp
eed
up
DCDCMP_DO15 DCDCMP_DO70
BJT ALL
SPICE 2G6 SpeedupInput: 128-bit adder
0
2
4
6
0 2 4 6 8 10
Processors
Sp
eed
up
DCDCMP_DO15 DCDCMP_DO70
BJT ALL