Reducing the Communication Cost via Chain Pattern Scheduling Florina M. Ciorba, Theodore Andronikos,...
-
date post
20-Dec-2015 -
Category
Documents
-
view
220 -
download
4
Transcript of Reducing the Communication Cost via Chain Pattern Scheduling Florina M. Ciorba, Theodore Andronikos,...
Reducing the Communication Cost Reducing the Communication Cost via Chain Pattern Schedulingvia Chain Pattern Scheduling
Florina M. Ciorba, Theodore Andronikos, Ioannis Drositis,
George Papakonstantinou and Panayotis Tsanakas
National Technical University of Athens
Computing Systems Laboratory
July 29, 2005 NCA'05 2
OutlineOutline
• IntroductionIntroduction
• Definitions and notations
• Chain pattern scheduling
• unbounded #P – high communication
• fixed #P – moderate communication
• Performance results
• Conclusions
• Future work
July 29, 2005 NCA'05 3
IntroductionIntroduction
Motivation:
• A lot of work has been done in
parallelizing loops with dependencies,
but very little work exists on explicitly
minimizing the communication incurred
by certain dependence vectors
July 29, 2005 NCA'05 4
IntroductionIntroduction
Contribution:
• Enhancing the data locality for loops with
dependencies
• Reducing the communication cost by
mapping iterations tied by certain
dependence vectors to the same processor
• Applicability to various multiprocessor
architectures
July 29, 2005 NCA'05 5
OutlineOutline
• Introduction
• Definitions and notationsDefinitions and notations
• Chain pattern scheduling
• unbounded #P – high communication
• fixed #P – moderate communication
• Performance results
• Conclusions
• Future work
July 29, 2005 NCA'05 6
Definitions and notationsDefinitions and notations
Algorithmic model:FOR (i1=l1; i1<=u1; i1++) FOR (i2=l2; i2<=u2; i2++) … FOR (in=ln; in<=un; in++)
Loop Body ENDFOR … ENDFORENDFOR• Perfectly nested loops
• Constant flow data dependencies
July 29, 2005 NCA'05 7
Definitions and notationsDefinitions and notations
• J – the index space of an n-dimensional uniform dependence loop
• ECT – earliest computation time of an iteration (time-patterns)
• Patk – set of points (called pattern) of J with ECT k
• Pat0 – contains the boundary (pre-computed) points
• Pat1 – initial pattern
• patk – pattern outline (the upper boundary) of Patk
• Pattern points – the points that define the polygon shape of a
pattern
• Pattern vectors – are those dependence vectors di whose end-
points are the pattern points of Pat1
• Chain of computations – a sequence of iterations executed by the
same processor (space-patterns)
July 29, 2005 NCA'05 8
Definitions and notationsDefinitions and notations
• Index space of a loop with d1=(1,3), d2=(2,2), d3=(4,1), d4=(4,3)
• The pattern vectors are d1, d2, d3
• Pat1, Pat2, Pat3, pat1, pat2 and pat3 are shown
• Few chains of computations are shown
July 29, 2005 NCA'05 9
Definitions and notationsDefinitions and notations• dc – the communication vector (one of the pattern vectors)
• j = p + λdc is the family of lines of J formed by dc
• Cr = is a chain formed by dc
• |Cr| is the number of iteration points of Cr
• r – is the starting point of a chain
• C – is the set of Cr chains and |C| is the number of Cr chains
• |CM| – is the cardinality of the maximal chain
• Drin – the volume of “incoming” data for Cr
• Drout – the volume of “outgoing” data for Cr
• Drin + Dr
out is the total communication associated with Cr
• #P – the number of available processors
• m – the number of dependence vectors, except dc
}{ R λ some for,λ | cJ drjj
July 29, 2005 NCA'05 10
Definitions and notationsDefinitions and notations
• Communication vector is dc = d2 = (2,2)
• Cr=(0,0) communicates with Cr=(0,2), Cr=(1,0) and Cr=(3,0)
July 29, 2005 NCA'05 11
OutlineOutline
• Introduction
• Definitions and notations
• Chain pattern schedulingChain pattern scheduling
• unbounded #P – high communicationunbounded #P – high communication
• fixed #P – moderate communicationfixed #P – moderate communication
• Performance results
• Conclusions
• Future work
July 29, 2005 NCA'05 12
Chain pattern schedulingChain pattern schedulingScenario 1: unbounded #P – high communicationScenario 1: unbounded #P – high communication
• All points of a chain Cr are mapped to the same
processor
• #P is assumed to be unbounded
• Each chain is mapped to a different processor
Disadvantages
Unrealistic because for large index spaces the number of chains
formed, hence of processors needed, is prohibitive
Provides limited data locality (only for points tied by dc)
Total communication volume is
V = (Drin+ Dr
out)|C|≈2m|CM||C|
July 29, 2005 NCA'05 13
Chain pattern schedulingChain pattern schedulingScenario 1: unbounded #P – high communicationScenario 1: unbounded #P – high communication
Each chain is mapped to a different processor. 24 chains are formed.
July 29, 2005 NCA'05 14
Chain pattern schedulingChain pattern schedulingScenario 2: fixed #P – moderate communicationScenario 2: fixed #P – moderate communication
• All points of a chain Cr are mapped to the same processor
• #P is arbitrarily chosen to be fixed
Mapping I: cyclic mappingcyclic mapping [8]
Each chain from the pool of unassigned chains is mapped to a
processor in a cyclic fashion
Disadvantages:
Provides limited data locality
Total communication volume is a function of #P and r1,…,rm
Due to not being able to predict for what dependence vector the
communication is eliminated and in which case, the total
communication volume is bounded above by V ≈2m|CM||C|
July 29, 2005 NCA'05 15
Chain pattern schedulingChain pattern schedulingScenario 2: fixed #P – moderate communicationScenario 2: fixed #P – moderate communication
Mapping IMapping I: cyclic mapping: cyclic mapping
July 29, 2005 NCA'05 16
Chain pattern schedulingChain pattern schedulingScenario 2: fixed #P – moderate communicationScenario 2: fixed #P – moderate communication
Mapping II: chain pattern mappingchain pattern mapping
It zeroes the communication cost imposed by as many dependence
vectors as possible
#P is divided into a group of na processors used in the area above
dc, and another group of nb processors used in the area below dc
Chains above dc are cyclically mapped to the na processors
Chains below dc are cyclically mapped to the nb processors
This way communication cost is additionally zeroed along one
dependence vector in the area above dc, and along another
dependence vector in the area below dc
July 29, 2005 NCA'05 17
Chain pattern schedulingChain pattern schedulingScenario 2: fixed #P – moderate communicationScenario 2: fixed #P – moderate communication
Mapping IIMapping II: chain pattern mapping: chain pattern mapping
na=2 nb=3 na=2 nb=3
July 29, 2005 NCA'05 18
Chain pattern schedulingChain pattern schedulingScenario 2: fixed #P – moderate communicationScenario 2: fixed #P – moderate communication
Mapping II: chain pattern mappingchain pattern mapping
Total communication volume in this case is
bounded above by
V ≈2(m-2) |CM||C| Differences from cyclic mapping
Processors do not span the entire index space, but
only a part of it
A different cycle size is chosen to map different areas
of the index space
July 29, 2005 NCA'05 19
Chain pattern schedulingChain pattern schedulingScenario 2: fixed #P – moderate communicationScenario 2: fixed #P – moderate communication
Mapping II: chain pattern mappingchain pattern mapping
Advantages
Provides better data locality than the cyclic mapping
Uses a more realistic #P than the cyclic mapping
Suitable for
Distributed memory systems (a chain is mapped to a single
processor)
Symmetric multiprocessor systems (a chain is mapped to a
single node, that may contain more than one processors)
Heterogeneous systems (longer chains are mapped to faster
processors, whereas shorter chains to slower processors)
July 29, 2005 NCA'05 20
OutlineOutline
• Introduction
• Definitions and notations
• Chain pattern scheduling
• unbounded #P – high communication
• fixed #P – moderate communication
• Performance resultsPerformance results
• Conclusions
• Future work
July 29, 2005 NCA'05 21
Performance resultsPerformance results Simulation setup
• Simulation program written in C++
• The distributed memory system was emulated
• Index spaces range from 10×10 … 1000×1000 iterations
• Dependence vectors d1=(1,3), dc=d2=(2,2), d3=(4,1),
d4=(4,3)
• #P ranges from 5 … 8
• Comparison with the cyclic mapping
• Communication reduction achieved ranges from
15% - 35%
July 29, 2005 NCA'05 22
Performance resultsPerformance results
0
500000
1000000
1500000
2000000
2500000
3000000
Co
mm
un
icat
ion
vo
lum
e
Index space sizes
Communication reduction with 5 processors (2)
Cyclic
Chain-Pattern
0
5000
10000
15000
20000
25000
30000
Co
mm
un
icat
ion
vo
lum
e
10x10 20x20 30x30 40x40 60x60 80x80 100x100
Index space sizes
Communication reduction with 5 processors (1)
Cyclic
Chain-Pattern
0
5000
10000
15000
20000
25000
30000
Co
mm
un
icat
ion
vo
lum
e
10x10 20x20 30x30 40x40 60x60 80x80 100x100
Index space sizes
Communication reduction with 6 processors (1)
Cyclic
Chain-Pattern
0
500000
1000000
1500000
2000000
2500000
3000000
Co
mm
un
icat
ion
vo
lum
e
Index space sizes
Communication reduction with 6 processors (2)
Cyclic
Chain-Pattern
July 29, 2005 NCA'05 23
Performance resultsPerformance results
0
5000
10000
15000
20000
25000
30000
Co
mm
un
icat
ion
vo
lum
e
10x10 20x20 30x30 40x40 60x60 80x80 100x100
Index space sizes
Communication reduction with 7 processors (1)
Cyclic
Chain-Pattern
0
500000
1000000
1500000
2000000
2500000
3000000
Co
mm
un
icat
ion
vo
lum
e
Index space sizes
Communication reduction with 7 processors (2)
Cyclic
Chain-Pattern
0
5000
10000
15000
20000
25000
30000
Co
mm
un
icat
ion
vo
lum
e
10x10 20x20 30x30 40x40 60x60 80x80 100x100
Index space sizes
Communication reduction with 8 processors (1)
Cyclic
Chain-Pattern
0
500000
1000000
1500000
2000000
2500000
3000000
Co
mm
un
icat
ion
vo
lum
e
Index space sizes
Communication reduction with 8 processors (2)
Cyclic
Chain-Pattern
July 29, 2005 NCA'05 24
OutlineOutline
• Introduction
• Definitions and notations
• Chain pattern scheduling
• unbounded #P – high communication
• fixed #P – moderate communication
• Performance results
• ConclusionsConclusions
• Future work
July 29, 2005 NCA'05 25
ConclusionsConclusions
• The total communication cost can be
significantly reduced if the communication
incurred by certain dependence vectors is
eliminated
• The chain pattern mapping outperforms
other mapping schemes (e.g. cyclic
mapping) by enhancing the data locality
July 29, 2005 NCA'05 26
OutlineOutline
• Introduction
• Definitions and notations
• Chain pattern scheduling
• unbounded #P – high communication
• fixed #P – moderate communication
• Performance results
• Conclusions
• Future workFuture work
July 29, 2005 NCA'05 27
Future work
• Simulate other architectures (such as
shared memory systems, SMPs and
heterogeneous systems)
• Experiment also with the centralized (i.e.
master-slave) version of the chain pattern
scheduling scheme
July 29, 2005 NCA'05 28
Thank you
Questions?