Reducing the Communication Cost via Chain Pattern Scheduling Florina M. Ciorba, Theodore Andronikos,...

Reducing the Communication Cost Reducing the Communication Cost via Chain Pattern Schedulingvia Chain Pattern Scheduling

Florina M. Ciorba, Theodore Andronikos, Ioannis Drositis,

George Papakonstantinou and Panayotis Tsanakas

National Technical University of Athens

Computing Systems Laboratory

[email protected]

July 29, 2005 NCA'05 2

OutlineOutline

• IntroductionIntroduction

• Definitions and notations

• Chain pattern scheduling

• unbounded #P – high communication

• fixed #P – moderate communication

• Performance results

• Conclusions

• Future work

July 29, 2005 NCA'05 3

IntroductionIntroduction

Motivation:

• A lot of work has been done in

parallelizing loops with dependencies,

but very little work exists on explicitly

minimizing the communication incurred

by certain dependence vectors

July 29, 2005 NCA'05 4

IntroductionIntroduction

Contribution:

• Enhancing the data locality for loops with

dependencies

• Reducing the communication cost by

mapping iterations tied by certain

dependence vectors to the same processor

• Applicability to various multiprocessor

architectures

July 29, 2005 NCA'05 5

OutlineOutline

• Introduction

• Definitions and notationsDefinitions and notations





• Conclusions

• Future work

July 29, 2005 NCA'05 6

Definitions and notationsDefinitions and notations

Algorithmic model:FOR (i1=l1; i1<=u1; i1++) FOR (i2=l2; i2<=u2; i2++) … FOR (in=ln; in<=un; in++)

Loop Body ENDFOR … ENDFORENDFOR• Perfectly nested loops

• Constant flow data dependencies

July 29, 2005 NCA'05 7


• J – the index space of an n-dimensional uniform dependence loop

• ECT – earliest computation time of an iteration (time-patterns)

• Patk – set of points (called pattern) of J with ECT k

• Pat0 – contains the boundary (pre-computed) points

• Pat1 – initial pattern

• patk – pattern outline (the upper boundary) of Patk

• Pattern points – the points that define the polygon shape of a

pattern

• Pattern vectors – are those dependence vectors di whose end-

points are the pattern points of Pat1

• Chain of computations – a sequence of iterations executed by the

same processor (space-patterns)

July 29, 2005 NCA'05 8


• Index space of a loop with d1=(1,3), d2=(2,2), d3=(4,1), d4=(4,3)

• The pattern vectors are d1, d2, d3

• Pat1, Pat2, Pat3, pat1, pat2 and pat3 are shown

• Few chains of computations are shown

July 29, 2005 NCA'05 9

Definitions and notationsDefinitions and notations• dc – the communication vector (one of the pattern vectors)

• j = p + λdc is the family of lines of J formed by dc

• Cr = is a chain formed by dc

• |Cr| is the number of iteration points of Cr

• r – is the starting point of a chain

• C – is the set of Cr chains and |C| is the number of Cr chains

• |CM| – is the cardinality of the maximal chain

• Drin – the volume of “incoming” data for Cr

• Drout – the volume of “outgoing” data for Cr

• Drin + Dr

out is the total communication associated with Cr

• #P – the number of available processors

• m – the number of dependence vectors, except dc

}{ R λ some for,λ | cJ drjj

July 29, 2005 NCA'05 10


• Communication vector is dc = d2 = (2,2)

• Cr=(0,0) communicates with Cr=(0,2), Cr=(1,0) and Cr=(3,0)

July 29, 2005 NCA'05 11

OutlineOutline

• Introduction


• Chain pattern schedulingChain pattern scheduling

• unbounded #P – high communicationunbounded #P – high communication

• fixed #P – moderate communicationfixed #P – moderate communication


• Conclusions

• Future work

July 29, 2005 NCA'05 12

Chain pattern schedulingChain pattern schedulingScenario 1: unbounded #P – high communicationScenario 1: unbounded #P – high communication

• All points of a chain Cr are mapped to the same

processor

• #P is assumed to be unbounded

• Each chain is mapped to a different processor

Disadvantages

Unrealistic because for large index spaces the number of chains

formed, hence of processors needed, is prohibitive

Provides limited data locality (only for points tied by dc)

Total communication volume is

V = (Drin+ Dr

out)|C|≈2m|CM||C|

July 29, 2005 NCA'05 13

Chain pattern schedulingChain pattern schedulingScenario 1: unbounded #P – high communicationScenario 1: unbounded #P – high communication

Each chain is mapped to a different processor. 24 chains are formed.

July 29, 2005 NCA'05 14

Chain pattern schedulingChain pattern schedulingScenario 2: fixed #P – moderate communicationScenario 2: fixed #P – moderate communication

• All points of a chain Cr are mapped to the same processor

• #P is arbitrarily chosen to be fixed

Mapping I: cyclic mappingcyclic mapping [8]

Each chain from the pool of unassigned chains is mapped to a

processor in a cyclic fashion

Disadvantages:

Provides limited data locality

Total communication volume is a function of #P and r1,…,rm

Due to not being able to predict for what dependence vector the

communication is eliminated and in which case, the total

communication volume is bounded above by V ≈2m|CM||C|

July 29, 2005 NCA'05 15


Mapping IMapping I: cyclic mapping: cyclic mapping

July 29, 2005 NCA'05 16


Mapping II: chain pattern mappingchain pattern mapping

It zeroes the communication cost imposed by as many dependence

vectors as possible

#P is divided into a group of na processors used in the area above

dc, and another group of nb processors used in the area below dc

Chains above dc are cyclically mapped to the na processors

Chains below dc are cyclically mapped to the nb processors

This way communication cost is additionally zeroed along one

dependence vector in the area above dc, and along another

dependence vector in the area below dc

July 29, 2005 NCA'05 17


Mapping IIMapping II: chain pattern mapping: chain pattern mapping

na=2 nb=3 na=2 nb=3

July 29, 2005 NCA'05 18



Total communication volume in this case is

bounded above by

V ≈2(m-2) |CM||C| Differences from cyclic mapping

Processors do not span the entire index space, but

only a part of it

A different cycle size is chosen to map different areas

of the index space

July 29, 2005 NCA'05 19



Advantages

Provides better data locality than the cyclic mapping

Uses a more realistic #P than the cyclic mapping

Suitable for

Distributed memory systems (a chain is mapped to a single

processor)

Symmetric multiprocessor systems (a chain is mapped to a

single node, that may contain more than one processors)

Heterogeneous systems (longer chains are mapped to faster

processors, whereas shorter chains to slower processors)

July 29, 2005 NCA'05 20

OutlineOutline

• Introduction





• Performance resultsPerformance results

• Conclusions

• Future work

July 29, 2005 NCA'05 21

Performance resultsPerformance results Simulation setup

• Simulation program written in C++

• The distributed memory system was emulated

• Index spaces range from 10×10 … 1000×1000 iterations

• Dependence vectors d1=(1,3), dc=d2=(2,2), d3=(4,1),

d4=(4,3)

• #P ranges from 5 … 8

• Comparison with the cyclic mapping

• Communication reduction achieved ranges from

15% - 35%

July 29, 2005 NCA'05 22

Performance resultsPerformance results

0

500000

1000000

1500000

2000000

2500000

3000000

Co

mm

un

icat

ion

vo

lum

e

Index space sizes

Communication reduction with 5 processors (2)

Cyclic

Chain-Pattern

0

5000

10000

15000

20000

25000

30000

Co

mm

un

icat

ion

vo

lum

e

10x10 20x20 30x30 40x40 60x60 80x80 100x100

Index space sizes


Cyclic

Chain-Pattern

0

5000

10000

15000

20000

25000

30000

Co

mm

un

icat

ion

vo

lum

e

10x10 20x20 30x30 40x40 60x60 80x80 100x100

Index space sizes


Cyclic

Chain-Pattern

0

500000

1000000

1500000

2000000

2500000

3000000

Co

mm

un

icat

ion

vo

lum

e

Index space sizes


Cyclic

Chain-Pattern

July 29, 2005 NCA'05 23

Performance resultsPerformance results

0

5000

10000

15000

20000

25000

30000

Co

mm

un

icat

ion

vo

lum

e

10x10 20x20 30x30 40x40 60x60 80x80 100x100

Index space sizes


Cyclic

Chain-Pattern

0

500000

1000000

1500000

2000000

2500000

3000000

Co

mm

un

icat

ion

vo

lum

e

Index space sizes


Cyclic

Chain-Pattern

0

5000

10000

15000

20000

25000

30000

Co

mm

un

icat

ion

vo

lum

e

10x10 20x20 30x30 40x40 60x60 80x80 100x100

Index space sizes


Cyclic

Chain-Pattern

0

500000

1000000

1500000

2000000

2500000

3000000

Co

mm

un

icat

ion

vo

lum

e

Index space sizes


Cyclic

Chain-Pattern

July 29, 2005 NCA'05 24

OutlineOutline

• Introduction






• ConclusionsConclusions

• Future work

July 29, 2005 NCA'05 25

ConclusionsConclusions

• The total communication cost can be

significantly reduced if the communication

incurred by certain dependence vectors is

eliminated

• The chain pattern mapping outperforms

other mapping schemes (e.g. cyclic

mapping) by enhancing the data locality

July 29, 2005 NCA'05 26

OutlineOutline

• Introduction






• Conclusions

• Future workFuture work

July 29, 2005 NCA'05 27

Future work

• Simulate other architectures (such as

shared memory systems, SMPs and

heterogeneous systems)

• Experiment also with the centralized (i.e.

master-slave) version of the chain pattern

scheduling scheme

July 29, 2005 NCA'05 28

Thank you

Questions?

Reducing the Communication Cost via Chain Pattern Scheduling Florina M. Ciorba, Theodore Andronikos,...

Documents

Transcript of Reducing the Communication Cost via Chain Pattern Scheduling Florina M. Ciorba, Theodore Andronikos,...