1 SIGMETRICS ‘96 Generalized Data Transfers At Memory Bandwidth Peter A. Dinda Peter A. DindaDavid...

30
1 SIGMETRICS ‘96 Generalized Data Transfers At Memory Bandwidth Peter A. Dinda Peter A. Dinda David R. O’Hallaron Carnegie Mellon University http://www.cs.cmu.edu/~pdinda http://www.cs.cmu.edu/~pdinda http://www.cs.cmu.edu/~droh
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of 1 SIGMETRICS ‘96 Generalized Data Transfers At Memory Bandwidth Peter A. Dinda Peter A. DindaDavid...

1SIGMETRICS ‘96

Generalized Data Transfers

At Memory Bandwidth

Generalized Data Transfers

At Memory Bandwidth

Peter A. DindaPeter A. Dinda David R. O’Hallaron

Carnegie Mellon University

http://www.cs.cmu.edu/~pdinda http://www.cs.cmu.edu/~pdinda

http://www.cs.cmu.edu/~droh

2SIGMETRICS ‘96

Generalized Data TransfersGeneralized Data Transfers

Receiving Node Memory

ABC

D

FE

Sending Node Memory

3SIGMETRICS ‘96

Address RelationsAddress Relations

R={(x,y) | data item at address x on sender is copied to addressy on receiver}

R={(x,y) | data item at address x on sender is copied to addressy on receiver}

{(A,F),(B,D),(C,E)}

Receiving Node Memory

ABC

D

FE

Sending Node Memory

4SIGMETRICS ‘96

Send/Recv ImplementationSend/Recv Implementation

{(A,F), (B,D), (C,E)}

Sending NodeMemory

Receiving Node Memory

Message Contents

Data TransferData Transfer

ABC

D

FE

Message Disassembly

Message Disassembly

Message Assembly

Message Assembly

(also put and get communication models)

5SIGMETRICS ‘96

Storing Address RelationsStoring Address Relations

while not doneget_address_pair(x,y)buffer[i++]=data[x]

end while

while not donecompute_address_pair(x,y)store_address_pair(x,y)

end while

Done Once

RepeatedMany Times

Compute Address Relation - “Inspector”

Assemble Message - “Executor”

6SIGMETRICS ‘96

Inspector/Executor [Salz, et al]Inspector/Executor [Salz, et al]In-line Computation Inspector/Executor

i=1

i=2

i=3

do i=1,1000 call Work() call COPY()

call Work()

enddo

i=2

i=1

i=3

Inspector

Executor

Executor

Executor

i=3

Executor

7SIGMETRICS ‘96

Context: Array AssignmentsContext: Array Assignments

Abstraction

Array A Array BB=AB=A

do i=1,1000call Work(A)

call Work(B)end

dim A(N,N),B(N,N)

We concentrate on B=A and B=TRANSPOSE(A)

More general forms exist

8SIGMETRICS ‘96

Distributed ArraysDistributed Arrays

(*,BLOCK) (*,CYCLIC)(*,CYCLIC(k))

Regular Block-cyclic distributions as in High Performance Fortran(HPF)

Elements Processor 0Owns

LocalArray onProcessor 0

Distribution

9SIGMETRICS ‘96

Representative AssignmentsRepresentative Assignments

(BLOCK,*) (*,BLOCK) (CYCLIC,*)

(*,CYCLIC)

(BLOCK,*)

(BLOCK,*) (CYCLIC,*)(CYCLIC,*)Data Transpose

10SIGMETRICS ‘96

Representing Address RelationsRepresenting Address Relations General Purpose Space Efficiency Hardware Limited Performance In-line expansion

11SIGMETRICS ‘96

AAPAIR: Simple RepresentationAAPAIR: Simple Representation

Simple sequence of pointer pairsSimple sequence of pointer pairs

PROBLEM: Space EfficiencyPROBLEM: Performance

Receiving Node Memory

ABC

D

FE

Sending Node Memory

{(A,F),(B,D),(C,E)}

ABC

DE

F

12SIGMETRICS ‘96

AABLK: Run-length EncodingAABLK: Run-length Encoding

A

B

C

D

F

E

Sequence of pointer, pointer, length triplesSequence of pointer, pointer, length triples

PROBLEM: Strided Access

{(A,F),(A+1,F+1), (B,D),(B+1,D+1), (C,E),(C+1,E+1)}

ABC

DE

F22

2

13SIGMETRICS ‘96

DMRLE: Handling StridesDMRLE: Handling Strides

sequence of offset, offset, length triplessequence of offset, offset, length triples

PROBLEM: Repeated Strides

A

B

C

D

F

Eg

g h

h

Ag h

F21

{(A,F),(B,E),(C,D)}B-A = C-B = gE-F = D-E = h

14SIGMETRICS ‘96

D

FE

DMRLEC: Repeated StridesDMRLEC: Repeated Strides

Sequence of indices into table of offset, offset, length triples

Sequence of indices into table of offset, offset, length triples

ABCg

gh

h

A’B’C’

D’

F’E’

g

gh

h

Ag h

F21

uv

u v 1

0 1 2 1

{(A,F),(B,E),(C,D),(A’,F’),(B’,E’),(C’,D’)}

B-A = C-B = B’-A’ = C’-B’ = g E-F = D-E = E’-F’= D’-E’ = h

A’-C = u and F’-D=v

0:1:2:

15SIGMETRICS ‘96

Address Relation Storage CostsAddress Relation Storage Costs

1

10

100

1000

10000

100000

1000000

10000000

Tota

l Sto

rage

(B

ytes

)

Various Testcases

AAPAIR

AABLK

DMRLE

DMRLEC

16SIGMETRICS ‘96

Copying & Superscalar PlateauCopying & Superscalar Plateau

Maximum number of non load/store instructions before copy bandwidth suffers

Maximum number of non load/store instructions before copy bandwidth suffers

load

stor

e

load

stor

e

...

...

Time

stallstall

stall

load

stall

stor

e

...

n Plateau = np = 2*3= 6

p

Issued attime t

load

stor

e

FreeIssueSlots

17SIGMETRICS ‘96

Paragon: No Superscalar Plat.Paragon: No Superscalar Plat.

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70

Co

py

Ra

te (

MB

/s)

Extra Instructions in Copy Loop

18SIGMETRICS ‘96

Pentium 90: Clear PlateauPentium 90: Clear Plateau

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50 60 70

Cop

y R

ate

(MB

/s)

Extra Instructions in Copy Loop

19SIGMETRICS ‘96

DEC 3K/400: Complex PlateauDEC 3K/400: Complex Plateau

0

5

10

15

20

25

30

35

40

45

0 10 20 30 40 50 60 70

Cop

y R

ate

(MB

/s)

Extra Instructions in Copy Loop

20SIGMETRICS ‘96

Measurement DetailsMeasurement Details Portable Library written in C Four representative assignments 512x512, 1Kx1K, 2Kx2K arrays of

doubles distributed on Four processors

Six Machines Assembly and Disassembly Rates

21SIGMETRICS ‘96

Measurement TestcasesMeasurement Testcases

(BLOCK,*) (*,BLOCK) (CYCLIC,*)

(*,CYCLIC)

(BLOCK,*)

(BLOCK,*) (CYCLIC,*)(CYCLIC,*)Data Transpose

22SIGMETRICS ‘96

Performance: DEC 3K/400Performance: DEC 3K/400

(B,*) to (*,B)

(B,*) to (C,*)

(C,*) to (B,*)

(*,C) to (C,*) T

05

1015202530354045

Mes

sage

Ass

embl

y R

ate

(MB

/s)

AAPAIR

DMRLEC

Memory

23SIGMETRICS ‘96

Performance:IBM 250 (PPC601)Performance:IBM 250 (PPC601)

(B,*) to (*,B)

(B,*) to (C,*)

(C,*) to (B,*)

(*,C) to (C,*)T

0

5

10

15

20

25

30

35M

essa

ge A

ssem

bly

Rat

e (M

B/s

)

AAPAIR

DMRLEC

Memory

24SIGMETRICS ‘96

Performance: IBM SP2 (PWR2)Performance: IBM SP2 (PWR2)

(B,*) to (*,B)

(B,*) to (C,*)

(C,*) to (B,*)

(*,C) to (C,*)T

0

10

20

30

40

50

60

70M

essa

ge A

ssem

bly

Rat

e (M

B/s

)

AAPAIR

DMRLEC

Memory

25SIGMETRICS ‘96

Performance: ParagonPerformance: Paragon

(B,*) to (*,B)

(B,*) to (C,*)

(C,*) to (B,*)

(*,C) to (C,*)T

0

5

10

15

20

25

30

35M

essa

ge A

ssem

bly

Rat

e (M

B/s

)

AAPAIR

DMRLEC

Memory

26SIGMETRICS ‘96

Performance: Pentium 90Performance: Pentium 90

(B,*) to (*,B)

(B,*) to (C,*)

(C,*) to (B,*)

(*,C) to (C,*)T

02468

101214161820

Mes

sage

Ass

embl

y R

ate

(MB

/s)

AAPAIR

AABLK

DMRLE

DMRLEC

Memory

27SIGMETRICS ‘96

Performance: Pentium 133Performance: Pentium 133

(B,*) to (*,B)

(B,*) to (C,*)

(C,*) to (B,*)

(*,C) to (C,*)T

05

101520253035404550

Mes

sage

Ass

embl

y R

ate

(MB

/s)

AAPAIR

AABLK

DMRLE

DMRLEC

Memory

28SIGMETRICS ‘96

ConclusionsConclusions Exploit “Superscalar Plateau” using

compact address relation encodings

Cheap enough even for scalar machines

Generalized data transfer with hardware-limited throughput

Many possible applications

29SIGMETRICS ‘96

Copying with Address RelationsCopying with Address Relations

Copy Engine

Sender Data Addresses

Data Items Data Items

Receiver Data Addresses

AddressRelationAddresses

AddressRelationData

Address RelationDecoder

30SIGMETRICS ‘96

A Simple Copy EngineA Simple Copy Engine

Copy Engine

Sender Data Adx

Data

Comm.System

AddressRelationAddresses

AddressRelationData

Copy Engine Data

AddressRelationAddresses

AddressRelationData

Decoder DecoderReceiverData Adx