Ernie Chan

Post on 22-Feb-2016

61 views 0 download

Tags:

description

Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures. Ernie Chan. How to Program SCC?. Tile. 48 cores in 6×4 mesh with 2 cores per tile 4 DDR3 memory c ontrollers. Core 1. Core 1. L2$1. Router. MPB. Tile. Tile. Tile. - PowerPoint PPT Presentation

Transcript of Ernie Chan

T H E U N I V E R S I T Y O F T E X A S A T A U S T I N

Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on

Many-Core Architectures

Ernie Chan

MARC symposium 2

• 48 cores in 6×4 mesh with 2 cores per tile• 4 DDR3 memory controllers

How to Program SCC?

November 9, 2010

Mem

ory

Cont

rolle

r

Tile

RTile

RTile

RTile

RTile

RTile

R

Tile

RTile

RTile

RTile

RTile

RTile

R

Tile

RTile

RTile

RTile

RTile

RTile

R

Tile

RTile

RTile

RTile

RTile

R

Mem

ory

Cont

rolle

r

Mem

ory

Cont

rolle

rM

emor

y Co

ntro

ller

System I/F

TileTile

Core 1

Core 0

L2$1

L2$0

Router MPB

Core 1

Core 0

R

MARC symposium 3

Outline

• How to Program SCC?• Elemental• Collective Communication• Off-Chip Shared-Memory• Conclusion

November 9, 2010

MARC symposium 4

Elemental

• New, Modern Distributed-Memory Dense Linear Algebra Library– Replacement for PLAPACK and ScaLAPACK– Object-oriented data structures for matrices– Coded in C++– Torus-wrap/elemental mapping of matrices to a

two-dimensional process grid– Implemented entirely using bulk synchronous

communication

November 9, 2010

MARC symposium 5

Elemental

• Two-Dimensional Process Grid: – Tile the process grid over the matrix to assign each

matrix element to a process

November 9, 2010

0 2 4

1 3 5

MARC symposium 6

Elemental

• Two-Dimensional Process Grid: – Tile the process grid over the matrix to assign each

matrix element to a process

November 9, 2010

0 2 4

1 3 5

MARC symposium 7

Elemental

• Two-Dimensional Process Grid: – Tile the process grid over the matrix to assign each

matrix element to a process

November 9, 2010

0 2 4

1 3 5

MARC symposium 8

Elemental

• Redistributing the Matrix Over a Process Grid– Collective communication

November 9, 2010

MARC symposium 9

Outline

• How to Program SCC?• Elemental• Collective Communication• Off-Chip Shared-Memory• Conclusion

November 9, 2010

MARC symposium 10

Collective Communication

• RCCE Message Passing API– Blocking send and receiveint RCCE_send( char *buf, size_t num, int dest );int RCCE_recv( char *buf, size_t num, int src );

– Potential for deadlock

November 9, 2010

0 2 41 3 5

MARC symposium 11

Collective Communication

• Avoiding Deadlock– Even number of cores in cycle

November 9, 2010

0 2 41 3 5

0 2 41 3 5

MARC symposium 12

Collective Communication

• Avoiding Deadlock– Odd number of cores in cycle

November 9, 2010

0 2 41 3

0 2 41 3

0 2 41 3

MARC symposium 19

Collective Communication

• Scatterint RCCE_scatter( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm );

November 9, 2010

Before

MARC symposium 20

Collective Communication

• Scatterint RCCE_scatter( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm );

November 9, 2010

After

MARC symposium 21

Collective Communication

• Allgatherint RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm );

November 9, 2010

Before

MARC symposium 22

Collective Communication

• Allgatherint RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm );

November 9, 2010

After

MARC symposium 30

Collective Communication

• Minimum Spanning Tree Algorithm– Scatter

November 9, 2010

MARC symposium 31

Collective Communication

• Minimum Spanning Tree Algorithm– Scatter

November 9, 2010

MARC symposium 32

Collective Communication

• Minimum Spanning Tree Algorithm– Scatter

November 9, 2010

MARC symposium 33

Collective Communication

• Minimum Spanning Tree Algorithm– Scatter

November 9, 2010

MARC symposium 34

Collective Communication

• Cyclic (Bucket) Algorithm– Allgather

November 9, 2010

MARC symposium 35

Collective Communication

• Cyclic (Bucket) Algorithm– Allgather

November 9, 2010

MARC symposium 36

Collective Communication

• Cyclic (Bucket) Algorithm– Allgather

November 9, 2010

MARC symposium 37

Collective Communication

• Cyclic (Bucket) Algorithm– Allgather

November 9, 2010

MARC symposium 38

Collective Communication

• Cyclic (Bucket) Algorithm– Allgather

November 9, 2010

MARC symposium 39

Collective Communication

• Cyclic (Bucket) Algorithm– Allgather

November 9, 2010

MARC symposium 40

Collective Communication

November 9, 2010

MARC symposium 41

Elemental

November 9, 2010

MARC symposium 43

Elemental

November 9, 2010

MARC symposium 44

Elemental

November 9, 2010

MARC symposium 45

Elemental

November 9, 2010

MARC symposium 46

Outline

• How to Program SCC?• Elemental• Collective Communication• Off-Chip Shared-Memory• Conclusion

November 9, 2010

MARC symposium 47

Off-Chip Shared-Memory

• Distributed vs. Shared-Memory

November 9, 2010

Mem

ory

Cont

rolle

r

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

Mem

ory

Cont

rolle

r

Mem

ory

Cont

rolle

rM

emor

y Co

ntro

ller

System I/F

Dist

ribut

ed

Mem

ory

Shared-Memory

MARC symposium 48

Off-Chip Shared-Memory

• SuperMatrix– Map dense matrix computation

to a directed acyclic graph

– No matrix distribution– Store DAG and matrix

on off-chip shared- memory

November 9, 2010

CHOL0

TRSM2TRSM1

SYRK5GEMM4SYRK3

CHOL6

TRSM7

SYRK8

CHOL9

MARC symposium 49

Off-Chip Shared-Memory

• Non-cacheable vs. Cacheable Shared-Memory– Non-cacheable• Allow for a simple programming interface• Poor performance

– Cacheable• Need software managed cache coherency mechanism• Execute on data stored in cache

• Interleave distributed and shared-memory programming concepts

November 9, 2010

MARC symposium 50

Off-Chip Shared-Memory

November 9, 2010

MARC symposium 56

Outline

• How to Program SCC?• Elemental• Collective Communication• Off-Chip Shared-Memory• Conclusion

November 9, 2010

MARC symposium 57

Conclusion

• Distributed vs. Shared-Memory– Elemental vs. SuperMatrix?

• A Collective Communication Library for SCC– RCCE_comm: released under LGPL and available

on the public Intel SCC software repositoryhttp://marcbug.scc-dc.com/svn/repository/trunk/rcce_applications/UT/RCCE_comm/

November 9, 2010

MARC symposium 58

Acknowledgments

• We thank the other members of the FLAME team for their support– Bryan Marker, Jack Poulson, and Robert van de Geijn

• We thank Intel for access to SCC and their help– Timothy G. Mattson and Rob F. Van Der Wijngaart

• Funding– Intel Corporation– National Science Foundation

November 9, 2010

MARC symposium 59

Conclusion

November 9, 2010

• More Informationhttp://www.cs.utexas.edu/~flame

• Questions?echan@cs.utexas.edu