X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

17
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013 DynAX Innovations in Programming Models, Compilers and Runtime Systems for Dynamic Adaptive Event-Driven Execution Models

description

DynAX Innovations in Programming Models, Compilers and Runtime Systems for Dynamic Adaptive Event-Driven Execution Models. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013. Objectives. Brandywine Xstack Software Stack. NWChem + Co-Design Applications. - PowerPoint PPT Presentation

Transcript of X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

Page 1: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

ET

E.T. International, Inc.

X-Stack: Programming Challenges, Runtime Systems, and Tools

Brandywine TeamMay2013

DynAXInnovations in Programming Models, Compilers and Runtime Systems for

Dynamic Adaptive Event-Driven Execution Models

Page 2: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

2

ObjectivesScalability Expose, express, and exploit O(1010) concurrencyLocality Locality aware data types, algorithms, and

optimizationsProgrammability

Easy expression of asynchrony, concurrency, locality

Portability Stack portability across heterogeneous architecturesEnergy Efficiency

Maximize static and dynamic energy savings while managing the tradeoff between energy efficiency, resilience, and performance

Resilience Gradual degradation in the face of many faultsInteroperability

Leverage legacy code through a gradual transformation towards exascale performance

Applications Support NWChem

Page 3: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

3

Brandywine Xstack Software Stack

SWARM(Runtime System)

SCALE(Compiler)

HTA (Library)

R-Stream(Compiler)

NWChem + Co-Design Applications

Rescinded Primitive Data Types .

E.T. International, Inc.

E.T. International, Inc.

Page 4: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

SWARMMPI, OpenMP, OpenCL SWARM

Asynchronous Event-Driven Tasks Dependencies Resources Active Messages Control Migration

VS.

Communicating Sequential Processes Bulk Synchronous Message Passing

Tim

e Time

Active threads

Waiting

4

Page 5: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

5

SWARM• Principles of Operation

Codelets* Basic unit of parallelism* Nonblocking tasks* Scheduled upon satisfaction of precedent constraints

Hierarchical Locale Tree: spatial position, data locality Lightweight Synchronization Active Global Address Space (planned)

• Dynamics Asynchronous Split-phase Transactions: latency hiding Message Driven Computation Control-flow and Dataflow Futures Error Handling Fault Tolerance (planned)

Page 6: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Cholesky DAG•POTRF → TRSM•TRSM → GEMM, SYRK•SYRK → POTRF

POTRF TRSM SYRK GEMM1:

2:

3:

6

• Implementations:OpenMPSWARM

Page 7: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Naïve O

penMP

Tuned OpenM

PSW

AR

M

Cholesky Decomposition: Xeon

1 2 3 4 5 6 7 8 9 101112123456789

101112

OpenMPSWARM

# Threads

Spee

dup

over

Ser

ial

7

Page 8: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Cholesky Decomposition: Xeon Phi

8

Ope

nMP

SWA

RM

Xeon Phi: 240 Threads

OpenMP fork-join programming suffers on many-core chips (e.g. Xeon Phi). SWARM removes these synchronizations.

Page 9: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Cholesky: SWARM vs ScaLapack/MKLSc

aLap

ack

SWA

RM

16 node cluster: Intel Xeon E5-2670 16-core 2.6GHz

Asynchrony is key in large dense linear algebra

2 4 8 16 32 640

2000

4000

6000

8000

10000

12000

14000

16000

ScaLapack/MKL

SWARM

# Nodes

GFLO

PS9

Page 10: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Code Transition to Exascale1. Determine application execution, communication, and

data access patterns2. Find ways to accelerate application execution directly.3. Consider data access pattern to better lay out data

across distributed heterogeneous nodes.4. Convert single-node synchronization to asynchronous

control-flow/data-flow (OpenMP -> asynchronous scheduling)

5. Remove bulk-synchronous communications where possible (MPI -> asynchronous communication)

6. Synergize inter-node and intra-node code7. Determine further optimizations afforded by

asynchronous model.Method successfully deployed for NWChem code transition

10

Page 11: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Self Consistent Field Module From NWChem

•NWChem used by 1000’s of researchers•Code is designed to be highly scalable to petaflop scale

•Thousands of man-hours expensed on tuning and performance

•Self Consistent Field (SCF) module is a key component of NWChem

•ETI has worked with PNNL to extract the algorithm from NWChem to study how to improve it.As part of the DOE XStack program

11

Page 12: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Serial Optimizations

Origina

l

Symmetr

y of g

()

BLAS/L

APACK

Precom

pute_x

000d

_g val

ues

Fock M

atrix S

ymmetr

y02468

1012141618

Serial OptimizationsSp

eedu

p

12

Page 13: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Single Node Parallelization

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

02468

10121416182022

Speedup of OpenMP versions and SWARM

Dynamic,v1Dynamic,v2Dynamic,v3Guided,v1Guided,v2Guided,v3Static,v1Static,v2Static,v3SWARMIdeal

# Threads

Spee

dup

13

Page 14: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Multi-Node Parallelization

16 32 64 128 256 512 1024 204810

100

1000

10000

SCF Multi-Node Execution Scaling

SWARMMPI

# Cores

Exec

utio

n Ti

me

(sec

onds

)

16 32 64 128 256 512 1024 2048

0.1

1

10

100

SCF Multinode Speedup

SWARMMPI

# Cores

Spee

dup

over

Sin

gle

Nod

e

14

Page 15: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

15

Information Repository• All of this information is available in more detail at the

Xstack wiki:http://www.xstackwiki.com

Page 16: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

16

Questions?

Page 17: X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

17

Acknowledgements• Co-PIs:

Benoit Meister (Reservoir)David Padua (Univ. Illinois) John Feo (PNNL)

• Other team members:ETI: Mark Glines, Kelly Livingston, Adam MarkeyReservoir: Rich LethinUniv. Illinois: Adam SmithPNNL: Andres Marquez

• DOESonia Sachs, Bill Harrod