BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner...

38
BIPS BIPS Analyzing HPC Communication Requirements Shoaib Kamil , Lenny Oliker, John Shalf, David Skinner [email protected] NERSC and Computational Research Division Lawrence Berkeley National Laboratory Brocade Networks October 10, 2007 NERSC 5

Transcript of BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner...

Page 1: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

BIPSBIPS

Analyzing HPC Communication Requirements

Shoaib Kamil , Lenny Oliker, John Shalf, David [email protected]

NERSC and Computational Research Division Lawrence Berkeley National Laboratory

Brocade NetworksOctober 10, 2007

NERSC

5

Page 2: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

2BIPSBIPS

OverviewOverview

• CPU clock scaling bonanza has ended – Heat density– New physics below 90nm (departure from bulk material properties)

• Yet, by end of decade mission critical applications expected to have 100X computational demands of current levels (PITAC Report, Feb 1999)

• The path forward for high end computing is increasingly reliant on massive parallelism

– Petascale platforms will likely have hundreds of thousands of processors– System costs and performance may soon be dominated by interconnect

• What kind of an interconnect is required for a >100k processor system?– What topological requirements? (fully connected, mesh)– Bandwidth/Latency characteristics?– Specialized support for collective communications?

Page 3: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

3BIPSBIPS

QuestionsQuestions(How do we determine appropriate interconnect requirements?)(How do we determine appropriate interconnect requirements?)

• Topology: will the apps inform us what kind of topology to use?– Crossbars: Not scalable– Fat-Trees: Cost scales superlinearly with number of processors– Lower Degree Interconnects: (n-Dim Mesh, Torus, Hypercube, Cayley)

• Costs scale linearly with number of processors• Problems with application mapping/scheduling fault tolerance

• Bandwidth/Latency/Overhead– Which is most important? (trick question: they are intimately

connected)– Requirements for a “balanced” machine? (eg. performance is not

dominated by communication costs)

• Collectives– How important/what type?– Do they deserve a dedicated interconnect?– Should we put floating point hardware into the NIC?

Page 4: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

4BIPSBIPS

ApproachApproach

• Identify candidate set of “Ultrascale Applications” that span scientific disciplines– Applications demanding enough to require Ultrascale

computing resources

– Applications that are capable of scaling up to hundreds of thousands of processors

– Not every app is “Ultrascale!”

• Find communication profiling methodology that is– Scalable: Need to be able to run for a long time with many

processors. Traces are too large

– Non-invasive: Some of these codes are large and can be difficult to instrument even using automated tools

– Low-impact on performance: Full scale apps… not proxies!

Page 5: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

5BIPSBIPS

IPM IPM (the “hammer”)(the “hammer”)

IntegratedPerformanceMonitoring

• portable, lightweight,

scalable profiling

• fast hash method

• profiles MPI topology

• profiles code regions

• open source

MPI_Pcontrol(1,”W”); …code…MPI_Pcontrol(-1,”W”);

############################################ IPMv0.7 :: csnode041 256 tasks ES/ESOS# madbench.x (completed) 10/27/04/14:45:56## <mpi> <user> <wall> (sec)# 171.67 352.16 393.80 # …################################################ W# <mpi> <user> <wall> (sec)# 36.40 198.00 198.36## call [time] %mpi %wall# MPI_Reduce 2.395e+01 65.8 6.1# MPI_Recv 9.625e+00 26.4 2.4# MPI_Send 2.708e+00 7.4 0.7# MPI_Testall 7.310e-02 0.2 0.0# MPI_Isend 2.597e-02 0.1 0.0###############################################…

Developed by David Skinner, NERSC

Page 6: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

6BIPSBIPS

Application Overview Application Overview (the “nails”)(the “nails”)

NAME Discipline Problem/Method Structure

MADCAP Cosmology CMB Analysis Dense Matrix

FVCAM Climate Modeling AGCM 3D Grid

CACTUS Astrophysics General Relativity 3D Grid

LBMHD Plasma Physics MHD 2D/3D Lattice

GTC Magnetic Fusion Vlasov-Poisson Particle in Cell

PARATEC Material Science DFT Fourier/Grid

SuperLU Multi-Discipline LU Factorization Sparse Matrix

PMEMD Life Sciences Molecular Dynamics Particle

Page 7: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

7BIPSBIPS

Latency Bound vs. Bandwidth Bound?Latency Bound vs. Bandwidth Bound?

• How large does a message have to be in order to saturate a dedicated circuit on the interconnect? – N1/2 from the early days of vector computing– Bandwidth Delay Product in TCP

System Technology MPI LatencyPeak Bandwidth

Bandwidth Delay Product

SGI Altix Numalink-4 1.1us 1.9GB/s 2KB

Cray X1 Cray Custom 7.3us 6.3GB/s 46KB

NEC ES NEC Custom 5.6us 1.5GB/s 8.4KB

Myrinet Cluster Myrinet 2000 5.7us 500MB/s 2.8KB

Cray XD1 RapidArray/IB4x 1.7us 2GB/s 3.4KB

• Bandwidth Bound if msg size > Bandwidth*Delay• Latency Bound if msg size < Bandwidth*Delay

– Except if pipelined (unlikely with MPI due to overhead)– Cannot pipeline MPI collectives (but can in Titanium)

Page 8: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

8BIPSBIPS

Call CountsCall Counts

WaitWait

Wait

WaitAll WaitAll

Isend

Isend

IsendIsend

Irecv

Irecv Irecv

Irecv

IsendIsend

Wait IrecvRecv

RecvSend

Send

WaitAny IrecvSendRecv

Allr

edu

ceA

llredu

ce

GatherReduce

Page 9: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

9BIPSBIPS

Diagram of Message Size Distribution FunctionDiagram of Message Size Distribution Function

Page 10: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

10BIPSBIPS

Message Size DistributionsMessage Size Distributions

Page 11: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

11BIPSBIPS

P2P Buffer SizesP2P Buffer Sizes

Page 12: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

12BIPSBIPS

Collective Buffer SizesCollective Buffer Sizes

Page 13: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

13BIPSBIPS

Collective Buffer SizesCollective Buffer Sizes

95% Latency Bound!!!

Page 14: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

14BIPSBIPS

P2P Topology OverviewP2P Topology Overview

0

Max

Tot

al M

essa

ge V

olum

e

Page 15: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

15BIPSBIPS

Low Degree Regular Mesh Communication PatternsLow Degree Regular Mesh Communication Patterns

Page 16: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

16BIPSBIPS

Cactus CommunicationCactus CommunicationPDE Solvers on Block Structured GridsPDE Solvers on Block Structured Grids

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 17: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

17BIPSBIPS

LBMHD CommunicationLBMHD Communication

Page 18: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

18BIPSBIPS

GTC CommunicationGTC Communication

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Call Counts

Page 19: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

19BIPSBIPS

FVCAM CommunicationFVCAM Communication

Page 20: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

20BIPSBIPS

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

SuperLU CommunicationSuperLU Communication

Page 21: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

21BIPSBIPS

PMEMD CommunicationPMEMD Communication

Call Counts

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 22: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

22BIPSBIPS

PARATEC CommunicationPARATEC Communication

3D FFT

Page 23: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

23BIPSBIPS

Latency/Balance DiagramLatency/Balance Diagram

Latency BoundBandwidth Bound

Communication

Co

mp

uta

tion

Bo

un

dC

om

mu

nic

atio

n B

ou

nd

Com

puta

tion

Need Faster Processors

Need More InterconnectBandwidth

Need Lower InterconnectLatency

Page 24: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

24BIPSBIPS

Summary of Communication PatternsSummary of Communication Patterns

Code

256procs

%P2P :

%Collective

Avg. Coll

Bufsize

Avg. P2P

Bufsize

TDC@2k

max,avg.

%FCN

Utilization

GTC 40% : 60% 100 128k 10 , 4 2%

Cactus 99% : 1% 8 300k 6 , 5 2%

LBMHD 99% : 1% 8 3D=848k

2D=12k

12 , 11.8 5%

2%

SuperLU 93% : 7% 24 48 30 , 30 25%

PMEMD 98% : 2% 768 6k or 72 255 , 55 22%

PARATEC 99% : 1% 4 64 255 , 255 100% (<10%)

MADCAP-MG 78% : 22% 163k 1.2M 44 , 40 23%

FVCAM 99% : 1% 8 96k 20 , 15 16%

Page 25: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

25BIPSBIPS

Requirements for Interconnect TopologyRequirements for Interconnect Topology

Regularity of Communication Topology

Inte

nsity

(#n

eigh

bors

)

Regular

Fully

Connected

Embarassingly

Parallel

Irregular

FVCAM

Cactus

3D LBMHD

PARATEC

2D LBMHD

SuperLU

Monte Carlo

PMEMD

MADCAP

CAM/GTC

AMR

(coming soon!)

Page 26: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

26BIPSBIPS

Coverage By Interconnect TopologiesCoverage By Interconnect Topologies

Regularity of Communication Topology

Inte

nsity

(#n

eigh

bors

)

Regular

Fully

Connected

Embarassingly

Parallel

Irregular

FVCAM

Cactus

3D LBMHD

PARATEC

2D LBMHD

SuperLU

PMEMD

MADCAP3D Mesh3D Mesh

AMR

Fully Connected Network (Fat-Tree/Crossbar)Fully Connected Network (Fat-Tree/Crossbar)

CAM/GTC2D Mesh2D Mesh

Page 27: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

27BIPSBIPS

Coverage by Interconnect TopologiesCoverage by Interconnect Topologies

Regularity of Communication Topology

Inte

nsity

(#n

eigh

bors

)

Regular

Fully

Connected

Embarassingly

Parallel

Irregular

FVCAM

Cactus

3D LBMHD

PARATEC

2D LBMHD

SuperLU

PMEMD

MADCAP3D Mesh3D Mesh

AMR

Fully Connected Network (Fat-Tree/Crossbar)Fully Connected Network (Fat-Tree/Crossbar)

CAM/GTC2D Mesh2D Mesh

?

?

Page 28: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

28BIPSBIPS

Revisiting Original QuestionsRevisiting Original Questions

• Topology– Most codes require far less than full connectivity

• PARATEC is the only code requiring full connectivity• Many require low degree (<12 neighbors)

– Low TDC codes not necessarily isomorphic to a mesh!• Non-isotropic communication pattern• Non-uniform requirements

• Bandwidth/Delay/Overhead requirements– Scalable codes primarily bandwidth-bound messages– Average message sizes several Kbytes

• Collectives– Most payloads less than 1k (8-100 bytes!)

• Well below the bandwidth delay product• Primarily latency-bound (requires different kind of interconnect)

– Math operations limited primarily to reductions involving sum, max, and min operations.

– Deserves a dedicated network (significantly different reqs.)

Page 29: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

29BIPSBIPS

Mitigation StrategiesMitigation Strategies

• What does the data tell us to do?– P2P: Focus on messages that are bandwidth-

bound (eg. larger than bandwidth-delay product)• Switch Latency=50ns• Propagation Delay = 5ns/meter propagation delay• End-to-End Latency = 1000-1500 ns for the very best

interconnects!

– Shunt collectives to their own tree network (BG/L)

– Route latency-bound messages along non-dedicated links (multiple hops) or alternate network (just like collectives)

– Try to assign a direct/dedicated link to each of the distinct destinations that a process communicates with

Page 30: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

30BIPSBIPS

Operating Systems for CMPOperating Systems for CMP

• Even Cell Phones will need OS (and our idea of an OS is tooooo BIG!)– Mediating resources for many cores, protection from viruses, and managing increasing code

complexity– But it has to be very small and modular! (see also embedded Linux)

• Old OS Assumptions are bogus for hundreds of cores!– Assumes limited number of CPUs that must be shared

• Old OS: time-multiplexing (context switching and cache pollution!)• New OS: spatial partitioning

– Greedy allocation of finite I/O device interfaces (eg. 100 cores go after the network interface simultaneously)

• Old OS: First process to acquire lock gets device (resource/lock contention! Nondeterm delay!)• New OS: QoS management for symmetric device access

– Background task handling via threads and signals• Old OS: Interrupts and threads (time-multiplexing) (inefficient!)• New OS: side-cores dedicated to DMA and async I/O

– Fault Isolation• Old OS: CPU failure --> Kernel Panic (will happen with increasing frequency in future silicon!)• New OS: CPU failure --> Partition Restart (partitioned device drivers)

– Old OS invoked any interprocessor communication or scheduling vs. direct HW access

• What will the new OS look like?– Whatever it is, it will probably look like Linux (or ISV’s will make life painful)– Linux too big, but microkernel not sufficiently robust– Modular kernels commonly used in embedded Linux applications! (e.g. vxWorks running under a

hypervisor XEN, K42, D.K. Panda Side Cores)

Page 31: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

31BIPSBIPS

I/O For Massive ConcurrencyI/O For Massive Concurrency

• Scalable I/O for massively concurrent systems!– Many issues with coordinating access to disk within node (on chip

or CMP)

– OS will need to devote more attention to QoS for cores competing for finite resource (mutex locks and greedy resource allocation policies will not do!) (it is rugby where device == the ball)

nTasks I/O Rate

16 Tasks/node

I/O Rate

8 tasks per node

8 - 131 Mbytes/sec

16 7 Mbytes/sec 139 Mbytes/sec

32 11 Mbytes/sec 217 Mbytes/sec

64 11 Mbytes/sec 318 Mbytes/sec

128 25 Mbytes/sec 471 Mbytes/sec

Page 32: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

32BIPSBIPS

Other Topics for DiscussionOther Topics for Discussion

• RDMA

• Low-overhead messaging

• Support for one-sided messages– Page pinning issues– TLB peers– Side Cores

Page 33: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

33BIPSBIPS

ConundrumConundrum

• Can’t afford to continue with Fat-trees or other Fully-Connected Networks (FCNs)

• Can’t map many Ultrascale applications to lower degree networks like meshes, hypercubes or torii

• How can we wire up a custom interconnect topology for each application?

Page 34: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

34BIPSBIPS

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Switch TechnologySwitch Technology

• Packet Switch:– Read each packet header and decide where

it should go fast!

– Requires expensive ASICs for line-rate switching decisions

– Optical Transceivers

• Circuit Switch:– Establishes direct circuit from point-to-

point (telephone switchboard)

– Commodity MEMS optical circuit switch• Common in telecomm industry• Scalable to large crossbars

– Slow switching (~100microseconds)

– Blind to message boundaries

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Force10 E1200

1260 x 1GigE56 x 10GigE

400x4001-40GigE

Movaz iWSS

Page 35: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

35BIPSBIPS

A Hybrid Approach to Interconnects HFASTA Hybrid Approach to Interconnects HFAST

• Hybrid Flexibly Assignable Switch Topology (HFAST)– Use optical circuit switches to

create custom interconnect topology for each application as it runs (adaptive topology)

– Why? Because circuit switches are• Cheaper: Much simpler, passive

components• Scalable: Already available in

large crossbar configurations• Allow non-uniform assignment of

switching resources– GMPLS manages changes to packet

routing tables in tandem with circuit switch reconfigurations

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 36: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

36BIPSBIPS

HFASTHFAST

• HFAST Solves Some Sticky Issues with Other Low-Degree Networks– Fault Tolerance: 100k processors… 800k links between them

using a 3D mesh (probability of failures?)

– Job Scheduling: Finding right sized slot– Job Packing: n-Dimensional Tetris…– Handles apps with low comm degree but not isomorphic to a

mesh or nonuniform requirements

• How/When to Assign Topology?– Job Submit Time: Put topology hints in batch script (BG/L, RS)

– Runtime: Provision mesh topology and monitor with IPM. Then use data to reconfigure circuit switch during barrier.

– Runtime: Pay attention to MPI Topology directives (if used) – Compile Time: Code analysis and/or instrumentation using UPC,

CAF or Titanium.

Page 37: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

37BIPSBIPS

HFAST Recent WorkHFAST Recent Work

• Clique-mapping to improve switch port utilization efficiency (Ali Pinar)

– The general solution is NP-complete– Bounded clique size creates an upper-bound that is < NP-

complete, but still potentially very large– Examining good “heuristics” and solutions to restricted cases

for mapping that completes within our lifetime

• AMR and Adaptive Applications (Oliker, Lijewski)

– Examined evolution of AMR communication topology– Degree of communication is very low if filtered for high-

bandwidth messages– Reconfiguration costs can be hidden behind computation

• Hot-spot monitoring (Shoaib Kamil)

– Use circuit switches to provision overlay network gradually as application runs

– Gradually adjust topology to remove hot-spots

Page 38: BIPS Analyzing HPC Communication Requirements Shoaib Kamil, Lenny Oliker, John Shalf, David Skinner jshalf@lbl.gov NERSC and Computational Research Division.

38BIPSBIPS

Conclusions/Future Work?Conclusions/Future Work?

• Expansion of IPM studies– More DOE codes (eg. AMR: Cactus/SAMARAI, Chombo, Enzo)

– Temporal changes in communication patterns (AMR examples)

– More architectures (Comparative study like Vector Evaluation project)

– Put results in context of real DOE workload analysis

• HFAST– Performance prediction using discrete event simulation– Cost Analysis (price out the parts for mock-up and compare to

equivalent fat-tree or torus)

– Time domain switching studies (eg. how do we deal with PARATEC?)

• Probes– Use results to create proxy applications/probes– Apply to HPCC benchmarks (generates more realistic communication

patterns than the “randomly ordered rings” without complexity of the full application code)