ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16...

21
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DEDNAD0003525. Sandia’s ARM–centric Co3Design Strategy Introduction to the NNSA/ASC Vanguard Project James A. Ang, Ron Brightwell, Simon D. Hammond, K. Scott Hemmert. Robert J. Hoekstra, James H. Laros III, Kevin Pedretti and Arun F. Rodrigues Center for Computing Research Sandia National Laboratories Albuquerque, NM 8718531319 USA ARM Research Summit, Going ARM Workshop Robinson College, Cambridge, UK September 11313, 2017 Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DEDNAD0003525. 1 SAND2017D9713 C

Transcript of ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16...

Page 1: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Sandia&National&Laboratories&is&a&multimission laboratory&managed&and&operated&by&National&Technology&and&Engineering&Solutions&of&Sandia,&LLC,&a&wholly&owned&subsidiary&of&Honeywell&International,&Inc.,&for&the&U.S.&Department&of&Energy's&National&Nuclear&Security&Administration&under&contract&DEDNAD0003525.

Sandia’s(ARM–centric(Co3Design(StrategyIntroduction(to(the(NNSA/ASC(Vanguard(Project(

James(A.(Ang,(Ron(Brightwell,(Simon(D.(Hammond,(K.(Scott(Hemmert.(Robert(J.(Hoekstra,(James(H.(Laros III,(Kevin(Pedretti and(Arun F.(RodriguesCenter(for(Computing(ResearchSandia(National(LaboratoriesAlbuquerque,(NM(8718531319((USA

ARM(Research(Summit,(Going(ARM(WorkshopRobinson(College,(Cambridge,(UKSeptember(11313,(2017

Sandia&National&Laboratories&is&a&multimission laboratory&managed&and&operated&by&National&Technology&and&Engineering&Solutions&of&Sandia,&LLC,&a&wholly&owned&subsidiary&of&Honeywell&International,&Inc.,&for&the&U.S.&Department&of&Energy's&National&Nuclear&Security&Administration&under&contract&DEDNAD0003525.

1

SAND2017D9713&C

Page 2: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Sandia(ARM(R&D(Overview! Sandia's(ASC(Advanced(Architecture(Testbed(project

! For(co3design(early(access(to(new(architectural(features(is(critical(! Positive(ROI(for(the(pain(associated(with(early(hardware(&(immature(

software! ARM(testbed(experiences(– with(Cavium’s(permission(to(share(pre3

production(measurements(of(relative(performance

! Architectural(Simulators! Structural(Simulation(Toolkit((SST),(Open(architectural(simulation(

framework! Open(Source,(Open(Development,(catalyst(for(collaboration(that(can(

be(proprietary! Links(to(Sandia/DOE's(experience(with(application(drivers

! Sandia(is(responsible(for(the(NNSA/ASC's(first(large&scaleARM(platform! Overview(of(key(requirements(targeting(2019

2

Page 3: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

APM/HPE

Xgene+1

Cavium/HP3Labs

ThunderX1

=32019

=3Retired

=32015

=32017

TODAY

ARM

Next3Gen

ExascalePlatform

Sept32011

ARM

Next3Gen

Sandia’s(NNSA/ASC(ARM(Platforms

3

Small3Pre+

Vanguard3

early3access3

system3– FY18

Install3FY193Target3

ThunderX33(Triton)

Cavium&ThunderX1

PreDproduction&Cavium&ThunderX2

Page 4: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Memory(Bandwidth(Intensive(Kernels

4

0

2

4

6

8

10

12

0 16 32 48 64 80 96

Speedup3vs.3Pre+Prod3TX1

OpenMP Threads

STREAM3Triad

ThunderX1&D Final& ThunderX1&D PreDProductionThunderX2&PreDProduction&(Wave&1)

0

1

2

3

4

CG&Solve WAXPY DOT MATVEC

Speedup3vs.3Pre+Prod3TX1

MiniFE CG3Solve/Kernels

ThunderX1&D PreDProduction ThunderX1&Final&SiliconThunderX2&D PreDProduction&(Wave&1)

Benchmarked(on(maximum(number(of(cores((node3node(comparison)

https://github.com/Mantevo/miniFE (Sandia&National&Labs)

Significant(improvement(in(memory(bandwidth((~4X)

Individual(cores(more(powerful ~2.5X

Page 5: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Compute(Intensive(Kernels

5

0

1

2

3

Speedup3vs.3Pre+Prod3TX1

LULESH3OpenMP (Figure3of3Merit)

ThunderX1&D PreDProductionThunderX1&D Final&ThunderX2&PreDProduction&(Wave&1)

~2.7X

0

1

2

3

Speedup3vs.3Pre+Prod3TX1

PENNANT3(Hydro3Solve3Time)

ThunderX1&D PreDProduction ThunderX1&D Final&ThunderX2&D PreDProduction&(Wave&1)

https://github.com/lanl/PENNANT (Los&Alamos&Nat.&Lab) https://codesign.llnl.gov/lulesh.php (Lawrence&Livermore&Nat.&Lab)

~2.2X

Page 6: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Full(Engineering(Applications(on(ARM

! On(Pre3production(TX2,(Sandia(has(been(actively(working(on(libraries(and(packages(for(its(ARM(systems! Math(Libraries/Kernels((Trilinos)! I/O(Libraries((Exodus(Mesh)! YAML

! Ported(Sandia’s(open3source(Trinity&acceptance(Nalu application(to(ARM! Representative(of(some(SIERRA(engineering(

applications! Complex(mesh(handling,(load(balancing,(

solvers,(etc.6

https://github.com/NaluCFD

Page 7: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Architectural(Simulation(Framework:(SST! Use(Supercomputers(to(Design(Supercomputers! Parallel(Discrete3Event(Simulator(Framework

! Flexible(framework(allows(multitude(of(custom(simulators! Demonstrated(scaling(to(over(512(processors

! Comes(with(many(built3in(simulation(models! Processors,(Memory,(Network

! Open(API! Easily(extensible(with(new(models! Modular(framework! (Non3Viral)(Open3source(core

! Time3scale(independent(core! Handles(Micro3,(Meso3,(Macro3scale(simulations

! “Best(of(Breed”(– Bring(together(work(from(Labs,(Industry,(Academia

7

SST::Component SST::Component

SST3Core

Configuration

Partitioning

Link

Event

InstantiationTime3Coordination

Parallel3Communication

http://sst3simulator.org/

Page 8: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Example(SST(Element(Libraries! memHierarchy 3 Cache(and(Memory(! cassini 3 Cache(prefetchers! DRAMSim 3 DDR! NVDIMMSim 3 Emerging(Memories

! ariel 3 PIN3based(Tracing

! m5C 3 Gem5(integration(layer

! ember 3 State3machine(Message(generation! firefly 3 Communication(Protocols! hermes 3 MPI3like(interface

! merlin 3 Network(router(model(and(NIC

! scheduler 3 Job3scheduler(simulation(models8

Detailed(Memory(Models

Dynamic(Trace3based(Processor(Model

Cycle3based(Processor(Model

High3level(Program(Communication(Models

Cycle3based(Network(Model

High3level(System(Workflow(Model

Page 9: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

SST(Use(Cases

9

NVM-Based DIMM

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

Rank

Bank

NVM Internal Controller

Write Buffer

Requests Tracker

RequestBuffer Scheduler

Wear LevelerPower

Manager

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

Rank

Memory Bus

Network

P

Scratch

P P

"$"

Fast

SlowMemory

P

Scratch

P P

"$"

Fast

SlowMemory

P

Scratch

P P

"$"

Fast

SlowMemory Network

P

Scratch

P P

"$"

Fast

SlowMemory

P

Scratch

P P

"$"

Fast

SlowMemory

P

Scratch

P P

"$"

Fast

SlowMemory

L3

L2

L1

L0

Emerging&NV&Memory&Technologies

Photonic&Network&Topology&&&Routing

MultiDLevel&Memory(HBM+DDR+NV)

Disaggregated&Memory

Communication/&Interconnect&Modeling

Page 10: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

The(NNSA/ASC(Vanguard(Project! Expand'HPC'ecosystem'by(developing(ARM3based(system(necessary(to(enable(a(credible ARM3based(Advanced(Technology(System(offering(for(ASC(post32022! Mature(ARM(HPC(ecosystem

– Software(stack(and(toolchain((OS,(compilers,(scalable(MPI,(runtime,(system(management,(development(tools,(I/O,(…)

– On3package(high3bandwidth(memory– Advanced(HPC(interconnect

! Influence(Supplier/Integrator(community(to(accelerate(and(optimize(ARM(technologies(for(leadership(HPC,(and(foster(HPC(community(confidence

! Leverage(the(open(ARM(architecture(platform(to(enable(innovations(in(processor,(interconnect,(memory,(specialized(acceleration(and(packaging(technologies(that(focus&on&performance&and&scalability&of&our&multi3physics&production&codes&and(could(impact(2025(systems

10

Page 11: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

The(Vanguard(Project(Approach

! Bridge(the(gap(between(what(we(can(accomplish(with(Testbed(project(and(fielding(an(Advanced(Technology(System(platform

! Vanguard(is(a(project,(not(a(single(platform! Accelerate(maturity(of(emerging(technologies(for(ASC(program

11

Test3BedsVanguardATS/Capability

Platforms

Greater3Stability,3Larger3Scale

Higher3Risk,3Greater3Architectural3Choices

Page 12: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Vanguard(Approach((cont.)

! Appropriate scale(systems! Must(be(large(enough(to(serve(as(proof(of(concept(for(future(Advanced(Technology(Systems

! Sufficient(scale(to(interest(production(multi3physics(application(teams! Appropriate investment

! Gain(and(maintain(vendor(and(collaborator(attention! Appropriate level(of(risk

! Goals(target(mission(workloads(but(not(turn3key(production(mission(support

12

Page 13: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Software(Environment(Plan(3 Overview! Goal:(Accelerate(maturity(of(ARM(ecosystem(for(ASC(computing(mission! Need(an(integrated(software(stack(for(the(2019(ARM(Prototype(to(enable(

application(development(and(optimization! Programming(environment((compilers,(math(libs,(tools,(MPI,(OMP,…..)! Low3level(OS((optimized(Linux,(I/O,(network,(containers(+(VMs,(PowerAPI,(…)! Job(Scheduling(and(management((WLM(Slurm?),(app(launcher,(user(tools,(….)! System(Management((boot,(monitoring,(image(mgt.,(rapid(re3provisioning,(….)

! Focus(Areas! Integration(and(robustness,(overall(user(experience! Address(known(weaknesses:(compilers,(libs,(and(tools! Increase(modularization(and(openness(of(system(software(stack;(seek(“plugin”(capability(for(

externally(developed(components.

13

Page 14: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Software(Environment(Collaboration(Roles

14

! System(Vendor! Deliver(and(support(core(elements(of(the(software(environment(necessary(

for(a(viable(integrated(system((part(of(system(contract)! Tri3lab(team

! Integrate(system(into(our(computing(environment! Identify(and(resolve(SW(issues(in(collaboration(with(system(vendor! Contribute(tools(and(other(capabilities(to(fill(gaps(and(improve(the(overall(

computing(environment! Other(Eco3system(Vendors/Stakeholders

! Linaro! OpenHPC! ARM/Allinea! Others

Page 15: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Drivers(for(ARM(Software(Stack1.(Focus(on(NNSA/ASC’s(unique(requirements

! Many(programs(are(already(exploring(HPC(on(ARM((e.g.,(Mont3Blanc,(Post3K)! Need(to(demonstrate(full(multi3physics(applications(running(at(scale(on(ARM

2.(Build(an(integrated(team! Integrate(work(across(tri3labs(and(vendors,(managed(as(single(project! Focus(on(strengths(of(each(laboratory,(support(technology(deployment

3.(Provide(a(robust(“traditional”(HPC(software(stack! Meet(user(expectations,(provide(familiar(environment! ARM(is(just(an(ISA;(much(of(system(software(stack(should(“just(work”

4.(Engage(with(users! Enable(low3effort(issue(reporting,(rapidly(resolve,(get(and(provide(feedback! Build(community(through(talks,(training,(hackathons,(workshops,(etc.

5.(Look(forward! Improve(support(for(real3world(workflows(and(data3centric(computing! Seek(to(reduce(friction(points(for(users(when(moving(between(platforms! Leverage(software(technology(from(US(DOE(Exascale Computing(Project

15

Page 16: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Technology(Development(Areas

16

! 2019(ARM(platform(is(not a(production(system(– increased(latitude(for(experimentation(and(R&D(activities! Need(to(address(options(for(scheduling(and(types(of(access(to(the(system(to(

provide(for(full(scale(projects,(alternate(environments,(etc.! Plan(to(pursue(several(technology(development(areas:

! On3package(high3bandwidth(memory! Advanced(HPC(Interconnect! Compilers,(math(libraries,(tools,(task3based(runtimes,(...

! Vendor(framework(that(OpenHPC components(get(plugged(into! OpenHPC framework(that(vendor(components(get(plugged(into

! Experimental(OS/R(stacks! Lightweight(kernels((Hobbes/Kitten,(Intel(mOS,(RIKEN(McKernel)! KVM(support(for(cyber(“simulate(the(Internet”(frameworks! Support(for(HPC(+(Analytics((including(AI(and(ML)(workflows(and(compositions! More(interactive(IaaS3style(usage(models

Page 17: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Required(Vendor(Support

17

! Modular(system(software(environment(architected(to(facilitate(site(modifications(and(additions.

! Documented(APIs(for(integrating(new(software(components(with(vendor(stack((e.g.(RAS(APIs,(Network,(I/O,(Health(of(subsystems,…)

! Ability(to(partition(the(system(and(manage(multiple(OS(images! Carve(out(dedicated(resources(for(R&D(activities! Boot(new(or(experimental(software(environments

! Buildable(source(– High(Priority! Ability(for(labs(to(debug(and(fix(system(software(issues

! For(example,(be(able(to(debug(MPI,(PMI,(Lustre,(Burst(Buffers,(app(load,(…! Ability(to(rebuild(vendor(Linux(kernels(with(modified(config options

! Requires(vendor(to(provide(source(for(all(“binary3only”(kernel(drivers(and(all(non3standard(Linux(kernel(patches

! Ability(to(build(and(evaluate(experimental(OS/R(stacks

Page 18: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Recap:(Goals(of(NNSA/ASC(Vanguard(Project

! Key(Requirements(for(this(2019(Platform! System(prototype(for(future(leadership(class(DOE(platform! Competitive(HPC(64(Bit(ARM(processor(technology! Integrated(On3package(Memory! First(of(a(kind,(Advanced(Interconnect(technology! Large(focus(on(maturing(the(ARM(software(ecosystem

! Opportunities(for(holistic co1design&of innovative test(hardware! Logic(in(NIC(to(improve(interconnection(network(performance! Logic(in/near(memory(to(support(sparse(linear(algebra(acceleration! Other(advanced(processor(architectures

18

Page 19: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Acknowledgements

! NNSA/ASC3Program:3 Mark&Anderson,&Douglas&Wade,&Thuc Hoang! Advanced3Architecture3Testbeds3Operations3Team: John&Noe,&Jim&Brandt,&Ann&Gentile,&Victor&Kuhns,&Nate&Gauntt,&Mike&Aguilar

! SST3Developers:33K.&Scott&Hemmert,&Branden&Moore,&Gwen&Voskuilen,&AmroAwad,&Clay&Hughes,&Jeremy&Wilke

! SST3SQE3team:33Jon&Wilson,&Aaron&Levine,&John&Van&Dyke! Tri+lab Vanguard3team: Jim&Laros,&Kevin&Pedretti,&Si&Hammond,&Rob&Hoekstra,&Ron&Brightwell,&Ken&Alvin,&Trent&D’Hooge,&James&Foraker,&Rob&Neely,&Becky&Springmeyer,&Bronis&de&Supinski,&Mike&Lang,&Pat&McCormick

! Cavium:33Gopal&Hegde,&Rishi&Chugh,&Larry&Wikelius,&David&Hass,&Giri&Chukkapalli&! ARM:33Nigel&Paver,&Eric&Van&Hensbergen,&Geraint North&! HPE:33Mike&Vildibill,&Nic&Dube,&Kelly&Pracht! Cray:33Duncan&Roweth,&Dan&Ernst,&Larry&Kaplan,&Heidi&Poxon

19

Page 20: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&
Page 21: ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16 32 48 64 80 96 Speedup3vs.3Pre + Prod3TX1 OpenMPThreads STREAM3Triad ThunderX1&DFinal&

Current(NNSA(Architecture(Strategy

21

Advanced&Technology

Systems&&(ATS)

Fiscal&Year

‘13 ‘14 ‘15 ‘16 ‘17 ‘18

UseRetire

‘20‘19 ‘21

Commodity&

Technology

Systems&(CTS)&

Procure&&Deploy

Sequoia(((LLNL)

ATS'1'– Trinity''(LANL/SNL)

ATS(2(– Sierra((LLNL)

Tri3lab(Linux(Capacity(Cluster(II((TLCC(II)

CTS(1

CTS(2

‘22

SystemDelivery

ATS(3(– Crossroads((LANL/SNL)

ATS(4(– (LLNL)

‘23

ATS(5(– (LANL/SNL)