ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16...

Post on 12-Oct-2020

2 views 0 download

Transcript of ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16...

Sandia&National&Laboratories&is&a&multimission laboratory&managed&and&operated&by&National&Technology&and&Engineering&Solutions&of&Sandia,&LLC,&a&wholly&owned&subsidiary&of&Honeywell&International,&Inc.,&for&the&U.S.&Department&of&Energy's&National&Nuclear&Security&Administration&under&contract&DEDNAD0003525.

Sandia’s(ARM–centric(Co3Design(StrategyIntroduction(to(the(NNSA/ASC(Vanguard(Project(

James(A.(Ang,(Ron(Brightwell,(Simon(D.(Hammond,(K.(Scott(Hemmert.(Robert(J.(Hoekstra,(James(H.(Laros III,(Kevin(Pedretti and(Arun F.(RodriguesCenter(for(Computing(ResearchSandia(National(LaboratoriesAlbuquerque,(NM(8718531319((USA

ARM(Research(Summit,(Going(ARM(WorkshopRobinson(College,(Cambridge,(UKSeptember(11313,(2017

Sandia&National&Laboratories&is&a&multimission laboratory&managed&and&operated&by&National&Technology&and&Engineering&Solutions&of&Sandia,&LLC,&a&wholly&owned&subsidiary&of&Honeywell&International,&Inc.,&for&the&U.S.&Department&of&Energy's&National&Nuclear&Security&Administration&under&contract&DEDNAD0003525.

1

SAND2017D9713&C

Sandia(ARM(R&D(Overview! Sandia's(ASC(Advanced(Architecture(Testbed(project

! For(co3design(early(access(to(new(architectural(features(is(critical(! Positive(ROI(for(the(pain(associated(with(early(hardware(&(immature(

software! ARM(testbed(experiences(– with(Cavium’s(permission(to(share(pre3

production(measurements(of(relative(performance

! Architectural(Simulators! Structural(Simulation(Toolkit((SST),(Open(architectural(simulation(

framework! Open(Source,(Open(Development,(catalyst(for(collaboration(that(can(

be(proprietary! Links(to(Sandia/DOE's(experience(with(application(drivers

! Sandia(is(responsible(for(the(NNSA/ASC's(first(large&scaleARM(platform! Overview(of(key(requirements(targeting(2019

2

APM/HPE

Xgene+1

Cavium/HP3Labs

ThunderX1

=32019

=3Retired

=32015

=32017

TODAY

ARM

Next3Gen

ExascalePlatform

Sept32011

ARM

Next3Gen

Sandia’s(NNSA/ASC(ARM(Platforms

3

Small3Pre+

Vanguard3

early3access3

system3– FY18

Install3FY193Target3

ThunderX33(Triton)

Cavium&ThunderX1

PreDproduction&Cavium&ThunderX2

Memory(Bandwidth(Intensive(Kernels

4

0

2

4

6

8

10

12

0 16 32 48 64 80 96

Speedup3vs.3Pre+Prod3TX1

OpenMP Threads

STREAM3Triad

ThunderX1&D Final& ThunderX1&D PreDProductionThunderX2&PreDProduction&(Wave&1)

0

1

2

3

4

CG&Solve WAXPY DOT MATVEC

Speedup3vs.3Pre+Prod3TX1

MiniFE CG3Solve/Kernels

ThunderX1&D PreDProduction ThunderX1&Final&SiliconThunderX2&D PreDProduction&(Wave&1)

Benchmarked(on(maximum(number(of(cores((node3node(comparison)

https://github.com/Mantevo/miniFE (Sandia&National&Labs)

Significant(improvement(in(memory(bandwidth((~4X)

Individual(cores(more(powerful ~2.5X

Compute(Intensive(Kernels

5

0

1

2

3

Speedup3vs.3Pre+Prod3TX1

LULESH3OpenMP (Figure3of3Merit)

ThunderX1&D PreDProductionThunderX1&D Final&ThunderX2&PreDProduction&(Wave&1)

~2.7X

0

1

2

3

Speedup3vs.3Pre+Prod3TX1

PENNANT3(Hydro3Solve3Time)

ThunderX1&D PreDProduction ThunderX1&D Final&ThunderX2&D PreDProduction&(Wave&1)

https://github.com/lanl/PENNANT (Los&Alamos&Nat.&Lab) https://codesign.llnl.gov/lulesh.php (Lawrence&Livermore&Nat.&Lab)

~2.2X

Full(Engineering(Applications(on(ARM

! On(Pre3production(TX2,(Sandia(has(been(actively(working(on(libraries(and(packages(for(its(ARM(systems! Math(Libraries/Kernels((Trilinos)! I/O(Libraries((Exodus(Mesh)! YAML

! Ported(Sandia’s(open3source(Trinity&acceptance(Nalu application(to(ARM! Representative(of(some(SIERRA(engineering(

applications! Complex(mesh(handling,(load(balancing,(

solvers,(etc.6

https://github.com/NaluCFD

Architectural(Simulation(Framework:(SST! Use(Supercomputers(to(Design(Supercomputers! Parallel(Discrete3Event(Simulator(Framework

! Flexible(framework(allows(multitude(of(custom(simulators! Demonstrated(scaling(to(over(512(processors

! Comes(with(many(built3in(simulation(models! Processors,(Memory,(Network

! Open(API! Easily(extensible(with(new(models! Modular(framework! (Non3Viral)(Open3source(core

! Time3scale(independent(core! Handles(Micro3,(Meso3,(Macro3scale(simulations

! “Best(of(Breed”(– Bring(together(work(from(Labs,(Industry,(Academia

7

SST::Component SST::Component

SST3Core

Configuration

Partitioning

Link

Event

InstantiationTime3Coordination

Parallel3Communication

http://sst3simulator.org/

Example(SST(Element(Libraries! memHierarchy 3 Cache(and(Memory(! cassini 3 Cache(prefetchers! DRAMSim 3 DDR! NVDIMMSim 3 Emerging(Memories

! ariel 3 PIN3based(Tracing

! m5C 3 Gem5(integration(layer

! ember 3 State3machine(Message(generation! firefly 3 Communication(Protocols! hermes 3 MPI3like(interface

! merlin 3 Network(router(model(and(NIC

! scheduler 3 Job3scheduler(simulation(models8

Detailed(Memory(Models

Dynamic(Trace3based(Processor(Model

Cycle3based(Processor(Model

High3level(Program(Communication(Models

Cycle3based(Network(Model

High3level(System(Workflow(Model

SST(Use(Cases

9

NVM-Based DIMM

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

Rank

Bank

NVM Internal Controller

Write Buffer

Requests Tracker

RequestBuffer Scheduler

Wear LevelerPower

Manager

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

NVM Chip

Rank

Memory Bus

Network

P

Scratch

P P

"$"

Fast

SlowMemory

P

Scratch

P P

"$"

Fast

SlowMemory

P

Scratch

P P

"$"

Fast

SlowMemory Network

P

Scratch

P P

"$"

Fast

SlowMemory

P

Scratch

P P

"$"

Fast

SlowMemory

P

Scratch

P P

"$"

Fast

SlowMemory

L3

L2

L1

L0

Emerging&NV&Memory&Technologies

Photonic&Network&Topology&&&Routing

MultiDLevel&Memory(HBM+DDR+NV)

Disaggregated&Memory

Communication/&Interconnect&Modeling

The(NNSA/ASC(Vanguard(Project! Expand'HPC'ecosystem'by(developing(ARM3based(system(necessary(to(enable(a(credible ARM3based(Advanced(Technology(System(offering(for(ASC(post32022! Mature(ARM(HPC(ecosystem

– Software(stack(and(toolchain((OS,(compilers,(scalable(MPI,(runtime,(system(management,(development(tools,(I/O,(…)

– On3package(high3bandwidth(memory– Advanced(HPC(interconnect

! Influence(Supplier/Integrator(community(to(accelerate(and(optimize(ARM(technologies(for(leadership(HPC,(and(foster(HPC(community(confidence

! Leverage(the(open(ARM(architecture(platform(to(enable(innovations(in(processor,(interconnect,(memory,(specialized(acceleration(and(packaging(technologies(that(focus&on&performance&and&scalability&of&our&multi3physics&production&codes&and(could(impact(2025(systems

10

The(Vanguard(Project(Approach

! Bridge(the(gap(between(what(we(can(accomplish(with(Testbed(project(and(fielding(an(Advanced(Technology(System(platform

! Vanguard(is(a(project,(not(a(single(platform! Accelerate(maturity(of(emerging(technologies(for(ASC(program

11

Test3BedsVanguardATS/Capability

Platforms

Greater3Stability,3Larger3Scale

Higher3Risk,3Greater3Architectural3Choices

Vanguard(Approach((cont.)

! Appropriate scale(systems! Must(be(large(enough(to(serve(as(proof(of(concept(for(future(Advanced(Technology(Systems

! Sufficient(scale(to(interest(production(multi3physics(application(teams! Appropriate investment

! Gain(and(maintain(vendor(and(collaborator(attention! Appropriate level(of(risk

! Goals(target(mission(workloads(but(not(turn3key(production(mission(support

12

Software(Environment(Plan(3 Overview! Goal:(Accelerate(maturity(of(ARM(ecosystem(for(ASC(computing(mission! Need(an(integrated(software(stack(for(the(2019(ARM(Prototype(to(enable(

application(development(and(optimization! Programming(environment((compilers,(math(libs,(tools,(MPI,(OMP,…..)! Low3level(OS((optimized(Linux,(I/O,(network,(containers(+(VMs,(PowerAPI,(…)! Job(Scheduling(and(management((WLM(Slurm?),(app(launcher,(user(tools,(….)! System(Management((boot,(monitoring,(image(mgt.,(rapid(re3provisioning,(….)

! Focus(Areas! Integration(and(robustness,(overall(user(experience! Address(known(weaknesses:(compilers,(libs,(and(tools! Increase(modularization(and(openness(of(system(software(stack;(seek(“plugin”(capability(for(

externally(developed(components.

13

Software(Environment(Collaboration(Roles

14

! System(Vendor! Deliver(and(support(core(elements(of(the(software(environment(necessary(

for(a(viable(integrated(system((part(of(system(contract)! Tri3lab(team

! Integrate(system(into(our(computing(environment! Identify(and(resolve(SW(issues(in(collaboration(with(system(vendor! Contribute(tools(and(other(capabilities(to(fill(gaps(and(improve(the(overall(

computing(environment! Other(Eco3system(Vendors/Stakeholders

! Linaro! OpenHPC! ARM/Allinea! Others

Drivers(for(ARM(Software(Stack1.(Focus(on(NNSA/ASC’s(unique(requirements

! Many(programs(are(already(exploring(HPC(on(ARM((e.g.,(Mont3Blanc,(Post3K)! Need(to(demonstrate(full(multi3physics(applications(running(at(scale(on(ARM

2.(Build(an(integrated(team! Integrate(work(across(tri3labs(and(vendors,(managed(as(single(project! Focus(on(strengths(of(each(laboratory,(support(technology(deployment

3.(Provide(a(robust(“traditional”(HPC(software(stack! Meet(user(expectations,(provide(familiar(environment! ARM(is(just(an(ISA;(much(of(system(software(stack(should(“just(work”

4.(Engage(with(users! Enable(low3effort(issue(reporting,(rapidly(resolve,(get(and(provide(feedback! Build(community(through(talks,(training,(hackathons,(workshops,(etc.

5.(Look(forward! Improve(support(for(real3world(workflows(and(data3centric(computing! Seek(to(reduce(friction(points(for(users(when(moving(between(platforms! Leverage(software(technology(from(US(DOE(Exascale Computing(Project

15

Technology(Development(Areas

16

! 2019(ARM(platform(is(not a(production(system(– increased(latitude(for(experimentation(and(R&D(activities! Need(to(address(options(for(scheduling(and(types(of(access(to(the(system(to(

provide(for(full(scale(projects,(alternate(environments,(etc.! Plan(to(pursue(several(technology(development(areas:

! On3package(high3bandwidth(memory! Advanced(HPC(Interconnect! Compilers,(math(libraries,(tools,(task3based(runtimes,(...

! Vendor(framework(that(OpenHPC components(get(plugged(into! OpenHPC framework(that(vendor(components(get(plugged(into

! Experimental(OS/R(stacks! Lightweight(kernels((Hobbes/Kitten,(Intel(mOS,(RIKEN(McKernel)! KVM(support(for(cyber(“simulate(the(Internet”(frameworks! Support(for(HPC(+(Analytics((including(AI(and(ML)(workflows(and(compositions! More(interactive(IaaS3style(usage(models

Required(Vendor(Support

17

! Modular(system(software(environment(architected(to(facilitate(site(modifications(and(additions.

! Documented(APIs(for(integrating(new(software(components(with(vendor(stack((e.g.(RAS(APIs,(Network,(I/O,(Health(of(subsystems,…)

! Ability(to(partition(the(system(and(manage(multiple(OS(images! Carve(out(dedicated(resources(for(R&D(activities! Boot(new(or(experimental(software(environments

! Buildable(source(– High(Priority! Ability(for(labs(to(debug(and(fix(system(software(issues

! For(example,(be(able(to(debug(MPI,(PMI,(Lustre,(Burst(Buffers,(app(load,(…! Ability(to(rebuild(vendor(Linux(kernels(with(modified(config options

! Requires(vendor(to(provide(source(for(all(“binary3only”(kernel(drivers(and(all(non3standard(Linux(kernel(patches

! Ability(to(build(and(evaluate(experimental(OS/R(stacks

Recap:(Goals(of(NNSA/ASC(Vanguard(Project

! Key(Requirements(for(this(2019(Platform! System(prototype(for(future(leadership(class(DOE(platform! Competitive(HPC(64(Bit(ARM(processor(technology! Integrated(On3package(Memory! First(of(a(kind,(Advanced(Interconnect(technology! Large(focus(on(maturing(the(ARM(software(ecosystem

! Opportunities(for(holistic co1design&of innovative test(hardware! Logic(in(NIC(to(improve(interconnection(network(performance! Logic(in/near(memory(to(support(sparse(linear(algebra(acceleration! Other(advanced(processor(architectures

18

Acknowledgements

! NNSA/ASC3Program:3 Mark&Anderson,&Douglas&Wade,&Thuc Hoang! Advanced3Architecture3Testbeds3Operations3Team: John&Noe,&Jim&Brandt,&Ann&Gentile,&Victor&Kuhns,&Nate&Gauntt,&Mike&Aguilar

! SST3Developers:33K.&Scott&Hemmert,&Branden&Moore,&Gwen&Voskuilen,&AmroAwad,&Clay&Hughes,&Jeremy&Wilke

! SST3SQE3team:33Jon&Wilson,&Aaron&Levine,&John&Van&Dyke! Tri+lab Vanguard3team: Jim&Laros,&Kevin&Pedretti,&Si&Hammond,&Rob&Hoekstra,&Ron&Brightwell,&Ken&Alvin,&Trent&D’Hooge,&James&Foraker,&Rob&Neely,&Becky&Springmeyer,&Bronis&de&Supinski,&Mike&Lang,&Pat&McCormick

! Cavium:33Gopal&Hegde,&Rishi&Chugh,&Larry&Wikelius,&David&Hass,&Giri&Chukkapalli&! ARM:33Nigel&Paver,&Eric&Van&Hensbergen,&Geraint North&! HPE:33Mike&Vildibill,&Nic&Dube,&Kelly&Pracht! Cray:33Duncan&Roweth,&Dan&Ernst,&Larry&Kaplan,&Heidi&Poxon

19

Current(NNSA(Architecture(Strategy

21

Advanced&Technology

Systems&&(ATS)

Fiscal&Year

‘13 ‘14 ‘15 ‘16 ‘17 ‘18

UseRetire

‘20‘19 ‘21

Commodity&

Technology

Systems&(CTS)&

Procure&&Deploy

Sequoia(((LLNL)

ATS'1'– Trinity''(LANL/SNL)

ATS(2(– Sierra((LLNL)

Tri3lab(Linux(Capacity(Cluster(II((TLCC(II)

CTS(1

CTS(2

‘22

SystemDelivery

ATS(3(– Crossroads((LANL/SNL)

ATS(4(– (LLNL)

‘23

ATS(5(– (LANL/SNL)