ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16...
Transcript of ASC ARM OverviewGoingARM-FinalSmaller · Memory(Bandwidth(Intensive(Kernels 4 0 2 4 6 8 10 12 0 16...
Sandia&National&Laboratories&is&a&multimission laboratory&managed&and&operated&by&National&Technology&and&Engineering&Solutions&of&Sandia,&LLC,&a&wholly&owned&subsidiary&of&Honeywell&International,&Inc.,&for&the&U.S.&Department&of&Energy's&National&Nuclear&Security&Administration&under&contract&DEDNAD0003525.
Sandia’s(ARM–centric(Co3Design(StrategyIntroduction(to(the(NNSA/ASC(Vanguard(Project(
James(A.(Ang,(Ron(Brightwell,(Simon(D.(Hammond,(K.(Scott(Hemmert.(Robert(J.(Hoekstra,(James(H.(Laros III,(Kevin(Pedretti and(Arun F.(RodriguesCenter(for(Computing(ResearchSandia(National(LaboratoriesAlbuquerque,(NM(8718531319((USA
ARM(Research(Summit,(Going(ARM(WorkshopRobinson(College,(Cambridge,(UKSeptember(11313,(2017
Sandia&National&Laboratories&is&a&multimission laboratory&managed&and&operated&by&National&Technology&and&Engineering&Solutions&of&Sandia,&LLC,&a&wholly&owned&subsidiary&of&Honeywell&International,&Inc.,&for&the&U.S.&Department&of&Energy's&National&Nuclear&Security&Administration&under&contract&DEDNAD0003525.
1
SAND2017D9713&C
Sandia(ARM(R&D(Overview! Sandia's(ASC(Advanced(Architecture(Testbed(project
! For(co3design(early(access(to(new(architectural(features(is(critical(! Positive(ROI(for(the(pain(associated(with(early(hardware(&(immature(
software! ARM(testbed(experiences(– with(Cavium’s(permission(to(share(pre3
production(measurements(of(relative(performance
! Architectural(Simulators! Structural(Simulation(Toolkit((SST),(Open(architectural(simulation(
framework! Open(Source,(Open(Development,(catalyst(for(collaboration(that(can(
be(proprietary! Links(to(Sandia/DOE's(experience(with(application(drivers
! Sandia(is(responsible(for(the(NNSA/ASC's(first(large&scaleARM(platform! Overview(of(key(requirements(targeting(2019
2
APM/HPE
Xgene+1
Cavium/HP3Labs
ThunderX1
=32019
=3Retired
=32015
=32017
TODAY
ARM
Next3Gen
ExascalePlatform
Sept32011
ARM
Next3Gen
Sandia’s(NNSA/ASC(ARM(Platforms
3
Small3Pre+
Vanguard3
early3access3
system3– FY18
Install3FY193Target3
ThunderX33(Triton)
Cavium&ThunderX1
PreDproduction&Cavium&ThunderX2
Memory(Bandwidth(Intensive(Kernels
4
0
2
4
6
8
10
12
0 16 32 48 64 80 96
Speedup3vs.3Pre+Prod3TX1
OpenMP Threads
STREAM3Triad
ThunderX1&D Final& ThunderX1&D PreDProductionThunderX2&PreDProduction&(Wave&1)
0
1
2
3
4
CG&Solve WAXPY DOT MATVEC
Speedup3vs.3Pre+Prod3TX1
MiniFE CG3Solve/Kernels
ThunderX1&D PreDProduction ThunderX1&Final&SiliconThunderX2&D PreDProduction&(Wave&1)
Benchmarked(on(maximum(number(of(cores((node3node(comparison)
https://github.com/Mantevo/miniFE (Sandia&National&Labs)
Significant(improvement(in(memory(bandwidth((~4X)
Individual(cores(more(powerful ~2.5X
Compute(Intensive(Kernels
5
0
1
2
3
Speedup3vs.3Pre+Prod3TX1
LULESH3OpenMP (Figure3of3Merit)
ThunderX1&D PreDProductionThunderX1&D Final&ThunderX2&PreDProduction&(Wave&1)
~2.7X
0
1
2
3
Speedup3vs.3Pre+Prod3TX1
PENNANT3(Hydro3Solve3Time)
ThunderX1&D PreDProduction ThunderX1&D Final&ThunderX2&D PreDProduction&(Wave&1)
https://github.com/lanl/PENNANT (Los&Alamos&Nat.&Lab) https://codesign.llnl.gov/lulesh.php (Lawrence&Livermore&Nat.&Lab)
~2.2X
Full(Engineering(Applications(on(ARM
! On(Pre3production(TX2,(Sandia(has(been(actively(working(on(libraries(and(packages(for(its(ARM(systems! Math(Libraries/Kernels((Trilinos)! I/O(Libraries((Exodus(Mesh)! YAML
! Ported(Sandia’s(open3source(Trinity&acceptance(Nalu application(to(ARM! Representative(of(some(SIERRA(engineering(
applications! Complex(mesh(handling,(load(balancing,(
solvers,(etc.6
https://github.com/NaluCFD
Architectural(Simulation(Framework:(SST! Use(Supercomputers(to(Design(Supercomputers! Parallel(Discrete3Event(Simulator(Framework
! Flexible(framework(allows(multitude(of(custom(simulators! Demonstrated(scaling(to(over(512(processors
! Comes(with(many(built3in(simulation(models! Processors,(Memory,(Network
! Open(API! Easily(extensible(with(new(models! Modular(framework! (Non3Viral)(Open3source(core
! Time3scale(independent(core! Handles(Micro3,(Meso3,(Macro3scale(simulations
! “Best(of(Breed”(– Bring(together(work(from(Labs,(Industry,(Academia
7
SST::Component SST::Component
SST3Core
Configuration
Partitioning
Link
Event
InstantiationTime3Coordination
Parallel3Communication
http://sst3simulator.org/
Example(SST(Element(Libraries! memHierarchy 3 Cache(and(Memory(! cassini 3 Cache(prefetchers! DRAMSim 3 DDR! NVDIMMSim 3 Emerging(Memories
! ariel 3 PIN3based(Tracing
! m5C 3 Gem5(integration(layer
! ember 3 State3machine(Message(generation! firefly 3 Communication(Protocols! hermes 3 MPI3like(interface
! merlin 3 Network(router(model(and(NIC
! scheduler 3 Job3scheduler(simulation(models8
Detailed(Memory(Models
Dynamic(Trace3based(Processor(Model
Cycle3based(Processor(Model
High3level(Program(Communication(Models
Cycle3based(Network(Model
High3level(System(Workflow(Model
SST(Use(Cases
9
NVM-Based DIMM
NVM Chip
NVM Chip
NVM Chip
NVM Chip
NVM Chip
NVM Chip
NVM Chip
NVM Chip
Rank
Bank
NVM Internal Controller
Write Buffer
Requests Tracker
RequestBuffer Scheduler
Wear LevelerPower
Manager
NVM Chip
NVM Chip
NVM Chip
NVM Chip
NVM Chip
NVM Chip
NVM Chip
NVM Chip
Rank
Memory Bus
Network
P
Scratch
P P
"$"
Fast
SlowMemory
P
Scratch
P P
"$"
Fast
SlowMemory
P
Scratch
P P
"$"
Fast
SlowMemory Network
P
Scratch
P P
"$"
Fast
SlowMemory
P
Scratch
P P
"$"
Fast
SlowMemory
P
Scratch
P P
"$"
Fast
SlowMemory
L3
L2
L1
L0
Emerging&NV&Memory&Technologies
Photonic&Network&Topology&&&Routing
MultiDLevel&Memory(HBM+DDR+NV)
Disaggregated&Memory
Communication/&Interconnect&Modeling
The(NNSA/ASC(Vanguard(Project! Expand'HPC'ecosystem'by(developing(ARM3based(system(necessary(to(enable(a(credible ARM3based(Advanced(Technology(System(offering(for(ASC(post32022! Mature(ARM(HPC(ecosystem
– Software(stack(and(toolchain((OS,(compilers,(scalable(MPI,(runtime,(system(management,(development(tools,(I/O,(…)
– On3package(high3bandwidth(memory– Advanced(HPC(interconnect
! Influence(Supplier/Integrator(community(to(accelerate(and(optimize(ARM(technologies(for(leadership(HPC,(and(foster(HPC(community(confidence
! Leverage(the(open(ARM(architecture(platform(to(enable(innovations(in(processor,(interconnect,(memory,(specialized(acceleration(and(packaging(technologies(that(focus&on&performance&and&scalability&of&our&multi3physics&production&codes&and(could(impact(2025(systems
10
The(Vanguard(Project(Approach
! Bridge(the(gap(between(what(we(can(accomplish(with(Testbed(project(and(fielding(an(Advanced(Technology(System(platform
! Vanguard(is(a(project,(not(a(single(platform! Accelerate(maturity(of(emerging(technologies(for(ASC(program
11
Test3BedsVanguardATS/Capability
Platforms
Greater3Stability,3Larger3Scale
Higher3Risk,3Greater3Architectural3Choices
Vanguard(Approach((cont.)
! Appropriate scale(systems! Must(be(large(enough(to(serve(as(proof(of(concept(for(future(Advanced(Technology(Systems
! Sufficient(scale(to(interest(production(multi3physics(application(teams! Appropriate investment
! Gain(and(maintain(vendor(and(collaborator(attention! Appropriate level(of(risk
! Goals(target(mission(workloads(but(not(turn3key(production(mission(support
12
Software(Environment(Plan(3 Overview! Goal:(Accelerate(maturity(of(ARM(ecosystem(for(ASC(computing(mission! Need(an(integrated(software(stack(for(the(2019(ARM(Prototype(to(enable(
application(development(and(optimization! Programming(environment((compilers,(math(libs,(tools,(MPI,(OMP,…..)! Low3level(OS((optimized(Linux,(I/O,(network,(containers(+(VMs,(PowerAPI,(…)! Job(Scheduling(and(management((WLM(Slurm?),(app(launcher,(user(tools,(….)! System(Management((boot,(monitoring,(image(mgt.,(rapid(re3provisioning,(….)
! Focus(Areas! Integration(and(robustness,(overall(user(experience! Address(known(weaknesses:(compilers,(libs,(and(tools! Increase(modularization(and(openness(of(system(software(stack;(seek(“plugin”(capability(for(
externally(developed(components.
13
Software(Environment(Collaboration(Roles
14
! System(Vendor! Deliver(and(support(core(elements(of(the(software(environment(necessary(
for(a(viable(integrated(system((part(of(system(contract)! Tri3lab(team
! Integrate(system(into(our(computing(environment! Identify(and(resolve(SW(issues(in(collaboration(with(system(vendor! Contribute(tools(and(other(capabilities(to(fill(gaps(and(improve(the(overall(
computing(environment! Other(Eco3system(Vendors/Stakeholders
! Linaro! OpenHPC! ARM/Allinea! Others
Drivers(for(ARM(Software(Stack1.(Focus(on(NNSA/ASC’s(unique(requirements
! Many(programs(are(already(exploring(HPC(on(ARM((e.g.,(Mont3Blanc,(Post3K)! Need(to(demonstrate(full(multi3physics(applications(running(at(scale(on(ARM
2.(Build(an(integrated(team! Integrate(work(across(tri3labs(and(vendors,(managed(as(single(project! Focus(on(strengths(of(each(laboratory,(support(technology(deployment
3.(Provide(a(robust(“traditional”(HPC(software(stack! Meet(user(expectations,(provide(familiar(environment! ARM(is(just(an(ISA;(much(of(system(software(stack(should(“just(work”
4.(Engage(with(users! Enable(low3effort(issue(reporting,(rapidly(resolve,(get(and(provide(feedback! Build(community(through(talks,(training,(hackathons,(workshops,(etc.
5.(Look(forward! Improve(support(for(real3world(workflows(and(data3centric(computing! Seek(to(reduce(friction(points(for(users(when(moving(between(platforms! Leverage(software(technology(from(US(DOE(Exascale Computing(Project
15
Technology(Development(Areas
16
! 2019(ARM(platform(is(not a(production(system(– increased(latitude(for(experimentation(and(R&D(activities! Need(to(address(options(for(scheduling(and(types(of(access(to(the(system(to(
provide(for(full(scale(projects,(alternate(environments,(etc.! Plan(to(pursue(several(technology(development(areas:
! On3package(high3bandwidth(memory! Advanced(HPC(Interconnect! Compilers,(math(libraries,(tools,(task3based(runtimes,(...
! Vendor(framework(that(OpenHPC components(get(plugged(into! OpenHPC framework(that(vendor(components(get(plugged(into
! Experimental(OS/R(stacks! Lightweight(kernels((Hobbes/Kitten,(Intel(mOS,(RIKEN(McKernel)! KVM(support(for(cyber(“simulate(the(Internet”(frameworks! Support(for(HPC(+(Analytics((including(AI(and(ML)(workflows(and(compositions! More(interactive(IaaS3style(usage(models
Required(Vendor(Support
17
! Modular(system(software(environment(architected(to(facilitate(site(modifications(and(additions.
! Documented(APIs(for(integrating(new(software(components(with(vendor(stack((e.g.(RAS(APIs,(Network,(I/O,(Health(of(subsystems,…)
! Ability(to(partition(the(system(and(manage(multiple(OS(images! Carve(out(dedicated(resources(for(R&D(activities! Boot(new(or(experimental(software(environments
! Buildable(source(– High(Priority! Ability(for(labs(to(debug(and(fix(system(software(issues
! For(example,(be(able(to(debug(MPI,(PMI,(Lustre,(Burst(Buffers,(app(load,(…! Ability(to(rebuild(vendor(Linux(kernels(with(modified(config options
! Requires(vendor(to(provide(source(for(all(“binary3only”(kernel(drivers(and(all(non3standard(Linux(kernel(patches
! Ability(to(build(and(evaluate(experimental(OS/R(stacks
Recap:(Goals(of(NNSA/ASC(Vanguard(Project
! Key(Requirements(for(this(2019(Platform! System(prototype(for(future(leadership(class(DOE(platform! Competitive(HPC(64(Bit(ARM(processor(technology! Integrated(On3package(Memory! First(of(a(kind,(Advanced(Interconnect(technology! Large(focus(on(maturing(the(ARM(software(ecosystem
! Opportunities(for(holistic co1design&of innovative test(hardware! Logic(in(NIC(to(improve(interconnection(network(performance! Logic(in/near(memory(to(support(sparse(linear(algebra(acceleration! Other(advanced(processor(architectures
18
Acknowledgements
! NNSA/ASC3Program:3 Mark&Anderson,&Douglas&Wade,&Thuc Hoang! Advanced3Architecture3Testbeds3Operations3Team: John&Noe,&Jim&Brandt,&Ann&Gentile,&Victor&Kuhns,&Nate&Gauntt,&Mike&Aguilar
! SST3Developers:33K.&Scott&Hemmert,&Branden&Moore,&Gwen&Voskuilen,&AmroAwad,&Clay&Hughes,&Jeremy&Wilke
! SST3SQE3team:33Jon&Wilson,&Aaron&Levine,&John&Van&Dyke! Tri+lab Vanguard3team: Jim&Laros,&Kevin&Pedretti,&Si&Hammond,&Rob&Hoekstra,&Ron&Brightwell,&Ken&Alvin,&Trent&D’Hooge,&James&Foraker,&Rob&Neely,&Becky&Springmeyer,&Bronis&de&Supinski,&Mike&Lang,&Pat&McCormick
! Cavium:33Gopal&Hegde,&Rishi&Chugh,&Larry&Wikelius,&David&Hass,&Giri&Chukkapalli&! ARM:33Nigel&Paver,&Eric&Van&Hensbergen,&Geraint North&! HPE:33Mike&Vildibill,&Nic&Dube,&Kelly&Pracht! Cray:33Duncan&Roweth,&Dan&Ernst,&Larry&Kaplan,&Heidi&Poxon
19
Current(NNSA(Architecture(Strategy
21
Advanced&Technology
Systems&&(ATS)
Fiscal&Year
‘13 ‘14 ‘15 ‘16 ‘17 ‘18
UseRetire
‘20‘19 ‘21
Commodity&
Technology
Systems&(CTS)&
Procure&&Deploy
Sequoia(((LLNL)
ATS'1'– Trinity''(LANL/SNL)
ATS(2(– Sierra((LLNL)
Tri3lab(Linux(Capacity(Cluster(II((TLCC(II)
CTS(1
CTS(2
‘22
SystemDelivery
ATS(3(– Crossroads((LANL/SNL)
ATS(4(– (LLNL)
‘23
ATS(5(– (LANL/SNL)