Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle%...

17
Jack C. Wells, Director of Science Oak Ridge Leadership Computing Facility/Oak Ridge National Laboratory Join the Conversation #OpenPOWERSummit Powering the Road to National HPC Leadership

Transcript of Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle%...

Page 1: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

Jack%C.%Wells,%Director%of%ScienceOak$Ridge$Leadership$Computing$Facility/Oak$Ridge$National$Laboratory

Join%the%Conversation%#OpenPOWERSummit

Powering)the)Road)to)National)HPC)Leadership)

Page 2: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

ORNL%is%managed%by%UT2Battelle%for%the%US%Department%of%Energy

Powering$the$Road$to$National$HPC$Leadership$

Jack%C.%WellsDirector%of%ScienceOak%Ridge%Leadership%Computing%FacilityOak%Ridge%National%Laboratory

2018%OpenPOWER SummitLas%Vegas19%March%2018

This%research%used%resources%of%the%Oak%Ridge%Leadership%Computing%Facility%at%the%Oak%Ridge%National%Laboratory,%which%is%supported%by%the%Office%of%Science%of%the%U.S.%Department%of%Energy%under%Contract%No.%DE2AC05200OR22725.%Some%of%the%work%presented%here%is%from%the%TOTAL%and%Oak%Ridge%National%Laboratory%collaboration%which%is%done%under%the%CRADA%agreement%NFE214205227.%Some%of%the%experiments%were%supported%by%an%allocation%of%advanced%computing%resources%provided%by%the%National%Science%Foundation.%The%computations%were%performed%on%Nautilus%at%the%National%Institute%for%Computational%Sciences.

Page 3: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

A"Little"About"ORNL…

Oak Ridge, Tennessee

Oak$Ridge$National$Laboratory$is$the$

largest$US$Department$of$

Energy$(DOE)$open$science$laboratory$

Page 4: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

What$is$a$Leadership$Computing$Facility$(LCF)?

• Collaborative%DOE%Office%of%Science%user2facility%program%at%ORNL%and%ANL

• Mission:%Provide%the%computational%and%data%resources%required%to%solve%the%most%challenging%problems.

• 22centers/22architectures%to%address%diverse%and%growing%computational%needs%of%the%scientific%community

• Highly%competitive%user%allocation%programs%(INCITE,%ALCC).

• Projects%receive%10x%to%100x%more%resource%than%at%other%generally%available%centers.

• LCF%centers%partner%with%users%to%enable%science%&%engineering%breakthroughs%(Liaisons,%Catalysts).

Page 5: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

OLCF23

ORNL$has$systematically$delivered$a$series$of$leadershipEclass$systemsOn%scope%•%On%budget%•%Within%schedule

Titan,%five%years%old%in%October%2017,%continues%to%deliver%world2class%science%research%in%support%of%our%user%community.%We%will%operate%Titan%through%2019%when%it%will%be%decommissioned.

OLCF21

OLCF22

10002foldimprovementin%8%years

2012Cray%XK7%Titan

27PF

18.5TF

25%TF

54%TF

62%TF

263%TF

1%PF

2.5PF

2004Cray%X1E%Phoenix%

2005Cray%XT3%Jaguar

2006Cray%XT3%Jaguar

2007Cray%XT4%Jaguar

2008Cray%XT4%Jaguar

2008Cray%XT5%Jaguar

2009Cray%XT5%Jaguar

Page 6: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

We$are$building$on$this$record$of$success$to$enable$exascale in$2021

5002foldimprovementin%9%years

OLCF25

OLCF24~1EF

200PF

27PF

2012Cray%XK7%Titan

2021Frontier

2018IBM%

Summit

Page 7: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

Summit,%slated%to%be%more%powerful%than%any%other%existing%supercomputer,%is%the%Department%of%Energy’s%Oak%Ridge%National%Laboratory’s%newest%supercomputer%for%open%science.

Coming$in$2018:$Summit$will$replace$Titan$as$the$OLCF’s$leadership$supercomputer$

Page 8: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

Summit$Overview

IBM$POWER9• 22%Cores• 4%Threads/core• NVLink

NVIDIA$GV100• 7%TF• 16%GB%@%0.9%TB/s• NVLink

Components

Compute$Node2%x%POWER96%x%NVIDIA%GV100NVMe2compatible%PCIe 1600%GB%SSD%

!

! 25%GB/s%EDR%IB2 (2%ports)512%GB%DRAM2 (DDR4)96%GB%HBM2 (3D%Stacked)Coherent%Shared%Memory

Compute$Rack

39.7%TB%Memory/rack55%KW%max%power/rack

18%Compute%ServersWarm%water%(70°F%direct2cooled%components)

RDHX%for%air2cooled%components

Compute$System10.2$PB$Total$Memory256%compute%racks4,608%compute%nodesMellanox EDR%IB%fabric

200%PFLOPS~13%MW%

GPFS$File$System250$PB$storage

2.5%TB/s%read,%2.5%TB/s%write

Page 9: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

Summit$Node$Overview

P9 P9

DRAM256 GBH

BM16

GB

GPU 7 TF

HBM

16 G

B

GPU 7 TF

HBM

16 G

B

GPU 7 TF

DRAM256 GB H

BM16

GB

GPU 7 TF

HBM

16 G

B

GPU 7 TF

HBM

16 G

B

GPU 7 TF

TF 42 TF (6x7 TF)HBM 96 GB (6x16 GB)DRAM 512 GB (2x16x16 GB)NET 25 GB/s (2x12.5 GB/s)MMsg/s 83

NIC

HBM/DRAM Bus (aggregate B/W)NVLINKX-Bus (SMP)PCIe Gen4EDR IB

HBM & DRAM speeds are aggregate (Read+Write).All other speeds (X-Bus, NVLink, PCIe, IB) are bi-directional.

NVM6.0 GB/s Read2.2 GB/s Write

12.5

GB/

s

12.5

GB/

s

16 GB/s 16

GB/

s

64GB/s

135

GB/

s

135

GB/

s

50 G

B/s

50 GB/s

50 GB/s

50 G

B/s

50 GB/s

50 GB/s

50 G

B/s

50 G

B/s

50 G

B/s

50 G

B/s

900

GB/

s90

0 G

B/s

900

GB/

s

900

GB/

s90

0 G

B/s

900

GB/

s

Page 10: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

Coming$in$2018:$Summit$will$replace$Titan$as$the$OLCF’s$leadership$supercomputer$

• Many%fewer%nodes

• Much%more%powerful%nodes

• Much%more%memory%per%node%and%total%system%memory

• Faster%interconnect

• Much%higher%bandwidth%between%CPUs%and%GPUs

• Much%larger%and%faster%file%system

Feature Titan SummitApplication Performance Baseline 5210x%Titan

Number%of%Nodes 18,688 4,608

Node%performance 1.4%TF 42%TF

Memory per%Node 32 GB DDR3%+%6%GB%GDDR5 512%GB%DDR4%+%96%GB%HBM2

NV%memory per%Node 0 1600%GB

Total%System%Memory 710%TB >10%PB%DDR4%+%HBM2%+ Non2volatile

System%Interconnect Gemini%(6.4%GB/s) Dual%Rail%EDR2IB (25%GB/s)

Interconnect%Topology 3D Torus Non2blocking%Fat%Tree

Bi2Section%Bandwidth 15.6%TB/s 115.2 TB/s

Processors1%AMD%Opteron™1%NVIDIA%Kepler™

2%IBM%POWER9™6%NVIDIA Volta™

File%System 32%PB,%1%TB/s, Lustre® 250 PB,%2.5%TB/s,%GPFS™

Power%Consumption 9%MW 13%MW

Page 11: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

What$is$CORAL?$The$program$through$which$Summit$&$Sierra$are$procured.• Several%DOE%labs%have%strong%supercomputing%programs%and%facilities.%• To%bring%the%next%generation%of%leading%supercomputers%to%these%labs,%DOE%created%CORAL%(the%Collaboration%of%Oak%Ridge,%Argonne,%and%Livermore)%to%jointly%procure%these%systems,%and%in%so%doing,%align%strategy%and%resources%across%the%DOE%enterprise.

• Collaboration%grouping%of%DOE%labs%was%done%based%on%common%acquisition%timings.%Collaboration%is%a%win2win%for%all%parties.%

“Summit”%System “Sierra”%System

OpenPOWER Technologies:%IBM%POWER%CPUs,%NVIDIA%Tesla%GPUs,%Mellanox EDR%100Gb/s%InfiniBand

Paving%The%Road%to%Exascale%Performance

Page 12: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

OLCF$Program$to$Ready$Application$Developers$and$Users• We%are%preparing%users%through:– Application%Readiness%and%Early%Science%through%Center%for%Accelerated%Application%Readiness%(CAAR)

– Training%and%web2based%%documentation– Early%access%on%SummitDev and%Summit%Phase%I%system%(already%accepted)– Access%for%broader%user%base%on%final,%accepted%Phase%II%system

• Goals:%– Early%science%achievements,%– Demonstrate%application%readiness,%– Prepare%INCITE%&%ALCC%proposals,%– Harden%Summit%for%full2user%operations

Page 13: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

Summit$Early$Science$Program$(ESP)$

• We%put%out%a%Call%for%Proposals%in%December%2017– Resulted%in%62%Letters%of%Intent%(LOI)%received%by%year’s%end.• 27%are%from%PIs%at%universities• 32%are%from%PIs%at%national%laboratories%or%research%institutions%(DOE,%NASA)%• 14%are%CAAR%project2related%LOIs• 27%have%had%past%INCITE%allocations• 9%have%had%past%ALCC%allocations• 15%have%connections%to%the%US%DOE%Exascale%Computing%Project• 9%are%AI%or%deep%learning2related%

– Proposals%are%due%at%the%beginning%of%June– ESP%Users%will%gain%full%access%to%Summit%for%early%science%later%this%year

Page 14: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

Summit$will$be$the$world’s$smartest$supercomputer$for$open$scienceBut%what%makes%a%supercomputer%smart?

• GPU$Brawn:$Summit%links%more%than%27,000%deep2learning%optimized%NVIDIA%GPUs%with%the%potential%to%deliver%exascale2level%performance%(a%billion2billion%calculations%per%second)%for%AI%applications.

• HighEspeed$Data$Movement:$NVLink high2bandwidth%technology%built%into%all%of%Summit’s%processors%supplies%the%next2generation%“information%superhighways”%needed%to%train%deep%learning%algorithms%for%challenging%science%problems%quickly.

• Memory$Where$it$Matters:%Summit’s%sizable%local%memory%gives%AI%researchers%a%convenient%launching%point%for%data2intensive%tasks,%an%asset%that%allows%for%faster%AI%training%and%greater%algorithmic%accuracy.

One%of%Summit’s%4,600%IBM%AC922%nodes.%Each%node%contains%six%NVIDIA%Volta%GPUs%and%two%IBM%Power9%CPUs,%giving%scientists%new%opportunities%to%automate,%accelerate%and%drive%understanding%using%artificial%intelligence%techniques.

Summit%provides%unprecedented%opportunities%for%the%integration%of%artificial%intelligence%(AI)%and%scientific%discovery.%Here’s%why:

Page 15: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

Science%challenges%for%a%smart%supercomputer:%

Summit$will$be$the$world’s$smartest$supercomputer$for$open$scienceBut%what%can%a%smart%supercomputer%do?

Identifying$NextEgeneration$MaterialsBy%training%AI%algorithms%to%predict%material%properties%from%experimental%data,%longstanding%questions%about%material%behavior%at%atomic%scales%could%be%answered%for%better%batteries,%more%resilient%building%materials,%and%more%efficient%semiconductors.%

Combating$CancerThrough%the%development%of%scalable%deep%neural%networks,%scientists%at%the%US%Department%of%Energy%and%the%National%Cancer%Institute%are%making%strides%in%improving%cancer%diagnosis%and%treatment.%

Deciphering$HighEenergy$Physics$DataWith%AI%supercomputing,%physicists%can%lean%on%machines%to%identify%important%pieces%of%information—data%that’s%too%massive%for%any%single%human%to%handle%and%that%could%change%our%understanding%of%the%universe.

Predicting$Fusion$EnergyPredictive%AI%software%is%already%helping%scientists%anticipate%disruptions%to%the%volatile%plasmas%inside%experimental%reactors.%Summit’s%arrival%allows%researchers%to%take%this%work%to%the%next%level%and%further%integrate%AI%with%fusion%technology.%

Page 16: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

Summit$is$still$under$construction

• We%expect%to%accept%the%machine%in%Summer%of%2018,%allow%early%users%on%this%year,%and%allocate%our%first%users%through%the%INCITE%program%in%January%2019.%

• We%are%continuing%node%and%file%storage%installation%and%software%testing.%%

Page 17: Powering)the)Road)to)National)HPC)Leadership)...ORNL%is%managed%by%UT2Battelle% for%the%US%Department%of%Energy Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science

Questions?Jack$Wells

[email protected]