GPU Clusters in HPC - GPU Technology...
Transcript of GPU Clusters in HPC - GPU Technology...
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
GPU Clusters for HPC
Bill Kramer
Director of Blue Waters
National Center for
Supercomputing Applications
University of Illinois at Urbana-
Champaign
National Center for Supercomputing Applications: 30 years of leadership
• NCSA
• R&D unit of the University of Illinois at Urbana-Champaign
• One of original five NSF-funded supercomputing centers
• Mission: Provide state-of-the-art computing capabilities (hardware, software, hpc
expertise) to nation’s scientists and engineers
• The Numbers
• Approximately 200 staff (160+ technical/professional staff)
• Approximately 15 graduate students (+ new SPIN program), 15 undergrad students
• Two major facilities (NCSA Building, NPCF)
• Operating NSF’s most powerful computing system: Blue Waters
• Managing NSF’s national cyberinfrastructure: XSEDE
Source: Thom Dunning
Petascale Computing Facility: Home to Blue
Waters
• Modern Data Center
• 90,000+ ft2 total
• 30,000 ft2 raised floor
20,000 ft2 machine room gallery
• Energy Efficiency
• LEED certified Gold
• Power Utilization Efficiency = 1.1–1.2
• Blue Waters
• 13PF, 1500TB,
300PB
• >1PF On real apps
• NAMD, MILC,
WRF, PPM,
NWChem, etc
Source: Thom Dunning
Data Intensive Computing
Source: Thom Dunning
LSST, DES Personalized Medicine w/ Mayo
NCSA’s Industrial Partners
Source: Thom Dunning
NCSA, NVIDIA and GPUs
• NCSA and NVIDIA have been partners for over a
decade, building the expertise, experience and
technology.
• The efforts were at first exploratory and small scale, but
have now blossomed into providing the largest GPU
production resource in the US academic cyber-
infrastructure
• Today, we are focusing on helping world class science
and engineering teams decrease their time to insight for
some of the world’s most important and challenging
computational and data analytical problems
Imaginations unbound
Original Blue Waters Goals
• Deploy a computing system capable of sustaining more than one
petaflops or more for a broad range of applications • Cray system achieves this goal using a well defined metrics
• Enable the Science Teams to take full advantage of the sustained
petascale computing system • Blue Waters Team has established strong partnership with Science Teams, helping them to
improve the performance and scalability of their applications
• Enhance the operation and use of the sustained petascale system • Blue Waters Team is developing tools, libraries and other system software to aid in operation of
the system and to help scientists and engineers make effective use of the system
• Provide a world-class computing environment for the petascale
computing system • The NPCF is a modern, energy-efficient data center with a rich WAN environment (100-400
Gbps) and data archive (>300 PB)
• Exploit advances in innovative computing technology • Proposal anticipated the rise of heterogeneous computing and planned to help the computational
community transition to new modes for computational and data-driven science and engineering
Imaginations unbound
Blue Waters Computing System
Sonexion: 26 usable PB
>1 TB/sec
100 GB/sec
10/40/100 Gb Ethernet Switch
Spectra Logic: 300 usable PB
120+ Gb/sec
100-300 Gbps WAN
IB Switch
External Servers
Aggregate Memory – 1.6 PB
Imaginations unbound
Details of Blue Waters
Imaginations unbound
Computation by Discipline on Blue Waters
Imaginations unbound
Astronomy and Astrophysics
17.8%
Atmospheric and Climate Sciences
10.4%
Biology and Biophysics 23.6%
Chemistry 6.5%
Computer Science 0.5%
Earth Sciences 2.0%
Engineering 0.05%
Fluid Systems
5.1% Geophysics
1.3%
Humanities 0.0002%
Materials Science
3.3%
Mechanical and Dynamic Systems 0.03%
Nuclear Physics 0.7%
Particle Physics 25.9%
Physics 2.5%
Social Sciences 0.3%
STEM Education
0.01%
Actual Usage by Discipline
XK7 Usage by NSF PRAC teams – A
Behavior Experiment – First year
Imaginations unbound
MD
- G
rom
acs
QCD – MILC and
Chroma MD
- N
AM
D/V
MD
MD
- A
mber
• An observed experiment – teams self select what type of node is
most useful
• First year of usage
Increasing allocation size
Production Computational Science with XK
nodes • The Computational Microscope
• PI – Klaus Schulten
• Simulated flexibility of ribosome trigger factor complex at
full length and obtained better starting configuration of
trigger factor model (simulated to 80ns)
• 100ns simulation of cylindrical HIV 'capsule’ of CA proteins
revealed it is stabilized by hydrophobic interactions
between CA hexamers; maturation involves detailed
remodeling rather than
disassembly/re-assembly of CA lattice, as had been
proposed.
• 200ns simulation of CA pentamer surrounded by CA
hexamers suggested interfaces in hexamer-hexamer and
hexamer-pentamer pairings involve different patterns of
interactions
• Simulated photosynthetic membrane of a chromatophore in
bacterium Rps. photometricum for 20 ns -- simulation of a
few hundred nanoseconds will be needed
Images from Klaus Schulten and John Stone, University of Illinois at Urbana-Champaign
Imaginations unbound
Production Computational Science with XK
nodes • Lattice QCD on Blue Waters
• PI - Robert Sugar, University of California, Santa Barbara
• The USQCD Collaboration, which consists of nearly all of
the high-energy and nuclear physicists in the United States
working on the numerical study of quantum
chromodynamics (QCD), will use Blue Waters to study the
theory of the strong interactions of sub-atomic physics,
including simulations at the physical masses of the up and
down quarks, the two lightest of the six quarks that are the
fundamental constituents of strongly interacting matter
Imaginations unbound
Production Computational Science with XK
nodes • Hierarchical molecular dynamics sampling for assessing pathways and free
energies of RNA catalysis, ligand binding, and conformational change
• PI - Thomas Cheatham, University of Utah
• Attempting to decipher the full landscape of RNA structure and function.
• Challenging because • RNA require modeling the flexibility and subtle balance between charge, stacking and other molecular
interactions
• structure of RNA is highly sensitive to its surroundings, and RNA can adopt multiple functionally relevant
conformations.
• Goal - Fully map out the conformational, energetic and chemical landscape of RNA.
• "Essentially we are able to push enhanced sampling methodologies for molecular
dynamics simulation, specifically replica-exchange, to complete convergence for
conformational ensembles (which hasn't really been investigated previously) and
perform work that normally would take 6 months to years in weeks. This is
critically important for validating and assessing the force fields for nucleic acids,” -
Cheatham.
Imaginations unbound
Images courtesy – T Cheatham
Most Recent Computational Use of XK
nodes
Imaginations unbound
-
1,000,000.0
2,000,000.0
3,000,000.0
4,000,000.0
5,000,000.0
6,000,000.0
7,000,000.0
8,000,000.0
9,000,000.0
Karim
aba
di-3
D K
ine
tic S
ims.
of…
Suga
r-La
ttic
e Q
CD
Yeun
g-C
om
ple
x T
urb
ule
nt
Flo
ws…
Sch
ulten-T
he
Com
puta
tiona
l…
Ch
eath
am
-MD
Path
ways a
nd…
Aksim
en
tie
v-P
ion
eeri
ng…
Shap
iro-S
ign
atu
res o
f C
om
pa
ct…
Mo
ri-P
lasm
a P
hysic
s S
ims.
usin
g…
Ott
-C
CS
Ne, H
yp
erm
assiv
e…
Voth
-M
ultis
cale
Sim
s. o
f…
Tajk
hors
hid
-C
om
ple
x B
iolo
gy in
…
Glo
tzer-
Many-G
PU
Sim
s. of
So
ft…
Woosle
y-T
ype
Ia S
up
ern
ovae
Jord
an
-E
art
hqu
ake S
yste
m…
Alu
ru-Q
MC
of
H2O
-Gra
phe
ne,…
Tom
ko
-Rede
sig
nin
g C
om
m. a
nd…
Bern
ho
lc -
Quan
tum
Sim
s.…
Pand
e -
Sim
ula
tin
g V
esic
le F
usio
n
Kasso
n-I
nflue
nza F
usio
n…
Ch
em
la-C
he
mla
Lu
sk-S
ys. S
oftw
are
fo
r S
cala
ble
…
Tho
mas -
QC
du
ring
Ste
el…
Fie
lds-B
enchm
ark
Hum
an V
ari
ant…
Ma
kri
-QC
PI P
roto
n &
Ele
ctr
on…
Hira
ta
-P
redic
tive C
om
p.
of…
Elg
hoba
sh
i-D
NS
of
Va
pori
zin
g…
Jong
ene
el-A
ccu
rate
Gen
e…
La
ze
bnik
-L
arg
e-S
cale
…
Beltra
n
-S
pot S
ca
nnin
g P
roto
n…
Wood
ward
-T
urb
ule
nt S
tella
r…
No
de*H
ou
rs
Teams with both XE and XK usage - July 1, 2014 to Sept 30, 2014
Total Node*hrs
XK Node Hrs
XE Node Hrs
Most Resent Computational Use of XK
nodes
Imaginations unbound
-
1,000,000.0
2,000,000.0
3,000,000.0
4,000,000.0
5,000,000.0
6,000,000.0
7,000,000.0
8,000,000.0
9,000,000.0
No
de*H
ou
rs
Teams with both XE and XK usage - July 1, 2014 to Sept 30, 2014
Total Node*hrs
XK Node Hrs
XE Node Hrs
Evolving XK7 Use on BW - Major Advance in Understanding of
Collisionless Plasmas Enabled through Petascale Kinetic Simulations
• PI: Homayoun Karimabadi,
University of California, San
Diego
• Major results to date:
• Global fully kinetic simulations
of magnetic reconnection
• First large-scale 3D
simulations of decaying
collisionless plasma
turbulence
• 3D global hybrid simulations
addressing coupling between
shock physics &
magnetosheath turbulence
Fully kinetic simulation
(all species kinetic;
code: VPIC)
~up to 1010 cells
~up to 4x1012 particles
~120 TB of memory
~107 CPU-HRS
~up to 500,000 cores
Large scale hybrid kinetic simulation:
(kinetic ions + fluid electrons;
codes: H3D, HYPERES)
~up to 1.7x1010 cells
~up to 2x1012 particles
~130 TB of memory Slide courtesy of H Karimardi
Imaginations unbound
Evolving XK7 Use on BW - Petascale Particle in
Cell Simulations of of Kinetic Effects in Plasmas • PI – Warren Mori – Presenter – Frank
Tsung
• Use six parallel particle-in-cell (PIC) codes
to investigate four key science areas:
• Can fast ignition be used to develop inertial
fusion energy?
• What is the source of the most energetic
particles in the cosmos?
• Can plasma-based acceleration be the basis of
new compact accelerators for use at the energy
frontier, in medicine, in probing materials, and in
novel light sources?
• What processes trigger substorms in the
magnetotail?
• Evaluating New Particle-in-Cell (PIC)
Algorithms on GPU and comparing to
standard
• Electromagnetic Case 2-1/2D EM Benchmark
with 2048x2048 grid, 150,994,944 particles, 36
particles/cell optimal block size = 128, optimal
tile size = 16x16. Single precision. Fermi M2090
GPU
• First result • OSIRIS : 2PF sustained on BW
• Complex interaction could not be understood
without the simulations performed on BW
Image and Information courtesy of
Warren Mori and Frank Tsung
Imaginations unbound
1
2
2
3
4
CVM-S4.26 BBP-1D
Evolving XK7 Use on BW - Comparison of 1D and 3D
CyberShake Models for the Los Angeles Region
1. lower near-fault intensities due to 3D scattering
2. much higher intensities in near-fault basins
3. higher intensities in the Los Angeles basins
4. lower intensities in hard-rock areas
Slide courtesy of T Jordan - SCEC
Imaginations unbound
XK7 For Visualization on Blue Waters
• Many visualization utilities rely on the OpenGL API for
hardware-accelerated rendering
• Unsupported by default XK7 system software
• Enabling NVIDIA’s OpenGL required that we:
• Change operating mode of the XK7 GPU firmware
• Develop a custom X11 stack
• Work with Cray to acquire alternate driver package from NVIDIA
• Blue Waters is the first Cray to offer this functionality
which has been distributed to other systems now
Imaginations unbound
Impact: VMD
• Molecular dynamics analysis and
visualization tool used by “The
Computational Microscope”
science team (PI Klaus Schulten)
• 10X to 50X rendering speedup in
VMD
• Interactive rate visualization
• Drastic reduction in required time to
fine tune parameters for production
visualization
Imaginations unbound
Impact of integrated system reduces data
movement
Computational fluid dynamics volume renderer used by “Petascale Simulation of Turbulent Stellar Hydrodynamics” science team (PI Paul R. Woodward)
Visualization created on Blue Waters:
• 10,5603 grid inertial confinement fusion
(ICF) calculation (26 TB)
• 13,688 frames at 2048x1080 pixels
• 711 frame stereo movie (2 views) at
4096x2160 pixels
• Total rendering time: 24 hours
• Estimated time to just ship
data to team’s remote site
where they had been doing
visualization (no rendering):
15 days
• 20-30x improvement in
time to insight
Imaginations unbound
Summary
• UIUC, NCSA and NVIDIA have a very stong partnership
for some time
• NCSA has helped move GPU computing into the
mainstream for several discipline areas
• Molecular dynamics, particle physics, seismic, …
• NCSA is leading innovation in use of GPUs for grand
challenges
• Blue Waters has unique capabilities' for computation and
data analysis
• There is still much work to do in order to make GPU
processing a standard way of doing real computational
science and modeling for all disciplines
Imaginations unbound
Backup Other Slides
Imaginations unbound
Science Area Number
of
Teams
Codes Struc
t
Grids
Unstruc
t Grids
Dens
e
Matri
x
Sparse
Matrix
N-
Body
Mont
e
Carlo
FF
T
PIC Significa
nt I/O
Climate and Weather 3 CESM, GCRM,
CM1/WRF,
HOMME
X X X X X
Plasmas/Magnetosphere 2 H3D(M),VPIC,
OSIRIS,
Magtail/UPIC
X X X X
Stellar Atmospheres and
Supernovae
5 PPM, MAESTRO,
CASTRO,
SEDONA,
ChaNGa, MS-
FLUKSS
X X X X X X
Cosmology 2 Enzo, pGADGET X X X
Combustion/Turbulence 2 PSDNS, DISTUF X X
General Relativity 2 Cactus, Harm3D,
LazEV X X
Molecular Dynamics 4 AMBER,
Gromacs, NAMD,
LAMMPS
X X X
Quantum Chemistry 2 SIAL, GAMESS,
NWChem X X X X X
Material Science 3 NEMOS, OMEN,
GW, QMCPACK X X X X
Earthquakes/Seismology 2 AWP-ODC,
HERCULES,
PLSQR,
SPECFEM3D
X X X X
Quantum Chromo
Dynamics
1 Chroma, MILC,
USQCD X X X X X
Social Networks 1 EPISIMDEMICS
Evolution 1 Eve
Engineering/System of
Systems
1 GRIPS,Revisit X
Computer Science 1 X X X X X Imaginations unbound
Blue Waters Symposium
• May 12-15 – after the 1 year of full service
• https://bluewaters.ncsa.illinois.edu/symposium-2014-
schedule
• About 180 people attended – over 120 from outside
Illinois
• 54 individual
science talks
Imaginations unbound
Climate – courtesy of Don Weubbles
Imaginations unbound
Petascale Simulations of Complex Biological
Behavior in Fluctuating Environments
• Project PI: llias
Tagkopoulos, University
of California, Davis
• Simulated 128,000
organisms
• Previous best was 200 on
Blue Gene
Image and Information courtesy of Ilias Tagkopoulos
Imaginations unbound
Selected Highlights
• PI - Keith Bisset, of the Network
Dynamics and Simulation Science
Laboratory at Virginia Tech
• Simulated 280 millions people (US
Populations) for 120 days on
352,000 cores (11,000 nodes) on
Blue Waters.
• Simulation took 12 second
• Estimated that the world
population would take 6-10
minutes per scenario
• Emphasized that a realistic
assessment of disease threat
would require many such runs.
Image and Information courtesy of K Bisset
Imaginations unbound
P.K. Yeung – DNS Turbulence - Topology
Imaginations unbound
8,1923 grid points – 0.5 Trillion Slide courtesy of P.K Yeung
Inference Spiral of System Science
(PI T Jordan)
• As models become more complex and new data bring in more
information, we require ever increasing computational resources
Jordan et al. (2010)
Slide courtesy of T Jordan - SCEC
Imaginations unbound
1
2
2
3
4
CVM-S4.26 BBP-1D
Comparison of 1D and 3D CyberShake
Models for the Los Angeles Region
1. lower near-fault intensities due to 3D scattering
2. much higher intensities in near-fault basins
3. higher intensities in the Los Angeles basins
4. lower intensities in hard-rock areas
Slide courtesy of T Jordan - SCEC
Imaginations unbound
CyberShake Time-to-Solution Comparison
CyberShake Application
Metrics (Hours)
2008
(Mercury,
normalized)
2009
(Ranger,
normalized)
2013
(Blue Waters /
Stampede)
2014
(Blue Waters)
Application Core Hours: 19,488,000
(CPU)
16,130,400
(CPU)
12,200,000
(CPU)
15,800,000
(CPU+GPU)
Application Makespan: 70,165 6,191 1,467 342
Los Angeles Region Hazard Models (1144 sites)
Metric 2013 (Study 13.4) 2014 (Study 14.2)
Simultaneous processors 21,100 (CPU) 46,720 (CPU) + 160 (GPU)
Concurrent Workflows 5.8 26.2
Job Failure Rate 2.6% 1.3%
Data transferred 57 TB 12 TB
4.2x quicker time to insight
Slide courtesy of T Jordan - SCEC
Imaginations unbound
Major Advance in Understanding of Collisionless Plasmas
Enabled through Petascale Kinetic Simulations
• PI: Homayoun Karimabadi,
University of California, San
Diego
• Major results to date:
• Global fully kinetic simulations
of magnetic reconnection
• First large-scale 3D
simulations of decaying
collisionless plasma
turbulence
• 3D global hybrid simulations
addressing coupling between
shock physics &
magnetosheath turbulence
Fully kinetic simulation
(all species kinetic;
code: VPIC)
~up to 1010 cells
~up to 4x1012 particles
~120 TB of memory
~107 CPU-HRS
~up to 500,000 cores
Large scale hybrid kinetic simulation:
(kinetic ions + fluid electrons;
codes: H3D, HYPERES)
~up to 1.7x1010 cells
~up to 2x1012 particles
~130 TB of memory Slide courtesy of H Karimardi
Imaginations unbound
OTHER FUN DATA
Imaginations unbound
Q1 2014 XE Scale
Imaginations unbound
128,000 Integer cores
Q1 2014 XK Scale
Imaginations unbound