Challenges using Roadrunner
Transcript of Challenges using Roadrunner
Slide 1
Challenges using Roadrunner
Ken KochLos Alamos National Laboratory
Operated by the Los Alamos National Security, LLC for the DOE/NNSA
The Roadrunner Petaflop SystemMore at http://www lanl gov/roadrunner
Slide 2
More at http://www.lanl.gov/roadrunner
Two QS22’s(2 Cells each)
Triblade Node with PCIe-connected CellsSPESPU
PowerXCell 8i: an improved Cell processor
LS21 with
Expansionblade
( )
PowerPC
8 optimizedvector cores
PPCCPU
2 Opterons
Design objective: One Cell processor for every Opteron core, plus the same memory footprint for each (4GB each),
with the fastest feasible interconnects
Connected Unit cluster 12 240 PowerXCell 8i chips ⇒ 1 33 PF 49 TBCell
to PCIeDDR2
memory
Vastly improved double precision performanceLarger DDR2-based memory
Connected Unit cluster180 compute nodes w/ Cells
+ 12 I/O nodes
12,240 PowerXCell 8i chips ⇒ 1.33 PF, 49 TB6,120 dual-core Opterons ⇒ 44 TF, 49 TB
17 CUs3264 nodes PPE
SPE (8)
Posix ThreadsDMAIBM ALFLANL CML
PowerPC compiler
SPE compiler SPE (8)SPE (8)SPE (8)SPE (8)SPE (8)SPE (8)SPE (8)
ringbus
wor
k to
geth
er
SPEProgram
12 links per CU to each of 8 switches
Eight 2nd-stage 288-port IB 4X DDR switches
288-port IB 4X DDR Switch288-port IB 4X DDR Switch
Opteron
IBM DaCSIBM ALFLANL CML
x86 compiler
p
PCIe
ee p
rogr
ams
wx86
Program
PPEProgram
Operated by the Los Alamos National Security, LLC for the DOE/NNSA
Eight 2 stage 288 port IB 4X DDR switches
MPI (cluster) IB(one per node)Th
re
Slide 3
IBM created the PowerXCell 8i, a improved variant of the PlayStation 3 Cell processorthe PlayStation 3 Cell processor.
• Cell Broadband Engine (CBE*) developed by Sony-Toshiba-IBM
S S
SPESPU
• used in Sony PlayStation 3• 8 Synergistic Processing Elements (SPEs)
• 128-bit vector cores• 256 kB local memory
( S S )(LS = Local Store)• Direct Memory Access (DMA)
engine (25.6 GB/s each)• Chip interconnect (EIB)• Run SPE-code as POSIX threads
PowerPC
toRun SPE code as POSIX threads(SPMD, MPMD, streaming)
• 1 PowerPC PPE runs Linux OS
to PCIememory
Design: SPEs provide optimalDesign: SPEs provide optimal flop/s per watt in minimal area
This is an Exascale trend
8 optimizedvector cores
PPCCPU
Operated by the Los Alamos National Security, LLC for the DOE/NNSA
* trademark of Sony Computer Entertainment, Inc.
Slide 4
Three types of processors are programmed to work togethertogether.
SPE (8)SPECell
SPE (8)SPE (8)• Parallel computing on Cell• data partitioning & work queue pipelining• process management & synchronization
SPE (8)
Posix ThreadsDMAIBM ALF
SPE compiler SPE (8)SPE (8)SPE (8)SPE (8)SPE (8)SPE (8)SPE (8)
ringbus
• Remote communication to/from CellPPE
IBM ALFLANL CML
PowerPC compiler
bus
• data communication & synchronization• process management & synchronization• computationally-intense offload
IBM DaCSIBM ALFLANL CML
86
PCIe
• MPI remains as the foundation
Opteron
MPI (cluster)
x86 compiler
IB(oneper
Operated by the Los Alamos National Security, LLC for the DOE/NNSA
pernode)
Several Challenges seen on Roadrunner
Slide 5
Several Challenges seen on Roadrunner
• Exposing on-node parallelism for Cell’s 8 SPUs• Threads are used for 8 SPUs
• Tiling work to fit and stream in and out of the 256KB local memoriesmemories• Breaking up the data into chunks• Using DMA engine to overlap read-ahead/compute/write-
behind
• Breaking application into three collaborating parts (O t PPC SPU )(Opteron, PPC, SPUs)• Most MPI programs are SPMD and bulk synchronous• Relay for MPI messages (Cell Opteron IB and back up)
Operated by the Los Alamos National Security, LLC for the DOE/NNSA
Slide 6
How do you keep the 256KB SPUs busy?How do you keep the 256KB SPUs busy?
Break the work into astream of pieces
SPE
SPE
SPE
SPE
SPE
grid tilesti l
problemdomain
SPE
SPE
SPEor particlebundles(can includeghost zones)
domainof a Cell
processor SPE
data chunks stream in & outof 8 SPEs using asynch DMAs
Operated by the Los Alamos National Security, LLC for the DOE/NNSA
g yand triple-buffering
Slide 7
Put it all together: MPI+DaCS+DMA+SIMDPut it all together: MPI+DaCS+DMA+SIMDpipelinedwork units
HostCPU
CellPPE
SPESPESPESPESPESPESPESPE
upload
downloadDaCS
MPI • DMAs are simply block memory transfers
• HW asynchronous (no SPE stalls)
• DDR2 memory latency and
DMA Get (first prefetch)Switch work buffersDMA Get (first prefetch)Switch work buffers
“relay” of DaCS ⇔ MPI messages
MPI
DDR2 memory latency and BW performanceDMA Get (prefetch)
DMA Wait (complet current)ComputeDMA Put (store behind)
DMA Get (prefetch)DMA Wait (complet current)ComputeDMA Put (store behind)
Compute & memoryDMA transfers areoverlapped in HW!
DMA Get:mfc_get( LS_addr, Mem_addr, size, tag, 0, 0);
DMA Put:( )DMA Wait (previous put)Switch work buffers
DMA Wait (put)
( )DMA Wait (previous put)Switch work buffers
DMA Wait (put)
mfc_put( Mem_addr, LS_addr, size, tag, 0, 0);
DMA Wait:mfc_write_tag_mask(1<<tag);mfc_read_tag_status_all();
MPI & DaCS can alsobe fully asynchronous
Operated by the Los Alamos National Security, LLC for the DOE/NNSA
Other Challenges
Slide 8
Other Challenges
• Lack of programming tools beyond compilers• Like the earliest days of parallel computing
• Busy code developers worried about portability and longevity of an exotic platformlongevity of an exotic platform• Social acceptance• “Too busy. Too difficult. Not mainstream.”
Operated by the Los Alamos National Security, LLC for the DOE/NNSA
Roadrunner Open Science
Slide 9
Roadrunner Open Science
• Ten Science projects targeted Roadrunner
• February through September 2009• Significant system and Panasas integration and stabilization
work was ongoing during this same timework was ongoing during this same time
• 7 of 10 Projects made extensive Science runs• 2 had good code running but ran out of time2 had good code running but ran out of time• 1 ran into technical and staffing difficulties• A couple are running again with
the machine in Classifiedthe machine in Classifiedmode
Operated by the Los Alamos National Security, LLC for the DOE/NNSA
The ten Roadrunner Open Science projectsSlide 10
Science (code) Description Status
Laser Plasma Instabilities (VPIC)
Study the nonlinear physics of laser backscatter energy transfer and plasma instabilities related to the National Ignition Facility (NIF). Completed
Magnetic Reconnection(VPIC)
Study the continuous breaking and rearrangement of magnetic field lines in plasmas relevant to both space and laboratory applications. Completed(VPIC) plasmas relevant to both space and laboratory applications.
Thermonuclear Burn Kinetics (VPIC)
Study how the TN burn process impacts the velocity distributions of the reacting particle populations and the impact that has on sustaining the burn. (ASC effort)
Code completeOpen science incomplete
Spall and Ejecta( )
Study how materials break up internally, Spall, and how pieces fly off, Ejecta, as shock waves force the material to break apart at the atomic scale (ASC Mostly completed(SPaSM) as shock waves force the material to break apart at the atomic scale. (ASC Weapons Science effort)
Mostly completed
HIV Phylogenetics(ML)
Determine “best” evolutionary relationship trees from a large set of actual genetic HIV genetic sequences (phylogenetic tree) for HIV vaccine targeting. Completed
Properties of Metallic N i (P R )
Apply the parallel-replica approach at the atomistic scale for simulating t i l ti f i i l f it h i f t d i CompletedNanowires (ParRep) material properties of nanowires crucial for switches in future nanodevices. Completed
DNS of Reacting Turbulence (CFDNS)
Study thermonuclear burning in turbulent conditions in Type Ia supernovae using Direct Numerical Simulations (DNS) with full rad-hydro. Completed
The Roadrunner Create a repository of particle simulations of the distribution of matter in the universe to look at galaxy-scale concentrations and structures (dark matter Mostly completedUniverse (RRU) universe to look at galaxy-scale concentrations and structures (dark matter halos).
Mostly completed
SupernovaeLight-Curves (Cassio)
Study the impact of 2D asymmetries on the radiative light output in core collapse supernovae. Coupled RAGE on Opteron-only with Jayenne-Milagro IMC (accelerated).
Code completeOpen science incomplete
Operated by the Los Alamos National Security, LLC for the DOE/NNSA
Cellulosomes (Gromacs) Study the effectiveness of the decomposition of cellulosic sheets of plant fiber by cellusome bacteria related to biofuels (cellulosic alcohol) production
Code work stopped due to performance issues & manpower