1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John...
-
Upload
beverly-lester -
Category
Documents
-
view
214 -
download
0
Transcript of 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John...
![Page 1: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/1.jpg)
SciDAC Annual Meeting June 2007 1
Harnessing the Power of Emerging Petascale Platforms
John Mellor-Crummey
Department of Computer Science
Rice University
Center for Scalable Application Development Software
![Page 2: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/2.jpg)
SciDAC Annual Meeting June 2007 2
Where’s my PetaFLOP?
“My code runs glacially slow”
• Whose fault is it?– mine?
– the compiler’s?
– the architecture?
• How can I tell?
• What can I do about it?
• node performance–algorithm–data structure–code shape
• parallelization–load balance, serialization–communication frequency and volume–lack of latency tolerance
• inadequate vectorization• instruction mix difficiencies• ineffective tiling for cache and TLB
• ineffective implementation of SSE• cache organization• low memory bandwidth
![Page 3: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/3.jpg)
SciDAC Annual Meeting June 2007 3
Performance Challenges
Gap between typical and peak performance is growing
• Modern parallel architectures are harder to program effectively– complex microprocessors
• deeply pipelined, out of order, superscalar
– complex memory hierarchy• non-blocking, multi-level caches, TLB
– direct interconnection networks
• Often, low performance results from interaction effects– example: sparse-matrix vector multiply in LANL’s SAGE AMR code
• microprocessor architecture with limited memory bandwidth
• rows have different lengths
• typical row is short: most have 7 non-zeros
• compiler-based software pipelining is ineffective for CSR
change data structure change code shapecompiler more
effective
up to 2x improvement
![Page 4: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/4.jpg)
SciDAC Annual Meeting June 2007 4
Talk Structure
• Case study: analysis and tuning S3D (DOE Joule code)– introduction to S3D– S3D node performance analysis with HPCToolkit– tuning S3D kernels with LoopTool– S3D scalability issues
• automatic identification of scalability bottlenecks with HPCToolkit• scalability concerns for the NCCS Cray XT3/XT4
• A plan for action– enabling technology research and development– application engagement
Theme: enabling technologies for performance analysis and tuning– performance measurement and analysis (HPCToolkit)– source-to-source optimization of Fortran (LoopTool)– automatic identification of scalability bottlenecks (HPCToolkit)
![Page 5: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/5.jpg)
SciDAC Annual Meeting June 2007 5
S3D
• Direct numerical simulation (DNS) of turbulent combustion– state-of-the-art code developed at CRF/Sandia
• PI: Jaqueline H. Chen, SNL
– 2007 INCITE award - 6M hours on XT3/4 at NCCS– Tier 1 pioneering application for 250TF system
• Why DNS?– study micro-physics of turbulent reacting flows
• full access to time resolved fields
• physical insight into chemistry turbulence interactions
– develop and validate reduced model descriptions used in macro-scale simulations of engineering-level systems
DNSDNS PhysicalPhysicalModelsModels
EngineeringEngineeringCFD codesCFD codes
(RANS, LES)(RANS, LES)Text and figures courtesy of Jacqueline H. Chen, SNL
![Page 6: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/6.jpg)
SciDAC Annual Meeting June 2007 6
Text and figures courtesy of Jacqueline H. Chen, SNL
S3D - DNS Solver
• Solves compressible reacting Navier-Stokes equations• High fidelity numerical methods
– 8th order finite-difference– 4th order explicit RK integrator
• Hierarchy of molecular transport models• Detailed chemistry• Multi-physics (sprays, radiation and soot)
– from SciDAC-TSTC (Terascale Simulation of Turbulent Combustion)
![Page 7: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/7.jpg)
SciDAC Annual Meeting June 2007 7
S3D Parallelization
Fortran90 + MPI • 3D domain decomposition
– each MPI process manages a piece of the domain
• All processes have same number of grid points and same computational load
• Inter-processor communication only between nearest neighbors in 3D mesh– large messages; non-blocking sends and receives
• All-to-all communication only required for monitoring and synchronization ahead of I/O
€
Communication
Computation=kN 2
kN 3= Ο
1
N
⎛
⎝ ⎜
⎞
⎠ ⎟
S3D logical topology
Text courtesy of Jacqueline H. Chen, SNL
![Page 8: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/8.jpg)
SciDAC Annual Meeting June 2007 8
S3D Node Performance Study
Experimental Setup
• Model problem – pressure wave test (S3D-harness/Test1)– 1 processor execution– 50 x 50 x 50 domain– 40 iterations (normal test case = 200)
• reduced iterations suffice for analysis
• System– Cray XD1 (2.2 GHz Opteron 275; 6.4 GB/s DDR 400 memory)
• Cray XD1 node serves as a model for dual-core Cray XT3 node
• Overall node performance of S3D code provided to us (Feb 2007)– .305 FLOPs/cycle, 15% of peak
• Can performance be improved? If so how?
![Page 9: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/9.jpg)
SciDAC Annual Meeting June 2007 9
Rice’s HPCToolkit Performance Tools
• Work at binary level for language independence– support multi-lingual codes with external binary-only libraries
• Profile rather than adding code instrumentation– minimize measurement overhead and distortion– enable data collection for large-scale parallelism
• Collect and correlate multiple performance measures– can’t diagnose a problem with only one species of event
• Compute derived metrics to aid analysis• Support top down performance analysis
– intuitive enough for scientists and engineers to use– detailed enough to meet the needs of compiler writers
• Aggregate events for loops and procedures– accurate despite approximate event attribution from counters– loop-level info is more important than line-level info
![Page 10: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/10.jpg)
SciDAC Annual Meeting June 2007 10
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
![Page 11: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/11.jpg)
SciDAC Annual Meeting June 2007 11
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
– launch unmodified, optimized application binaries– collect statistical profiles of events of interest
![Page 12: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/12.jpg)
SciDAC Annual Meeting June 2007 12
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
– decode instructions and combine with profile data
![Page 13: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/13.jpg)
SciDAC Annual Meeting June 2007 13
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
– extract loop nesting & inlining from executables
![Page 14: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/14.jpg)
SciDAC Annual Meeting June 2007 14
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
– synthesize new metrics by combining metrics – relate metrics and structure to program source
![Page 15: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/15.jpg)
SciDAC Annual Meeting June 2007 15
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
– support top-down analysis with interactive viewer– analyze results anytime, anywhere
![Page 16: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/16.jpg)
SciDAC Annual Meeting June 2007 16
hpcviewer User Interface
source pane
navigation pane metric pane
view control
![Page 17: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/17.jpg)
SciDAC Annual Meeting June 2007 17
hpcviewer User Interface
![Page 18: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/18.jpg)
SciDAC Annual Meeting June 2007 18
hpcviewer Views
• Calling context view– top-down view shows dynamic calling contexts in which costs were
incurred
• Caller’s view– bottom-up view apportions costs incurred in a routine to the
routine’s dynamic calling contexts
• Flat view– aggregates all costs incurred by a routine in any context and shows
the details of where they were incurred within the routine
![Page 19: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/19.jpg)
SciDAC Annual Meeting June 2007 19
S3D Performance at the Loop Level
Wasted Opportunity(Maximum FLOP rate
* cycles - (actual FLOPs)) / total waste
highlighted loop accounts for11.4% of total program waste
Overall performance (15% of peak)2.05 x 1011 FLOPs / 6.73 x 1011 cycles= .305 FLOPs/cycle
![Page 20: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/20.jpg)
SciDAC Annual Meeting June 2007 20
S3D: What Opportunities Exist?
initialize
update
5D loop nest:2D explicit loops
3D F90 vector syntax
reuse
reuse
reuse performance problem
data streams in/out of memory
![Page 21: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/21.jpg)
SciDAC Annual Meeting June 2007 21
Loop Unswitching
Controlled Loop Fusion
LoopTool: Loop Optimization of Fortran
Rice University’s tool for source-to-source transformation of Fortran
(transformation subset shown)
Unroll and Jam
do k = 1,ndo k = 1,n-1,2
![Page 22: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/22.jpg)
SciDAC Annual Meeting June 2007 22
Markup of S3D Diffusive Flux Loop
!dir$ uj 3 do m=1,3 ! DIRECTION!dir$ uj 2 do n=1,n_spec-1 ! SPECIES
!dir$ unswitch 2 if (baro_switch) then ! driving force includes gradient in mole fraction and baro-diffusion:!dir$ fuse 1 1 1 diffFlux(:,:,:,n,m) = - Ds_mixavg(:,:,:,n) * ( grad_Ys(:,:,:,n,m) & + Ys(:,:,:,n) * ( grad_mixMW(:,:,:,m) & + (1 - molwt(n)*avmolwt) * grad_P(:,:,:,m)/Press)) else ! driving force is just the gradient in mole fraction:!dir$ fuse 1 1 1 diffFlux(:,:,:,n,m) = - Ds_mixavg(:,:,:,n) * ( grad_Ys(:,:,:,n,m) & + Ys(:,:,:,n) * grad_mixMW(:,:,:,m) ) endif
! Add thermal diffusion:!dir$ unswitch 2 if (thermDiff_switch) then!dir$ fuse 1 1 1 diffFlux(:,:,:,n,m) = diffFlux(:,:,:,n,m) - Ds_mixavg(:,:,:,n) * Rs_therm_diff(:,:,:,n) * molwt(n) * avmolwt * grad_T(:,:,:,m) / Temp endif
! compute contribution to nth species diffusive flux ! this will ensure that the sum of the diffusive fluxes is zero.!dir$ fuse 1 1 1 diffFlux(:,:,:,n_spec,m) = diffFlux(:,:,:,n_spec,m) - diffFlux(:,:,:,n,m)
enddo ! SPECIES enddo ! DIRECTION
unswitching directives
controlled fusiondirectives
unroll and jam directives
Add LoopTool directivesto source program
![Page 23: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/23.jpg)
SciDAC Annual Meeting June 2007 23
if BS if TD
else
else if TD
else
n=1,nspec-2,2
n=1,nspec-2,2
n=1,nspec-2,2
n=1,nspec-2,2
if BSelseif TD
n=1,nspec-1
m=1,3
LoopTool
Optimization of S3D Diffusive Flux Loop
Transformation Log:– scalarization (4 stmts)– loop unswitching (2 conditions)– fusion (loops within 4 outer nests)– unroll-and-jam (2 loops)– peeling excess iterations (4 nests)
2.94x faster than original (6.7% total savings)
(35 lines) (445 lines)
![Page 24: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/24.jpg)
SciDAC Annual Meeting June 2007 24
S3D: An Unexpected Bottleneck
Approach: adjust routine interfaces to avoid copy
100% faster
an implicit loop that copies a non-contiguous4D slice of 5D data to
contiguous storage
5.4% time
![Page 25: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/25.jpg)
SciDAC Annual Meeting June 2007 25
S3D Node Performance Tuning Summary
• More opportunities remain– register reuse and tiling of stencil computations– inlining + fusion + array contraction of temporary variables
• Further improvements require more changes– lots of potential smaller improvements
Enabling technologies contributions– HPCToolkit made it possible to identify and assess bottlenecks– LoopTool helped automate tedious code transformations
Achieved ~12.7% overall improvement– boosted node performance from 15% of peak to 17.4% of peak– estimated savings on planned 2M CPU hour run: 254K CPU hours
![Page 26: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/26.jpg)
SciDAC Annual Meeting June 2007 26
The Lump Under the Rug: Scaling Bottlenecks
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 4 16 64 256 1024 4096 16384 65536
CPUs
Efficiency
Ideal efficiency
Actual efficiency
?
Synthetic ExampleNote: higher is better
![Page 27: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/27.jpg)
SciDAC Annual Meeting June 2007 27
S3D Weak Scaling Performance
Graph courtesy of Jacqueline H. Chen, SNL (lower is better)
Studied up to 20,000 cores
on Cray XT3/XT4 at NCCS
cost per grid point
increases > 50%as system size
scales from 1 to 20,000 coreson Cray XT3/XT4
![Page 28: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/28.jpg)
SciDAC Annual Meeting June 2007 28
A Qualitative Understanding of S3D Scaling
Execution time breakdown for S3D using weak scaling (Cray XT3/XT4, NCCS) Courtesy of Sameer Shende, University of Oregon
(Measured with Oregon’s Tau using procedure- and loop-level instrumentation)
MPI wait
LUSTRE write
(a widening color band indicates a non-scalable cost)
![Page 29: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/29.jpg)
SciDAC Annual Meeting June 2007 29
Pinpointing Scalability Bottlenecks Automatically
Challenges• Applications
– modern software uses layers of libraries– performance is often context dependent
• Monitoring– bottleneck nature: computation, data movement, synchronization?– size of petascale platforms
Example climate code skeleton
main
ocean atmosphere
wait wait
sea ice
wait
land
wait
![Page 30: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/30.jpg)
SciDAC Annual Meeting June 2007 30
Call Path Profiling: Understanding Costs in Context
Event-based sampling method for performance measurement
• When a profile event occurs, e.g. a timer expires– determine context in which cost is incurred
• unwind call stack to determine set of active procedure frames
– attribute cost of sample to PC in calling context
• Benefits– monitor unmodified fully optimized code– language independent – C/C++, Fortran, assembly code, …– accurate– low overhead (1K samples per second has ~ 3-5% overhead)
[Froyd et. al ICS 05]
![Page 31: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/31.jpg)
SciDAC Annual Meeting June 2007 31
Performance expectation for weak scaling – work increases linearly with # processors
– execution time is same as that on a single processor
€
Xw (nq ) =C(nq ) −C(np )
Tq€
C(nq ) =C(np )
parallel overhead
Pinpointing Scalability Bottlenecks Automatically
• Execute code on p and q processors; without loss of generality, p < q
• Let Ti = total execution time on i processors
• For corresponding nodes nq and np
– let C(nq) and C(np) be the costs of nodes nq and np
• Expectation:
• Fraction of excess work: total time
![Page 32: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/32.jpg)
SciDAC Annual Meeting June 2007 32
LANL’s Parallel Ocean Program (POP)
successive global reductions on scalars
degrade parallel efficiency(7 total)
12% loss in scaling due to
scalar reductions
7% in this routine alone
![Page 33: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/33.jpg)
SciDAC Annual Meeting June 2007 33
Why Does S3D Performance Degrade?
Let’s explore the nature of the problem …
Communication overhead is an interaction between– logical communication topology of S3D – network topology of the Cray XT3/XT4– mapping S3D’s logical topology onto the Cray XT3/XT4– other factors …
• link latency and bandwidth
• communication volume
• fraction of message latency that is overlapped with computation
![Page 34: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/34.jpg)
SciDAC Annual Meeting June 2007 34
Bisection Bandwidth on a Torus Network
How much communication bandwidth crosses between halves?
YZ x “bandwidth between a pair of comm. partners”
X
ZY
Consider:
Ideal embedding of S3D mesh inthe torus
![Page 35: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/35.jpg)
SciDAC Annual Meeting June 2007 35
Bisection Bandwidth on a Torus Network
How much communication bandwidth crosses between halves?
O(XYZ) x “bandwidth betw. a pair of comm. partners”
X
ZY
Consider:
Random embedding of S3D mesh inthe torus
![Page 36: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/36.jpg)
SciDAC Annual Meeting June 2007 36
Mapping as a Potential Scalability Issue?
• Communication crossing between halves for different mappings– ideal: YZ x “bandwidth between a pair of comm. partners”– random: O(XYZ) x “bandwidth between a pair of comm. partners”
• Moral– a bad mapping could increase communication significantly
• random mapping yields O(X) times the bisection communication
• Next steps– investigate impact of logical to physical node mapping on Cray XT
• issues
– congestion: max number of logical links that map to a physical link
– dilation: longest path between a pair of communcation partners
• assess impact of congestion and dilation on performance
– explore better node mapping (and perhaps allocation) strategies
![Page 37: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/37.jpg)
SciDAC Annual Meeting June 2007 37
A Plan for Action (Part 1)
Enabling technologies for petascale computing
• Enhance and deploy performance measurement and analysis tools– sampling-based tools for measuring application performance– automatic analysis of scalability bottlenecks– cluster analysis of ensembles of processes– insights into node performance bottlenecks
• Enhance node compiler technology for scientific systems– source-to-source tools for optimizing Fortran loop nests– analysis and source-to-source code generation for multicore
processors
• Co-array Fortran compiler for Cray XT and IBM Blue Gene– CAF refinements for expressiveness and performance
![Page 38: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/38.jpg)
SciDAC Annual Meeting June 2007 38
A Plan for Action (Part 2)
Application engagement
• S3D– improve mapping: logical topology physical nodes– analyze and exploit opportunities for tailoring loop nests– explore alternatives for derivative computations
• XGC1– identify and exploit opportunities for tuning node performance
• GTC – use space filling curves to reorder particles to improve data locality
![Page 39: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/39.jpg)
SciDAC Annual Meeting June 2007 39
GTC: Gyrokinetic Toroidal Code
• Charged particles follow spiral paths around magnetic field lines• Plasma turbulence arises from temperature difference between outer and
inner regions– provides means for particles in the plasma to move toward the outer edges of
the reactor rather than fusing with other particles
• Major challenge: use simulations to better understand and minimize the problem of turbulence– theory and experimental results differ; use simulation to gain insight
Developed by SciDAC-funded Gyrokinetic Particle Simulation Center
Figure Credit: SciDAC final report, 2006
![Page 40: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/40.jpg)
SciDAC Annual Meeting June 2007 40
GTC: Boosting Locality by Ordering Particles
• Proposed approach– order particles in the plasma by
their position along a space-filling curve
• Expected benefits– better locality of access for
particle to cell and cell to particle interactions
top view of particlesin 1/8 tokamak
3D view of particles in 1/8 tokamak
Hilbert order of particles in 1/8 tokamak
![Page 41: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.](https://reader035.fdocuments.us/reader035/viewer/2022070410/56649f295503460f94c41e10/html5/thumbnails/41.jpg)
SciDAC Annual Meeting June 2007 41
Acknowledgments
• HPCToolkit Team– Michael Fagan– Mark Krentel– Nathan Tallent
• LoopTool Team– Apan Qasem
• S3D Studies– Yuan Zhao– Apan Qasem
• GTC study– Guohua Jin– Gabriel Marin