1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John...

41
SciDAC Annual Meeting June 2007 1 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University Center for Scalable Application Development Softwa

Transcript of 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John...

Page 1: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 1

Harnessing the Power of Emerging Petascale Platforms

John Mellor-Crummey

Department of Computer Science

Rice University

Center for Scalable Application Development Software

Page 2: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 2

Where’s my PetaFLOP?

“My code runs glacially slow”

• Whose fault is it?– mine?

– the compiler’s?

– the architecture?

• How can I tell?

• What can I do about it?

• node performance–algorithm–data structure–code shape

• parallelization–load balance, serialization–communication frequency and volume–lack of latency tolerance

• inadequate vectorization• instruction mix difficiencies• ineffective tiling for cache and TLB

• ineffective implementation of SSE• cache organization• low memory bandwidth

Page 3: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 3

Performance Challenges

Gap between typical and peak performance is growing

• Modern parallel architectures are harder to program effectively– complex microprocessors

• deeply pipelined, out of order, superscalar

– complex memory hierarchy• non-blocking, multi-level caches, TLB

– direct interconnection networks

• Often, low performance results from interaction effects– example: sparse-matrix vector multiply in LANL’s SAGE AMR code

• microprocessor architecture with limited memory bandwidth

• rows have different lengths

• typical row is short: most have 7 non-zeros

• compiler-based software pipelining is ineffective for CSR

change data structure change code shapecompiler more

effective

up to 2x improvement

Page 4: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 4

Talk Structure

• Case study: analysis and tuning S3D (DOE Joule code)– introduction to S3D– S3D node performance analysis with HPCToolkit– tuning S3D kernels with LoopTool– S3D scalability issues

• automatic identification of scalability bottlenecks with HPCToolkit• scalability concerns for the NCCS Cray XT3/XT4

• A plan for action– enabling technology research and development– application engagement

Theme: enabling technologies for performance analysis and tuning– performance measurement and analysis (HPCToolkit)– source-to-source optimization of Fortran (LoopTool)– automatic identification of scalability bottlenecks (HPCToolkit)

Page 5: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 5

S3D

• Direct numerical simulation (DNS) of turbulent combustion– state-of-the-art code developed at CRF/Sandia

• PI: Jaqueline H. Chen, SNL

– 2007 INCITE award - 6M hours on XT3/4 at NCCS– Tier 1 pioneering application for 250TF system

• Why DNS?– study micro-physics of turbulent reacting flows

• full access to time resolved fields

• physical insight into chemistry turbulence interactions

– develop and validate reduced model descriptions used in macro-scale simulations of engineering-level systems

DNSDNS PhysicalPhysicalModelsModels

EngineeringEngineeringCFD codesCFD codes

(RANS, LES)(RANS, LES)Text and figures courtesy of Jacqueline H. Chen, SNL

Page 6: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 6

Text and figures courtesy of Jacqueline H. Chen, SNL

S3D - DNS Solver

• Solves compressible reacting Navier-Stokes equations• High fidelity numerical methods

– 8th order finite-difference– 4th order explicit RK integrator

• Hierarchy of molecular transport models• Detailed chemistry• Multi-physics (sprays, radiation and soot)

– from SciDAC-TSTC (Terascale Simulation of Turbulent Combustion)

Page 7: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 7

S3D Parallelization

Fortran90 + MPI • 3D domain decomposition

– each MPI process manages a piece of the domain

• All processes have same number of grid points and same computational load

• Inter-processor communication only between nearest neighbors in 3D mesh– large messages; non-blocking sends and receives

• All-to-all communication only required for monitoring and synchronization ahead of I/O

Communication

Computation=kN 2

kN 3= Ο

1

N

⎝ ⎜

⎠ ⎟

S3D logical topology

Text courtesy of Jacqueline H. Chen, SNL

Page 8: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 8

S3D Node Performance Study

Experimental Setup

• Model problem – pressure wave test (S3D-harness/Test1)– 1 processor execution– 50 x 50 x 50 domain– 40 iterations (normal test case = 200)

• reduced iterations suffice for analysis

• System– Cray XD1 (2.2 GHz Opteron 275; 6.4 GB/s DDR 400 memory)

• Cray XD1 node serves as a model for dual-core Cray XT3 node

• Overall node performance of S3D code provided to us (Feb 2007)– .305 FLOPs/cycle, 15% of peak

• Can performance be improved? If so how?

Page 9: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 9

Rice’s HPCToolkit Performance Tools

• Work at binary level for language independence– support multi-lingual codes with external binary-only libraries

• Profile rather than adding code instrumentation– minimize measurement overhead and distortion– enable data collection for large-scale parallelism

• Collect and correlate multiple performance measures– can’t diagnose a problem with only one species of event

• Compute derived metrics to aid analysis• Support top down performance analysis

– intuitive enough for scientists and engineers to use– detailed enough to meet the needs of compiler writers

• Aggregate events for loops and procedures– accurate despite approximate event attribution from counters– loop-level info is more important than line-level info

Page 10: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 10

HPCToolkit Workflow

profile execution

profile execution

performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking

binary analysisbinary analysis

programstructure

programstructure

interpret profileinterpret profile

source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

hpcviewer

hpcviewer

Page 11: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 11

HPCToolkit Workflow

profile execution

profile execution

performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking

binary analysisbinary analysis

programstructure

programstructure

interpret profileinterpret profile

source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

hpcviewer

hpcviewer

– launch unmodified, optimized application binaries– collect statistical profiles of events of interest

Page 12: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 12

HPCToolkit Workflow

profile execution

profile execution

performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking

binary analysisbinary analysis

programstructure

programstructure

interpret profileinterpret profile

source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

hpcviewer

hpcviewer

– decode instructions and combine with profile data

Page 13: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 13

HPCToolkit Workflow

profile execution

profile execution

performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking

binary analysisbinary analysis

programstructure

programstructure

interpret profileinterpret profile

source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

hpcviewer

hpcviewer

– extract loop nesting & inlining from executables

Page 14: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 14

HPCToolkit Workflow

profile execution

profile execution

performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking

binary analysisbinary analysis

programstructure

programstructure

interpret profileinterpret profile

source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

hpcviewer

hpcviewer

– synthesize new metrics by combining metrics – relate metrics and structure to program source

Page 15: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 15

HPCToolkit Workflow

profile execution

profile execution

performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking

binary analysisbinary analysis

programstructure

programstructure

interpret profileinterpret profile

source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

hpcviewer

hpcviewer

– support top-down analysis with interactive viewer– analyze results anytime, anywhere

Page 16: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 16

hpcviewer User Interface

source pane

navigation pane metric pane

view control

Page 17: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 17

hpcviewer User Interface

Page 18: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 18

hpcviewer Views

• Calling context view– top-down view shows dynamic calling contexts in which costs were

incurred

• Caller’s view– bottom-up view apportions costs incurred in a routine to the

routine’s dynamic calling contexts

• Flat view– aggregates all costs incurred by a routine in any context and shows

the details of where they were incurred within the routine

Page 19: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 19

S3D Performance at the Loop Level

Wasted Opportunity(Maximum FLOP rate

* cycles - (actual FLOPs)) / total waste

highlighted loop accounts for11.4% of total program waste

Overall performance (15% of peak)2.05 x 1011 FLOPs / 6.73 x 1011 cycles= .305 FLOPs/cycle

Page 20: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 20

S3D: What Opportunities Exist?

initialize

update

5D loop nest:2D explicit loops

3D F90 vector syntax

reuse

reuse

reuse performance problem

data streams in/out of memory

Page 21: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 21

Loop Unswitching

Controlled Loop Fusion

LoopTool: Loop Optimization of Fortran

Rice University’s tool for source-to-source transformation of Fortran

(transformation subset shown)

Unroll and Jam

do k = 1,ndo k = 1,n-1,2

Page 22: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 22

Markup of S3D Diffusive Flux Loop

!dir$ uj 3 do m=1,3 ! DIRECTION!dir$ uj 2 do n=1,n_spec-1 ! SPECIES

!dir$ unswitch 2 if (baro_switch) then ! driving force includes gradient in mole fraction and baro-diffusion:!dir$ fuse 1 1 1 diffFlux(:,:,:,n,m) = - Ds_mixavg(:,:,:,n) * ( grad_Ys(:,:,:,n,m) & + Ys(:,:,:,n) * ( grad_mixMW(:,:,:,m) & + (1 - molwt(n)*avmolwt) * grad_P(:,:,:,m)/Press)) else ! driving force is just the gradient in mole fraction:!dir$ fuse 1 1 1 diffFlux(:,:,:,n,m) = - Ds_mixavg(:,:,:,n) * ( grad_Ys(:,:,:,n,m) & + Ys(:,:,:,n) * grad_mixMW(:,:,:,m) ) endif

! Add thermal diffusion:!dir$ unswitch 2 if (thermDiff_switch) then!dir$ fuse 1 1 1 diffFlux(:,:,:,n,m) = diffFlux(:,:,:,n,m) - Ds_mixavg(:,:,:,n) * Rs_therm_diff(:,:,:,n) * molwt(n) * avmolwt * grad_T(:,:,:,m) / Temp endif

! compute contribution to nth species diffusive flux ! this will ensure that the sum of the diffusive fluxes is zero.!dir$ fuse 1 1 1 diffFlux(:,:,:,n_spec,m) = diffFlux(:,:,:,n_spec,m) - diffFlux(:,:,:,n,m)

enddo ! SPECIES enddo ! DIRECTION

unswitching directives

controlled fusiondirectives

unroll and jam directives

Add LoopTool directivesto source program

Page 23: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 23

if BS if TD

else

else if TD

else

n=1,nspec-2,2

n=1,nspec-2,2

n=1,nspec-2,2

n=1,nspec-2,2

if BSelseif TD

n=1,nspec-1

m=1,3

LoopTool

Optimization of S3D Diffusive Flux Loop

Transformation Log:– scalarization (4 stmts)– loop unswitching (2 conditions)– fusion (loops within 4 outer nests)– unroll-and-jam (2 loops)– peeling excess iterations (4 nests)

2.94x faster than original (6.7% total savings)

(35 lines) (445 lines)

Page 24: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 24

S3D: An Unexpected Bottleneck

Approach: adjust routine interfaces to avoid copy

100% faster

an implicit loop that copies a non-contiguous4D slice of 5D data to

contiguous storage

5.4% time

Page 25: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 25

S3D Node Performance Tuning Summary

• More opportunities remain– register reuse and tiling of stencil computations– inlining + fusion + array contraction of temporary variables

• Further improvements require more changes– lots of potential smaller improvements

Enabling technologies contributions– HPCToolkit made it possible to identify and assess bottlenecks– LoopTool helped automate tedious code transformations

Achieved ~12.7% overall improvement– boosted node performance from 15% of peak to 17.4% of peak– estimated savings on planned 2M CPU hour run: 254K CPU hours

Page 26: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 26

The Lump Under the Rug: Scaling Bottlenecks

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 4 16 64 256 1024 4096 16384 65536

CPUs

Efficiency

Ideal efficiency

Actual efficiency

?

Synthetic ExampleNote: higher is better

Page 27: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 27

S3D Weak Scaling Performance

Graph courtesy of Jacqueline H. Chen, SNL (lower is better)

Studied up to 20,000 cores

on Cray XT3/XT4 at NCCS

cost per grid point

increases > 50%as system size

scales from 1 to 20,000 coreson Cray XT3/XT4

Page 28: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 28

A Qualitative Understanding of S3D Scaling

Execution time breakdown for S3D using weak scaling (Cray XT3/XT4, NCCS) Courtesy of Sameer Shende, University of Oregon

(Measured with Oregon’s Tau using procedure- and loop-level instrumentation)

MPI wait

LUSTRE write

(a widening color band indicates a non-scalable cost)

Page 29: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 29

Pinpointing Scalability Bottlenecks Automatically

Challenges• Applications

– modern software uses layers of libraries– performance is often context dependent

• Monitoring– bottleneck nature: computation, data movement, synchronization?– size of petascale platforms

Example climate code skeleton

main

ocean atmosphere

wait wait

sea ice

wait

land

wait

Page 30: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 30

Call Path Profiling: Understanding Costs in Context

Event-based sampling method for performance measurement

• When a profile event occurs, e.g. a timer expires– determine context in which cost is incurred

• unwind call stack to determine set of active procedure frames

– attribute cost of sample to PC in calling context

• Benefits– monitor unmodified fully optimized code– language independent – C/C++, Fortran, assembly code, …– accurate– low overhead (1K samples per second has ~ 3-5% overhead)

[Froyd et. al ICS 05]

Page 31: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 31

Performance expectation for weak scaling – work increases linearly with # processors

– execution time is same as that on a single processor

Xw (nq ) =C(nq ) −C(np )

Tq€

C(nq ) =C(np )

parallel overhead

Pinpointing Scalability Bottlenecks Automatically

• Execute code on p and q processors; without loss of generality, p < q

• Let Ti = total execution time on i processors

• For corresponding nodes nq and np

– let C(nq) and C(np) be the costs of nodes nq and np

• Expectation:

• Fraction of excess work: total time

Page 32: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 32

LANL’s Parallel Ocean Program (POP)

successive global reductions on scalars

degrade parallel efficiency(7 total)

12% loss in scaling due to

scalar reductions

7% in this routine alone

Page 33: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 33

Why Does S3D Performance Degrade?

Let’s explore the nature of the problem …

Communication overhead is an interaction between– logical communication topology of S3D – network topology of the Cray XT3/XT4– mapping S3D’s logical topology onto the Cray XT3/XT4– other factors …

• link latency and bandwidth

• communication volume

• fraction of message latency that is overlapped with computation

Page 34: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 34

Bisection Bandwidth on a Torus Network

How much communication bandwidth crosses between halves?

YZ x “bandwidth between a pair of comm. partners”

X

ZY

Consider:

Ideal embedding of S3D mesh inthe torus

Page 35: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 35

Bisection Bandwidth on a Torus Network

How much communication bandwidth crosses between halves?

O(XYZ) x “bandwidth betw. a pair of comm. partners”

X

ZY

Consider:

Random embedding of S3D mesh inthe torus

Page 36: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 36

Mapping as a Potential Scalability Issue?

• Communication crossing between halves for different mappings– ideal: YZ x “bandwidth between a pair of comm. partners”– random: O(XYZ) x “bandwidth between a pair of comm. partners”

• Moral– a bad mapping could increase communication significantly

• random mapping yields O(X) times the bisection communication

• Next steps– investigate impact of logical to physical node mapping on Cray XT

• issues

– congestion: max number of logical links that map to a physical link

– dilation: longest path between a pair of communcation partners

• assess impact of congestion and dilation on performance

– explore better node mapping (and perhaps allocation) strategies

Page 37: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 37

A Plan for Action (Part 1)

Enabling technologies for petascale computing

• Enhance and deploy performance measurement and analysis tools– sampling-based tools for measuring application performance– automatic analysis of scalability bottlenecks– cluster analysis of ensembles of processes– insights into node performance bottlenecks

• Enhance node compiler technology for scientific systems– source-to-source tools for optimizing Fortran loop nests– analysis and source-to-source code generation for multicore

processors

• Co-array Fortran compiler for Cray XT and IBM Blue Gene– CAF refinements for expressiveness and performance

Page 38: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 38

A Plan for Action (Part 2)

Application engagement

• S3D– improve mapping: logical topology physical nodes– analyze and exploit opportunities for tailoring loop nests– explore alternatives for derivative computations

• XGC1– identify and exploit opportunities for tuning node performance

• GTC – use space filling curves to reorder particles to improve data locality

Page 39: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 39

GTC: Gyrokinetic Toroidal Code

• Charged particles follow spiral paths around magnetic field lines• Plasma turbulence arises from temperature difference between outer and

inner regions– provides means for particles in the plasma to move toward the outer edges of

the reactor rather than fusing with other particles

• Major challenge: use simulations to better understand and minimize the problem of turbulence– theory and experimental results differ; use simulation to gain insight

Developed by SciDAC-funded Gyrokinetic Particle Simulation Center

Figure Credit: SciDAC final report, 2006

Page 40: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 40

GTC: Boosting Locality by Ordering Particles

• Proposed approach– order particles in the plasma by

their position along a space-filling curve

• Expected benefits– better locality of access for

particle to cell and cell to particle interactions

top view of particlesin 1/8 tokamak

3D view of particles in 1/8 tokamak

Hilbert order of particles in 1/8 tokamak

Page 41: 1SciDAC Annual Meeting June 2007 Harnessing the Power of Emerging Petascale Platforms John Mellor-Crummey Department of Computer Science Rice University.

SciDAC Annual Meeting June 2007 41

Acknowledgments

• HPCToolkit Team– Michael Fagan– Mark Krentel– Nathan Tallent

• LoopTool Team– Apan Qasem

• S3D Studies– Yuan Zhao– Apan Qasem

• GTC study– Guohua Jin– Gabriel Marin