Adams_SIAMCSE15
Transcript of Adams_SIAMCSE15
Mark F. Adams
Jed Brown
Matt Knepley
Segmental Refinement:
A Multigrid Technique for Data Locality
2
Communication costs at all levels of memory hierarchy have been increasing relative to processor speeds for 30+ years
Energy constraints exacerbates cost of memory movement
• Tipping point where radical ideas should be investigated
Brandt (1977) proposed Segmental Refinement (SR)
Serial ultra-low memory multigrid method
Brandt & Diskin (1994) apply idea parallel processing
We continue this work
• First published multilevel numerical results
SR uses buffering and does not “communicate” finest levels
• Do not replicate multigrid semantics exactly
Not brute force “communication avoiding”
• Game: keep textbook MG efficiency, asymptotically exact
Rethinking multigrid solvers for new architectures
3
Multigrid Communication Patterns: vertical (tree) & horiz.
FMG starts with accurate solve on coarsest grid
4
Nearest neighbor
intra grid comm. Inter grid tree++ comm.
Refine grid split processes, building tree
Multigrid Communication Patterns: vertical (tree) & horiz.
5
FMG goes back to coarse grid after each new level
Multigrid Communication Patterns: vertical (tree) & horiz.
6
Nearest neighbor
intra grid comm. Inter grid tree++ comm.
Back down – Two level FMG
Multigrid Communication Patterns: vertical (tree) & horiz.
7
Nearest neighbor
intra grid comm. Inter grid tree++ comm.
refine & populate procs
Multigrid Communication Patterns: vertical (tree) & horiz.
8
Nearest neighbor
intra grid comm. Inter grid tree++ comm.
Populate & refine
Fully populated – continue refinement
Multigrid Communication Patterns: vertical (tree) & horiz.
9
Nearest neighbor
intra grid comm. Inter grid tree+ comm.
Populate & refine
Multigrid Communication Patterns: vertical (tree) & horiz.
10
SR removes horizontal communication at some scale
Nearest neighbor
intra grid comm. Inter grid tree+ comm.
Populate & refine
Segmental refinement removes communication finest levels
11
SR uses conventional FMG solver “coarse” grid solver
Buffer cells added to finer grids & don’t update
Error decays
Experimentally
find adequate
buffer schedule
acceptable level
of accuracy
Segmental refinement technique – Brandt & Diskin
O(1)
O(log N)
12
Use linear buffer schedule: # buffer cells J level i: A + B(K – i), i > 0
• K SR grids, i = 0 transition level
• A & B independent integer parameters ≥ 0
• Only few (A) buffer cells on fine grid, more on coarse grids
Model problem
• Cell-centered finite difference 27-point stencil Laplacian
• Cartesian isotropic grids
• Piecewise constant Restriction, linear Prolongation
• 2nd order Chebyshev smoother in V(2,2) cycles
• Full multigrid with linear FMG prolongation
• u=(x4-Lxx2)(y4-Lyy
2)(z4-Lzz2); L= (2,1,1)
Segmental refinement parameters of interest:
• A: Number of buffer cells in finest grid
• B: Increase buffer cells per level
• N0: Size of transition level sub-domain
Accuracy w.r.t. segmental refinement parameters
13
Probe parameter 5D space (A, B, N0, K, esr)
Fine grid
buffer size
A = 4
N0 (K)
Increment
B16 (6) 8 (5) 4 (4)
0 5.7 2.6 1.2
1 2.0 1.4 1.0
2 1.3 1.1 NA
3 1.1 1.0 NA
Ratio (esr/econv) SR err to convention MG solver error
Define acceptable error as ~10% (esr/econv ~1.1)
Observe isosurfaces …
Implies N0 & B increase w K
Dependence on N0 not recognized B & D analysis• This need corroboration &
possibly amelioration
Implies new data model for asymptotic analysis …
More data in paper
14
Data movement complexity analysis – new data model
Machine model (that is proper “basis” for new DM)• Q words (small patches) memory and processes on fine grid
• √Q memory partitions
• √Q words per memory partition
New DM: transition level fits on one memory partition
Log(NK)/2 + 1
Log(NK)/2
15
Data movement complexity analysis
Complexity model (again proper “basis” for new DM):• Near – intra-partition – communication
• Far – inter-partition – communication
• Horizontal communication (residual, etc.): CH
• Vertical communication (Restrict & Prolong): CV
Comm. type Near (L2/8) Far (L2/8)
Coarse grids 3(6cH + 2cV) 0
Conv. fine grids 6cH 6cH + 2cV
SR fine grids 6cH 0 + 2cVSR removes fine grid communication
Unlike conventional communication avoiding, SR reduces
bisection bandwidth. In 3D: N2 -> N log2(N)
16
Conclusion: SR severs horizontal comm. at some scale
log NK1-K2
K2= O(1)
Designed two SR data models
• Each removes horizontal communication at level in memory hierarchy
Multiple SR models
combined address all
levels of interest …
Future work:
• Corroborate:
Try H.O. Prolong.
Vertex centered FE
• Extend to more apps
• New SR data models
• ...
17
Weak scaling: 2 - 8K sockets Edison (Cray XC30)
Some indication of SR
catching up
(4 SR levels)
O(1) SR levels
Not log (N)
18
Thank you
https://bitbucket.org/madams/srgmg
(SISC) paper, code, data, parse & run scripts: