Iterative Partition Improvement using Mesh Topology for Parallel Adaptive Analysis M.S. Shephard, C....

Iterative Partition Improvement using Mesh Topology for Parallel Adaptive Analysis

M.S. Shephard, C. Smith, M. ZhouScientific Computation Research Center, RPI

K.E. Jansen University of Colorado

Outline Status of Parallel Adaptive SimulationTools to support parallel mesh operationsPartition improvement via mesh adjacencies

Parallel Adaptive Analysis

Components Solver

Form the system of equationsSolve the system of equations

Parallel mesh representation/manipulationPartitioned mesh management

Mesh adaptation procedureDriven by error estimates and/or

correction indicatorsMesh improvement procedures

Dynamic load balancing Need to regain load balance

All components must operate in parallel Prefer to exercise parallel control of all

operations with same overall parallel structure – improves scalability

initialmesh

adaptedmesh

Parallel Adaptive Analysis Tools Used

An implicit stabilized finite element code (PHASTA)Supports simulations on general unstructured meshes including

boundary layers and anisotropic meshesStabilized methods can provide high order convergence while

maintaining good numerical properties Equations solved using iterative methods

Parallel mesh representation/manipulationITAPS iMeshP/FMDBUtilities for partition management and predictive load balancing

Mesh adaptation based on mesh modificationAvoid the accuracy and data problems associated with full remeshingWith sufficient number of modification operators and control

algorithms mesh modification can support general changes to the mesh

“Graph-based” dynamic load balancingClassic partition graph for initial distributionGraph defined by mesh topology for partition improvement

Implicit non-linear FEM solver with two phases of computation: Equation formation (Eqn. form.) – depends on elements

Equation solution – depends on degrees-of-freedom (dofs):

NS Flow Solver

Weak form – Defined as sum of element integrals

Quadrature – Simply adds loop inside element loop

Assembly – As elements processes element matrices and vectors processes

Iterative solver

…

Parallel strategy: Both compute stages operate off the same

mesh partition – critical to scalability Partition defines inter-part relations

(part-to-part comm.)

Parallelization

PartA PartB

PartC

PartA PartB

PartC

Eqn. form.

Eqn. sol.

Locally, incomplete values(in b, A, q, etc.) for shared dofs.

Apply communications to completevalues/entries (in b, q only)

during Eqn. form.

during Eqn. sol.

Communications to complete values/entries and norms (same overall calculations as methods based on global matrix):

Current Approach – Parallelization

values accumulated on owners values updated on to non-owners

dofs are sharedbetween parts

PartA PartB

PartC

control relationship between images(solid dots indicate owner images)

complete b

complete q (for on-part q=Ap)

on-part norm for qand all-reduce

(use complete q)

7

Parallel Implicit Flow Solver – Incompressible

Abdominal Aorta Aneurysm (AAA) 105 Million Elements

Cores (avg. elems./core)

IBM BG/L RPI-CCNI

t (secs.) scale factor

512 (204800) 2119.7 1 (base)

1024 (102400) 1052.4 1.01

2048 (51200) 529.1 1.00

4096 (25600) 267.0 0.99

8192 (12800) 130.5 1.02

16384 (6400) 64.5 1.03

32768 (3200) 35.6 0.93

32K parts show modest degradation due to 15% node imbalance (with only about 600 mesh-nodes/part)

+++++++++++++++++++++

+++++++++++++++++++++++++++++++

++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

x

x

x

x

x

x

x

xxxx

x

x

x

xx

x

x

x

x

x

x

x

x

x

xxx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

xxxx

xx

x

x

x

x

xxx

x

x

xxx

x

x

x

x

x

x

xxxx

x

x

x

x

xxx

x

x

xx

x

x

x

x

x

xxx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xxx

x

xx

x

x

xx

x

x

x

x

x

xx

xx

xx

xxx

x

xx

x

xx

xxxx

x

x

x

x

xxx

x

x

x

x

x

x

xx

x

x

xxx

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xxx

x

x

x

xxxx

x

x

x

x

x

x

xx

xx

x

x

x

x

x

x

x

x

xx

xx

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

xxxx

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

xxxx

xx

x

x

x

x

x

x

x

xx

xxx

x

x

x

xxxxx

x

x

x

x

x

xx

xx

x

x

x

x

x

x

xxx

xxxxx

xx

x

x

x

xx

x

x

xx

xx

xx

x

x

xx

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

xx

x

x

x

x

x

x

x

xx

xx

x

x

x

x

x

x

xx

x

x

x

x

xx

x

xxxx

x

x

x

x

x

x

xxx

x

x

x

x

x

x

xx

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

xxxxx

xx

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

xx

x

xx

xx

x

xx

x

x

xxx

x

xx

x

xx

x

x

x

x

x

x

x

x

x

xx

xx

x

x

x

x

x

xx

x

x

xx

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xxx

x

x

x

x

xx

x

x

xxx

x

xxx

x

xxx

x

x

xx

x

x

x

x

x

x

xx

x

xx

x

x

x

xx

x

x

x

x

x

x

xx

xx

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

xxx

x

x

x

x

x

x

xx

x

x

x

x

xx

x

xx

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xxx

x

x

x

xx

x

x

x

xxxx

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

xx

x

x

xx

x

x

x

xxx

x

x

x

xx

x

x

x

x

x

xx

x

xx

x

xxx

xx

x

x

x

x

x

x

x

x

x

xx

xxx

x

x

xxx

x

x

x

x

x

xx

x

xx

x

xxxx

x

x

xx

xx

x

x

x

x

x

x

x

xx

x

x

xx

x

xx

xx

xx

x

x

x

x

x

x

x

x

x

xx

x

x

xx

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xxx

x

x

xx

xxx

x

x

x

x

x

x

x

xxx

x

x

x

xxx

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xxx

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

xxx

x

xx

x

x

xx

x

x

xx

xxx

xx

x

x

x

x

x

x

x

x

x

xxxx

x

x

x

x

x

x

x

xxxx

x

xx

x

x

x

x

x

x

x

x

x

x

xxx

xx

x

x

x

x

x

x

x

xx

xx

x

x

x

x

x

x

xxx

x

x

x

x

x

xx

x

x

x

xxx

x

proc number0 250 500 750 1000

0.88

0.92

0.96

1

1.04

1.08

1.12

1.16region rationode ratio

+

x

Rgn./elem. ratioi = rgnsi/avg_rgns

Node ratioi = nodesi/avg_nodes

Strong Scaling – 1B Mesh up to 160k Cores

AAA 1B elements: further scaling analysis (ttot=tcomm+tcomp)

comm-to-comp ratio increases

AAA 1B elements: three supercomputers up to full-system scale


18.67 on XT5 and 14.39 on BGP secs for 20 time steps

AAA 5B elements: full-system scale on Jugene (IBM BG/P)


without PIMA strong scaling factor is0.88 (time is 70.5 secs),

for production runs savings can be in 43 cpu-years

initial mesh(20,067 tets)

adapted mesh(approx. 2M tets)

Mesh Adaptation by Local Mesh Modification

Controlled application of mesh modification operations including dealing with curved geometries, anisotropic meshes

Base operators swap collapse Split move

Compound operators chain single step operators Double split collapse operator Swap(s) followed by collapse operator Split, then move the created vertex Etc.

Edge collapseEdge split face split

Double split collapse to remove the red sliver

Matching Boundary Layer and Interior Mesh

A modification operation on any layer is propagated through the stack and interface elements to preserve attributes of layered elements.

Parallelization of refinement: perform on each part and synchronize at inter-part boundaries.

Parallelization of coarsening and swapping: migrate cavity (on-the-fly) and perform operation locally on one part.

Support for parallel mesh modification requires update of evolving communication-links between parts and dynamic mesh partitioning.

Parallel Mesh Adaptation

Tightly coupledAdv: Computationally efficient if

done wellDisadv: More complex code

developmentExample: Explicit solution of

cannon blasts

Loosely coupledAdv: Ability to use existing

analysis codesDisadv: Overhead of multiple

structures and data conversionExample: Implicit high-order

FE code developed by DOEfor EM simulations of linear accelerators

t=0.0

t=2e-4

t=5e-4

Adaptive Loop Construction

The initial mesh has 7.1 million regions

The local size mesh size is between 0.03cm and 0.1cm

The initial mesh is isotropic

Initial mesh BL with 0.004cm as the height of each

Patient Specific Vascular Surgical Planning

4 mesh adaptation iterations

The adapted mesh:42.8 million regions

7.1M->10.8M->21.2M->33.0M->42.8M

Boundary layer based mesh adaptation

Mesh is anisotropic

The minimum local size: 0.004cm, maximum local size: 0.8cm, and the height of the boundary layer: 0.004cm.

Note: the inflow diameter is 3cm, and the total model length is more than 150cm.

Mesh adaptation driven by 2nd derivatives of appropriate solution field (velocity and pressure in current case)

Anisotropic adapted mesh

Brain-left

Spleen

SMA

Patient Specific Vascular Surgical Planning

Parallel Boundary Layer Adaptation

Initial mesh of 450k elements Final mesh of 4.5m elements

Example of Anisotropic Adaptation

Tools to Support Parallel Mesh Operations

Unstructured mesh tools used to support parallel adaptive analysis on massively parallel computers Distributed mesh representation Mesh migration Dynamic load balancing Multiple parts per process Predictive load balancing Partition improvement

Distributed MeshMesh divided into parts for distribution

onto nodes/coresPart Pi consists of mesh entities

assigned to the ith part. Partition Object

Basic unit assigned to a partA mesh entity to be partitioned An entity set (where the set of

entities are required to be kepton the same part).

Residence PartOperator P [Mi

d] returns a set of part id’s

where Mid exists. (e.g. P [M1

0] = {P0, P1, P2} )

Distributed Mesh Data Structure

iM0

jM1 1P

0P2P

partition boundary

A Partition Model

Interoperable partition model implementation - iMeshP Defined by the DOE SciDAC center on Interoperable tools

for Advanced Petascale Simulations Supports unstructured meshes on distributed memory

parallel computers Focuses on supporting parallel

interactions of serial (iMesh) on part meshes

Multiple implementations underway Implementation that support parallel

mesh adaptation used here.

Mesh Migration

Movement of mesh entities between parts Dictated by operation - in swap and collapse it’s the mesh

entities on other parts needed to complete the mesh modification cavity

Information determined based on mesh adjacencies Complexity of mesh migration a function of mesh representation Complete representation can provide any adjacency without

mesh traversal - a requirement for satisfactory efficiency Both full and reduced representations can be complete

Full representation - all mesh entities explicitly representedReduced representation - requires all mesh vertices and mesh

entities of the same order as the model (or partition) entity they are classified on

Dynamic Partitioning of Unstructured Meshes

Load balance is lost during mesh modification Need to rebalance in parallel on already distributed mesh Want equal “work load” with minimum inter-processor

communications Graph based methods are flexible and provide best results Zoltan Dynamic Services (http://www.cs.sandia.gov/Zoltan/)

Supports multiple dynamic partitionersGeneral control of the definition of part objects and

weights supportedUnder active development

http://www.cs.sandia.gov/Zoltan/

Multiple Parts per Process

Support more that one part per process Key to changing number of parts Also used to deal with problem with current graph-based

partitioners that tend to fail on really large numbers of processors

• 1 Billion region mesh starts as well balanced mesh on 2048 parts

• Each part splits to 64 parts - get 128K parts via ‘local’ partitioning

• Region imbalance: 1.0549• Time usage

< 3mins on Kraken• This is the partition used

for scaling test on Intrepid

Multiple Parts per Process

Supports global and local graphs to create mesh partitions at extreme scale (>=100K parts):

Mesh and its initial partition(partition from 3 parts to 6 parts)

Global graph

Local graph Local partition

Global partition

Predictive Load Balancing

Mesh modification before load balancing can lead to memory problems

Employ predictive load balancing to avoid the problemAssign weights based on what will be

refined/coarsenedApply dynamic load balancing using those weightsPerform mesh modificationsMay want to do some local migration

Algorithm Mesh metric field at any point P is decomposed to three unit

direction (e1,e2,e3) and desired length (h1,h2,h3) in each corresponding direction.

The volume of desired element (tetrahedron) : h1h2h3/6 Estimate number of elements to be generated:

“num” is scaled to a good range before it is specified as a weight to graph nodes

Predictive Load Balancing ?

# r

egio

ns

?

Initial: 8.7M; Adapted: 80.5MPredL run on 128 cores

Initial: 8.7M; Adapted: 187MTest run out of memory

without PredLB on 200 cores

Predictive Load Balancing

Partitioning Improvement via Mesh Adjacencies

Observations on graph-based dynamic balancingParallel construction and balancing of graph with small

cuts takes reasonable timeIn the case of unstructured meshes, the graph defined in

terms of Graph nodes as mesh entity type to be balanced (e.g.,

regions, faces, edges or vertices) Graph edges based on mesh adjacencies taken into

account (e.g., graph edge between regions sharing a vertex, between regions sharing a face, etc.)

Accounting for multiple criteria and or multiple interactions is not obvious Hypergraphs allows edges to connect more that two

vertices may be of use – has been used to help account for migration communication costs

Partition Improvement via Mesh Adjacencies

Mesh adjacencies for a complete representation can be viewed as a complete graphAll entities can be “graph-nodes” at willAny desired graph edge obtained in O(1) time

Possible advantagesAvoid graph construction (assuming you have needed

adjacencies)Account for multiple entity types – important for the solve

process - typically the most computationally expensive step Disadvantage

Lack of well developed algorithms for parallel graph operations

Easy to use simple iterative diffusive proceduresNot ideal for “global” balancingCan be fast for improving partitions

Partition Improvement via Mesh Topology (PIMA)

Basic idea is to perform iterations exchanging mesh entities between neighboring parts to improve scaling Current algorithm focused on gaining better node balance to

improve scalability of the solve by accounting for multiple criteriaExample: Reduce down the small number of node

imbalance peaks at the cost of some increase in element imbalance

Similar approaches can be used to:Improve balance when using multiple parts per process -

may be as good as full rebalance for lower total costImprove balance during mesh adaptation – likely want

extensions past simple diffusive methods

Diffusive PIMA-advantages Imbalance in partitions obtained by graph/hyper-graph based

methods, even when considering multiple criteria is limited to a small number of heavily loaded parts, referred to as spikes, which limit the scalability of applications

Uses mesh adjacencies - Richer information provides chance to provide better multi-criteria partitions

All adjacencies are obtainable in O(1) operations (not a function of mesh size) – requirement for efficiency

Takes advantages of neighborhood communications - Work well on massively parallel computations, since more controlled communication used even at extreme scale

Diffusive PIMA is designed to migrate a small number of mesh entities on inter-part boundaries from heavily loaded parts to lightly loaded neighbors to improve load balance

PIMA: algorithmInput:

Types of mesh entities need to be balanced (Rgn, Face, Edge, Vtx) The relative importance (priority) between them (= or >) e.g., “Vtx=Edge>Rgn” or “Rgn>Face=Edge>Vtx”, etc. The balance of entities not specified in the input is not explicitly

improved or preserved Mesh with complete representation

Algorithm: From high to low priority if separated by “>” (different groups)

From low to high dimensions based on entities topologies if separated by “=” (same group) Improve partition balance while reducing edge-cut and data migration

costs for current mesh entity

e.g., “Rgn>Face=Edge>Vtx” is the user’s input Step 1: improve balance for mesh regionsStep 2.1: improve balance for mesh edgesStep 2.2: improve balance for mesh facesStep 3: improve balance for mesh vertices

PIMA: algorithm For priority level , beginning with the highest priority

For mesh entity type , beginning with the lowest dimension entityCompute the weighted partition connectivity matrix [HuBlake98]

Compute the RHS vector Compute the diffusive solution; solve Select entities for migration

On source part, i, iterate over regions on the part boundary with the destination part j

Let M = Ø // set of entities to migrateWhile iterator->next and |markedEntities | < λi – λj

If ( candidateEntity( *iterator ) )M = M U *iterator

Migrate mesh entities

What is a good choice for α ??

Can migrate via p2p communications between procs i and j – does FMDB use collective communication to do migration?

PIMA: Candidate Parts Mesh entities have weights associated with them representing computational load, Define the average computational load as

Define the computational load on a part i

Part j is absolutely lightly loaded if

Part j is relatively lightly loaded w.r.t. part i if

PIMA: Candidate Mesh Entities On the inter-part boundary between a source and destination part mesh

regions are selected to be migrated s.t. the migration reduces the load imbalance, and communication and data migration costs

vertices have weights associated with them representing communication cost //let all entities have communication costs?

The gain of a region measures the change in communication cost if the region was migrated from its source part to the destination partAccounts for the topological adjacencies of the region if uniform c then high gain entities are the spike and ridge nodes identified in the next slideThe gain concept defined in the context of a graph vertex:

Kernighan, B. W., & Lin, S. (1969). An Efficient Heuristic Procedure for Partitioning Graphs.

=

regions have weights associated with them representing migration cost, //let all entities have migration costs, possibly just regions since we only migrate regions?

PIMA: Candidate Mesh Entities

The gain(I) is the total gain of the part boundary; likewise for density(I) [line 5] A region is only marked for migration if it has above average gain and density [line 5] The average gain and density is updated with the definition of the part boundary as

entities are marked for migration [line 8] The gain is updated for regions adjacent to the region marked for migration [line 11]

PIMA: Candidate mesh entities

Vertex:The vertices on inter-part boundaries bounding a small number of regions on source part P0; tips of ‘spikes’

Edge: The edges on inter-part boundaries bounding a small number of faces; ‘ridge’ edges with (a) 2 bounding faces, and (b) 3 bounding faces on source part P0

Face/Region: Regions which have two or three faces on inter-part boundaries; (a) ‘spike’ region (b) region on a ‘ridge’

Partition Improvement

Worst loaded part dictates the performance Element-based partitioning results in spikes of dofs

PartA PartB

PartC

worst dof imbalance: 14.7%

0

Partition Improvement

Heaviest loaded part dictates the performance Element-based partitioning results in spikes of dofs

Parallel local iterative partition modificationsLocal mesh migration is applied from relatively heavy to light parts dof imbalance

reducedfrom 14.7% to

4.92%

element imbalance increased

from 2.64% to 4.54%


AAA 1B elements: effective partitioning at extreme scalewith and without partition modification (PIMA)

Full system

Without PIMA with PIMA

PModPMod

(see graph)

PIMA-Tests

133M region mesh on 16k parts

Table 1: Users input

Table 2:Balance of partitions

Table 3: Time usage and iterations (tests on Jaguar Cray XT5 system)

Closing Remarks

***** Needs update

Parallel adaptive for unstructured mesh simulations is progressing toward petascale Solver scales well (on good machines)Mesh adaptation can run on petascale machines -

How well can we make them scale?How well do they need to scale?

Execution on massively parallel computersIntroduces several new considerationsSpecific tools being developed to address these

issues

Iterative Partition Improvement using Mesh Topology for Parallel Adaptive Analysis M.S. Shephard, C....

Documents

Transcript of Iterative Partition Improvement using Mesh Topology for Parallel Adaptive Analysis M.S. Shephard, C....