Parallel Performance Diagnosis

22
Automatic Performance Diagnosis of Parallel Computations with Compositional Models Li Li , Allen D. Malony {lili, malony}@cs.uoregon.edu Performance Research Laboratory Dep. of Computer and Information Science University of Oregon

description

Automatic Performance Diagnosis of Parallel Computations with Compositional Models Li Li , Allen D. Malony {lili, malony}@cs.uoregon.edu Performance Research Laboratory Dep. of Computer and Information Science University of Oregon. Parallel Performance Diagnosis. Performance tuning process - PowerPoint PPT Presentation

Transcript of Parallel Performance Diagnosis

Page 1: Parallel Performance Diagnosis

Automatic Performance Diagnosis of Parallel Computations with

Compositional Models

Li Li, Allen D. Malony{lili, malony}@cs.uoregon.edu

Performance Research LaboratoryDep. of Computer and Information Science

University of Oregon

Page 2: Parallel Performance Diagnosis

2HIPS’07

Parallel Performance Diagnosis

• Performance tuning process– Process to find and fix performance problems

– Performance diagnosis: detect and explain problems

– Performance optimization: repair found problems– Diagnosis is critical to efficiency of performance tuning

• Focus on the performance diagnosis– Capture diagnosis processes

– Integrate with performance experimentation and evaluation

– Formalize the (expert) performance cause inference

– Support diagnosis in an automated manner

Page 3: Parallel Performance Diagnosis

3HIPS’07

Generic Performance Diagnosis Process

• Design and run performance experiments– Observe performance under a specific circumstance

– Generate desirable performance evaluation data

• Find symptoms – Observation deviating from performance expectation

– Detect by evaluating performance metrics

• Infer causes of symptoms– Relate symptoms to program

– Interpret symptoms at different levels of abstraction

• Iterate the process to refine performance bug search– Refine performance hypothesis based on symptoms found

– Generate more data to validate the hypothesis

Page 4: Parallel Performance Diagnosis

4HIPS’07

Knowledge-Based Automatic Performance Diagnosis

• Experts analyze systematically and use experience– Implicitly use knowledge of code structure and parallelism– Guide by the knowledge to conduct diagnostic analysis

• Knowledge-based approach– Capture knowledge about performance problems– Capture knowledge about how to detect and explain them– Apply the knowledge to performance diagnosis

• Performance knowledge– Experiment design and specifications– Performance models– Performance metrics and evaluation rules– High level performance factors/design parameters (causes)

Page 5: Parallel Performance Diagnosis

5HIPS’07

Implications

• Where does the knowledge come from?– Extract from parallel computational models

Structural and operational characteristics Reusable parallel design patterns

– Associate computational models with performance models Well-defined computation and communication pattern

• Model examples– Single models: Master-worker, Pipeline,AMR, ...

– Compositional models

• Use model knowledge to diagnose performance problem– Engineer model knowledge

– Integrate model knowledge with cause inference

Page 6: Parallel Performance Diagnosis

Performance bug searchand cause inference

Performance compositionand coupling descriptions

Algorithmic performancemodeling

refine

extend & Algorithmic-specific metrics

extend

extend Metric-driveninference

Algorithm-specific factors

Algorithmic-specific events

extend &

event2

Abstractevents

event1

Model-based metrics

Beh

avio

ral

Mod

elin

gP

erfo

rman

ceM

odel

ing

Met

rics

Def

init

ion

Infe

ren

ceM

odel

ing

Performancefactor library

Model-based Generic Knowledge Generation

instantiate

instantiate

Algorithm-specific Knowledge Extension

Page 7: Parallel Performance Diagnosis

7HIPS’07

Hercule Automatic Performance Diagnosis System

• Goals of automation, adaptability, extension, and reuse

perf

. d

ata

exp

eri

men

t sp

eci

fica

tio

ns

diagnosis results

problems explanations

Hercule

Computationalmodels

model knowledge

algorithm-spec. info

modelParallelprogram

knowledge base

inference engine

inference rules

eventrecognizer

metricevaluatorm

easu

rem

ent

syst

em

Page 8: Parallel Performance Diagnosis

8HIPS’07

Single Model Knowledge Engineered

• Master-worker

• Divide-and-conquer

• Wavefront (2D pipeline)

• Adaptive Mesh Refinement

• Parallel Recursive Tree

• Geometric Decomposition

• Related publications– L. Li and A. D. Malony, "Model-based Performance Diagnosis of Master-worker

Parallel Computations", in the proceedings of Europar 2006.

– L. Li, A. D. Malony and K. Huck, "Model-Based Relative Performance Diagnosis of Wavefront Parallel Computations", in the proceedings of HPCC 2006.

– L. Li, A. D. Malony, "Knowledge Engineering for Automatic Parallel Performance Diagnosis", to appear in Concurrency and Computation: Practice and Experience.

Page 9: Parallel Performance Diagnosis

9HIPS’07

Characteristics of Model Composition

• Compositional model– Combine two or more models – Interaction changes individual model behaviors– Composition pattern affects performance

• Model abstraction for describing composition– Computational component set {C1, C2, ..., Ck}– Relative control order F(C1, C2, ..., Ck)– Integrate component sets in a compositional model

• Composition forms– Model nesting– Model restructuring– Different implications to performance knowledge engineering

Page 10: Parallel Performance Diagnosis

10HIPS’07

Model Nesting

• Formal representation– Two models: root F(C1, C2, ..., Ck) and child G(D1, D2, ..., Dl)

F(C1, C2, ..., Ck) + G(D1, D2, ..., Dl) F(C1{G(D1, D2, ..., Dl)},

C2{G(D1, D2, ..., Dl)},

... ...

Ck{G(D1, D2, ..., Dl)})

where Ci{G(D1, D2, ..., Dl)} means Ci

implements the G model.

Page 11: Parallel Performance Diagnosis

11HIPS’07

Model Nesting (contd.)

• Examples– Iterative, multi-phase applications

– FLASH, developed by DOE supported ASC/Alliances Center for Astrophysical Thermonuclear Flashes

• Implications to performance diagnosis– Hierarchical model structure dictates analysis order

– Refine problem discovery from root to child

– Preserve performance features of individual models

Page 12: Parallel Performance Diagnosis

12HIPS’07

Model Restructuring

• Formal representation– Two models: F(C1, C2, ..., Ck) and G(D1, D2, ..., Dl)

F(C1, C2, ..., Ck) + G(D1, D2, ..., Dl) H(({C1

F, ..., CkF}|{D1

G, ..., Dl

G})+)

where {C1F, ..., Ck

F}|{D1G

, ..., DlG} selects a

component CiF or Dj

G while preserving relative

component order in F and G. H is the new

function ruling all components.

Page 13: Parallel Performance Diagnosis

13HIPS’07

Adapt Performance Knowledge to Composition

• Objective: discover and interpret performance effects caused by model interaction

• Model nesting– Behavioral modeling

Derive F(C1, C2, ..., Ck) from single model behaviors Replace affected root component with child model behaviors

– Performance modeling and metric formulation Unite overhead categories according to nesting hierarchy Evaluate overheads according to the model hierarchy

– Inference modeling Represent inference process with an inference tree Merge inference steps of participant models Extend root model inferences with implementing child model

inferences

Page 14: Parallel Performance Diagnosis

14HIPS’07

Model Nesting Case Study - FLASH

• FLASH– Parallel simulations in astrophysical hydrodynamics

– Use Adaptive Mesh Refinement (AMR) to manage meshes

– Use a Parallel Recursive Tree (PRT) to manage mesh data

– Model nesting Root AMR model Child PRT model AMR implements

PRT data operations

Page 15: Parallel Performance Diagnosis

15HIPS’07

Single Model Characteristics

• AMR operations– AMR_Refinement – refine a mesh grid– AMR_Derefinement – coarsen a mesh grid– AMR_LoadBalancing – even out work load after refinement or

derefinement– AMR_Guardcell – update guard cells at the boundary of every grid block

with data from the neighbors– AMR_Prolong – prolong the solution to newly created leaf blocks after

refinement– AMR_Restrict – restrict the solution up the block tree after derefinement– AMR_MeshRedistribution – mesh redistribution when balancing workload

• PRT operations– PRT_comm_to_parent – communicate to parent processor – PRT_comm_to_child – communicate to child processor – PRT_comm_to_sibling – communicate to sibling processor – PRT_build_tree – initialize tree structure, or migrate part of the tree to

another processor and rebuild the connection.

Page 16: Parallel Performance Diagnosis

16HIPS’07

AMR Inference Tree

low speedup

... ...

AMR_Prolong

AMR_Guardcell AMR_Refine

AMR_Restrict

guardcell_size

check_neighborinform_neighbor

data_fetch

data_contiguity_in_cache

computationcommunication

refine_levels

balance_strategy

AMR_Derefine

AMR_Workbalance

leaf_restrict

... ...

... ...

check_refine

calculate_blocks_weight

sort_blocks

migrate_blocks

parent_prolong

physical_block_contiguity

refine_levels

physical_block_contiguity

refine_levels

physical_block_contiguity

physical_block_contiguity

physical_block_contiguity

refine_freq.

other_comm.AMR_comm.

block_weight_assign_method

... ...

rebuild_block_connection bala

nce_strategy

physical_block_contiguity

: symptoms : intermediate observations

: performance factors : inference direction

Page 17: Parallel Performance Diagnosis

17HIPS’07

PRT Inference Tree

low speedup

sibling_comm

fetch_freq.

computationcommunicationother_ comm

: symptoms : intermediate observations

: performance factors : inference direction

parent_comm child_comm

lookup_parent

lookup_childlookup_sib.

rebuild_tree

data_transfer

build_tree

data_transfer

data_transfertree_node_contiguity

tree_depth

data_contiguity

fetch_freq.

tree_node_contiguity

tree_depth

data_contiguity

fetch_freq.

tree_node_contiguity

tree_depth

data_contiguity

init._tree

rebuild_ connection

migrate_subtree

... ...

migrate_strategy

migrate_strategy

freq.

link_parent

link_child

link_sib.

tree_node_contiguity

tree_depth

migrate_strategy

tree_node_contiguity

12 3

4 5

Page 18: Parallel Performance Diagnosis

18HIPS’07

FLASH Inference Tree

: refine perf. problem search following subtrees of PRT that are relevant to A. The No. represent corresponding subtrees in PRT.

Alow speedup

AMR_Guardcell AMR_Refine

guardcell_size

check_neighborinform_neighbor

data_fetch

data_contiguity_in_cache

computationcommunication

balance_strategy

AMR_Workbalance

leaf_restrict

check_refine

calculate_blocks_weight sort_blocks

migrate_blocks

parent_prolongrefine_levelsrefine_levels

physical_block_contiguity

physical_block_contiguity

refine_freq.

rebuild_block_connection

AMR_comm.

block_weight_assign_method

... ...others

physical_block_contiguity

physical_block_contiguity

physical_block_contiguity

1,3

1,2,3

refine_levels

1,2,31,2,3

3

1,2,31,2,3

4

5

balance_strategy

physical_block_contiguity

No.

Page 19: Parallel Performance Diagnosis

19HIPS’07

Experiment with FLASH v3.0

• Sedov explosion simulation in FLASH3

• Test platform IBM pSeries 690 SMP cluster with 8 processors

• Execution profiles of a problematic run (Paraprof view)

Page 20: Parallel Performance Diagnosis

20HIPS’07

Diagnosis Results Output (Step 1&2)

• Step 1: find performance symptom

• Step 2: look at root AMR model performance

Begin diagnosing ...========================================================Begin diagnosing AMR program... ...Level 1 experiment -- collect performance profiles with respect to computation and communication.______________________________________________________________do experiment 1... ...Communication accounts for 80.70% of run time.Communication cost of the run degrades performance.========================================================

=========================================================Level 2 experiment -- collect performance profiles with respect to AMR refine, derefine, guardcell-fill, prolong, and workload-balance.________________________________________________________________do experiment 2... ...Processes spent 4.35% of communication time in checking refinement,2.22% in refinement, 13.83% in checking derefinement (coarsening),1.43% in derefinement, 49.44% in guardcell filling,3.44% in prolongating data,9.43% in dealing with work balancing,=========================================================

Page 21: Parallel Performance Diagnosis

21HIPS’07

Step 3: Interpret Expensive guardcell_filling with PRT Performance

====================================================================Level 3 experiment for diagnosing grid guardcell-filling related problems -- collect performance event trace with respect to restriction, intra-level and inter-level commu. associated with the grid block tree. ___________________________________________________________________________________do experiment 3... ...

Among the guardcell-filling communication, 53.01% is spent restricting the solution up the block tree, 8.27% is spent in building tree connections required by guardcell-filling (updating the neighbor list in terms of morton order), and 38.71% in transferring guardcell data among grid blocks.___________________________________________________________________________________The restriction communication time consists of 94.77% in transferring physical data among grid blocks, and 5.23% in building tree connections.

Among the restriction communication, 92.26% is spent in collective communications.

Looking at the performance of data transfer in restrictions from the PRT perspective, remote fetch parent data comprises 0.0%, remote fetch sibling comprises 0.0%,and remote fetch child comprises 100%.Improving block contiguity at the inter-level of the PRT will reduce restriction data communication.__________________________________________________________________________________Among the guardcell data transfer, 65.78% is spent in collective communications.

Looking at the performance of guardcell data transfer from the PRT perspective, remote fetch parent data comprises 3.42%, remote fetch sibling comprises 85.93%,and remote fetch child comprises 10.64%.Improving block contiguity at the intra-level of the PRT will reduce guardcell data communication.====================================================================

AMR model performance

PRT operation perf. in transferring guardcell data

PRT operation perf. in AMR_Restrict

Page 22: Parallel Performance Diagnosis

22HIPS’07

Conclusion and Future Directions

• Model-based performance diagnosis approach– Provide performance feedbacks at a high level of abstraction

– Support automatic problem discovery and interpretation

– Enable novice programmers to use established expertises

• Compositional model diagnosis– Adapt knowledge engineering approach to model integration

– Disentangle cross-model performance effects

– Enhance applicability of model-based approach

• Future directions– Automate performance knowledge adaptation

Algorithmic knowledge, compositional model knowledge

– Incorporate system utilization model Reveal interplay between programming model and system utilization Explain performance with the model-system relationship