Parallel Performance Diagnosis
description
Transcript of Parallel Performance Diagnosis
Automatic Performance Diagnosis of Parallel Computations with
Compositional Models
Li Li, Allen D. Malony{lili, malony}@cs.uoregon.edu
Performance Research LaboratoryDep. of Computer and Information Science
University of Oregon
2HIPS’07
Parallel Performance Diagnosis
• Performance tuning process– Process to find and fix performance problems
– Performance diagnosis: detect and explain problems
– Performance optimization: repair found problems– Diagnosis is critical to efficiency of performance tuning
• Focus on the performance diagnosis– Capture diagnosis processes
– Integrate with performance experimentation and evaluation
– Formalize the (expert) performance cause inference
– Support diagnosis in an automated manner
3HIPS’07
Generic Performance Diagnosis Process
• Design and run performance experiments– Observe performance under a specific circumstance
– Generate desirable performance evaluation data
• Find symptoms – Observation deviating from performance expectation
– Detect by evaluating performance metrics
• Infer causes of symptoms– Relate symptoms to program
– Interpret symptoms at different levels of abstraction
• Iterate the process to refine performance bug search– Refine performance hypothesis based on symptoms found
– Generate more data to validate the hypothesis
4HIPS’07
Knowledge-Based Automatic Performance Diagnosis
• Experts analyze systematically and use experience– Implicitly use knowledge of code structure and parallelism– Guide by the knowledge to conduct diagnostic analysis
• Knowledge-based approach– Capture knowledge about performance problems– Capture knowledge about how to detect and explain them– Apply the knowledge to performance diagnosis
• Performance knowledge– Experiment design and specifications– Performance models– Performance metrics and evaluation rules– High level performance factors/design parameters (causes)
5HIPS’07
Implications
• Where does the knowledge come from?– Extract from parallel computational models
Structural and operational characteristics Reusable parallel design patterns
– Associate computational models with performance models Well-defined computation and communication pattern
• Model examples– Single models: Master-worker, Pipeline,AMR, ...
– Compositional models
• Use model knowledge to diagnose performance problem– Engineer model knowledge
– Integrate model knowledge with cause inference
Performance bug searchand cause inference
Performance compositionand coupling descriptions
Algorithmic performancemodeling
refine
extend & Algorithmic-specific metrics
extend
extend Metric-driveninference
Algorithm-specific factors
Algorithmic-specific events
extend &
event2
Abstractevents
event1
Model-based metrics
Beh
avio
ral
Mod
elin
gP
erfo
rman
ceM
odel
ing
Met
rics
Def
init
ion
Infe
ren
ceM
odel
ing
Performancefactor library
Model-based Generic Knowledge Generation
instantiate
instantiate
Algorithm-specific Knowledge Extension
7HIPS’07
Hercule Automatic Performance Diagnosis System
• Goals of automation, adaptability, extension, and reuse
perf
. d
ata
exp
eri
men
t sp
eci
fica
tio
ns
diagnosis results
problems explanations
Hercule
Computationalmodels
model knowledge
algorithm-spec. info
modelParallelprogram
knowledge base
inference engine
inference rules
eventrecognizer
metricevaluatorm
easu
rem
ent
syst
em
8HIPS’07
Single Model Knowledge Engineered
• Master-worker
• Divide-and-conquer
• Wavefront (2D pipeline)
• Adaptive Mesh Refinement
• Parallel Recursive Tree
• Geometric Decomposition
• Related publications– L. Li and A. D. Malony, "Model-based Performance Diagnosis of Master-worker
Parallel Computations", in the proceedings of Europar 2006.
– L. Li, A. D. Malony and K. Huck, "Model-Based Relative Performance Diagnosis of Wavefront Parallel Computations", in the proceedings of HPCC 2006.
– L. Li, A. D. Malony, "Knowledge Engineering for Automatic Parallel Performance Diagnosis", to appear in Concurrency and Computation: Practice and Experience.
9HIPS’07
Characteristics of Model Composition
• Compositional model– Combine two or more models – Interaction changes individual model behaviors– Composition pattern affects performance
• Model abstraction for describing composition– Computational component set {C1, C2, ..., Ck}– Relative control order F(C1, C2, ..., Ck)– Integrate component sets in a compositional model
• Composition forms– Model nesting– Model restructuring– Different implications to performance knowledge engineering
10HIPS’07
Model Nesting
• Formal representation– Two models: root F(C1, C2, ..., Ck) and child G(D1, D2, ..., Dl)
F(C1, C2, ..., Ck) + G(D1, D2, ..., Dl) F(C1{G(D1, D2, ..., Dl)},
C2{G(D1, D2, ..., Dl)},
... ...
Ck{G(D1, D2, ..., Dl)})
where Ci{G(D1, D2, ..., Dl)} means Ci
implements the G model.
11HIPS’07
Model Nesting (contd.)
• Examples– Iterative, multi-phase applications
– FLASH, developed by DOE supported ASC/Alliances Center for Astrophysical Thermonuclear Flashes
• Implications to performance diagnosis– Hierarchical model structure dictates analysis order
– Refine problem discovery from root to child
– Preserve performance features of individual models
12HIPS’07
Model Restructuring
• Formal representation– Two models: F(C1, C2, ..., Ck) and G(D1, D2, ..., Dl)
F(C1, C2, ..., Ck) + G(D1, D2, ..., Dl) H(({C1
F, ..., CkF}|{D1
G, ..., Dl
G})+)
where {C1F, ..., Ck
F}|{D1G
, ..., DlG} selects a
component CiF or Dj
G while preserving relative
component order in F and G. H is the new
function ruling all components.
13HIPS’07
Adapt Performance Knowledge to Composition
• Objective: discover and interpret performance effects caused by model interaction
• Model nesting– Behavioral modeling
Derive F(C1, C2, ..., Ck) from single model behaviors Replace affected root component with child model behaviors
– Performance modeling and metric formulation Unite overhead categories according to nesting hierarchy Evaluate overheads according to the model hierarchy
– Inference modeling Represent inference process with an inference tree Merge inference steps of participant models Extend root model inferences with implementing child model
inferences
14HIPS’07
Model Nesting Case Study - FLASH
• FLASH– Parallel simulations in astrophysical hydrodynamics
– Use Adaptive Mesh Refinement (AMR) to manage meshes
– Use a Parallel Recursive Tree (PRT) to manage mesh data
– Model nesting Root AMR model Child PRT model AMR implements
PRT data operations
15HIPS’07
Single Model Characteristics
• AMR operations– AMR_Refinement – refine a mesh grid– AMR_Derefinement – coarsen a mesh grid– AMR_LoadBalancing – even out work load after refinement or
derefinement– AMR_Guardcell – update guard cells at the boundary of every grid block
with data from the neighbors– AMR_Prolong – prolong the solution to newly created leaf blocks after
refinement– AMR_Restrict – restrict the solution up the block tree after derefinement– AMR_MeshRedistribution – mesh redistribution when balancing workload
• PRT operations– PRT_comm_to_parent – communicate to parent processor – PRT_comm_to_child – communicate to child processor – PRT_comm_to_sibling – communicate to sibling processor – PRT_build_tree – initialize tree structure, or migrate part of the tree to
another processor and rebuild the connection.
16HIPS’07
AMR Inference Tree
low speedup
... ...
AMR_Prolong
AMR_Guardcell AMR_Refine
AMR_Restrict
guardcell_size
check_neighborinform_neighbor
data_fetch
data_contiguity_in_cache
computationcommunication
refine_levels
balance_strategy
AMR_Derefine
AMR_Workbalance
leaf_restrict
... ...
... ...
check_refine
calculate_blocks_weight
sort_blocks
migrate_blocks
parent_prolong
physical_block_contiguity
refine_levels
physical_block_contiguity
refine_levels
physical_block_contiguity
physical_block_contiguity
physical_block_contiguity
refine_freq.
other_comm.AMR_comm.
block_weight_assign_method
... ...
rebuild_block_connection bala
nce_strategy
physical_block_contiguity
: symptoms : intermediate observations
: performance factors : inference direction
17HIPS’07
PRT Inference Tree
low speedup
sibling_comm
fetch_freq.
computationcommunicationother_ comm
: symptoms : intermediate observations
: performance factors : inference direction
parent_comm child_comm
lookup_parent
lookup_childlookup_sib.
rebuild_tree
data_transfer
build_tree
data_transfer
data_transfertree_node_contiguity
tree_depth
data_contiguity
fetch_freq.
tree_node_contiguity
tree_depth
data_contiguity
fetch_freq.
tree_node_contiguity
tree_depth
data_contiguity
init._tree
rebuild_ connection
migrate_subtree
... ...
migrate_strategy
migrate_strategy
freq.
link_parent
link_child
link_sib.
tree_node_contiguity
tree_depth
migrate_strategy
tree_node_contiguity
12 3
4 5
18HIPS’07
FLASH Inference Tree
: refine perf. problem search following subtrees of PRT that are relevant to A. The No. represent corresponding subtrees in PRT.
Alow speedup
AMR_Guardcell AMR_Refine
guardcell_size
check_neighborinform_neighbor
data_fetch
data_contiguity_in_cache
computationcommunication
balance_strategy
AMR_Workbalance
leaf_restrict
check_refine
calculate_blocks_weight sort_blocks
migrate_blocks
parent_prolongrefine_levelsrefine_levels
physical_block_contiguity
physical_block_contiguity
refine_freq.
rebuild_block_connection
AMR_comm.
block_weight_assign_method
... ...others
physical_block_contiguity
physical_block_contiguity
physical_block_contiguity
1,3
1,2,3
refine_levels
1,2,31,2,3
3
1,2,31,2,3
4
5
balance_strategy
physical_block_contiguity
No.
19HIPS’07
Experiment with FLASH v3.0
• Sedov explosion simulation in FLASH3
• Test platform IBM pSeries 690 SMP cluster with 8 processors
• Execution profiles of a problematic run (Paraprof view)
20HIPS’07
Diagnosis Results Output (Step 1&2)
• Step 1: find performance symptom
• Step 2: look at root AMR model performance
Begin diagnosing ...========================================================Begin diagnosing AMR program... ...Level 1 experiment -- collect performance profiles with respect to computation and communication.______________________________________________________________do experiment 1... ...Communication accounts for 80.70% of run time.Communication cost of the run degrades performance.========================================================
=========================================================Level 2 experiment -- collect performance profiles with respect to AMR refine, derefine, guardcell-fill, prolong, and workload-balance.________________________________________________________________do experiment 2... ...Processes spent 4.35% of communication time in checking refinement,2.22% in refinement, 13.83% in checking derefinement (coarsening),1.43% in derefinement, 49.44% in guardcell filling,3.44% in prolongating data,9.43% in dealing with work balancing,=========================================================
21HIPS’07
Step 3: Interpret Expensive guardcell_filling with PRT Performance
====================================================================Level 3 experiment for diagnosing grid guardcell-filling related problems -- collect performance event trace with respect to restriction, intra-level and inter-level commu. associated with the grid block tree. ___________________________________________________________________________________do experiment 3... ...
Among the guardcell-filling communication, 53.01% is spent restricting the solution up the block tree, 8.27% is spent in building tree connections required by guardcell-filling (updating the neighbor list in terms of morton order), and 38.71% in transferring guardcell data among grid blocks.___________________________________________________________________________________The restriction communication time consists of 94.77% in transferring physical data among grid blocks, and 5.23% in building tree connections.
Among the restriction communication, 92.26% is spent in collective communications.
Looking at the performance of data transfer in restrictions from the PRT perspective, remote fetch parent data comprises 0.0%, remote fetch sibling comprises 0.0%,and remote fetch child comprises 100%.Improving block contiguity at the inter-level of the PRT will reduce restriction data communication.__________________________________________________________________________________Among the guardcell data transfer, 65.78% is spent in collective communications.
Looking at the performance of guardcell data transfer from the PRT perspective, remote fetch parent data comprises 3.42%, remote fetch sibling comprises 85.93%,and remote fetch child comprises 10.64%.Improving block contiguity at the intra-level of the PRT will reduce guardcell data communication.====================================================================
AMR model performance
PRT operation perf. in transferring guardcell data
PRT operation perf. in AMR_Restrict
22HIPS’07
Conclusion and Future Directions
• Model-based performance diagnosis approach– Provide performance feedbacks at a high level of abstraction
– Support automatic problem discovery and interpretation
– Enable novice programmers to use established expertises
• Compositional model diagnosis– Adapt knowledge engineering approach to model integration
– Disentangle cross-model performance effects
– Enhance applicability of model-based approach
• Future directions– Automate performance knowledge adaptation
Algorithmic knowledge, compositional model knowledge
– Incorporate system utilization model Reveal interplay between programming model and system utilization Explain performance with the model-system relationship