final vrisha hpdc11 presentation€¦ · 250 Training" 300 350 400 Normal" Buggy...
Transcript of final vrisha hpdc11 presentation€¦ · 250 Training" 300 350 400 Normal" Buggy...
1
Vrisha: Using Scaling Proper2es of Parallel Programs for Bug Detec2on and Localiza2on
Bowen Zhou, Milind Kulkarni and Saurabh Bagchi School of Electrical and Computer Engineering
Purdue University
Scale-‐dependent Bugs in Parallel Programs
• What are they? – Bugs that arise oIen in large-‐scale runs while staying invisible in small-‐scale ones
– Examples? Race condi2on, integer overflow, etc. – Severe impact on HPC systems: performance degrada2on, silent data corrup2on
• Difficult to detect, let alone localize
2
Scale-‐dependent Bugs in Parallel Programs
• Why do parallel programs have such bugs? – Coded and tested in small-‐scale development machines with small-‐scale test cases
– Deployed in large-‐scale produc2on systems
When Sta2s2cal Debugging Meets Scale-‐dependent Bugs
Commun
ica2
on
0
50
100
150
200
250
300
350
400
Training
Normal
Buggy
Previous Sta2s2cal Debugging
• Previous methods of sta2s2cal debugging for parallel programs [Mirgorodskiy SC’06] [Gao SC’07] [Kasick FAST’10]
3
When Sta2s2cal Debugging Meets Scale-‐dependent Bugs
0
50
100
150
200
250
300
350
400
0 50 100 150 200
Commun
ica2
on Normal
Buggy Training
Scale
Previous Sta2s2cal Debugging Our Proposed Method
• To address scale-‐dependent bugs with previous methods – Either be very restricted in your feature selec2on – Or have access to a bug-‐free run on the large-‐scale system
• Our method solved the problems of previous methods – Capable of modeling scaling behavior of parallel program – Only need access to bug-‐free runs from small-‐scale systems
Key Insights
• “Natural scaling behavior" of an applica2on can be captured and used for bug detec2on
• Instead of looking for devia2ons from scale invariant behavior, we will look for devia2ons from the scaling trend
4
Contribu2ons
• We build Vrisha to use scale of run as a parameter to sta2s2cal model for debugging scale-‐dependent bugs
• Vrisha is capable of building a model to deduce the correla2on between scale of run and program behavior
• Vrisha detects both applica2on and library bugs with low overhead as validated with two real bugs from a popular MPI implementa2on
Example of Scale-‐dependent Bugs
• A bug in MPI_Allgather in MPICH2-‐1.1 – Allgather is a collec2ve communica2on which lets every process gather data from all par2cipa2ng processes
Allgather
P1
P2
P3
P1
P2
P3
5
Example of Scale-‐dependent Bugs
• MPICH2 uses dis2nct algorithms to do Allgather in different situa2ons
• Op2mal algorithm is selected based on the total amount of data received by each process
P0 P1 P2 P3 P4 P5 P6 P7P0 P1 P2 P3 P4 P5 P6 P7
RINGRECURSIVE DOUBLING
Example of Scale-‐dependent Bugs int MPIR_Allgather (
…… int recvcount, MPI_Datatype recvtype, MPID_Comm *comm_ptr ) { int comm_size, rank; int curr_cnt, dst, type_size, left, right, jnext, comm_size_is_pof2;
…… if ((recvcount*comm_size*type_size < MPIR_ALLGATHER_LONG_MSG) && (comm_size_is_pof2 == 1)) { /* Short or medium size message and power-‐of-‐two no. of processes. * Use recursive doubling algorithm */
…… else if (recvcount*comm_size*type_size < MPIR_ALLGATHER_SHORT_MSG) {
/* Short message and non-‐power-‐of-‐two no. of processes. Use * Bruck algorithm (see description above). */
…… else { /* long message or medium-‐size message and non-‐power-‐of-‐two
* no. of processes. use ring algorithm. */ ……
recvcount*comm_size*type_size can easily overflow a 32-‐bit integer on large systems and fail the if statement
6
Key Observa2on
• We observe that some program proper2es are predictable by the scale of run in parallel programs
• These are called scale-‐determined proper2es
0
2
4
6
8
10
1-‐node 2-‐node 4-‐node 8-‐node
Number of Comm Partners Total Bytes Sent
7
6
5 3
1
2
0
4
Scale-‐determined
x
16-node
65536-node
64-node
32-node
128-node
256-node
Com
munication
Scale
TrainingPredictionObservationx
Bug Detec2on
Localiza2on of bug is done by iden2fying the proper2es that cause the gap and the code regions that affect these proper2es
Large gap means bug detected!
7
Vrisha’s Workflow
a. Collect bug-‐free data from training runs of different scales b. Aggregate data into scale and program property c. Build a model from the scale and property d. Collect data from the produc2on run e. Perform detec2on and diagnosis for the produc2on run
(e) Detection and Diagnosis
n=16 n=32
n=64(a) Training runs at
different scales
Scale
Behavior
(b) Data collection
Model
(c) Model building
n=128(d) Production run? ? ?
Challenges
• What features should we use to build the model of scaling behavior?
(e) Detection and Diagnosis
n=16 n=32
n=64(a) Training runs at
different scales
Scale
Behavior
(b) Data collection
Model
(c) Model building
n=128(d) Production run
8
Control Feature
• We generalize the concept of “scale” to control features
• A set of parameters given by system or user – number of processors in the system – command-‐line arguments – Input size
• Control features are the predictors of program behavior
Observa2onal Feature
• To characterize program behavior, we use observaEonal features
• A set of vantage points in source code to profile various run2me proper2es
• Observa2onal features are the manifesta2ons of program behavior
9
Observa2onal Feature
• What does Vrisha use for observa2onal feature? – Each unique call stack at the socket level under the MPI library as a dis2nct vantage point
– Record the amount of data communicated at each vantage point as observa2onal feature
• Why? – Capture both applica2on and library bugs – Not need to modify applica2on code – Impose low overhead – Provide source level diagnosis
Challenges
• What model should we use to capture the rela2onship between scale and behavior? – Can describe both linear and non-‐linear rela2onships
(e) Detection and Diagnosis
n=16 n=32
n=64(a) Training runs at
different scales
Control
Observational
(b) Data collection
Model
(c) Model building
n=128(d) Production run
10
Model: Canonical Correla2on Analysis
X Xu
Y Yv
Control feature
Observa2onal feature
CorrelaEon maximized
1 such that
),( max ,
== vu
YvXucorrvu
Model: Kernel CCA
ϕ(X)u
ϕ(Y)v
Control feature
Observa2onal feature
CorrelaEon maximized
We use “kernel trick”, a popular machine learning technique, to transform non-‐linear rela2onship to linear ones via a non-‐linear mapping ϕ(.)
X
Y
ϕ(X)
ϕ(Y)
11
Challenges
• How does Vrisha do bug detec2on and diagnosis?
(e) Detection and Diagnosis
n=16 n=32
n=64(a) Training runs at
different scales
Control
Observational
(b) Data collection
KCCA
(c) Model building
n=128(d) Production run
Bug Detec2on
ϕ(X’)u
ϕ(Y’)v
Control feature
Observa2onal feature
Poorly correlated
We declare a bug in the produc2on run if the correla2on between the control feature X’ and the observa2onal features Y’ in the subspace defined by u and v is below certain range
X’
Y’
ϕ(X’)
ϕ(Y’)
12
Bug Localiza2on
• Normalize communica2on volume of training and produc2on runs to the same scale
• Comparing training (bug-‐free) with produc2on (buggy) runs
• Iden2fy the call stacks for the biggest changes as the poten2al loca2on for bug
Evalua2on
• We validated Vrisha with two real bugs from the MPICH2 library and Vrisha was able to detect and localize both of them
• Experiment setup – 16-‐node cluster – dual-‐socket 2.2GHz quad-‐core CPU – 512K L2, 8G RAM – MPICH2-‐1.1.1
13
Detect the Bug in Allgather
• The bug is configured to be triggered at 16-‐node run only • A KCCA model is built solely on 4-‐ to 15-‐node runs • Vrisha is capable of revealing this scale-‐dependent bug with
the very low correla2on in the 16-‐node buggy run
Locate the Bug in Allgather
• All the training runs like 4-‐ and 8-‐node runs share the same paqern
• The 16-‐node run shows a different paqern
• The most radical difference happens between feature #9 and #16 which leads us to find the program makes a wrong choice at the buggy if statement
14
Vrisha’s Overhead • We measured the average overhead of Vrisha with NAS Parallel Benchmarks
• Vrisha’s instrumenta2on overhead is less than 8% execu2on 2me on average
• Vrisha takes less than 30 ms to build model and less than 5 ms to detect bug, equivalent to around 1/10000 of the running 2me of these benchmarks on average
Conclusion
• Some program proper2es are scale-‐determined in parallel programs and can be exploited to detect and localize scale-‐dependent bugs
• We built a system called Vrisha leveraging this idea and successfully detected two real bugs from the MPICH2 library with low overhead