Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer...
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer...
Reduce Instrumentation Predictors Using Random Forests
Presented By Bin Zhao
Department of Computer ScienceUniversity of Maryland
May 3 2005
2
Motivation Crash report – too late to collect
program information until the program crashes
Testing – large number of test cases. Can we focus on the failing cases?
3
Motivation – failure prediction
Instrument program to monitor behavior
Predict if the program is going to fail Collect program data if the program
is predicted to likely fail Stop running the test if the test
program is not likely to fail
5
The questions to answer Can a good model be found for
predicting failing runs based on all available data?
Can an equally good model be created based on a random selection of k% of the predictors?
6
Experiment Instrumentation on a calculator program 295 predictors Instrumentation data collected every 50
milli-seconds 100 runs – 81 success, 19 failure Predictors: 275, 250, 225, 200, 175, 150,
125, 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10
7
Sample data Pass RunRun Res Rec 0x40bda0 DataItem3 MSF-0x40bda0 MSF-DataItem31 pass 1 3244 0 3244 01 pass 2 3206 0 3206 01 pass 3 3232 0 3232 01 pass 4 3203 0 3203 01 pass 5 3243 0 3243 0
Failure RunRun Res Rec 0x40bda0 DataItem3 MSF-0x40bda0 MSF-DataItem310 fail 1 3200 0 3200 010 fail 2 3200 0 3200 010 fail 3 3251 0 3251 010 fail 4 3251 0 3251 010 fail 5 3248 0 3248 0
8
Background – Random Forests
Many classification trees Each tree gives a classification –
vote The classification is chosen by the
most votes
9
Background – Random Forests
Need a training set to grow the forests
M predictors are randomly selected at each node to split the node (mtry)
One-third of the training data (oob) is used to get an estimation error
10
Background – Random Forests
To classify a test run as pass or fail Sample model estimation
OOB error rate: 0.0044
"fail" "pass" "class.error""fail" 933 17 0.0178947368421053"pass" 5 4045 0.00123456790123455
11
Background - R Software for data manipulation,
analysis and calculation Provide script capability Provide an implementation of
Random Forests
12
Experiment steps1. Determine which slice of the data to be
used as modeling and testing2. Find which parameter (ntree, mtry)
affect the model3. Find the optimal parameter values for all
the random models4. Build the random models by randomly
picking N predictors5. Verify the random models by prediction
14
Influential parameters in Random Forest
Two possible parameters – ntree and mtry
Building model by fixing either ntree or mtry and vary the other variable
Ntree: 200 – 1000 Mtry: 10 – 295 Only Mtry matters
15
Optimal mtry Need to decide optimal mtry for
different number of predictors (N) The default mtry is square root of N For different number of predicator
(295 – 10): N/2 – 3N
16
Random model Randomly pick the predictors from
the full set of the predictors Generate 5 sets of data for each
number of predictor Use the 5 sets of the data to build
the random forest model and average the result
17
Random prediction For each trained random forest, do
prediction on a total different set of test data (records 401 – 450)
18
Random Prediction Result
0 50 100 150 200 250 300
50
60
70
80
Number of Predictors
Fa
il E
rro
r R
ate
(%
)
19
Analysis of the random model
Why not linear
Exp1 Exp2 Exp3 Exp4 Exp5
Experiments
Fa
il E
rro
r R
ate
(%
)
02
04
06
08
0
20
Important predictors Random Forests can give importance to
each predictor – the number of correct votes involving the predictor
Top 20 important predictors
DataItem11 RT-DataItem11 PC-DataItem11 MSF-DataItem11AC-DataItem11 RT-DataItem9 RT-DataItem6 PC-DataItem6AC-DataItem6 MSF-DataItem9 MSF-DataItem6 PC-DataItem9DataItem9 AC-DataItem9 DataItem6 DataItem12MSF-DataItem12 AC-DataItem12 RT-DataItem12 PC-DataItem12
21
Top model Pick the top important predictors
from the full set of the predictors to build the model (top 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10)
22
Top model prediction result
20 40 60 80 100
35
40
45
50
Number of Predictors
Fa
il E
rro
r R
ate
(%
)
23
Observation and analysis The fail error rate is still high (>
30%) No all the runs fail at the same time Fail:Success = 19:81 (too few fail
cases to build a good model) Some predictors are raw, while
others are derived – MSF, AC, PC, RT
24
Improvements Get the last N records for a
particular run For a set of data, randomly drop
some pass data and duplicate the fail data
Randomly pick the raw predictors then all its derived predictors
25
Improved random prediction result
0 50 100 150 200 250 300
30
40
50
60
70
80
90
Number of Predictors
Fa
il E
rro
r R
ate
(%
)
26
Improved top prediction result
20 40 60 80 100
15
20
25
30
35
Number of Predictors
Fa
il E
rro
r R
ate
(%
)
27
Conclusion so far Random selection does not achieve
a good error rate Some predictors have a stronger
prediction power A small set of important predictor
can achieve good error rate
28
Future work Why some predictors have stronger
prediction power? Any pattern for the important
predictors? How many important predictors
should we pick? How soon can we predict a fail run
before it actually fails?
30
Random model estimation result
0 50 100 150 200 250 300
02
04
06
08
0
Number of Predictors
Fa
il E
rro
r R
ate
(%
)
31
Top model estimation result
20 40 60 80 100
12
34
5
Number of Predictors
Fa
il E
rro
r R
ate
(%
)