Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer...

Reduce Instrumentation Predictors Using Random Forests

Presented By Bin Zhao

Department of Computer ScienceUniversity of Maryland

May 3 2005

2

Motivation Crash report – too late to collect

program information until the program crashes

Testing – large number of test cases. Can we focus on the failing cases?

3

Motivation – failure prediction

Instrument program to monitor behavior

Predict if the program is going to fail Collect program data if the program

is predicted to likely fail Stop running the test if the test

program is not likely to fail

4

The problem Large number of instrumentation

predictors

What instrumentation predictors to picked?

5

The questions to answer Can a good model be found for

predicting failing runs based on all available data?

Can an equally good model be created based on a random selection of k% of the predictors?

6

Experiment Instrumentation on a calculator program 295 predictors Instrumentation data collected every 50

milli-seconds 100 runs – 81 success, 19 failure Predictors: 275, 250, 225, 200, 175, 150,

125, 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10

7

Sample data Pass RunRun Res Rec 0x40bda0 DataItem3 MSF-0x40bda0 MSF-DataItem31 pass 1 3244 0 3244 01 pass 2 3206 0 3206 01 pass 3 3232 0 3232 01 pass 4 3203 0 3203 01 pass 5 3243 0 3243 0

Failure RunRun Res Rec 0x40bda0 DataItem3 MSF-0x40bda0 MSF-DataItem310 fail 1 3200 0 3200 010 fail 2 3200 0 3200 010 fail 3 3251 0 3251 010 fail 4 3251 0 3251 010 fail 5 3248 0 3248 0

8

Background – Random Forests

Many classification trees Each tree gives a classification –

vote The classification is chosen by the

most votes

9


Need a training set to grow the forests

M predictors are randomly selected at each node to split the node (mtry)

One-third of the training data (oob) is used to get an estimation error

10


To classify a test run as pass or fail Sample model estimation

OOB error rate: 0.0044

"fail" "pass" "class.error""fail" 933 17 0.0178947368421053"pass" 5 4045 0.00123456790123455

11

Background - R Software for data manipulation,

analysis and calculation Provide script capability Provide an implementation of

Random Forests

12

Experiment steps1. Determine which slice of the data to be

used as modeling and testing2. Find which parameter (ntree, mtry)

affect the model3. Find the optimal parameter values for all

the random models4. Build the random models by randomly

picking N predictors5. Verify the random models by prediction

13

Find the good data

14

Influential parameters in Random Forest

Two possible parameters – ntree and mtry

Building model by fixing either ntree or mtry and vary the other variable

Ntree: 200 – 1000 Mtry: 10 – 295 Only Mtry matters

15

Optimal mtry Need to decide optimal mtry for

different number of predictors (N) The default mtry is square root of N For different number of predicator

(295 – 10): N/2 – 3N

16

Random model Randomly pick the predictors from

the full set of the predictors Generate 5 sets of data for each

number of predictor Use the 5 sets of the data to build

the random forest model and average the result

17

Random prediction For each trained random forest, do

prediction on a total different set of test data (records 401 – 450)

18

Random Prediction Result

0 50 100 150 200 250 300

50

60

70

80

Number of Predictors

Fa

il E

rro

r R

ate

(%

)

19

Analysis of the random model

Why not linear

Exp1 Exp2 Exp3 Exp4 Exp5

Experiments

Fa

il E

rro

r R

ate

(%

)

02

04

06

08

0

20

Important predictors Random Forests can give importance to

each predictor – the number of correct votes involving the predictor

Top 20 important predictors

DataItem11 RT-DataItem11 PC-DataItem11 MSF-DataItem11AC-DataItem11 RT-DataItem9 RT-DataItem6 PC-DataItem6AC-DataItem6 MSF-DataItem9 MSF-DataItem6 PC-DataItem9DataItem9 AC-DataItem9 DataItem6 DataItem12MSF-DataItem12 AC-DataItem12 RT-DataItem12 PC-DataItem12

21

Top model Pick the top important predictors

from the full set of the predictors to build the model (top 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10)

22

Top model prediction result

20 40 60 80 100

35

40

45

50


Fa

il E

rro

r R

ate

(%

)

23

Observation and analysis The fail error rate is still high (>

30%) No all the runs fail at the same time Fail:Success = 19:81 (too few fail

cases to build a good model) Some predictors are raw, while

others are derived – MSF, AC, PC, RT

24

Improvements Get the last N records for a

particular run For a set of data, randomly drop

some pass data and duplicate the fail data

Randomly pick the raw predictors then all its derived predictors

25

Improved random prediction result

0 50 100 150 200 250 300

30

40

50

60

70

80

90


Fa

il E

rro

r R

ate

(%

)

26

Improved top prediction result

20 40 60 80 100

15

20

25

30

35


Fa

il E

rro

r R

ate

(%

)

27

Conclusion so far Random selection does not achieve

a good error rate Some predictors have a stronger

prediction power A small set of important predictor

can achieve good error rate

28

Future work Why some predictors have stronger

prediction power? Any pattern for the important

predictors? How many important predictors

should we pick? How soon can we predict a fail run

before it actually fails?

30

Random model estimation result

0 50 100 150 200 250 300

02

04

06

08

0


Fa

il E

rro

r R

ate

(%

)

31

Top model estimation result

20 40 60 80 100

12

34

5


Fa

il E

rro

r R

ate

(%

)

32

Improved random model

0 50 100 150 200 250 300

02

04

06

08

0


Fa

il E

rro

r R

ate

(%

)

Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer...

Documents

Transcript of Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer...