Good Hunting: Locating, Prioritizing, and Fixing Bugs Automatically (Keynote, IWESEP 2013)

Good Hunting:Locating, Prioritizing, and Fixing Bugs Automatically

Dongsun KimThe University of Luxembourg

Interdisciplinary Centre for Security, Reliability and Trust

2 Dec 20131

2

Serval Team

2

4

Hunting

4

5

Hunting 101

5

6

1. Seeking1. Seeking

6

7

2. Selecting

7

8

3. Shooting

8

9

Debugging 101

9

10

1. Localizing

10

11

2. Prioritizing

11

12

3. Fixing

12

13

About This Talk

13

13

About This Talk

Three debugging techniques based on SW repository mining

13

13

About This Talk


Quick Tips on mining

13

13

About This Talk


Future Directions

Quick Tips on mining

13

14

Three techniques

14

14

Two-phase recommendation model for bug localization

Three techniques

14

14


Early prediction model for bug prioritization

Three techniques

14

14


Early prediction model for bug prioritization

Pattern-based program repair for bug fixing

Three techniques

14

Where Should We Fix This Bug? A Two-phase Recommendation Model

Dongsun Kim, Yida Tao, Sunghun KimThe Hong Kong University of Science and Technology, China

Andreas ZellerSaarland University

IEEE Transactions on Software Engineering (Vol. 39, No. 11)15

16

Fault LocalizationTest cases

(passing/failing)Program

(class/module)

Faultystatement/predicate

Bug (File) Localization

Bug report Program Buggy file

16

17

Bug Report

Bug Description

Comments

17

18

ML Classification 101

18

18


(1, 90, 21, A, text, ... , 58)Feature vector

(2, 12, 100, E, aaa, ... , 76)...

18

18



(2, 12, 100, E, aaa, ... , 76)...

Class[Type 4][Type 2]

...

18

18


Classifier (Model)


(2, 12, 100, E, aaa, ... , 76)...


...

18

18


Classifier (Model)

Training


(2, 12, 100, E, aaa, ... , 76)...


...

18

18


Classifier (Model)

Training

(4, 39, 5, K, text, ... , 32)Feature vector


(2, 12, 100, E, aaa, ... , 76)...


...

18

18


Classifier (Model)

Training

(4, 39, 5, K, text, ... , 32)Feature vector Class

[Type 3]


(2, 12, 100, E, aaa, ... , 76)...


...

18

18


Classifier (Model)

Training

(4, 39, 5, K, text, ... , 32)Feature vector Class

[Type 3]

Classification


(2, 12, 100, E, aaa, ... , 76)...


...

18

19

Bug Localization using ML

Feature vectorMeta + Textual

Information in bug reports

Class Changed files in bug reports

19

20

Intuitive ApproachJOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 5

PredictionModel

a.cppa.cppa.cppb.cppb.cppb.cpp

Bug ReportsPredicted

m.cm.cm.ch.ch.ch.c

PredictedFiles to Fix

Fig. 1: One-phase Prediction Model.

Bug 403040 - Places killed all my history >2 days ago Last Comment

Status: VERIFIED FIXED Whiteboard:

Keywords: dataloss, regression

Product: FirefoxComponent: Bookmarks & History

Version: Trunk Platform: All All

Importance: P1 critical (vote) Target Milestone: Firefox 3 beta2

Assigned To: Dietrich Ayala (:dietrich)QA Contact: bookmarks

URL:

Depends on: Blocks:

Show dependency tree / graph

Reported: 2007-11-08 08:47 PST by Reed Loden [:reed] (very busy)

Modified: 2010-12-17 06:29 PST (History)

CC List: 14 users (show)

Flags: mconnor: blocking-firefox3+ (more flags)

See Also:

Crash Signature:

Summon comment box

Attachments

fix v1.1 (1.31 KB, patch)2007-11-09 11:51 PST, Dietrich Ayala (:dietrich)

set mExpireVisits to default if not set in prefs (3.37 KB, patch)2007-11-15 02:31 PST, Marco Bonardo [:mak]

Add an attachment (proposed patch, testcase, etc.)

2007-11-08 08:47:34 PST

So, I noticed last night that I was missing three days or so in my history sidebar (I think days 3, 4, and 5), and now when I check today, the only thing in my history sidebar is today and yesterday. What happened to my history? :(

I have no idea how this happened, so I can't give good STR, sorry. I have had to kill Firefox several times lately, so it may be related to some type of shutdown sqlite save or expiration or something?

Description

Dietrich Ayala (:dietrich) 2007-11-08 09:04:47 PST

What's your browser.history_expire_days value?

Comment 1


Hrm, no other bugs reported about this. I also searched the build forums for the last few days, no comments about anything like this.

Killing Firefox shouldn't matter: the shutdown work would not have occurred if your force-killed it, and SQLite is (theoretically) immune to data corruption from unexpected shutdown given that we run it in the safest possible mode (synchronous = full).

Are all your bookmarks still there?

Comment 2

Reed Loden [:reed] (very busy) 2007-11-08 09:48:13 PST

(In reply to comment #1) > What's your browser.history_expire_days value?

Both browser.history_expire_days and browser.history_expire_days.mirror are 180 days.

(In reply to comment #2) > Are all your bookmarks still there?

Yes, all my bookmarks are still there.

Comment 3

Robert Kaiser (:[email protected]) 2007-11-09 08:13:20 PST

I'm using places history with my self-built SeaMonkey builds, and lose my places history about daily in the last few days, though it worked perfectly before in older builds. I first thought it would be lost on shutdown/restart, but I noticed that I had a few visited links left from a last session a few times, so it at least isn't at every restart.

Comment 4

Marco Bonardo [:mak] 2007-11-09 09:31:10 PST

i'm not sure if that could be the problem, but nsNavHistoryExpire::FindVisits it's looking strange:

Comment 5

Page 1 of 5Bug 403040 – Places killed all my history >2 days ago

2012-01-23https://bugzilla.mozilla.org/show_bug.cgi?id=403040

Fig. 2: An uninformative bug report. This is an excerptfrom Mozilla Bug #403040, written by the bug submit-ter. This description is not informative and the bugreviewer indeed had to ask the submitter for furtherelaboration on his browser’s history and bookmarksettings.

further computes each file’s probability of being afile to fix. Based on this probability, top-k files arerecommended to developers as the prediction result.

3.3 Two-phase Prediction Model

As Hooimeijer et al. [49] and Bettenburg et al. [50]noticed, the quality of bug reports can vary consider-ably. Some bug reports may not have enough infor-mation to predict files to fix. Our evaluation of one-phase prediction (Section 5) confirms this conjecture:bug reports whose files are not successfully predictedusually have insufficient information (e.g., no initialdescription). In other words, including uninformativebug reports might yield poor prediction performance.

Figure 2 shows an example of an uninformativebug report. In this report, the submitter describesa problem faced when using Firefox. However, thisdescription is very general and contains few informa-tive keywords that indicate the problematic modules.Therefore, it is not helpful for developers to locate thefiles to fix. Similarly, our one-phase prediction modeldoes not perform well with such uninformative bugreports.

Hence, it is desirable to filter out uninformative bugreports before the actual prediction process. Based onthis observation, we propose the two-phase predictionmodel that has two classification phases: binary andmulti-class classification (Figure 3). The model firstfilters out uninformative reports (Section 3.3.1) andthen predicts files to fix (Section 3.3.2).

Phase 2Phase 1

PredictionModel

PredictionModelP

a.cppb.cpp

Bug Reports PredictableReports Predicted

Pm.ch.c

Reports PredictedFiles to Fix

Deficient

DDeficientReports

Fig. 3: Two-phase prediction model. This model rec-ommends files to fix only when the given bug reportis determined to have sufficient information.

3.3.1 Phase 1

Phase 1 filters out uninformative bug reports beforepredicting files to fix. Its prediction model classifiesa given bug report as “predictable” or “deficient”(binary classification) as shown in Figure 3. Only bugreports classified as “predictable” are taken up for thePhase 2 prediction.

The prediction model in Phase 1 leveragesprediction history. The training dataset of this modeluses a set of bug reports that have already beenresolved. Let B = {b1, b2, . . . , bn} be a set of nresolved bug reports chronologically sorted by theirfiling date. V (bi) is the i-th bug report’s featurevector, which is extracted as described in Section 3.1.P (bi) is the set of actual files changed to fix thebug (i.e., the files in the bug’s patch), which can beobtained as well from report bi. For each report, itslabel (“predictable” or “deficient”) is determined bythe following process: for an arbitrary report bj 2 B,a one-phase prediction model Mj is trained on{(V (b1), P (b1)), (V (b2), P (b2)) . . . (V (bj�1), P (bj�1))}to predict files to fix for bj . If the predictionresult hits any file in P (bj), bj is labeled as“predictable”; otherwise, it is labeled as “deficient”.Now, let L(b) be the label of report b. Byapplying the above process to all reports inB � {b1}, we can obtain the training dataset{(V (b2), L(b2)), (V (b3), L(b3)), . . . , (V (bn), L(bn))} forthe prediction model of Phase 1. Note that no trainingdataset is built for b1 since there is no bug reportbefore b1 to create (V (b1), L(b1)).

When a new bug report is submitted, the predictionmodel classifies it as either “predictable” or “defi-cient”. If the report is classified as “predictable”, it ispassed on to Phase 2 prediction; otherwise, no furtherprediction is conducted. In the latter case, developersmay ask the report submitter to give more informationabout the bug.

3.3.2 Phase 2

The Phase 2 model accepts “predictable” bug reportsobtained from Phase 1 as the input. It extracts features

20

20

Intuitive ApproachJOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 5

PredictionModel



m.cm.cm.ch.ch.ch.c










URL:

Depends on: Blocks:






See Also:

Crash Signature:

Summon comment box

Attachments




2007-11-08 08:47:34 PST



Description



Comment 1





Comment 2






Comment 3



Comment 4



Comment 5









Phase 2Phase 1

PredictionModel

PredictionModelP

a.cppb.cpp


Pm.ch.c


Deficient

DDeficientReports


3.3.1 Phase 1




3.3.2 Phase 2


→ Low precision and recall

20

21

Quality Matters

Good Report Bad Report

“not working”

“error message”

“there is a glitch at the toolbar”

“When I did B and C after A, it crashes with

this stack trace”

“My bookmark item is deleted if I try this link”

“no response after clicking button A and B”

Hooimeijer and Weimer, “Modeling bug report quality,” ASE2007Zimmermann, et al., “What makes a good bug report?” TSE2010

21

22

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 5

PredictionModel



m.cm.cm.ch.ch.ch.c










URL:

Depends on: Blocks:






See Also:

Crash Signature:

Summon comment box

Attachments




2007-11-08 08:47:34 PST



Description



Comment 1





Comment 2






Comment 3



Comment 4



Comment 5









Phase 2Phase 1

PredictionModel

PredictionModelP

a.cppb.cpp


Pm.ch.c


Deficient

DDeficientReports


3.3.1 Phase 1




3.3.2 Phase 2


Bad Report

22

23

Two-phase Recommendation ModelJOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 5

PredictionModel



m.cm.cm.ch.ch.ch.c










URL:

Depends on: Blocks:






See Also:

Crash Signature:

Summon comment box

Attachments




2007-11-08 08:47:34 PST



Description



Comment 1





Comment 2






Comment 3



Comment 4



Comment 5









Phase 2Phase 1

PredictionModel

PredictionModelP

a.cppb.cpp


Pm.ch.c


Deficient

DDeficientReports


3.3.1 Phase 1




3.3.2 Phase 2


Noise Filtering FileRecommendation

23

24


PredictionModel



m.cm.cm.ch.ch.ch.c










URL:

Depends on: Blocks:






See Also:

Crash Signature:

Summon comment box

Attachments




2007-11-08 08:47:34 PST



Description



Comment 1





Comment 2






Comment 3



Comment 4



Comment 5









Phase 2Phase 1

PredictionModel

PredictionModelP

a.cppb.cpp


Pm.ch.c


Deficient

DDeficientReports


3.3.1 Phase 1




3.3.2 Phase 2



Noise Filtering

24

25

Noise Filtering -Classifying existing reports

bugreport #1

N-1

Model...

[Training]

N

[Testing]

Matchany file

No match

Nis [predictable]

Nis [deficient]

25

26

[predictable]Phase 1Model

[deficient]N

[Training]

Noise Filtering - Training Phase 1 model

26

27

Phase 1Model new

[Classifying]

[predictable] [deficient]

Noise Filtering - Using Phase 1 model

27

28


PredictionModel



m.cm.cm.ch.ch.ch.c










URL:

Depends on: Blocks:






See Also:

Crash Signature:

Summon comment box

Attachments




2007-11-08 08:47:34 PST



Description



Comment 1





Comment 2






Comment 3



Comment 4



Comment 5









Phase 2Phase 1

PredictionModel

PredictionModelP

a.cppb.cpp


Pm.ch.c


Deficient

DDeficientReports


3.3.1 Phase 1




3.3.2 Phase 2



File Recommendation

28

29

Evaluation

Two-phaseModel


PredictionModel



m.cm.cm.ch.ch.ch.c










URL:

Depends on: Blocks:






See Also:

Crash Signature:

Summon comment box

Attachments




2007-11-08 08:47:34 PST



Description



Comment 1





Comment 2






Comment 3



Comment 4



Comment 5









Phase 2Phase 1

PredictionModel

PredictionModelP

a.cppb.cpp


Pm.ch.c


Deficient

DDeficientReports


3.3.1 Phase 1




3.3.2 Phase 2


-‐ Comparative Study


PredictionModel



m.cm.cm.ch.ch.ch.c










URL:

Depends on: Blocks:






See Also:

Crash Signature:

Summon comment box

Attachments




2007-11-08 08:47:34 PST



Description



Comment 1





Comment 2






Comment 3



Comment 4



Comment 5









Phase 2Phase 1

PredictionModel

PredictionModelP

a.cppb.cpp


Pm.ch.c


Deficient

DDeficientReports


3.3.1 Phase 1




3.3.2 Phase 2


One-phaseModel

29

30


●

●

●

●

●●

●

●● ●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100Li

kelih

ood

(%)

ff−bookmark# of total test cases: 431# of predictable cases: 196Feedback: 45.5%

●

Usual SuspectsOne−phaseBugScoutTwo−phase

●●

●●

●

● ● ● ●●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100

Like

lihoo

d (%

)

ff−general# of total test cases: 216# of predictable cases: 158Feedback: 73.2%

●


●

●●

●●

●●

● ●●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100

Like

lihoo

d (%

)

core−js# of total test cases: 517# of predictable cases: 446Feedback: 78.1%

●


●

●

●

●

●● ●

● ● ●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100

Like

lihoo

d (%

)

core−dom# of total test cases: 251# of predictable cases: 98Feedback: 39.0%

●


●

●

●

●● ● ● ● ● ●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100

Like

lihoo

d (%

)

core−layout# of total test cases: 471# of predictable cases: 208Feedback: 44.2%

●


●

●

●●

● ● ● ● ●●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100

Like

lihoo

d (%

)

core−style# of total test cases: 171# of predictable cases: 127Feedback: 74.3%

●


●

●

●●

●● ● ● ● ●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100

Like

lihoo

d (%

)

core−xpcom# of total test cases: 202# of predictable cases: 41Feedback: 20.3%

●


●● ●

●

●● ●

●● ●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100

Like

lihoo

d (%

)

core−xul# of total test cases: 202# of predictable cases: 37Feedback: 18.3%

●


Fig. 4: Prediction likelihood for each module shown in Table 1. The Y-axis represents the likelihood valuescomputed by Equation (1). The X-axis represents the k values described in Section 3. In the upper-left cornerof each plot, the total number of bug reports in the test set, the number of predictable bug reports, and feedbackvalue computed by Equation (5) are shown.

Results - Likelihood

30

30


●

●

●

●

●●

●

●● ●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100Li

kelih

ood

(%)

ff−bookmark# of total test cases: 431# of predictable cases: 196Feedback: 45.5%

●


●●

●●

●

● ● ● ●●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100

Like

lihoo

d (%

)

ff−general# of total test cases: 216# of predictable cases: 158Feedback: 73.2%

●


●

●●

●●

●●

● ●●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100

Like

lihoo

d (%

)

core−js# of total test cases: 517# of predictable cases: 446Feedback: 78.1%

●


●

●

●

●

●● ●

● ● ●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100

Like

lihoo

d (%

)

core−dom# of total test cases: 251# of predictable cases: 98Feedback: 39.0%

●


●

●

●

●● ● ● ● ● ●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100

Like

lihoo

d (%

)

core−layout# of total test cases: 471# of predictable cases: 208Feedback: 44.2%

●


●

●

●●

● ● ● ● ●●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100

Like

lihoo

d (%

)

core−style# of total test cases: 171# of predictable cases: 127Feedback: 74.3%

●


●

●

●●

●● ● ● ● ●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100

Like

lihoo

d (%

)

core−xpcom# of total test cases: 202# of predictable cases: 41Feedback: 20.3%

●


●● ●

●

●● ●

●● ●

Top 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100

Like

lihoo

d (%

)

core−xul# of total test cases: 202# of predictable cases: 37Feedback: 18.3%

●


Fig. 4: Prediction likelihood for each module shown in Table 1. The Y-axis represents the likelihood valuescomputed by Equation (1). The X-axis represents the k values described in Section 3. In the upper-left cornerof each plot, the total number of bug reports in the test set, the number of predictable bug reports, and feedbackvalue computed by Equation (5) are shown.

Results - Likelihood


differences are statistically significant with 95%confidence [57].We chose this non-parametric test method insteadof any parametric test method such as t-testbecause the distribution of our evaluation resultsmay not be normal.

In addition, we used Feedback [25] to compute theratio of bug reports classified as predictable after Phase1 prediction. Let NP denote the number of predictablebug reports and ND denote the number of deficientones. Feedback is computed as follows:

Feedback =NP

NP +ND(5)

5 RESULTS

This section reports the evaluation results. Sections 5.1and 5.2 report the prediction performance and com-pare the results of four different models with theirstatistical significance (RQ1). We discuss the feedback(RQ2) in Section 5.3, and present the sensitivity anal-ysis in Section 5.4 to compare the prediction powerof individual features (RQ3). Section 5.5 shows exam-ples of usage to demonstrate how our approach canimprove developers’ bug-fixing practice (RQ4).

5.1 Performance

We first address RQ1: What is the predictive powerof the two-phase model in recommending files to fix?We present the likelihood, precision and recall valuesin Figures 4, 5, and 6, respectively. Since the modelrecommends top-k files, the performance depends onthe value of k. The X-axis of the figures represents thek value, which ranges from 1 to 10.

When recommending only the top one file (i.e.,k = 1), the two-phase model’s likelihood ranges from19% to 57%. The likelihood value grows as k increases.When k = 10, the two-phase model yields likelihoodbetween 52% and 88%. Suppose there are 10 bugreports. In the best scenario, our two-phase predictionmodel is able to successfully recommend at least onefile to fix for 6 to 9 out of 10 reports, which is verypromising.

When k = 1, the two-phase model’s precisionranges from 6% to 47%, with average of 23%. Theprecision ranges from 7% to 11% when k = 10. Thesevalues indicate that the two-phase model can makecorrect prediction even with a small k.

The average recall of the two-phase model increasesfrom 9% to 33% as k grows from 1 to 10. This indicatesthat when recommending the top ten files, our modelcan correctly suggest on average 1/3 of files whichneed to be fixed for a given bug. In addition, thetwo-phase model achieves a 60% recall value for ff-bookmark when k = 10.

◆✓

⇣⌘

Our two-phase model successfully predicts files tofix for 52% to 88% of all bug reports, with an

average of 70%.

5.2 Comparison

As shown in Figure 4, the two-phase model outper-forms the one-phase model in prediction likelihood.For example, when recommending the top 10 files, thelikelihood of the two-phase model for eight modulesranges from 52% to 88%, with an average value of70%. The one-phase model, on the other hand, has anaverage likelihood of only 44% when k=10, which iseven less than the lowest prediction likelihood of thetwo-phase model.

To counteract the problem that rare events arelikely to be observed in multiple comparisons, weused Bonferroni correction [58] so that a p-value lessthan 0.05/4 = 0.0125 indicates a significant differencebetween the corresponding pair of models. As shownin Table 2, the two-phase model significantly outper-forms the one-phase model for half of the modules.

The two-phase model also manifests higher preci-sion and recall than the one-phase model, as shownin Figures 5 and 6.◆

✓⇣⌘

The two-phase model outperformsthe one-phase model in prediction likelihood,

precision, and recall.

The one-phase model, on the other hand, manifestsprediction performance comparable to the Usual Sus-pects model — the last column of Table 2 shows thatthe p-values between these two models are greaterthan 0.0125 for all eight modules. BugScout alsoshows performance similar to the Usual Suspects, asshown in Figures 4, 5 and 6. One possible reason isthat BugScout leverages the defect-proneness infor-mation to recommend files to fix, an idea similar tothe Usual Suspects model.�

⇢

⇠

⇡Only the two-phase model outperforms

the Usual Suspects model, while the one-phasemodel and BugScout are both on par with the

Usual Suspects model.

We also compared the average rank of correctlypredicted files for each model (Equation 4). As shownin Table 3, the two-phase model has the highestaverage rank among the four prediction models for6 out of 8 modules (except for core-js and core-xul).This implies that compared to the other three models,developers might have more confidence in using thetwo-phase model since it ranks correctly predictedfiles at a higher position, which could potentially savetheir inspection time.

30

Which Crashes Should I Fix First?Crashing Bug Prioritization

Dongsun Kim, Xinming Wang, Sunghun Kim, S. C. CheungThe Hong Kong University of Science and Technology, China

Andreas ZellerSaarland University

IEEE Transactions on Software Engineering, May/June 2011selected as the featured article of the issue

Sooyong ParkSogang University

31

32

Crashes

32

33

Crash Reporting System

Apple Crash Report

Dr. Watson

Breakpad + Socorro

33

34

Bucketting

34

35

Top Crashes

of crash reports, we sorted crashes by their frequency ofbeing reported, and then counted the percentage of crashreports accounted for in each interval of 10 crashes. The barchart in Fig. 5 shows the results. For example, the leftmostbar indicates that the top-10 crashes accounted for morethan 50 percent of the Firefox crash reports and more than35 percent of the Thunderbird crash reports. Fig. 5 providesthe initial validation of our hypothesis: For example, thetop-20 crashes account for 72 and 55 percent of the crashreports for Firefox and Thunderbird, respectively.

Note that such a trend has also been observed incommercial software. For example, by analyzing crashreporting data, Microsoft has found that a small set ofdefects is responsible for the vast majority of its code-relatedproblems: “fixing 20 percent of code defects can eliminate80 percent or more of the problems users encounter” [1]. Thisindicates that identifying top crashes is important forcommercial products as well as open source projects.

Moreover, such a phenomenon is not restricted to crash-related failures. For example, Adams [2] observed that mostoperational system failures are caused by a small propor-tion of latent faults. Goseva and Hamill [23], [25] observedthat a few small regions in a program could account for thereliability of the whole program. Our finding here isconsistent with these studies.

3.2 Limitation of Current Practice

Top crashes need to be fixed as soon as possible. Given atop crash, how long does it take for developers to startworking on it? Ideally, a top crash should be handledimmediately once it is reported. In other words, the date ofa first crash report should be close to the date whendevelopers begin to work on the crash. To verify whetherthis is the case in the real world, we investigated the crashesand bug-fixing activities of Firefox 3.5.

One issue here is how to determine the time whendevelopers begin to work with a crash. In Mozilla projectssuch as Firefox and Thunderbird, management policymandates that any bug-fixing activity for a crash in the crashrepository must begin with the creation of a bug report usingBugzilla [10] by the developer. Thus, when the developercreates a bug report for a crash, we assume that he or she isready to work on this crash. Therefore, we regard the timewhen its corresponding bug report is created as the timewhen developers begin to work on this crash. With thisinformation, we calculated the number of days it took for a

KIM ET AL.: WHICH CRASHES SHOULD I FIX FIRST?: PREDICTING TOP CRASHES AT AN EARLY STAGE TO PRIORITIZE DEBUGGING... 433

Fig. 4. Number of crash reports for Firefox 3.5 per day since its release (30 June 2009). More than 14,000-24,000 crash reports have been reportedper day. The number of crash reports indicates that users experienced at least the same number of failures (abrupt program termination). Note that750 crashes for (crash points) are reported for Firefox 3.5.

Fig. 5. Number of crash reports ranked in groups of 10 for Firefox andThunderbird. Firefox 3.0 and Thunderbird 3.0 crash reports werecollected for July 2008-December 2008, and January 2009-May 2009,respectively. The top-10 crashes accounted for more than 35 percent(Thunderbird) and 50 percent (Firefox) of the total number of crashreports.

35

35

Top Crashes

of crash reports, we sorted crashes by their frequency ofbeing reported, and then counted the percentage of crashreports accounted for in each interval of 10 crashes. The barchart in Fig. 5 shows the results. For example, the leftmostbar indicates that the top-10 crashes accounted for morethan 50 percent of the Firefox crash reports and more than35 percent of the Thunderbird crash reports. Fig. 5 providesthe initial validation of our hypothesis: For example, thetop-20 crashes account for 72 and 55 percent of the crashreports for Firefox and Thunderbird, respectively.

Note that such a trend has also been observed incommercial software. For example, by analyzing crashreporting data, Microsoft has found that a small set ofdefects is responsible for the vast majority of its code-relatedproblems: “fixing 20 percent of code defects can eliminate80 percent or more of the problems users encounter” [1]. Thisindicates that identifying top crashes is important forcommercial products as well as open source projects.

Moreover, such a phenomenon is not restricted to crash-related failures. For example, Adams [2] observed that mostoperational system failures are caused by a small propor-tion of latent faults. Goseva and Hamill [23], [25] observedthat a few small regions in a program could account for thereliability of the whole program. Our finding here isconsistent with these studies.

3.2 Limitation of Current Practice

Top crashes need to be fixed as soon as possible. Given atop crash, how long does it take for developers to startworking on it? Ideally, a top crash should be handledimmediately once it is reported. In other words, the date ofa first crash report should be close to the date whendevelopers begin to work on the crash. To verify whetherthis is the case in the real world, we investigated the crashesand bug-fixing activities of Firefox 3.5.

One issue here is how to determine the time whendevelopers begin to work with a crash. In Mozilla projectssuch as Firefox and Thunderbird, management policymandates that any bug-fixing activity for a crash in the crashrepository must begin with the creation of a bug report usingBugzilla [10] by the developer. Thus, when the developercreates a bug report for a crash, we assume that he or she isready to work on this crash. Therefore, we regard the timewhen its corresponding bug report is created as the timewhen developers begin to work on this crash. With thisinformation, we calculated the number of days it took for a


Fig. 4. Number of crash reports for Firefox 3.5 per day since its release (30 June 2009). More than 14,000-24,000 crash reports have been reportedper day. The number of crash reports indicates that users experienced at least the same number of failures (abrupt program termination). Note that750 crashes for (crash points) are reported for Firefox 3.5.

Fig. 5. Number of crash reports ranked in groups of 10 for Firefox andThunderbird. Firefox 3.0 and Thunderbird 3.0 crash reports werecollected for July 2008-December 2008, and January 2009-May 2009,respectively. The top-10 crashes accounted for more than 35 percent(Thunderbird) and 50 percent (Firefox) of the total number of crashreports.

Top 20 Crashes account for > 50~70% crashes

35

36

developer to start working on a top crash. Fig. 6 shows theresults for the top-100 crashes of Firefox 3.5.

From Fig. 6, we can observe that the real situation is farfrom ideal: On average, developers waited 40 days until theystarted to work on a top-10 crash. This is unfortunate because,given the frequency of these top crashes, such a delay wouldmean hundreds of thousands of crash occurrences.

So why did Mozilla developers allow such a long delayin handling top crashes? One might blame this delay oninsufficient motivation for maintenance. However, ourpersonal communication with Mozilla development teammembers Gary Kong and Channy Yun suggests otherwise:Mozilla developers are generally eager to work on topcrashes. However, they are conservative in acknowledginga crash as a top crash, even if it appears at the top of the listfor the moment. This conservativeness is driven by theconcern that, at the early stage when crashes are firstreported (e.g., in the alpha and beta-testing phases), thefrequency of a crash might be substantially different fromits frequency at the later stage. Therefore, developers preferto “wait and see” until there are sufficient crash reports tosupport a crash being a top crash.

What if Mozilla developers were less conservative? Letus assume that they had used the data at an early stage, thealpha-testing phase, to determine top crashes. Using the5,199 crash reports submitted during the alpha-testingphase of Firefox 3.5, they would replace those crashes thatoccurred most frequently in this stage. However, are thesecrashes really the top crashes? Fig. 7 illustrates the rankingof these crashes in terms of their actual occurrencefrequencies, which are derived from all 415,351 crashreports submitted during the main life span of Firefox 3.5(from the start of alpha testing to the day when the nextversion was released). In this figure, each bar represents ak-most-frequent crash in the alpha-testing phase. Forexample, the leftmost bar indicates that the most-frequentcrash in the alpha-testing phase is ranked 162nd in terms ofactual occurrence frequency.

From Fig. 7, we can observe that the k-most-frequentcrashes in the alpha-testing phase are poor indicators ofactual top crashes: Only two of them (k ¼ 3 and k ¼ 10) aretop-20 crashes, while most of the others are actuallyinfrequent crashes. In fact, the 20 most-frequent crashes inthe alpha-testing phase can account for only 13.35 percent ofthe all crash reports of Firefox 3.5, whereas the actual top-20crashes account for 78.26 percent. The key reason, as pointedout by Fenton and Neil [19], is that the failure rate of a fault atthe early stage (prerelease) can be significantly different fromits failure rate after release. In practice, the goal of internaland volunteer alpha testers is to expose the most number ofbugs with the least number of test cases. Therefore, theyusually tend not to repeat already-exercised crashing testcases even though these test cases might trigger top crashes.

The above discussion highlights the dilemma of thecurrent practice: By being more conservative in determin-ing top crashes, developers delay bug fixing, but by beingless conservative in determining top crashes, developersmiss the actual top crashes. The core of the problem isthat current practice relies on hindsight to identify topcrashes, that is, we can accurately identify top crashesonly after they have already caused significant trouble forthe users.

It should be noted that most of the top crashes do occurin the early phase, although they are not frequent. Forexample, 16 of the top-20 crashes of Firefox 3.5 occurred atleast once during the alpha testing (shown in the bottom-right Gantt chart of Fig. 8). This indicates an opportunity forimproving current practice (see Section 6.7 for morediscussion on this topic).

3.3 How Can Prediction Improve the CurrentPractice?

To address this problem of current practice, we advocate aprediction-based approach that does not rely on hindsightto identify top crashes. With our approach, it becomesfeasible to identify top crashes during prerelease testing(i.e., alpha or beta testing), and also to react as soon as thefirst crash reports are received. Rather than waiting for anumber of crashes to occur, developers can identify andaddress the most pressing problems without delay.

To see the benefit of our approach, let us assume that wehave an “ideal top-crashes predictor” that can accurately

434 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 37, NO. 3, MAY/JUNE 2011

Fig. 6. Number of days for crashes to be reported as bugs (Firefox 3.5).We measured the number of days between the first crash report for eachcrash and its bug report. There was a correlation between the crash’sranking and time taken for bug reporting.

Fig. 7. The ranking of most-frequent crashes in the alpha-testing phase.

Time to Action

36

36

developer to start working on a top crash. Fig. 6 shows theresults for the top-100 crashes of Firefox 3.5.

From Fig. 6, we can observe that the real situation is farfrom ideal: On average, developers waited 40 days until theystarted to work on a top-10 crash. This is unfortunate because,given the frequency of these top crashes, such a delay wouldmean hundreds of thousands of crash occurrences.

So why did Mozilla developers allow such a long delayin handling top crashes? One might blame this delay oninsufficient motivation for maintenance. However, ourpersonal communication with Mozilla development teammembers Gary Kong and Channy Yun suggests otherwise:Mozilla developers are generally eager to work on topcrashes. However, they are conservative in acknowledginga crash as a top crash, even if it appears at the top of the listfor the moment. This conservativeness is driven by theconcern that, at the early stage when crashes are firstreported (e.g., in the alpha and beta-testing phases), thefrequency of a crash might be substantially different fromits frequency at the later stage. Therefore, developers preferto “wait and see” until there are sufficient crash reports tosupport a crash being a top crash.

What if Mozilla developers were less conservative? Letus assume that they had used the data at an early stage, thealpha-testing phase, to determine top crashes. Using the5,199 crash reports submitted during the alpha-testingphase of Firefox 3.5, they would replace those crashes thatoccurred most frequently in this stage. However, are thesecrashes really the top crashes? Fig. 7 illustrates the rankingof these crashes in terms of their actual occurrencefrequencies, which are derived from all 415,351 crashreports submitted during the main life span of Firefox 3.5(from the start of alpha testing to the day when the nextversion was released). In this figure, each bar represents ak-most-frequent crash in the alpha-testing phase. Forexample, the leftmost bar indicates that the most-frequentcrash in the alpha-testing phase is ranked 162nd in terms ofactual occurrence frequency.

From Fig. 7, we can observe that the k-most-frequentcrashes in the alpha-testing phase are poor indicators ofactual top crashes: Only two of them (k ¼ 3 and k ¼ 10) aretop-20 crashes, while most of the others are actuallyinfrequent crashes. In fact, the 20 most-frequent crashes inthe alpha-testing phase can account for only 13.35 percent ofthe all crash reports of Firefox 3.5, whereas the actual top-20crashes account for 78.26 percent. The key reason, as pointedout by Fenton and Neil [19], is that the failure rate of a fault atthe early stage (prerelease) can be significantly different fromits failure rate after release. In practice, the goal of internaland volunteer alpha testers is to expose the most number ofbugs with the least number of test cases. Therefore, theyusually tend not to repeat already-exercised crashing testcases even though these test cases might trigger top crashes.

The above discussion highlights the dilemma of thecurrent practice: By being more conservative in determin-ing top crashes, developers delay bug fixing, but by beingless conservative in determining top crashes, developersmiss the actual top crashes. The core of the problem isthat current practice relies on hindsight to identify topcrashes, that is, we can accurately identify top crashesonly after they have already caused significant trouble forthe users.

It should be noted that most of the top crashes do occurin the early phase, although they are not frequent. Forexample, 16 of the top-20 crashes of Firefox 3.5 occurred atleast once during the alpha testing (shown in the bottom-right Gantt chart of Fig. 8). This indicates an opportunity forimproving current practice (see Section 6.7 for morediscussion on this topic).

3.3 How Can Prediction Improve the CurrentPractice?

To address this problem of current practice, we advocate aprediction-based approach that does not rely on hindsightto identify top crashes. With our approach, it becomesfeasible to identify top crashes during prerelease testing(i.e., alpha or beta testing), and also to react as soon as thefirst crash reports are received. Rather than waiting for anumber of crashes to occur, developers can identify andaddress the most pressing problems without delay.

To see the benefit of our approach, let us assume that wehave an “ideal top-crashes predictor” that can accurately


Fig. 6. Number of days for crashes to be reported as bugs (Firefox 3.5).We measured the number of days between the first crash report for eachcrash and its bug report. There was a correlation between the crash’sranking and time taken for bug reporting.

Fig. 7. The ranking of most-frequent crashes in the alpha-testing phase.

Time to Action

Faster but not so fast!

36

37

To address this challenge, we adopt a learning-basedapproach, summarized in Fig. 2. From an earlier release, weknow which crash reports are “top” (frequent) and whichones are “bottom” (infrequent). We extract the top andbottom stack traces as well as their method signatures. Thefeatures of these signatures are then passed to a machinelearner. The learner can then immediately classify a crashsummarized by a new incoming crash report as frequent (atop crash) or not. As shown in Section 3, the deployment ofan accurate top-crash predictor may reduce the number ofcrash reports in Firefox 3.5 by at least 36 percent ifdevelopers fix top crashes first.

We employ features from crash reports and source code totrain a machine learner. Our preliminary observations andinsights led us to focus on three types of features that formthe core of our approach:

. First, we observed that statistical characteristics canindicate whether a crash is a top or bottom crash: Inparticular, methods in stack traces of top crashesappear again in other top crashes. This motivated usto extract historical features from crash reports.

. Second, intramethod characteristics can also indicatewhether a method belongs to frequent crashes;

complex methods may crash more often. Thismotivated us to employ complexity metrics (CM)features such as lines of code and the number ofpaths for top-crash prediction.

. Third, intermethod characteristics can describecrash frequency; well-connected methods in callgraphs may crash often. To measure connectedness,we employ social network analysis (SNA) featuressuch as centrality.

To validate our approach, we investigate the crash reportrepositories of the Firefox Web browser as well as theThunderbird e-mail client. We use a very small training set ofonly 150-250 crash reports from a prior release (that is, thecrash reports received within 10-15 minutes after release).Given the small size of the set, the machine learner can thenclassify crash reports for the new release immediately—thatis, with the very first crash report. This classificationmethod has a high accuracy: In Firefox, 75 percent of allincoming reports are correctly classified; in Thunderbird,the accuracy rises to 90 percent. These accurate predictionresults can provide valuable information for developers toprioritize their defect-fixing efforts, improve quality at anearly stage, and improve the overall user experience.

From a technical standpoint, this paper makes thefollowing contributions:

1. We present a novel technique to predict whether acrash will be frequent (a “top crash”) or not.

2. We evaluate our approach on the crash reportrepositories of Thunderbird and Mozilla, demon-strating that it scales to real-life software.

3. We show that our approach is efficient, as itrequires only a small training set from the previousrelease. This implies that it can be applied at anearly stage of development, e.g., during alpha orbeta testing.

4. We show that our approach is effective, as it predictstop crashes with high accuracy. This means that efforton addressing the predicted problems is well spent.

5. We discuss and investigate under which circum-stances our approach works best; in particular, weinvestigate which features of crash reports are mostsensitive for successful prediction.


Fig. 2. Approach overview. Our approach has three steps: extracting traces from top and bottom crash reports, creating training data from the traces,and predicting unknown crashes. The first step classifies top and bottom crashes and extracts stack traces from their reports. The second stepextracts methods from the stack traces and characterizes these methods using feature data, which are extracted from source code repositories.Feature values are then accumulated per trace. These are used for training a machine learner. In the prediction step, the machine learner takes anunknown crash stack trace and classifies it as a top or bottom trace. (a) Extracting crash traces. (b) Creating corpus. (c) Prediction.

Fig. 1. A Firefox crash message from a user’s perspective.

Approach

ML Classification

Using three feature groups

• History• Complexity• Social Network Analysis (SNA) Measures

37

38

History Features

f( )f( )

f( )

f( )

g( )

“f()” is shown in crash traces more frequently

andmore vulnerable to crash

38

39

Complexity Features

f( ) g( )

“f()” is more complexand

more vulnerable to crash

39

40

SNA Features

f( )

k( )

x( )

y( )

h( )

z( )

g( )

r( )

“f()” is well-connectedand

more vulnerable to crash

40

41

Evaluation - Preprocessing

Top Crashes Bottom Crashes

Our Approach

CrashReports

41

42

crashes, they motivate us to investigate on three featuregroups.

5 EVALUATION

We present the experimental evaluation of our approach inthis section. Five research questions will be evaluated:

. RQ1: Is history information indicative of topcrashes?

. RQ2: Is the complexity of a method indicative of itschance of triggering top crashes?

. RQ3: Does the connectedness of a method correlatewith its chance of occurring in top crashes?

. RQ4: Is the size of training data relevant to theaccuracy of top-crash prediction?

. RQ5: Which feature is more indicative than the otherfeatures?

This section describes the experiment setup to evaluate ourresearch questions and reports the experimental results.

5.1 Experiment Setup

For our experiments, we used real crash reports from twoopen source systems: Firefox and Thunderbird. To demon-strate the effectiveness of our approach toward unknownstack traces, we explicitly separated the training set and thetesting set. For example, we collected a training set fromFirefox 3.0.9 and a testing set from Firefox 3.0.10. Sometimes,crashes may not be fixed in the following versions. Forexample, the crash “_PR_MD_SEND” in Firefox 3.0.9 was notfixed in Firefox 3.0.10. As a result, we find that some crashesare reported across different software versions. For fairexperiments, we ensured that the reports of the same crashdid appear in both the training and the testing sets byremoving these reports from our experiments.

Table 4 describes the data sets (corpus) used in ourexperiments. We collected crash reports for four programs(two versions of Firefox and two versions of Thunderbird).The two Firefox projects had more than 1,000 data instances(i.e., trace-based feature vectors) extracted from the stacktrace database, while the two Thunderbird projects hadaround 590 data instances. Each project had the samenumber of top and bottom crashes. Each instance wascharacterized by 10 history, 28 CM, and five SNA features,as described in Section 4.2, and had 86 elements (sum andaverage of features), as described in Section 4.3.

Specifically, we created training sets as follows:

1. Sort crashes and choose top-20 crashes.2. Randomly select n (e.g., 40 in the case of Firefox

3.0.9) stack traces for each crash.

3. Choose bottom-20 crashes and select all traces asthese crashes had less than 10 crash reports (some-times only one).

4. Select the additional bottom 20þ k crashes andselect all traces until the number of traces is equalto the number of top traces. The testing sets werealso created in the same manner.

We only used history information in the training set tocreate our testing set, as we assumed that we did not knowthe history information of the testing set. For example, wecounted how many times the method appeared in topcrashes for the training set. It is possible that some methodsin the testing set did not appear in the training set. In thiscase, we set the corresponding history features as missingvalues [37].

For a machine learner, we used two machine learningalgorithms, Naive Bayes (NB) [45] and multilayer percep-tron (MLP) [52]. Naive Bayes is a simple probabilisticclassification algorithm based on Bayes’ theorem [6] withstrong naive independence assumptions. It takes trainingdata and calculates probabilities from them. When a newinstance is presented, it predicts the target value of the newinstance. It is adopted for our evaluation because of itssimple structure and fast learning.

MLP is a feedforward artificial neural network [27]. Ithas several layers of perceptrons, which are simple binaryclassifiers. Learning in MLP occurs by changing connectionweights between perceptrons after the training data areprocessed. MLP was chosen for our evaluation becauseMLP can efficiently classify nonlinear problems [52] (weassumed that it is difficult to learn features in trace-basedfeature vectors using linear functions).

In addition, we applied the feature selection algorithmproposed by Shivaji et al. [53], which is based on abackward wrapped feature selection technique [47]. First,we put features in order according to their predictive poweras measured by the information gain ratio [34], a well-known measure of the amount by which a given featurecontributes information to a classification decision. Then,we removed the least significant feature from the feature setand measured the top/bottom crash prediction accuracy.Next, we continually removed the next weakest feature andmeasured the accuracy until there was only one feature leftin the feature set. After this iteration, it was possible toidentify the best prediction accuracy and the feature set thatyielded the best accuracy.

Although our application scenarios consider predictionat an early stage (e.g., alpha or beta-testing phases), ourevaluation concerns two subsequent official release versions(Firefox) because we focused on a performance comparisonbetween our approach and the wait-and-see approach. In


TABLE 4Data Set Used in Our Experiments

Experiment Subjects

42

43other words, we cannot compare the performance if wepredict the alpha version crash stack traces as stated inbackground (Section 3); the wait-and-see approach does notwork for the alpha version. Note that stack traces of alphaversions are the same as those of official versions. There-fore, our evaluation deals with correct subjects.

In the case of Thunderbird, we adopted two subsequentalpha versions for our evaluation because these versions arequasi-official versions, which consist of sufficient crashreports. In addition, crash reports of the latest officialversion (Thunderbird 2.0) are currently not available.Therefore, no crash report of the version can be collected.

To implement all the machine learning algorithmsmentioned above, we used the Weka [56] library.

5.2 Evaluation Measures

Applying a machine learner to a top-crash predictionproblem can result in four possible outcomes:

1. predicting a top stack trace as a top stack trace(T! T),

2. predicting a top stack trace as a bottom stack trace(T! B),

3. predicting a bottom stack trace as a top stack trace(B! T), and

4. predicting a bottom stack trace as a bottom stacktrace (B! B).

Items 1 and 4 are correct predictions, while the others areincorrect.

We used the above outcomes to evaluate the classifica-tion with the following four measures [3], [31], [48]:

. Accuracy: the number of correctly classified stacktraces divided by the total number of traces. This is agood overall measure of classification performance.

Accuracy ¼ NT!T þNB!B

NT!T þNT!B þNB!T þNB!B: ð1Þ

. Precision: the number of stack traces correctlyclassified as expected class (NT!T or NB!B) overthe number of all methods classified as top orbottom stack traces (NT!T þNB!T or NB!B þNT!B).

Precision of Top crashed traces

P ðT Þ ¼ NT!T

NT!T þNB!T;

ð2Þ

Precision of Bottom crashed traces

P ðBÞ ¼ NB!B

NB!B þNT!B: ð3Þ

. Recall: the number of traces correctly classified astop or bottom traces (NT!T or NB!B) over thenumber of actual top or bottom stack traces.

Top traces recall RðT Þ ¼ NT!T

NT!T þNT!B; ð4Þ

Bottom traces recall RðBÞ ¼ NB!B

NB!B þNB!T: ð5Þ

. F-score: a composite measure of precision P ð%Þ andrecall Rð%Þ for each class (top and bottom).

F score F ð%Þ ¼ 2& P ð%Þ &Rð%ÞP ð%Þ þRð%Þ

: ð6Þ

5.3 Prediction Results

This section reports our prediction results. First, we applied

our approach to two subsequent versions. For example, we

trained a model with Firefox and then applied the model toa subsequent version of Firefox. Second, we applied our

approach for cross projects. We trained a model on Firefox

and applied it to Thunderbird and vice versa. Table 5 shows


TABLE 5Prediction Results

Experiments were conducted for four subjects: two same-project subjects and two cross-project subjects. For each subject, Naive Bayes, NB withfeature selection, multilayer perceptron, and MLP with FS were used to classify top and bottom crashes. Four criteria were measured: accuracy,precision, recall, and F-score. In terms of accuracy, MLP outperformed Naive Bayes except for the fourth subject, and MLP with FS outperformedMLP and Naive Bayes for all subjects.

Results

43

43other words, we cannot compare the performance if wepredict the alpha version crash stack traces as stated inbackground (Section 3); the wait-and-see approach does notwork for the alpha version. Note that stack traces of alphaversions are the same as those of official versions. There-fore, our evaluation deals with correct subjects.

In the case of Thunderbird, we adopted two subsequentalpha versions for our evaluation because these versions arequasi-official versions, which consist of sufficient crashreports. In addition, crash reports of the latest officialversion (Thunderbird 2.0) are currently not available.Therefore, no crash report of the version can be collected.

To implement all the machine learning algorithmsmentioned above, we used the Weka [56] library.

5.2 Evaluation Measures

Applying a machine learner to a top-crash predictionproblem can result in four possible outcomes:

1. predicting a top stack trace as a top stack trace(T! T),

2. predicting a top stack trace as a bottom stack trace(T! B),

3. predicting a bottom stack trace as a top stack trace(B! T), and

4. predicting a bottom stack trace as a bottom stacktrace (B! B).

Items 1 and 4 are correct predictions, while the others areincorrect.

We used the above outcomes to evaluate the classifica-tion with the following four measures [3], [31], [48]:

. Accuracy: the number of correctly classified stacktraces divided by the total number of traces. This is agood overall measure of classification performance.

Accuracy ¼ NT!T þNB!B

NT!T þNT!B þNB!T þNB!B: ð1Þ

. Precision: the number of stack traces correctlyclassified as expected class (NT!T or NB!B) overthe number of all methods classified as top orbottom stack traces (NT!T þNB!T or NB!B þNT!B).

Precision of Top crashed traces

P ðT Þ ¼ NT!T

NT!T þNB!T;

ð2Þ

Precision of Bottom crashed traces

P ðBÞ ¼ NB!B

NB!B þNT!B: ð3Þ

. Recall: the number of traces correctly classified astop or bottom traces (NT!T or NB!B) over thenumber of actual top or bottom stack traces.

Top traces recall RðT Þ ¼ NT!T

NT!T þNT!B; ð4Þ

Bottom traces recall RðBÞ ¼ NB!B

NB!B þNB!T: ð5Þ

. F-score: a composite measure of precision P ð%Þ andrecall Rð%Þ for each class (top and bottom).

F score F ð%Þ ¼ 2& P ð%Þ &Rð%ÞP ð%Þ þRð%Þ

: ð6Þ

5.3 Prediction Results

This section reports our prediction results. First, we applied

our approach to two subsequent versions. For example, we

trained a model with Firefox and then applied the model toa subsequent version of Firefox. Second, we applied our

approach for cross projects. We trained a model on Firefox

and applied it to Thunderbird and vice versa. Table 5 shows


TABLE 5Prediction Results

Experiments were conducted for four subjects: two same-project subjects and two cross-project subjects. For each subject, Naive Bayes, NB withfeature selection, multilayer perceptron, and MLP with FS were used to classify top and bottom crashes. Four criteria were measured: accuracy,precision, recall, and F-score. In terms of accuracy, MLP outperformed Naive Bayes except for the fourth subject, and MLP with FS outperformedMLP and Naive Bayes for all subjects.

Results

the overall results. These results may answer RQ1, 2, and 3.For more details (i.e., predictive power of individual featuregroups), see Section 5.5.

For the subsequent versions prediction, our approachpredicted top or bottom crashes with > 75 percent accuracy,which is sufficiently high to be useful in practice. Note that theaccuracy of a random guess would be around 50 percent sinceour testing sets were evenly distributed, as shown in Table 5.In terms of top-crash precision, the accuracy of our model wasaround 90 percent for Thunderbird and 75 percent for Firefox.Overall, we believe our approach is effective and accurate atidentifying top crashes as soon as a new crash report arrives.

For the cross-project prediction, the accuracy was around70 percent, which is slightly lower than that of thesubsequent version prediction. However, an accuracy of70 percent is still considerably better than that of a randomprediction. These results suggest that our trained predictionmodel can be applied to new projects. For example, supposethat the Mozilla group releases a new product. It is possibleto predict the new product’s crashes as top or bottom usingour prediction model trained from Firefox crashes.

MLP mostly outperformed Naive Bayes. We obtained thebest results when we used MLP with feature selection. Thisimplies that using the appropriate combinations of featuresincreased the prediction accuracy. We discuss the predic-tive power of various training data sizes (Section 5.4), andfor each feature and feature groups in Section 5.5.

5.4 Size of Training Data

In this experiment, we evaluate the impact of training setsize to measure the necessary training data size (i.e., thenumber of crash instances represented in feature vectorsdescribed in Section 4.3) for yielding a reasonable predic-tion accuracy (around 70 percent) [43] (RQ4). We trainedour prediction model using various training set sizes and

measured the accuracy. Figs. 10 and 11 show the predictionaccuracy with various sizes of training data. We also useddifferent feature groups, history, SNA, CM, and all tomeasure the accuracy.

In the case of Firefox (Fig. 10), the accuracy jittered whenour model was trained with less than 200 training data.However, after 250 training data, the results stabilized andreached a reasonable accuracy. Similarly, the accuracy forThunderbird (Fig. 11) settled after 150 training data.

5.5 Feature Sensitivity Analysis

In this section, we measure and discuss the sensitivity(predictive power) of feature groups and individualfeatures (RQ1, 2, 3, and 5).

To measure the predictive power of each feature group,we trained our prediction model with three different featuregroups: history, CM, and SNA (as described in Section 4.2;these feature groups had 10, 28, and five features,respectively). The results are shown in Figs. 10 and 11.

In the case of Firefox, CM features outperformed theother feature groups. They were more than 70 percentaccurate and close to the accuracy of all features (for sometraining data sizes, they even outperformed the accuracy ofall features together). The history feature group showedaround 65 percent accuracy after 200 training instances.However, the SNA feature group performed worse thanrandom guess.

In the case of Thunderbird, all three types of featuregroups showed more than 60 percent accuracy, and thehistory and SNA feature groups showed more than70 percent accuracy after 600 training instances. The historyfeature group even outperformed the case in which allfeatures were used.


Fig. 10. Prediction accuracy using various training data sizes (Fire-fox 3.0.10 training on Firefox 3.0.9). This graph shows the accuracy onthe basis of different feature groups: social network analysis, complexitymetrics, history, and all. At the beginning, the accuracy jitters, but it isstabilized after 250 training instances.

Fig. 11. Prediction accuracy using various training data sizes (Thunder-bird 3.0a2 training on Thunderbird 3.0a1). This graph shows accuracyon the basis of different feature groups, the same as Fig. 10. This alsohas some jitters, but the accuracy stabilized after 150 training instances.Compared to Fig. 10, the accuracy for all four feature groups increasedgradually.

43

Automatic Patch Generation Learned from Human-Written Patches

Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun KimThe Hong Kong University of Science and Technology, China

the 35th International Conference on Software Engineering (ICSE 2013)

ACM SIGSOFT Distinguished Paper Award

44

45

GenProg

C. Le Goues, M. Dewey-Vogt, S. Forrest, and W. Weimer, “A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each,” in ICSE ’12.

45

45

GenProgState-of-the-art


45

45

GenProgState-of-the-artGenetic Programming


45

45

GenProgState-of-the-artGenetic ProgrammingRandom Mutation


45

45

GenProgState-of-the-artGenetic ProgrammingRandom MutationSystematically Evaluated


45

46

Buggy Code

46

1500� num�=�state.parenCount;�1501� int�kidMatch�=�matchRENodes(state,�(RENode)ren.kid,�1502�� stop,�index);�1503� if�(kidMatch�!=�Ş1)�return�kidMatch;�1504� for�(int�i�=�num;�i�<�state.parenCount;�i++)�1505� � state.parens[i].length�=�0;�1506� state.parenCount�=�num;�

46

in Interpreter.java reported as Mozilla Bug #76683

Buggy Code

46


46

in Interpreter.java reported as Mozilla Bug #76683

Null Pointer Exception

Buggy Code

46


47

GenProg repairs bugs

47



�

�

1500� num�=�state.parenCount;�1501� int�kidMatch�=�matchRENodes(state,�(RENode)ren.kid,�1502 �� stop,�index);�1503� if�(kidMatch�!=�Ş1)�return�kidMatch;�1504� for�(int�i�=�num;�i�<�state.parenCount;�i++)�1505� {�1506� � //�deleted.�1507� }�1508� state.parenCount�=�num;�

�

47



Buggy Code

GenProg

47



�

�


�

47



Buggy Code

GenProg

This patch passes ALL test cases.

47

48

GenProg repairs bugs1500� num�=�state.parenCount;�1501� int�kidMatch�=�matchRENodes(state,�(RENode)ren.kid,�1502�� stop,�index);�1503� if�(kidMatch�!=�Ş1)�return�kidMatch;�1504� for�(int�i�=�num;�i�<�state.parenCount;�i++)�1505� � state.parens[i].length�=�0;�1506� state.parenCount�=�num;�

�

�


�

48

48


�

�


�


�

�


�48

49


�

�

1500� num�=�state.parenCount;�1501� int�kidMatch�=�matchRENodes(state,�(RENode)ren.kid,�1502 �� stop,�index);�1503� if�(kidMatch�!=�Ş1)�return�kidMatch;�1504� for�(int�i�=�num;�i�<�state.parenCount;�i++)�1505� {�1506� � //�do�nothing.�1507� }�1508� state.parenCount�=�num;�

�

Would you accept?

17 Students68 Developers

49

49


�

�

1500� num�=�state.parenCount;�1501� int�kidMatch�=�matchRENodes(state,�(RENode)ren.kid,�1502 �� stop,�index);�1503� if�(kidMatch�!=�Ş1)�return�kidMatch;�1504� for�(int�i�=�num;�i�<�state.parenCount;�i++)�1505� {�1506� � //�do�nothing.�1507� }�1508� state.parenCount�=�num;�

�

Would you accept?

9.4%

90.6%17 Students68 Developers

49

50

Human-written Patches

50

50


Readable

50

50


ReadableNatural

50

50


ReadableNaturalEasy to understand

50

50


ReadableNaturalEasy to understand

We can learn how to generate patches from human knowledge.

50

51

JDT

51

51

>60,000Patches

JDT

51

51

ManualClassification

>60,000Patches

JDT

51

51


# p

atch

es

Patterns

>60,000Patches

JDT

51

51


# p

atch

es

Patterns

Top frequent patterns account for >20~30%

>60,000Patches

JDT

51

52

Common Fix PatternsAltering method parameters

obj.method(v1,v2)0→0obj.method(v1,v3)

Altering method parameters

obj.method(v1,v2)0→0obj.method(v1,v3)

52

52

Common Fix PatternsAdding a checker

obj.m1())→)if(obj'!='null)){obj.m1()}

Adding a checker

obj.m1())→)if(obj'!='null)){obj.m1()}

52

53

PAR

Pattern-based Automatic Program Repair

53

54

$VJDGVIDVGIDVOGMJONDMDVJDGI��O+6GOMNIONDVMGI�OMDVGOIMVGO-GVIVDGIVGDIVGDVGONIMGDVOM+

Using Human Knowledge for patch generation

54

54


Fix Templates


54

54


Fix TemplatesProgram Edit Script


54

54




10

54

54



Manually created from fix patterns


10

JDT

54

54



Manually created from fix patternsHighly reusable


10

JDT

54

if(lhs == DBL_MRK) lhs = ...; if(lhs == undefined) { lhs = strings[pc + 1]; } Scriptable calleeScope = ...;

DDKOMDOMODMIO�GVNOMKVGONMIGVDI

$VJDGVIDVGIDVOGMJONDMDVJDGI��O

6GOMNIONDVMGI�OMDVGOIMVGO

GVIVDGIVGDIVGDVGONIMGDVOM

ONDVMIONMVGDONIMOGNVDMIONGVDM

Buggy Program






(a) Fault Localization




6GOMNIONDVMGI�OMDVGOIMVGO-++




(b) Template-based Patch Candidate Generation

Fail

Pass

(c) Patch Evaluation

T Repaired

Fix Template

Patch Candidate

VGDJNOMODNZMHIUOZMHIVGMON�






VGDJNOMODNZMHIUOZMHIVGMONDDKOMDOMODMIO�GVNOMKVGONMIGVDI





Repaired Program

Fault Location

55

Template-basedPatch Candidate Generation




6GOMNIONDVMGI�OMDVGOIMVGO-++




Fix Template

Patch Candidate






VGDJNOMODNZMHIUOZMHIVGMON

Fault Location

55

Using a Fix Template: An Example

56


56

$VJDGVIDVGIDVOGMJONDMDVJDGI��O+6GOMNIONDVMGI�OMDVGOIMVGO-GVIVDGIVGDIVGDVGONIMGDVOM+Null Pointer Checker


56


56



56

obj ref.: state, parens[i], ...


56



56


Check obj ref.: PASS


56



56



Edit: Insert ... ... + if( ) { state.parens[i].length = 0; + } ... ...

state != null && state.parens[i] != null


56



56





1500 num = state.parenCount; 1501 int kidMatch = matchRENodes(state, (RENode)ren.kid, 1502 stop, index); 1503 if (kidMatch != -1) return kidMatch; 1504 for (int i = num; i < state.parenCount; i++) 1505 state.parens[i].length = 0; 1506 state.parenCount = num;

1500 num = state.parenCount; 1501 int kidMatch = matchRENodes(state, (RENode)ren.kid, 1502 stop, index); 1503 if (kidMatch != -1) return kidMatch; 1504 for (int i = num; i < state.parenCount; i++) 1505 { 1506 // deleted. 1507 } 1508 state.parenCount = num;


1500 num = state.parenCount; 1501 int kidMatch = matchRENodes(state, (RENode)ren.kid, 1502 stop, index); 1503 if (kidMatch != -1) return kidMatch; 1504 for (int i = num; i < state.parenCount; i++) 1505 { 1506 if( state != null && state.parens[i] != null) 1507 state.parens[i].length = 0; 1508 } 1509 state.parenCount = num;


56



56














56

57

List of Templates

57

57

List of Templates

Parameter Replacer

Method Replacer

Parameter Adder and Remover

Expression Replacer

Expression Adder and Remover

Object Initializer

Range Checker

Collection Size Checker

Null Pointer Checker

Class Cast Checker

57

58

Evaluation: Experiment Design

58

58


PAR GenProg

58

58


PAR GenProg

# #

58

59

RQ1(Fixability): How many bugs are fixed successfully?

RQ2(Acceptability): Which approach can generate more acceptable bug patches?

Evaluation: Research Questions

#

59

60

Subject # bugs LOC # test cases

Rhino 17 51,001 5,578

AspectJ 18 180,394 1,602

log4j 15 27,855 705

Math 29 121,168 3,538

Lang 20 54,537 2,051

Collections 20 48,049 11,577

Total 119 351,406 25,051

Experiment Subjects

60

61

RQ1: Fixability

61

61

RQ1: Fixability

PAR GenProg06

12182430

61

61

RQ1: Fixability

PAR GenProg06

12182430

27

61

61

RQ1: Fixability

PAR GenProg06

12182430

27

16

61

61

RQ1: Fixability

PAR GenProg06

12182430

27

16PAR GenProg

27 16>61

GenProg

62

PAR

RQ2: Acceptability

62

GenProg

62

0

10

20

30

40

21

28

37

14resp

onse

s (%

)

PAR HumanBoth NotSure

PAR

RQ2: Acceptability

62

GenProg

62

0

10

20

30

40

21

28

37

14resp

onse

s (%

)


PAR

0

15

30

45

60

20

12

51

17re

spon

ses

(%)

GenProg HumanBoth NotSure

RQ2: Acceptability

62

GenProg

62

0

10

20

30

40

21

28

37

14resp

onse

s (%

)


PAR

0

15

30

45

60

20

12

51

17re

spon

ses

(%)


49%

RQ2: Acceptability

62

GenProg

62

0

10

20

30

40

21

28

37

14resp

onse

s (%

)


PAR

0

15

30

45

60

20

12

51

17re

spon

ses

(%)


49%

32%

RQ2: Acceptability

62

GenProg

62

0

10

20

30

40

21

28

37

14resp

onse

s (%

)


PAR

0

15

30

45

60

20

12

51

17re

spon

ses

(%)


49%

32%PAR generates more

acceptable patches than GenProg

RQ2: Acceptability

62

63

Quick Tips on Mining

Repositories hate massive crawler.

Data format can be changed frequently.

Noise Filtering [ICSE2011,ICSE2013] is very important.

63

64

Future Directions

Automatic Fix Template Identification

Tangled Changes [MSR2013]

Build Scripts (e.g., Ant, maven, and gradle)

64

Good Hunting: Locating, Prioritizing, and Fixing Bugs Automatically (Keynote, IWESEP 2013)

Technology

Transcript of Good Hunting: Locating, Prioritizing, and Fixing Bugs Automatically (Keynote, IWESEP 2013)