A Replicated Study on Duplicate Detection: Using Apache...

LUND UNIVERSITY

PO Box 117221 00 Lund+46 46-222 00 00

A Replicated Study on Duplicate Detection: Using Apache Lucene to Search AmongAndroid Defects

Borg, Markus; Runeson, Per; Johansson, Jens; Mäntylä, Mika

Published in:[Host publication title missing]

DOI:10.1145/2652524.2652556

2014

Link to publication

Citation for published version (APA):Borg, M., Runeson, P., Johansson, J., & Mäntylä, M. (2014). A Replicated Study on Duplicate Detection: UsingApache Lucene to Search Among Android Defects. In [Host publication title missing]https://doi.org/10.1145/2652524.2652556

General rightsUnless other specific re-use rights are stated the following general rights apply:Copyright and moral rights for the publications made accessible in the public portal are retained by the authorsand/or other copyright owners and it is a condition of accessing publications that users recognise and abide by thelegal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private studyor research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal

Read more about Creative commons licenses: https://creativecommons.org/licenses/Take down policyIf you believe that this document breaches copyright please contact us providing details, and we will removeaccess to the work immediately and investigate your claim.

https://doi.org/10.1145/2652524.2652556

https://portal.research.lu.se/portal/en/publications/a-replicated-study-on-duplicate-detection-using-apache-lucene-to-search-among-android-defects(ad808def-3b14-48e4-a0b7-cf22dca7f570).html

https://doi.org/10.1145/2652524.2652556

A replicated study on duplicate detection:

Using Apache Lucene to search among Android defects

Jens Johansson, Markus Borg, Per Runeson and Mika Mantyla

Abstract

Duplicate detection is a fundamental part ofissue management. Systems able to predictwhether a new defect report will be closed asa duplicate, may decrease costs by limitingrework and collecting related pieces of infor-mation. Previous work relies on the textualcontent of the defect reports, often assumingthat better results are obtained if the title isweighted as more important than the descrip-tion. We conduct a conceptual replication ofa well-cited study conducted at Sony Erics-son, using Apache Lucene for searching in thepublic Android defect repository. In line withthe original study, we explore how varying theweighting of the title and the description af-fects the accuracy. Our work shows the poten-tial of using Lucene as a scalable solution forduplicate detection. Also, we show that Luceneobtains the best results the when the defect re-port title is weighted three times higher thanthe description, a bigger difference than hasbeen previously acknowledged.

1 Introduction

Issue management is an important part oflarge-scale software evolution. Developmentorganizations typically manage thousands ofdefect reports in issue repositories such as Jiraor Bugzilla. In large projects, the continu-ous inflow of defect reports can be a hard tooverview [2]. An important step in the earlyissue triaging is to determine whether an issue

has been reported before. This activity, knownas duplicate detection, is typically a manualprocess. Previous studies show that the frac-tion of duplicate defect reports range from 10%to 30%, both in proprietary and Open SourceSoftware (OSS) development [1, 7, 5]. More-over, in projects where also end users can sub-mit defect reports, i.e., not only developers andtesters, the amount of duplicates may be evenhigher [12].

In 2007, we implemented automated dupli-cate detection at Sony Ericsson [7]. We showedthat our prototype tool, based on InformationRetrieval (IR) techniques, discovered about40% of the true duplicates in a large projectat the company. Furthermore, we argued thatthe approach had the potential to find about2/3 of the duplicate issue reports. Also, we es-timated that the tool support could save about20 hours of issue triaging per 1,000 submitteddefect reports.

In our previous study, we also discovered anumber of IR configurations that have guidedsubsequent research on duplicate detection.We found that weighting the title twice asimportant as the description of an defect re-port yielded improved results, a finding thathas been confirmed in several later studies [12,5, 9]. Moreover, we found that most dupli-cate defect reports were submitted within threemonths of the master report, thus providing usan option to filter the duplicate detection ac-cordingly.

We now return to two of the directionspointed out as future work in the originalstudy. First, we argued that the IR approach

Page 1

used in our prototype tool should be comparedto an off-the-shelf search engine. While sev-eral papers have been published on duplicatedetection in recent years [3], such an evalua-tion has not yet been conducted. Second, weargued that our work should be replicated onanother dataset. Several researchers have pre-sented work on other datasets (see Section 2),and also confirmed the value of up-weightingthe the title; however a deeper study of thetitle vs. description balance has not been pre-sented. Also, the usefulness of filtering dupli-cates based on differences in submission dateshave not been confirmed in later studies.

In this study, we evaluate using the searchengine Apache Lucene1 for duplicate detection.Lucene is an OSS state-of-the-art search en-gine library, often integrated in large-scale full-text search solutions in industry. We study alarge set (n=20,175) of defect reports collectedfrom the Android development project, i.e., thedataset originates from the domain of mobiledevices for telecommunication, in line with ouroriginal study.

The following Research Questions (RQ), thesame as in the original study, guide our work:

RQ1 How many duplicate defect reports arefound by Apache Lucene?

RQ2 What is the relative importance of the titlevs. the description?

RQ3 How can the submission date be used tofilter the results?

2 Background and relatedwork

When IR is used for duplicate detection, thequery consists of the textual content of thenewly submitted defect report and the IR sys-tem will search the issue repository for po-tential duplicates and list them accordingly.Several researchers have published studies onduplicate detection since our original study.

1http://lucene.apache.org/

Wang et al. complemented the textual con-tent of defect reports with stack traces [12].While their evaluation contained fewer than2,000 defect reports, they report a Rc@10 (de-fined in Section 3.3) as high as 93% on datacollected from the development of Firefox. Sunet al. proposed including also categorical fea-tures to represent defect reports [9]. Also, theyconsidered both unigrams (single terms in iso-lation) and bigrams (sequences of two terms)in the textual description, reaching a Rc@10of 65-70% on the three OSS projects Firefox,OpenOffice, and Mozilla. Sureka et al. pre-sented a different approach to text representa-tion, considering defect descriptions on char-acter level [11]. They did not perform anypre-processing, and thus allow duplicate detec-tion across languages. They evaluated their ap-proach on a random sample of 2,270 defect re-ports from the Eclipse project, and obtained aRc@10 of 40%.

Other researchers have instead consideredduplicate detection a classification problem. Aclassifier can be trained on historical defect re-ports to answer the question “is this new defectreport a duplicate of an already known issue?”.Jalbert and Weimer complemented IR tech-niques with training a linear regression modelusing both textual information and categoricalfeatures [5]. The regression model then learnsa threshold that distinguishes duplicates andunique issues, reaching a Rc@10 of 45%. Sunet al. [10] instead used SVM, a standard clas-sifier, to detect duplicated defect reports. Byconsidering both unigrams and bigrams in thetitle and description, they achieve a Rc@10 ofabout 60%.

3 Method

This section presents the IR tool used, thedataset, and the experimental design.

3.1 Apache Lucene for duplicate de-tection

We use Apache Lucene version 3.6.1, to de-velop a prototype tool for duplicate detection.

Page 2

Lucene provides a scalable solution to indexand search text, in our case identify textu-ally similar defect reports. For each defect re-port in our dataset, Lucene creates a documentwhich consists of fields, that represent the de-fect report. The documents are stored in anindex. Finally, we use other defect reports assearch queries to the index and Lucene returnsa ranked list of the search results, i.e., potentialduplicate defect reports.

Lucene matches queries and documents bycalculating similarities. Each search result isgiven a raw score, i.e., a floating-point num-ber ≥ 0.0. Higher numbers indicate bettermatches. Also, it is possible to set boost fac-tors in Lucene, weighting features as more orless important. The default boost factor for allfields is 1.0, but can be any value ≥ 0. Gen-eral practice is to find the best boost factors bytrial and error.

3.2 Dataset and feature selection

We study a set of 20,175 Android defect re-ports submitted between November 2007 andSeptember 2011, publicly available as part ofthe MSR Mining Challenge 2012 [8]. We rep-resent a defect report by the textual contentof the title and the description, i.e., as twoseparate bags-of-words. We apply stop wordremoval and stemming to the text, using built-in functionality in Lucene. Moreover, we alsoconsider the submission date. We planned toinclude also categorical features such as theseverity and component, but pilot runs showedthat they did not add any predictive value forthe duplicate detection task.

The dataset contains 1,158 defect reportswith duplicates. However, a defect report canhave several duplicates. Also, we observed thatsome bug reports have high numbers of dupli-cates, 160 defect reports have as many as 20 du-plicates or more. In the dataset, duplicates aresignaled using one-way references to the mas-ter report. By translating all duplicate links totwo-way relations, and by using the transitiveproperty of equality2, we obtain in total 16,050

2if a is duplicate of b and b if duplicate of c, then a

duplicate relations. More than half of the du-plicate defect reports were submitted within 20days, and almost 75% within 200 days.

3.3 Measures

The most commonly reported measure used forevaluating duplicate detection in issue manage-ment is the recall when considering the top-nrecommendations (Rc@N) [3], i.e., how manyof the duplicates are found within the top 1,2, 3, ... 10 recommendations (as a cut-off).Thus, Rc@N shows the fraction of the true du-plicate reports actually presented among thetop N recommendations. We restrict the cut-off level to 10 as that is a reasonable number ofduplicate candidates for a developer to assess.

We complement this measure with Mean Av-erage Precision (MAP), a popular IR measurethat provides a single-figure result reflectingthe quality of a ranked list of search results.Within IR research, MAP is known to be astable measure good at discriminating differ-ent approaches [6]. Also, we have previouslyargued that MAP, a so called secondary mea-sure for IR evaluation, should be used moreoften in software engineering research [4]

We define Average Precision (AP) until thecut-off N as:

AP@N =

n∑r=1

(P (k) × rel(k))

number of relevant documents(1)

where P (k) is the precision at cut-off k, andrel(k) is equal to 1 if the search result at rankk is relevant, otherwise 0. The MAP is calcu-lated over all the queries in the dataset. For du-plicate detection relevant documents turn intotrue duplicates.

3.4 Experimental design and validity

We perform two controlled experiments, ExpAand ExpB, to tackle our RQs. For both ex-periments we use all 20,175 defect reports assearch queries in Lucene once, and evaluate the

is duplicate of c

Page 3

ranked output at the cut-off 10. The dependentvariables are in both experiments the Rc@10and MAP@10.

First, ExpA evaluates the performance of du-plicate detection of defect reports among An-droid defect reports using Lucene (RQ1). Tostudy the relative importance of the title vs.the description (RQ2) we used Lucene boostfactors as the independent variables. We ex-plore the effect of different boost factors in asystematic manner, varying the boost factorsfrom 0 to 4.9 in steps of 0.1. For all combi-nations of boost factors we perform an experi-mental run with Lucene, in total 2,500 experi-mental runs.

ExpB addresses RQ3, i.e., whether it is use-ful to filter the results based on submissiondates. We evaluate removing duplicate can-didates returned by Lucene, keeping only du-plicate candidates within a specific time win-dow, in line with the original study [7]. We fixthe boost factors based on the results of ExpA.Then we use the size of the time window as theindependent variable, varying between 1 and1,141 days (the longest time between a dupli-cate and a master defect report in the dataset).We consider the defect report used as the queryas being in the middle of the time window.

We consider the following validity threats asthe most relevant for our study: 1) The shareof duplicate defects in the Android databaseis only roughly 5% whereas prior work reportthat typically 10-30% are duplicates. Thisatypical share of duplicate defects might makecomparison to prior work difficult. 2) Theway we establish our gold standard is differ-ent. Our dataset does not have fine-granulartime stamps, i.e., only dates are available, thuswe cannot not determine a master report as inprevious work. In our sets of duplicates, typi-cally several defect reports have the same sub-mission date. We are restricted to a set-basedevaluation, which is likely to boost precisionvalues but decrease recall. Still, we argue thata high-level comparison to the results in previ-ous work can be made.

Rc@10 Title boost Desc. boost Title/Desc.0.138 1.9 0.8 2.3750.138 2.6 1.1 2.3640.138 3.4 1.1 3.0910.138 3.7 1.2 3.0830.138 3.1 1.3 2.3850.138 3.3 1.4 2.3570.138 3.8 1.6 2.3750.138 4.0 1.7 2.3530.138 4.3 1.8 2.3890.138 4.5 1.9 2.368

Baseline0.134 1.0 1.0 1.0

Table 1: Best relative weighting of title vs. de-scription wrt. Rc@10. The baseline reflects theRc@10 when using balanced weighting.

4 Results and discussion

In ExpA, the highest Rc@10 we achieve is0.138, i.e., detecting 2,215 out of 16,050 caseswhere duplicates are listed among the topten candidates. Ten different combinations ofboost factors yield this result, as presented inTable 1. Thus, we answer RQ1 by: Lucenefinds more than 10% of the true duplicatesamong the top-10 duplicate candidates.

The performance of our approach appearssignificantly lower than what has been reportedin previous work. However, while previouswork has assumed the simplification that ev-ery duplicate report has exactly one master re-port, typically the first one submitted, we in-stead consider sets of duplicates. Thus, everyduplicated defect report might have several dif-ferent reports to find, many more than if usingthe simplifying assumption. The Rc@10 we ob-tain reflect the more challenging gold standardused in our evaluation.

ExpA confirms the claim by previous re-search, stating that up-weighting the title overthe description provides better results in de-fect duplicate detection. Although the im-provement of overweighing the title is small,3.0 and 2.8% for Rc@10 and MAP@10, respec-tively, those improvements are greater than inour original work where improvement was 1.2%[7]. Table 2 lists the ten best combinations ofboost factors wrt. MAP@10. We see that thebest results are achieved when the title is giventhree times as high weight as the description, a

Page 4

MAP@10 Title boost Desc. boost Title/Desc.0.633 4.9 1.6 3.0630.633 4.6 1.5 3.0670.633 4.3 1.4 3.0710.632 3.1 1.0 3.1000.632 2.5 0.8 3.1250.632 2.2 0.7 3.1430.632 4.4 1.4 3.1430.632 4.7 1.5 3.1330.632 2.8 0.9 3.1110.632 2.9 0.9 3.222

Baseline0.616 1.0 1.0 1.0

Table 2: Best relative weighting of title vs. de-scription wrt. MAP@10. The baseline reflectsthe MAP@10 when using balanced weighting.

Figure 1: Rc@10 and MAP@10 when consider-ing duplicate candidates within a varying timewindow.

bigger difference than has been acknowledgedin previous work. Our answer to RQ2 is: the ti-tle is more important than the description whendetecting duplicates using IR; for Android de-fect reports it should be given thrice the weight.

We hypothesize two main reasons for this dif-ference. First, submitters of defect reports ap-pear to put more thought into formulating thetitle, carefully writing a condensed summaryof the issue. Second, less precise information isadded as parts of the description. Stack traces,snippets from log files, and steps to reproducethe issue might all be useful for resolving thedefect report, but for duplicate detection basedon IR it appears to introduce noise. On theother hand, it is still beneficial to use the tex-tual content in the description, but it shouldnot be given the same weight.

ExpB uses output from ExpA to set promis-ing boost factors. We use two weight configu-rations to up-weight the title in relation to the

description: 2.5 (ConfA) and 3.1 (ConfB). Fig-ure 1 shows the effect of using different timewindows. The left subfigure shows the effecton Rc@10 for ConfA, the right subfigure dis-plays how MAP@10 is affected for ConfB. Ourresults do not show any improvements of prac-tical significance for filtering duplicates basedon the submission time. Instead, it appearsthat considering the full dataset as potentialduplicates gives the best results. Our results donot confirm the results from the original study,that only considering an interval of 60 days isbeneficial [7]. We answer RQ3 by duplicate de-tection among Android defect reports does notbenefit from filtering results based on the sub-mission date.

As reported in Section 3.2, more than 50%of the duplicate defect reports in the Androiddataset were submitted within 20 days, and al-most 75% within 200 days. Thus we expectedfiltering to have a positive effect on the preci-sion of our ranked lists. This was not at all thecase; instead it appeared consistently better toskip filtering, in contrast to the finding in theoriginal study. The filter appears to not removeany meaningful noise in our case, but insteadremove meaningful candidates. This could ei-ther be because differences in the dataset, orin the IR tool used. No matter what, our rec-ommendation is that if the textual similarityis high, consider the defect report a potentialduplicate even if it was submitted at a distantpoint in time.

5 Summary and future work

In this paper, we make three important find-ings in duplicate defect detection. First weconfirm the prior findings that weighting thetitle more important than the description cansignificantly improve duplicate defect detec-tion. Since this confirmation comes from adifferent dataset and with the use of anothertool, we think this result might be generaliz-able. Yet, future studies are needed before wecan make definitive conclusions. Second, wepresent conflicting evidence on the benefits of

Page 5

filtering defect reports based on the submissiondate. This suggests that the benefits of timebased filtering are dependent on the context.Third, we open a new avenue for future workby considering all duplicate relations whereasthe prior work has only considered that everydefect has a single master report. The useof all duplicate relations led to much poorerRc@10 when compared with prior work. How-ever, since our approach is more realistic wesuggest that future works follow our more com-plex scenario.

Acknowledgement

This work was funded by the Industrial Excel-lence Center EASE – Embedded ApplicationsSoftware Engineering.

References

[1] J. Anvik, L. Hiew, and G. Murphy. Cop-ing with an open bug repository. In Proc.of the OOPSLA workshop on Eclipse tech-nology eXchange, pages 35–39, 2005.

[2] N. Bettenburg, R. Premraj, T. Zimmer-mann, and K. Sunghun. Duplicate bugreports considered harmful... really? InProc. of the Int’l Conf. on Software Main-tenance, pages 337–345, 2008.

[3] M. Borg and P. Runeson. Changes, evolu-tion and bugs - recommendation systemsfor issue management. In M. Robillard,W. Maalej, R. Walker, and T. Zimmer-mann, editors, Recommendation Systemsin Software Engineering. Springer, 2014.

[4] M. Borg, P. Runeson, and L. Broden.Evaluation of traceability recovery in con-text: a taxonomy for information retrievaltools. In Proc. of the Int’l Conf. on Evalu-ation & Assessment in Software Engineer-ing, pages 111–120, 2012.

[5] N. Jalbert and W. Weimer. Automatedduplicate detection for bug tracking sys-

tems. In Proc. of the Int’l Conf. on De-pendable Systems and Networks, pages 52–61, 2008.

[6] C. Manning, P. Raghavan, andH. Schutze. Introduction to Informa-tion Retrieval. Cambridge UniversityPress, 2008.

[7] P. Runeson, M. Alexandersson, andO. Nyholm. Detection of duplicate defectreports using natural language processing.In Proc. of the Int’l Conf. on Software En-gineering, pages 499–510, 2007.

[8] E. Shihab, Y. Kamei, and P. Bhat-tacharya. Mining challenge 2012: TheAndroid platform. In Proc. of the Work-ing Conf. on Mining Software Reposito-ries, pages 112–115, 2012.

[9] C. Sun, D. Lo, S.-C. Khoo, and J. Jiang.Towards more accurate retrieval of dupli-cate bug reports. In Proc. of the Int’lConf. on Automated Software Engineer-ing, pages 253–262, 2011.

[10] C. Sun, D. Lo, X. Wang, J. Jiang, andS.-C. Khoo. A discriminative model ap-proach for accurate duplicate bug reportretrieval. In Proc. of the Int’l Conf. onSoftware Engineering, pages 45–54, 2010.

[11] A. Sureka and P. Jalote. Detecting dupli-cate bug report using character n-gram-based features. In Proc. of the Asia PacificSoftware Engineering Conf., pages 366–374, 2010.

[12] X. Wang, L. Zhang, T. Xie, J. Anvik, andJ. Sun. An approach to detecting dupli-cate bug reports using natural languageand execution information. In Proc. of theInt’l Conf. on Software Engineering, pages461–470, 2008.

Page 6

A Replicated Study on Duplicate Detection: Using Apache...

Documents

Transcript of A Replicated Study on Duplicate Detection: Using Apache...