A Lightweight Approach to Uncover Technical Information in Unstructured Data

42
A Lightweight Approach to Uncover Technical Information in Unstructured Data Nicolas Bettenburg , Bram Adams, Ahmed E. Hassan Queen’s University, Kingston, ON Michel Smidt University of Bremen, Germany 1

description

Talk given at the 2011 International Conference on Program Comprehension in Kingston, ON, Canada.

Transcript of A Lightweight Approach to Uncover Technical Information in Unstructured Data

Page 1: A Lightweight Approach to Uncover Technical Information in Unstructured Data

A Lightweight Approach to Uncover Technical

Information in Unstructured Data

Nicolas Bettenburg, Bram Adams, Ahmed E. HassanQueen’s University, Kingston, ON

Michel SmidtUniversity of Bremen, Germany

1

Page 2: A Lightweight Approach to Uncover Technical Information in Unstructured Data

2

Page 3: A Lightweight Approach to Uncover Technical Information in Unstructured Data

The code after "if callback.isAcceleratorInUse (SWT.ALT | character ) )" inside Eclipse's MenuManager.java removes the mnemonic, but it seems like Eclipse should be checking isAcceleratorInUse.

2

Page 4: A Lightweight Approach to Uncover Technical Information in Unstructured Data

“Our developers have a severe problem.”

3

Page 5: A Lightweight Approach to Uncover Technical Information in Unstructured Data

“Our developers have a severe problem.”

[ Our_PRP$ developers_NNS ]) <: have_VBP :> ([ a_DT severe_JJ problem_NN ])._.

NLP

3

Page 6: A Lightweight Approach to Uncover Technical Information in Unstructured Data

“Our developers have a severe problem.”

[ Our_PRP$ developers_NNS ]) <: have_VBP :> ([ a_DT severe_JJ problem_NN ])._.

pronoun, possessive

NLP

3

Page 7: A Lightweight Approach to Uncover Technical Information in Unstructured Data

“Our developers have a severe problem.”

[ Our_PRP$ developers_NNS ]) <: have_VBP :> ([ a_DT severe_JJ problem_NN ])._.

pronoun, possessive

noun, common, plural

NLP

3

Page 8: A Lightweight Approach to Uncover Technical Information in Unstructured Data

“Our developers have a severe problem.”

[ Our_PRP$ developers_NNS ]) <: have_VBP :> ([ a_DT severe_JJ problem_NN ])._.

pronoun, possessive

noun, common, plural

verb, present tense

determiner

adjective, ordinal

noun, common, singular

NLP

3

Page 9: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Structured Text

4

Page 10: A Lightweight Approach to Uncover Technical Information in Unstructured Data

NLP?can’t deal withthe source code parts

PARSERS?can’t deal withthe natural language parts

5

Page 11: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Past Solutions

infoZillaBettenburg et al. - MSR’08

•Based on heuristics• Stack Traces•Code snippets at block level•Patches

6

Page 12: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Past Solutions

MilerBacchelli et al. - ICSE’10, ICPC’10

•Based on heuristics•Classifies lines as source code•Classifies documents as containing code• Finds class names

7

Page 13: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Past Solutions

MilerBacchelli et al. - ICPC’10

infoZillaBettenburg et al. - MSR’08

•only specific kinds of technical information!• new heuristics to extend (complex, error prone)• for some kind of technical information, infeasible

8

Page 14: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Build ID: M20070212-1330 Steps To Reproduce: 1. Create a plugin for eclipse that includes a key binding for "M1+S" (ie. Alt+S) where S is any letter that is used as a mnemonic in one of the top level menus. Since eclipse uses "S" as the mnemonic for Help > &Software Updates, "S" is sufficient. 2. Launch the plugin as part of Eclipse IDE 3. Press Alt+H to bring down the Help menu (to go along with our example in #1) BUG: Notice "Software Updates" is missing its mnemonic. More information: The code after "if (callback.isAcceleratorInUse(SWT.ALT | character))" inside Eclipse's MenuManager.java removes the mnemonic, but it seems like Eclipse should be checking "isAcceleratorInUse" only for top level menumanagers like File,Edit,...,Help, etc. : /* (non-Javadoc) * @see org.eclipse.jface.action.IContributionItem#update(java.lang.String) */ public void update(String property) { IContributionItem items[] = getItems(); for (int i = 0; i < items.length; i++) { items[i].update(property); } [...] } Any status on this bug? I'd consider any contributions for M6 (API) or M7 (non-API) [...] A 3.5 fix would be to make that behaviour optional in MenuManager with API and off by default early in 3.5, and to have the WorkbenchActionBuilder contributed MenuManagers and actionSets/editorActions contributed MenuManagers turn it on (if I can find MenuManagers in the correct place). I'd like us to work with the SWT team to make sure we understand what the correct platform behavior is, and make sure that we aren't getting in the way of that. The current behavior (i.e. turning off mnemonics) seems odd to me, in general. If we're going to fix this, we should fix it properly.

9

Page 15: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Build ID: M20070212-1330 Steps To Reproduce: 1. Create a plugin for eclipse that includes a key binding for "M1+S" (ie. Alt+S) where S is any letter that is used as a mnemonic in one of the top level menus. Since eclipse uses "S" as the mnemonic for Help > &Software Updates, "S" is sufficient. 2. Launch the plugin as part of Eclipse IDE 3. Press Alt+H to bring down the Help menu (to go along with our example in #1) BUG: Notice "Software Updates" is missing its mnemonic. More information: The code after "if (callback.isAcceleratorInUse(SWT.ALT | character))" inside Eclipse's MenuManager.java removes the mnemonic, but it seems like Eclipse should be checking "isAcceleratorInUse" only for top level menumanagers like File,Edit,...,Help, etc. : /* (non-Javadoc) * @see org.eclipse.jface.action.IContributionItem#update(java.lang.String) */ public void update(String property) { IContributionItem items[] = getItems(); for (int i = 0; i < items.length; i++) { items[i].update(property); } [...] } Any status on this bug? I'd consider any contributions for M6 (API) or M7 (non-API) [...] A 3.5 fix would be to make that behaviour optional in MenuManager with API and off by default early in 3.5, and to have the WorkbenchActionBuilder contributed MenuManagers and actionSets/editorActions contributed MenuManagers turn it on (if I can find MenuManagers in the correct place). I'd like us to work with the SWT team to make sure we understand what the correct platform behavior is, and make sure that we aren't getting in the way of that. The current behavior (i.e. turning off mnemonics) seems odd to me, in general. If we're going to fix this, we should fix it properly.

9

Page 16: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Build ID: M20070212-1330 Steps To Reproduce: 1. Create a plugin for eclipse that includes a key binding for "M1+S" (ie. Alt+S) where S is any letter that is used as a mnemonic in one of the top level menus. Since eclipse uses "S" as the mnemonic for Help > &Software Updates, "S" is sufficient. 2. Launch the plugin as part of Eclipse IDE 3. Press Alt+H to bring down the Help menu (to go along with our example in #1) BUG: Notice "Software Updates" is missing its mnemonic. More information: The code after "if (callback.isAcceleratorInUse(SWT.ALT | character))" inside Eclipse's MenuManager.java removes the mnemonic, but it seems like Eclipse should be checking "isAcceleratorInUse" only for top level menumanagers like File,Edit,...,Help, etc. : /* (non-Javadoc) * @see org.eclipse.jface.action.IContributionItem#update(java.lang.String) */ public void update(String property) { IContributionItem items[] = getItems(); for (int i = 0; i < items.length; i++) { items[i].update(property); } [...] } Any status on this bug? I'd consider any contributions for M6 (API) or M7 (non-API) [...] A 3.5 fix would be to make that behaviour optional in MenuManager with API and off by default early in 3.5, and to have the WorkbenchActionBuilder contributed MenuManagers and actionSets/editorActions contributed MenuManagers turn it on (if I can find MenuManagers in the correct place). I'd like us to work with the SWT team to make sure we understand what the correct platform behavior is, and make sure that we aren't getting in the way of that. The current behavior (i.e. turning off mnemonics) seems odd to me, in general. If we're going to fix this, we should fix it properly.

Challenge!

9

Page 17: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Spelling and Grammar Checkers

... are really good at finding “what’s not right“ in natural language text!

10

Page 18: A Lightweight Approach to Uncover Technical Information in Unstructured Data

A 3.5 fix would be to make taht behaviour optional in MenuManager with API and off by default early in 3.5, and to have the WorkbenchActionBuilder contributed MenuManagers and actionSets/editorActions contributed MenuManagers turn it on (if I can find MenuManagers in the corect place).

11

Page 19: A Lightweight Approach to Uncover Technical Information in Unstructured Data

A 3.5 fix would be to make taht behaviour optional in MenuManager with API and off by default early in 3.5, and to have the WorkbenchActionBuilder contributed MenuManagers and actionSets/editorActions contributed MenuManagers turn it on (if I can find MenuManagers in the corect place).

12

Page 20: A Lightweight Approach to Uncover Technical Information in Unstructured Data

A 3.5 fix would be to make taht behaviour optional in MenuManager with API and off by default early in 3.5, and to have the WorkbenchActionBuilder contributed MenuManagers and actionSets/editorActions contributed MenuManagers turn it on (if I can find MenuManagers in the corect place).

Actual Spelling Mistakes!

13

Page 21: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Add Heuristics

14

Page 22: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Add Heuristics

H1: Camel CasecamelCase, CamelCase, CamelCASE, ...

14

Page 23: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Add Heuristics

H1: Camel CasecamelCase, CamelCase, CamelCASE, ...

H2: Programming Language Keywordsprintf, fork, fi, ...

14

Page 24: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Add Heuristics

H1: Camel CasecamelCase, CamelCase, CamelCASE, ...

H2: Programming Language Keywordsprintf, fork, fi, ...

H3: Special Characterstree();

14

Page 25: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Evaluation

Manually annotated 20 complete Bug Reports and Discussions from ECLIPSE.

Manually annotated 20 complete Email Discussions from POSTGRESQL developers Mailing List.

15

Page 26: A Lightweight Approach to Uncover Technical Information in Unstructured Data

1

2

3

Annotation GUI

16

Page 27: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Precision / Recall

Precision(Si) =TPSi

TPSi+FPSi

Recall(Si) =TPSi

TPSi+FNSi

TP = We annotated and tool annotatedFP = Tool annotated, we did notFN = We annotated, tool did not

17

Page 28: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Results

Spellchecker Precision Recall

JOrtho 88.01% 64.31%

Jazzy 84.16% 68.30%

Hunspell 86.40% 68.34%

18

Page 29: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Results

Spellchecker Precision Recall

JOrtho 88.01% 64.31%

Jazzy 84.16% 68.30%

Hunspell 86.40% 68.34%

18

Page 30: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Results

Spellchecker Precision Recall

JOrtho 88.01% 64.31%

Jazzy 84.16% 68.30%

Hunspell 86.40% 68.34%

Hunspell used by OpenOffice and Mozilla Suite

18

Page 31: A Lightweight Approach to Uncover Technical Information in Unstructured Data

ComparisonLine-wise classification of source code

1 Launch the plugin as part of Eclipse IDE 3. Press Alt+H to 2 bring down the Help menu (to go along with our example in #1)34 BUG: Notice "Software Updates" is missing its mnemonic.5 6 public void update(String property) { 7 IContributionItem items[] = getItems();8 for (int i = 0; i < items.length; i++) { 9 items[i].update(property); 10 } 11 }1213 Any status on this bug?

19

Page 32: A Lightweight Approach to Uncover Technical Information in Unstructured Data

ComparisonLine-wise classification of source code

1 Launch the plugin as part of Eclipse IDE 3. Press Alt+H to 2 bring down the Help menu (to go along with our example in #1)34 BUG: Notice "Software Updates" is missing its mnemonic.5 6 public void update(String property) { 7 IContributionItem items[] = getItems();8 for (int i = 0; i < items.length; i++) { 9 items[i].update(property); 10 } 11 }1213 Any status on this bug?

20

Page 33: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Comparison

Precision Recall

Our approach 89.27% 86.46%

State-of-the-Art 66.13% 69.37%

Line-wise classification of source code

21

Page 34: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Comparison

Precision Recall

Our approach 89.27% 86.46%

State-of-the-Art 66.13% 69.37%

Line-wise classification of source code

21

Page 35: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Summary

22

Page 36: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Summary

22

Page 37: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Summary

22

Page 38: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Summary

22

Page 39: A Lightweight Approach to Uncover Technical Information in Unstructured Data

Summary

22

Page 40: A Lightweight Approach to Uncover Technical Information in Unstructured Data

mozilla

paulc

zhangchunlin

kbrosnan

sdwilsh

samuel.sidler+oldhasham8888

myles7897

deletesoftwareabillings

eddy_nigg

jmjeffery

sgautherie.bz

john.p.baker

l10n

adelfino

jo.hermans

jruderman

nightstalkerz

alice0775

hskupin

mmortal03

tchung

marcia

me.at.work

fittysix

steve.england

cbook

tonglebeak

ctalbert

VYV03354

ehsan

alex

nrthomas

aarobertxtr

smichaud shaver

johnjbartonmanujsabarwal

jdaggett

matt

bzbarsky

dtownsend

davemgarrett

info

stephen.donner

elmar.ludwig sdaugherty

mak77jdarmochwal

polidobj

vseerrortwalker

dietrich

mconnorbeltzner

steffen.wilberg

mano

highmind63

ria.klaassen

robert.bugzilla

edilee

kliu

faaborg

marco.zehesylvain.pasche bugzilla

rotisuliss

cl-bugs-new2

anselm.meyer

timwi

RainerStroebel

tomer

gavin.sharp

jbecerra

johnath

kev

martijn.martijn

cwwmozilla

longsonr

m-wada

zenikodveditz

matspal

philringnalda

zurtex

bomfog

cjcypoi02 corevette

masayukireed

phiw

timeless

matti

mh+mozilla

dao

klaas1988

sziadeh mark.finkle

23

Page 41: A Lightweight Approach to Uncover Technical Information in Unstructured Data

mozilla

paulc

zhangchunlin

kbrosnan

sdwilsh

samuel.sidler+oldhasham8888

myles7897

deletesoftwareabillings

eddy_nigg

jmjeffery

sgautherie.bz

john.p.baker

l10n

adelfino

jo.hermans

jruderman

nightstalkerz

alice0775

hskupin

mmortal03

tchung

marcia

me.at.work

fittysix

steve.england

cbook

tonglebeak

ctalbert

VYV03354

ehsan

alex

nrthomas

aarobertxtr

smichaud shaver

johnjbartonmanujsabarwal

jdaggett

matt

bzbarsky

dtownsend

davemgarrett

info

stephen.donner

elmar.ludwig sdaugherty

mak77jdarmochwal

polidobj

vseerrortwalker

dietrich

mconnorbeltzner

steffen.wilberg

mano

highmind63

ria.klaassen

robert.bugzilla

edilee

kliu

faaborg

marco.zehesylvain.pasche bugzilla

rotisuliss

cl-bugs-new2

anselm.meyer

timwi

RainerStroebel

tomer

gavin.sharp

jbecerra

johnath

kev

martijn.martijn

cwwmozilla

longsonr

m-wada

zenikodveditz

matspal

philringnalda

zurtex

bomfog

cjcypoi02 corevette

masayukireed

phiw

timeless

matti

mh+mozilla

dao

klaas1988

sziadeh mark.finkle

UIJavaScript

Engine

XML Parser

Internet Explorer

23

Page 42: A Lightweight Approach to Uncover Technical Information in Unstructured Data

24