PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

43
Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Handling missing data in software effort prediction with naive Bayes and EM algorithm Wen Zhang Ye Yang Qing Wang Laboratory for Internet Software Technologies Institute of Software, Chinese Academy of Sciences Beijing 100190, P.R.China {zhangwen,ye,wq}@itechs.iscas.ac.cn 7th International Conference on Predictive Models in Software Engineering (PROMISE), 2011 Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

description

PROMISE 2011:"Handling missing data in software effort prediction with naive Bayes and EM"Wen Zhang, Ye Yang and Qing Wang.

Transcript of PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

Page 1: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Handling missing data in software effortprediction with naive Bayes and EM algorithm

Wen Zhang Ye Yang Qing Wang

Laboratory for Internet Software TechnologiesInstitute of Software, Chinese Academy of Sciences

Beijing 100190, P.R.China{zhangwen,ye,wq}@itechs.iscas.ac.cn

7th International Conference on Predictive Models inSoftware Engineering (PROMISE), 2011

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 2: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Outline

1 Introduction

2 Naive Bayes and EM for software effort prediction

3 Missing data handling strategiesMissing data toleration strategy.Missing data imputation strategy

4 ExperimentsThe datasetsExperiment setupExperimental results

5 Threats.

6 Conclusion and future work

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 3: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Effort prediction with missing data.

The knowledge on software project effort stored in thehistorical datasets can be used to develop predictivemodels, by either statistical methods such as linearregression and correlation analysis to predict the effort ofnew incoming projects.

Usually, most historical effort datasets contain largeamount of missing data.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 4: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Effort prediction with missing data.

Due to the small sizes of most historical databases, thecommon practice of ignoring projects with missing data willlead to biased and inaccurate prediction model.

For these reasons, how to handle missing data in softwareeffort datasets is becoming an important problem.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 5: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Sample data

The historical effort data of projects were organized asshown in the following Table.

Table: The sample data in historical project dataset.

D X1 ... Xj ... Xn HD1 x11 ... x1j ... x1n h1

... ... ... ... ... ... ...Di xi1 ... xij ... xin hi

... ... ... ... ... ... ...Dm xm1 ... xmj ... xmn hm

Xj (1 ≤ j ≤ n) denotes an attribute of project Di

(1 ≤ i ≤ m). hi is the effort class label of Di and it isderived from the real effort of project Di .

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 6: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Sample data.

There are l effort classes for all the projects in a dataset,that is, hi is equal to one of the elements in {c1, ..., cl}.

Xj is independent of each other and has Boolean valueswithout missing data, i.e. xij ∈ {0,1}.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 7: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Formulation of the problem.

An effort dataset Ycom containing m historical projects asYcom = (D1, ...,Di , ...,Dm)T , where Di (1 ≤ i ≤ m) is ahistorical project and Di = (xi1, ..., xij , ..., xin)

T isrepresented by n attributes Xj (1 ≤ j ≤ n).

hi denotes the effort class label of project Di . For each xij ,which is the value of attribute Xj) (1 ≤ j ≤ n)on Di , it wouldbe observed or missing.

Cross validation on effort prediction is used to to evaluatethe performances of missing data handling techniques.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 8: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Motivation.

EM (Expectation Maximization) algorithm is a method forfinding maximum likelihood or maximum a posterioriestimates of parameters in statistical models.

The motivation of applying EM(Expectation Maximization)to naive Bayes is to augment the unlabeled projects withtheir estimated effort class labels into the labeled data sets.

Thus, the performance of classification would be improvedby using more data to train the prediction model.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 9: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Labeled projects and unlabeled projects.

For a labeled project DLi , its effort class

P(hi = ct ∣DLi ) ∈ {0,1} is determinate.

For an unlabeled project DUi , its label P(hi = ct ∣DU

i ) isunknown.

However, if we can assign predicted effort class to DUi ,

then DUi could also be used to update the estimates

P{Xj = 0∣ct}, P{Xj = 1∣ct} and P(ct), and further to refinethe effort prediction model P(ct ∣Di). This process isdescribed in Equations 1, 2, 3 and 4.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 10: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Estimating P(�+1)(Xj = 1∣ct).

The likelihood of occurrence of Xj with respect to ct at� + 1 iteration, is updated by Equation 1 using theestimates at � iteration.

P(�+1)(Xj = 1∣ct) =1 +

∑mi=1 xijP(�)(hi = ct ∣Di)

n +∑n

j=1∑m

i=1 xijP(�)(hi = ct ∣Di). (1)

In practice, we explain P(�+1)(Xj = 1∣ct) as probability ofattribute Xj appearing in a project whose effort class is ct .

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 11: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Estimating P(�+1)(Xj = 0∣ct).

Accordingly, the likelihood of non-occurrence of Xj withrespect to ct at � + 1 iteration, P(�+1)(Xj = 0∣ct) isestimated by Equation 2.

P(�+1)(Xj = 0∣ct) = 1 − P(�+1)(Xj = 1∣ct). (2)

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 12: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Estimating P(�+1)(ct).

Second, the effort class prior probability, P(�+1)(ct ), is updatedin the same manner by Equation 3 using estimates at the �

iteration. In practice, we may regard P(�+1)(ct ) as the priorprobability of class label ct appearing in all the softwareprojects.

P(�+1)(ct ) =1 +

∑mi=1 P(�)(hi = ct ∣Di)

l + m. (3)

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 13: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Estimating P(�+1)(hi ′ = ct ∣Di ′).

Third, the posterior probability of an unlabeled project Di ′

belonging to an effort class ct at the � + 1 iteration,P(�+1)(hi ′ = ct ∣Di ′), is updated using Equation 4.

P(�+1)(hi ′ = ct ∣Di ′) =P(�)(ct )P(�)(Di ′ ∣ct)

P(�)(Di ′)

=

P(�)(ct)n∏

j=1P(�)(xi ′j ∣ct)

l∑

t=1P(�)(ct )

n∏

j=1P(�)(xi ′ j ∣ct)

.

(4)

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 14: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Estimating P(�+1)(hi ′ = ct ∣Di ′).

Hereafter,for labeled projects, if xij = 1, thenP(�)(xij ∣ct) = P(�)(Xj = 1∣ct). Otherwise xij = 0, thenP(�)(xij ∣ct) = P(�)(Xj = 0∣ct).for unlabeled projects, if xi′j = 1, thenP(�)(xi′j ∣ct) = P(�)(Xj = 1∣ct). Otherwise xi′j = 0, thenP(�)(xi′j ∣ct) = P(�)(Xj = 0∣ct).

Here, P(0)(Xj = 1∣ct) and P(0)(ct ) are initially estimated bymerely the labeled projects at the first step of iteration, andthe unlabeled project cases are appended into the learningprocess after they were predicted probabilistic effort classby P(1)(hi ′ = ct ∣Di ′).

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 15: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Predicting the effort class of unlabeled projects.

We loop the Equations 1, 2, 3 and 4 until their estimatesconverge to stable values.

Then, P(�+1)(hi ′ = ct ∣Di ′) is used to predict effort class ofDi ′ .

The ct ∈ {c1, ..cl} that maximizes P(�+1)(hi ′ = ct ∣Di ′) isregarded as the effort class of Di ′.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 16: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Missing data toleration strategy.Missing data imputation strategy

Outline

1 Introduction

2 Naive Bayes and EM for software effort prediction

3 Missing data handling strategiesMissing data toleration strategy.Missing data imputation strategy

4 ExperimentsThe datasetsExperiment setupExperimental results

5 Threats.

6 Conclusion and future work

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 17: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Missing data toleration strategy.Missing data imputation strategy

Initial setting.

When we use Equation 1 to estimate the likelihood of Xj

with respect to ct , P(Xj = 1∣ct) or P(Xj = 0∣ct), we do notconsider missing values involved in xij (1 ≤ i ≤ m).For each Xj , we can divide the whole historical dataset Dinto two subsets, i.e. D = {Dobs,j ∣Dmis,j} where Dobs,j is theset of projects whose values on attribute Xj are observedand Dmis,j is the set of projects whose values on attributeare unobserved.We may also divide the attributes in a project Di into twosubsets, i.e. Di = {Xobs,i ∣Xmis,i} where Xobs,i is the set ofattributes whose values are observed in project Di andXmis,i denotes the set of attributes whose values areunobserved in project Di .

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 18: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Missing data toleration strategy.Missing data imputation strategy

Missing data toleration strategy.

This strategy is very similar with the method adopted byC4.5 to handle missing data. That is, we ignore missingvalues in training prediction model.

To estimate P(�+1)(Xj = 1∣ct) under this strategy, werewrite Equation 1 into Equation 5.

P(�+1)(Xj = 1∣ct) =

1 +∣Dobs,j ∣∑

i=1xijP(�)(hi = ct ∣Di)

n +n∑

j=1

∑∣Dobs,j ∣

i=1 xijP(�)(hi = ct ∣Di)

. (5)

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 19: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Missing data toleration strategy.Missing data imputation strategy

Missing data toleration strategy.

The difference between Equations 1 and 5 lies in that onlyobserved projects on attribute Xj , i.e., Dobs,j are used toestimate P(�+1)(Xj = 1∣ct).

Equation 2 can also be used here to estimateP(�+1)(Xj = 0∣ct). To estimate P(�+1)(ct ), Equation 3 canalso be used here.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 20: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Missing data toleration strategy.Missing data imputation strategy

Missing data toleration strategy.

Accordingly, the prediction model should be adapted fromEquation 4 to Equation 6.

P(�+1)(hi ′ = ct ∣Di ′) =P(�)(ct)P(�)(Di ′ ∣ct)

P(�)(Di ′)

=

P(�)(ct)∣Xobs,i ∣∏

j=1P(�)(xi ′j ∣ct)

∣Xobs,i ∣∏

j=1

l∑

t=1P(�)(ct )P(�)(xi ′j ∣ct)

. (6)

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 21: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Missing data toleration strategy.Missing data imputation strategy

Outline

1 Introduction

2 Naive Bayes and EM for software effort prediction

3 Missing data handling strategiesMissing data toleration strategy.Missing data imputation strategy

4 ExperimentsThe datasetsExperiment setupExperimental results

5 Threats.

6 Conclusion and future work

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 22: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Missing data toleration strategy.Missing data imputation strategy

Missing data imputation strategy.

The basic idea of this strategy is that unobserved values ofattributes can be imputed using the observed values.Then, both observed values and imputed values are usedto construct the prediction model.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 23: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Missing data toleration strategy.Missing data imputation strategy

Missing data imputation strategy.

This strategy is an embedded processing in naive Bayesand EM and we may rewrite Equation 1 to Equation 7 toestimate P(�+1)(Xj = 1∣ct).

P(�+1)(Xj = 1∣ct) =

1 +∣Dobs,j ∣∑

i=1xijP(�)(hi = ct ∣Di) +

∣Dmis,j ∣∑

s=1xsj P(�)(hi = ct ∣Ds)

n +n∑

j=1{∣Dobs,j ∣∑

i=1xijP(�)(hi = ct ∣Di) +

∣Dmis,j ∣∑

s=1xsj P(�)(hi = ct ∣Ds)}

.

(7)

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 24: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Missing data toleration strategy.Missing data imputation strategy

Missing data imputation strategy.

The missing value xsj , which is the value of attribute Xj onthe project Ds, is imputed using xsj with Equation 8

xsj =

∣Dobs,j ∣∑

i=1xijP(�)(hi = ct ∣Di)

∣Dobs,j ∣∑

i=1P(�)(hi = ct ∣Di)

. (8)

xsj is a constant independent of Ds given ct .

We regulate that xsj is approximated to 1 if xsj ≥ 0.5.Otherwise, xsj is approximated to 0.

Here, we also use Equation 3 to estimate P(�+1)(ct) .

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 25: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Missing data toleration strategy.Missing data imputation strategy

Missing data imputation strategy.

As for the prediction model, P(�+1)(ct ∣Di), can beconstructed in Equation 9 with considering the missingvalues.

P(�+1)(hi ′ = ct ∣Di ′) =P(�)(ct )P(�)(Di ′ ∣ct)

P(�)(Di ′)

=

P(�)(ct)n∏

j=1P(�)(xi ′ j ∣ct)

n∏

j=1

l∑

t=1P(�)(ct )P(�)(xi ′ j ∣ct )

. (9)

Note that if xi ′j is unobserved, it value will be substitutedwith xi ′j given by Equation 8.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 26: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

The datasetsExperiment setupExperimental results

Outline

1 Introduction

2 Naive Bayes and EM for software effort prediction

3 Missing data handling strategiesMissing data toleration strategy.Missing data imputation strategy

4 ExperimentsThe datasetsExperiment setupExperimental results

5 Threats.

6 Conclusion and future work

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 27: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

The datasetsExperiment setupExperimental results

The ISBSG dataset.

The ISBSG data set (http://www.isbsg.org) has 70attributes and many attributes have no values in thecorresponding places.

We extract 188 projects with 16 attributes with the criterionthat each project has at least 2/3 attributes whose valuesare observed and, for an attribute, its values should beobserved at least in 2/3 of total projects.

13 attributes are nominal attributes and 3 attributes arecontinuous attributes.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 28: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

The datasetsExperiment setupExperimental results

The ISBSG dataset.

We use Equation 10 to normalize the efforts of projectsinto l(= 3) classes.

ct = ⌊l × (effortDi

− effortmin)

effortmax − effortmin⌋+ 1 (10)

Table: The effort classes in ISBSG data set.

Class No. # of projects Label1 85 Low2 76 Medium3 27 High

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 29: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

The datasetsExperiment setupExperimental results

The CSBSG dataset.

CSBSG dataset contains 1103 projects collected from 140organizations and 15 regions across China by Chineseassociation of software industry.We extract 94 projects and 21 attributes (15 nominalattributes and 6 continuous attributes) with same selectioncriterion of ISBSG data set. We use Equation 10 tonormalize the efforts of projects into l(= 3) classes.

Table: The effort classes in CSBSG data set.

Class No. # of projects Label1 27 Low2 31 Medium3 36 High

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 30: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

The datasetsExperiment setupExperimental results

Outline

1 Introduction

2 Naive Bayes and EM for software effort prediction

3 Missing data handling strategiesMissing data toleration strategy.Missing data imputation strategy

4 ExperimentsThe datasetsExperiment setupExperimental results

5 Threats.

6 Conclusion and future work

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 31: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

The datasetsExperiment setupExperimental results

Experiment setup.

To evaluate the proposed method comparatively, we adoptMI and MINI to impute the missing values of the assignedISBSG and CSBSG dataset.

BPNN is used to classify the projects in the data sets afterimputation.

Our experiments are conducted with 10-flodcross-validation technique.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 32: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

The datasetsExperiment setupExperimental results

Outline

1 Introduction

2 Naive Bayes and EM for software effort prediction

3 Missing data handling strategiesMissing data toleration strategy.Missing data imputation strategy

4 ExperimentsThe datasetsExperiment setupExperimental results

5 Threats.

6 Conclusion and future work

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 33: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

The datasetsExperiment setupExperimental results

EM-T and EM-I on ISBSG dataset.

The following figure illustrates the performances, of themissing data toleration strategy (hereafter called EM-T)and missing data imputation strategy (hereafter calledEM-I) in handling the missing date for effort prediction onISBSG data set.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 34: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

The datasetsExperiment setupExperimental results

EM-T and EM-I on ISBSG dataset.

0 4 8 12 16 200.6

0.65

0.7

0.75

0.8A

ccur

acy

# of unlabeled projects

EM−IEM−TBPNN+MIBPNN+MINI

Figure: Performances of naive Bayes with EM-I and EM-T incomparison with BPNN on effort prediction using ISBSG data set.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 35: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

The datasetsExperiment setupExperimental results

EM-T and EM-I on ISBSG dataset.

What we can see from the figure.

Both EM-I and EM-T have better performances than BPNNwith either MI or MINI on classifying the projects in ISBSGdata set.

The performance of naive Bayes and EM is augmentedwhen unlabeled projects are appended. This outcomeillustrates that semi-supervised learning can improve theprediction of software effort.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 36: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

The datasetsExperiment setupExperimental results

EM-T and EM-I on ISBSG dataset.

What we can see from figure.

If supervised learning was used for software effortprediction, MINI method is favorable to impute the missingvalues but missing toleration strategy may not be desirableto handle missing values.

Imputing strategy for missing data is more effective thantolerating strategy when naive Bayes and EM is used forpredicting ISBSG software efforts.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 37: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

The datasetsExperiment setupExperimental results

EM-T and EM-I on CSBSG dataset.

EM-T and EM-I in handling the missing date for effortprediction on CSBSG dataset.

0 2 4 6 80.5

0.55

0.6

0.65

0.7

0.75

0.8

Acc

urac

y

# of unlabeled projects

EM−IEM−TBPNN+MIBPNN+MINI

Figure: Performances of EM-I and EM-T in comparison with BPNN on predicting effort with differentnumber of unlabeled projects using CSBSG dataset.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 38: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

The datasetsExperiment setupExperimental results

EM-T and EM-I on CSBSG dataset.

What we can see from the above figure.

The better performance of EM-I than EM-T is alsoobserved using CSBSG data set, which is the same asusing ISBSG dataset. This further validate our conjecturethat EM-I outperforms EM-T in software effort prediction.

EM-T has better performance than EM-I on condition thatthe number of unlabeled projects is larger than that of"maxima", that is different from that of ISBSG dataset. Weexplain this result may be brought out by the relative smallsize of CSBSG dataset where imputation strategy will bemore prone to bring bias into predictive than tolerationstrategy.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 39: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

The datasetsExperiment setupExperimental results

More experiments and hypotheses testing.

More experimental results with explanations are detailed in thepaper. Also, we conduct hypotheses testing to examine thesignificance of the conclusions draw from our experiments. Oneof interest may refer to the paper.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 40: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

The threat to external validity primarily is the degree towhich the attributes we used to describe the projects andthe representative capacity of ISBSG and CSBSG sampledatasets.

The threat to internal validity are measurement and dataeffects that can bias our results caused by performancemeasure as accuracy.

The threat to construct validity is that our experimentsmake use of clipping attributes and clipping project datafrom both ISBSG and CSBSG datasets

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 41: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Conclusion

Semi-supervised learning as naive Bayes and EM isemployed to predict software effort.

We propose two embedded strategies in naive Bayes andEM to handle the missing data.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 42: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Future work

We plan to compare the proposed techniques with othermissing data imputation techniques, such as FIML andMSWR.

We will develop more missing data techniques embeddedwith naive Bayes and EM for software effort prediction.

We have already investigated the underlying mechanism ofmissingness (structural missing or unstructured missing) ofsoftware effort data. With this progress, we will improve themissing data handling strategies oriented to the underlyingmissing mechanism of software effort data.

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Page 43: PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Thanks

Any further questions about the content of the slides and thepaper can be sent to Mr. Wen Zhang.Email: [email protected]

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm