PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

IntroductionNaive Bayes and EM for software effort prediction

Missing data handling strategiesExperiments

Threats.Conclusion and future work

Handling missing data in software effortprediction with naive Bayes and EM algorithm

Wen Zhang Ye Yang Qing Wang

Laboratory for Internet Software TechnologiesInstitute of Software, Chinese Academy of Sciences

Beijing 100190, P.R.China{zhangwen,ye,wq}@itechs.iscas.ac.cn

7th International Conference on Predictive Models inSoftware Engineering (PROMISE), 2011

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm




Outline

1 Introduction

2 Naive Bayes and EM for software effort prediction

3 Missing data handling strategiesMissing data toleration strategy.Missing data imputation strategy

4 ExperimentsThe datasetsExperiment setupExperimental results

5 Threats.

6 Conclusion and future work





Effort prediction with missing data.

The knowledge on software project effort stored in thehistorical datasets can be used to develop predictivemodels, by either statistical methods such as linearregression and correlation analysis to predict the effort ofnew incoming projects.

Usually, most historical effort datasets contain largeamount of missing data.





Effort prediction with missing data.

Due to the small sizes of most historical databases, thecommon practice of ignoring projects with missing data willlead to biased and inaccurate prediction model.

For these reasons, how to handle missing data in softwareeffort datasets is becoming an important problem.





Sample data

The historical effort data of projects were organized asshown in the following Table.

Table: The sample data in historical project dataset.

D X1 ... Xj ... Xn HD1 x11 ... x1j ... x1n h1

... ... ... ... ... ... ...Di xi1 ... xij ... xin hi

... ... ... ... ... ... ...Dm xm1 ... xmj ... xmn hm

Xj (1 ≤ j ≤ n) denotes an attribute of project Di

(1 ≤ i ≤ m). hi is the effort class label of Di and it isderived from the real effort of project Di .





Sample data.

There are l effort classes for all the projects in a dataset,that is, hi is equal to one of the elements in {c1, ..., cl}.

Xj is independent of each other and has Boolean valueswithout missing data, i.e. xij ∈ {0,1}.





Formulation of the problem.

An effort dataset Ycom containing m historical projects asYcom = (D1, ...,Di , ...,Dm)T , where Di (1 ≤ i ≤ m) is ahistorical project and Di = (xi1, ..., xij , ..., xin)

T isrepresented by n attributes Xj (1 ≤ j ≤ n).

hi denotes the effort class label of project Di . For each xij ,which is the value of attribute Xj) (1 ≤ j ≤ n)on Di , it wouldbe observed or missing.

Cross validation on effort prediction is used to to evaluatethe performances of missing data handling techniques.





Motivation.

EM (Expectation Maximization) algorithm is a method forfinding maximum likelihood or maximum a posterioriestimates of parameters in statistical models.

The motivation of applying EM(Expectation Maximization)to naive Bayes is to augment the unlabeled projects withtheir estimated effort class labels into the labeled data sets.

Thus, the performance of classification would be improvedby using more data to train the prediction model.





Labeled projects and unlabeled projects.

For a labeled project DLi , its effort class

P(hi = ct ∣DLi ) ∈ {0,1} is determinate.

For an unlabeled project DUi , its label P(hi = ct ∣DU

i ) isunknown.

However, if we can assign predicted effort class to DUi ,

then DUi could also be used to update the estimates

P{Xj = 0∣ct}, P{Xj = 1∣ct} and P(ct), and further to refinethe effort prediction model P(ct ∣Di). This process isdescribed in Equations 1, 2, 3 and 4.





Estimating P(�+1)(Xj = 1∣ct).

The likelihood of occurrence of Xj with respect to ct at� + 1 iteration, is updated by Equation 1 using theestimates at � iteration.

P(�+1)(Xj = 1∣ct) =1 +

∑mi=1 xijP(�)(hi = ct ∣Di)

n +∑n

j=1∑m

i=1 xijP(�)(hi = ct ∣Di). (1)

In practice, we explain P(�+1)(Xj = 1∣ct) as probability ofattribute Xj appearing in a project whose effort class is ct .





Estimating P(�+1)(Xj = 0∣ct).

Accordingly, the likelihood of non-occurrence of Xj withrespect to ct at � + 1 iteration, P(�+1)(Xj = 0∣ct) isestimated by Equation 2.

P(�+1)(Xj = 0∣ct) = 1 − P(�+1)(Xj = 1∣ct). (2)





Estimating P(�+1)(ct).

Second, the effort class prior probability, P(�+1)(ct ), is updatedin the same manner by Equation 3 using estimates at the �

iteration. In practice, we may regard P(�+1)(ct ) as the priorprobability of class label ct appearing in all the softwareprojects.

P(�+1)(ct ) =1 +

∑mi=1 P(�)(hi = ct ∣Di)

l + m. (3)





Estimating P(�+1)(hi ′ = ct ∣Di ′).

Third, the posterior probability of an unlabeled project Di ′

belonging to an effort class ct at the � + 1 iteration,P(�+1)(hi ′ = ct ∣Di ′), is updated using Equation 4.

P(�+1)(hi ′ = ct ∣Di ′) =P(�)(ct )P(�)(Di ′ ∣ct)

P(�)(Di ′)

=

P(�)(ct)n∏

j=1P(�)(xi ′j ∣ct)

l∑

t=1P(�)(ct )

n∏

j=1P(�)(xi ′ j ∣ct)

.

(4)





Estimating P(�+1)(hi ′ = ct ∣Di ′).

Hereafter,for labeled projects, if xij = 1, thenP(�)(xij ∣ct) = P(�)(Xj = 1∣ct). Otherwise xij = 0, thenP(�)(xij ∣ct) = P(�)(Xj = 0∣ct).for unlabeled projects, if xi′j = 1, thenP(�)(xi′j ∣ct) = P(�)(Xj = 1∣ct). Otherwise xi′j = 0, thenP(�)(xi′j ∣ct) = P(�)(Xj = 0∣ct).

Here, P(0)(Xj = 1∣ct) and P(0)(ct ) are initially estimated bymerely the labeled projects at the first step of iteration, andthe unlabeled project cases are appended into the learningprocess after they were predicted probabilistic effort classby P(1)(hi ′ = ct ∣Di ′).





Predicting the effort class of unlabeled projects.

We loop the Equations 1, 2, 3 and 4 until their estimatesconverge to stable values.

Then, P(�+1)(hi ′ = ct ∣Di ′) is used to predict effort class ofDi ′ .

The ct ∈ {c1, ..cl} that maximizes P(�+1)(hi ′ = ct ∣Di ′) isregarded as the effort class of Di ′.





Missing data toleration strategy.Missing data imputation strategy

Outline

1 Introduction




5 Threats.







Initial setting.

When we use Equation 1 to estimate the likelihood of Xj

with respect to ct , P(Xj = 1∣ct) or P(Xj = 0∣ct), we do notconsider missing values involved in xij (1 ≤ i ≤ m).For each Xj , we can divide the whole historical dataset Dinto two subsets, i.e. D = {Dobs,j ∣Dmis,j} where Dobs,j is theset of projects whose values on attribute Xj are observedand Dmis,j is the set of projects whose values on attributeare unobserved.We may also divide the attributes in a project Di into twosubsets, i.e. Di = {Xobs,i ∣Xmis,i} where Xobs,i is the set ofattributes whose values are observed in project Di andXmis,i denotes the set of attributes whose values areunobserved in project Di .






Missing data toleration strategy.

This strategy is very similar with the method adopted byC4.5 to handle missing data. That is, we ignore missingvalues in training prediction model.

To estimate P(�+1)(Xj = 1∣ct) under this strategy, werewrite Equation 1 into Equation 5.

P(�+1)(Xj = 1∣ct) =

1 +∣Dobs,j ∣∑

i=1xijP(�)(hi = ct ∣Di)

n +n∑

j=1

∑∣Dobs,j ∣

i=1 xijP(�)(hi = ct ∣Di)

. (5)







The difference between Equations 1 and 5 lies in that onlyobserved projects on attribute Xj , i.e., Dobs,j are used toestimate P(�+1)(Xj = 1∣ct).

Equation 2 can also be used here to estimateP(�+1)(Xj = 0∣ct). To estimate P(�+1)(ct ), Equation 3 canalso be used here.







Accordingly, the prediction model should be adapted fromEquation 4 to Equation 6.

P(�+1)(hi ′ = ct ∣Di ′) =P(�)(ct)P(�)(Di ′ ∣ct)

P(�)(Di ′)

=

P(�)(ct)∣Xobs,i ∣∏

j=1P(�)(xi ′j ∣ct)

∣Xobs,i ∣∏

j=1

l∑

t=1P(�)(ct )P(�)(xi ′j ∣ct)

. (6)






Outline

1 Introduction




5 Threats.







Missing data imputation strategy.

The basic idea of this strategy is that unobserved values ofattributes can be imputed using the observed values.Then, both observed values and imputed values are usedto construct the prediction model.







This strategy is an embedded processing in naive Bayesand EM and we may rewrite Equation 1 to Equation 7 toestimate P(�+1)(Xj = 1∣ct).

P(�+1)(Xj = 1∣ct) =

1 +∣Dobs,j ∣∑

i=1xijP(�)(hi = ct ∣Di) +

∣Dmis,j ∣∑

s=1xsj P(�)(hi = ct ∣Ds)

n +n∑

j=1{∣Dobs,j ∣∑

i=1xijP(�)(hi = ct ∣Di) +

∣Dmis,j ∣∑

s=1xsj P(�)(hi = ct ∣Ds)}

.

(7)







The missing value xsj , which is the value of attribute Xj onthe project Ds, is imputed using xsj with Equation 8

xsj =

∣Dobs,j ∣∑

i=1xijP(�)(hi = ct ∣Di)

∣Dobs,j ∣∑

i=1P(�)(hi = ct ∣Di)

. (8)

xsj is a constant independent of Ds given ct .

We regulate that xsj is approximated to 1 if xsj ≥ 0.5.Otherwise, xsj is approximated to 0.

Here, we also use Equation 3 to estimate P(�+1)(ct) .







As for the prediction model, P(�+1)(ct ∣Di), can beconstructed in Equation 9 with considering the missingvalues.

P(�+1)(hi ′ = ct ∣Di ′) =P(�)(ct )P(�)(Di ′ ∣ct)

P(�)(Di ′)

=

P(�)(ct)n∏

j=1P(�)(xi ′ j ∣ct)

n∏

j=1

l∑

t=1P(�)(ct )P(�)(xi ′ j ∣ct )

. (9)

Note that if xi ′j is unobserved, it value will be substitutedwith xi ′j given by Equation 8.





The datasetsExperiment setupExperimental results

Outline

1 Introduction




5 Threats.







The ISBSG dataset.

The ISBSG data set (http://www.isbsg.org) has 70attributes and many attributes have no values in thecorresponding places.

We extract 188 projects with 16 attributes with the criterionthat each project has at least 2/3 attributes whose valuesare observed and, for an attribute, its values should beobserved at least in 2/3 of total projects.

13 attributes are nominal attributes and 3 attributes arecontinuous attributes.


http://www.isbsg.org





The ISBSG dataset.

We use Equation 10 to normalize the efforts of projectsinto l(= 3) classes.

ct = ⌊l × (effortDi

− effortmin)

effortmax − effortmin⌋+ 1 (10)

Table: The effort classes in ISBSG data set.

Class No. # of projects Label1 85 Low2 76 Medium3 27 High






The CSBSG dataset.

CSBSG dataset contains 1103 projects collected from 140organizations and 15 regions across China by Chineseassociation of software industry.We extract 94 projects and 21 attributes (15 nominalattributes and 6 continuous attributes) with same selectioncriterion of ISBSG data set. We use Equation 10 tonormalize the efforts of projects into l(= 3) classes.

Table: The effort classes in CSBSG data set.

Class No. # of projects Label1 27 Low2 31 Medium3 36 High






Outline

1 Introduction




5 Threats.







Experiment setup.

To evaluate the proposed method comparatively, we adoptMI and MINI to impute the missing values of the assignedISBSG and CSBSG dataset.

BPNN is used to classify the projects in the data sets afterimputation.

Our experiments are conducted with 10-flodcross-validation technique.






Outline

1 Introduction




5 Threats.







EM-T and EM-I on ISBSG dataset.

The following figure illustrates the performances, of themissing data toleration strategy (hereafter called EM-T)and missing data imputation strategy (hereafter calledEM-I) in handling the missing date for effort prediction onISBSG data set.







0 4 8 12 16 200.6

0.65

0.7

0.75

0.8A

ccur

acy

# of unlabeled projects

EM−IEM−TBPNN+MIBPNN+MINI

Figure: Performances of naive Bayes with EM-I and EM-T incomparison with BPNN on effort prediction using ISBSG data set.







What we can see from the figure.

Both EM-I and EM-T have better performances than BPNNwith either MI or MINI on classifying the projects in ISBSGdata set.

The performance of naive Bayes and EM is augmentedwhen unlabeled projects are appended. This outcomeillustrates that semi-supervised learning can improve theprediction of software effort.







What we can see from figure.

If supervised learning was used for software effortprediction, MINI method is favorable to impute the missingvalues but missing toleration strategy may not be desirableto handle missing values.

Imputing strategy for missing data is more effective thantolerating strategy when naive Bayes and EM is used forpredicting ISBSG software efforts.






EM-T and EM-I on CSBSG dataset.

EM-T and EM-I in handling the missing date for effortprediction on CSBSG dataset.

0 2 4 6 80.5

0.55

0.6

0.65

0.7

0.75

0.8

Acc

urac

y

# of unlabeled projects

EM−IEM−TBPNN+MIBPNN+MINI

Figure: Performances of EM-I and EM-T in comparison with BPNN on predicting effort with differentnumber of unlabeled projects using CSBSG dataset.






EM-T and EM-I on CSBSG dataset.

What we can see from the above figure.

The better performance of EM-I than EM-T is alsoobserved using CSBSG data set, which is the same asusing ISBSG dataset. This further validate our conjecturethat EM-I outperforms EM-T in software effort prediction.

EM-T has better performance than EM-I on condition thatthe number of unlabeled projects is larger than that of"maxima", that is different from that of ISBSG dataset. Weexplain this result may be brought out by the relative smallsize of CSBSG dataset where imputation strategy will bemore prone to bring bias into predictive than tolerationstrategy.






More experiments and hypotheses testing.

More experimental results with explanations are detailed in thepaper. Also, we conduct hypotheses testing to examine thesignificance of the conclusions draw from our experiments. Oneof interest may refer to the paper.





The threat to external validity primarily is the degree towhich the attributes we used to describe the projects andthe representative capacity of ISBSG and CSBSG sampledatasets.

The threat to internal validity are measurement and dataeffects that can bias our results caused by performancemeasure as accuracy.

The threat to construct validity is that our experimentsmake use of clipping attributes and clipping project datafrom both ISBSG and CSBSG datasets





Conclusion

Semi-supervised learning as naive Bayes and EM isemployed to predict software effort.

We propose two embedded strategies in naive Bayes andEM to handle the missing data.





Future work

We plan to compare the proposed techniques with othermissing data imputation techniques, such as FIML andMSWR.

We will develop more missing data techniques embeddedwith naive Bayes and EM for software effort prediction.

We have already investigated the underlying mechanism ofmissingness (structural missing or unstructured missing) ofsoftware effort data. With this progress, we will improve themissing data handling strategies oriented to the underlyingmissing mechanism of software effort data.





Thanks

Any further questions about the content of the slides and thepaper can be sent to Mr. Wen Zhang.Email: [email protected]


PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

Technology

Transcript of PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"