Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara...

40
Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University

Transcript of Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara...

Page 1: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Guidelines for Conducting and Evaluating Empirical Studies

Barbara Kitchenham

Keele University

Page 2: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Agenda• Introduction

• Guideline topics– Experimental context– Study Design– Data Collection– Analysis– Result presentation– Interpretation

• Conclusions

Page 3: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Introduction• Empirical research in applied disciplines

is weak– Medical research

• Yancey reviewed A.J. Surgery and found– “methodologic errors so as to render invalid the

conclusions of the authors”

• Welch & Gaber reviewed A.J. Obstetrics and found

– Half the papers could not be assessed because of poor reporting

– Third of the paper used statistics inappropriately

Page 4: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Empirical Software Engineering• In my experience (as a reviewer and

researcher)– Just as bad (if not worse) than medicine

• David Hoaglin, former vice president of the American Statistical Society– Reviewed 8 papers from IEEE Trans SE

• Poor study design• Inappropriate use of statistical techniques• Conclusions that don’t follow from results

Page 5: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Possible Solution• Medical researchers have produced

statistical guidelines for researchers– CONSORT

– Adopted by 70 journals– International guidelines on statistical principles for

clinical trials ICH E9

• Guidelines might help SE research– Improve empirical studies– Assist reviewers– Enable meta-analysis

Page 6: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Interest group• Dr Shari Lawrence Pfleeger

– Assistant editor of IEEE Trans-SE

• Dr Lesley Pickard– Statistician & SE Researcher

• Prof Peter Jones– Professor of Statistics

• Jarret Rosenberg– Statistician working for SUN

• Dr David Hoaglin

Page 7: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Context Guidelines

• CG1: Be sure to specify as much of the software engineering context as possible

• CG2: State the hypothesis so that the study’s objectives are clear

• CG3: Discuss the theory from which the hypothesis is derived

Page 8: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Empirical context• Two types of context

– “Extraneous” factors that affect software development

• Product & Process variety• Corporate culture• Client expectations• Staff Motivation

– Theoretical background to empirical study– Both have implications for empirical

research

Page 9: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Product & process diversity

• Immense variety in products, methods, procedures, culture, – How can we tell whether results obtained

in one environment will apply in another?

• Need richer descriptions of context than is normally provided– “Company X, a large multinational

telecommunications company,..”

Page 10: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Maintenance Ontology• Aimed to identify concepts that affect

empirical maintenance studies– 12 “concepts” including

• product• procedure• human resources• service level agreement

– 23 concept properties that might affect results including

• product size, maturity, quality, age• user type and population size

Page 11: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Theoretical context• Scientific method

– Observe behaviour– Develop a theory– Test the theory

• Study hypothesis should reflect derive from the theory– Early empirical studies didn’t state

hypothesis at all

Page 12: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Shallow hypotheses• Most researchers now state their

hypotheses but they are usually not derived from a theory– Collect complexity metrics and fault counts

and do a correlation analysis– Null hypothesis “There is no correlation

between complexity and number of faults”– Hypothesis has no explanatory power

Page 13: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Deep Hypothesis• Vinter, Loomes & Kornbrot

– Interested in validity of claims made about formal methods

– Investigated cognitive psychology research concerned with logic errors

– Studied logic error made by people with formal methods background

• Null hypothesis that the error would be the same as those observed for naïve subjects

Page 14: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Design Guidelines - 1

• DG1: Identify the population from which subject/objects are drawn

• DG2: Define the process by which subject/objects were selected

• DG3: Define the process by which subjects/objects are assigned to treatments

Page 15: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Design Guidelines - 2• DG4: Perform a pre-experiment or pre-

calculation to identify the required sample size and experimental power

• DG5: Justify the choice of outcome measures in terms of their relevance to the objectives of the empirical study

• DG6: Restrict yourself to simple experiments or, at least, to designs that are fully defined in the literature

Page 16: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Design Guidelines - 3

• DG7: Explain how you handle possible subject bias

• DG8: Avoid evaluating your own inventions, or make explicit any vested interests

Page 17: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Subject Selection• Sample (subject) selection

– How was the sample obtained?• Was there any bias?

– All responses to a Web posting• Self-selecting samples are not random

– From what population was it derived?• Statistical results can only apply to the defined

population– 2nd year undergraduates are only representative of

2nd year undergraduates

Page 18: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Randomisation Pitfalls• Was the randomisation process

suitable?– Assign students in one class to one

treatment, students in another class to another

• The experimental “unit” is class not student• 0 degrees of freedom to test the treatment

Page 19: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Sample size

• Was the sample size appropriate?– Should perform a pre-study

• Assess the likely size of effect• Identify appropriate sample size for full

experiment• Identify the power of the experiment

Page 20: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Type II errors & Power

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-4 -2 0 2 4 6 8

Page 21: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Importance of Power• Need to ensure that power is

acceptable– power>0.9

• depends on – specific alternative hypothesis– sample size

• Non-parametric tests less powerful than parametric tests

– If parametric assumptions are true

Page 22: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Surrogate measures

• Are the outcome measures appropriate?– Use of surrogate measures can be

misleading• Defect counts for quality• Pre-release defect rates instead of post-release

rates

Page 23: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Complex Designs• Experimental design determines the

appropriate analysis– Some designs are too complex to analyse

• Software tasks are affected by subject experience & capability

– Cross-over designs use the different treatments on the same subject

• Allow for experience but add design complexity– Order of treatment may affect subject experience

• So order needs to be randomised• Adds to complexity of the cross-over experiment

Page 24: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Experimenter Influence

• Cannot perform blind or double-blind experiments in SE– A subject must by definition know what

treatment he/she is assigned to• May be affected by expectation

– How do we stop experimenter influence?• No way of addressing this problem• Beware vested interests

Page 25: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Data Collection Guidelines

• DC1: For surveys, – specify the response rate– discuss the representativeness of the

responses– discuss the impact of non-response

• DC2: Define all software metrics fully

Page 26: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Data Collection Guidelines -2

• DC3: For experiments record data on subjects who drop-out from experiments

• DC4: For experiments, record data about other performance measures that you do not want to be adversely affected by the treatment even if they are not the main focus of the study

Page 27: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Metrics Definitions• Software metrics are not well-defined

– Need to specify entity, attribute and counting rules

– Counting rules• Define the where, when, & how of collecting

measures– Defect counts are not comparable unless you know

• When in the development process counting started

Page 28: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Extra data

• Avoid sub-optimization– Many software engineering factors are

related• If we test a hypothesis about productivity, we

should consider the impact on quality

Page 29: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Analysis Guidelines

• AG1: Analyze data in accordance with the experimental design

• AG2: Justify any non-standard analysis

• AG3: Identify whether your statistics are inferential or descriptive

Page 30: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Analysis Guidelines - 2

• AG4: Adjust significance levels if performing many significance tests on the same dataset

• AG5: If possible, use blind analysis

• AG6: Perform sensitivity analyses

Page 31: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Multiple tests• Researchers often perform many tests

on the same data set– With many tests some “significant” results

will occur by chance • 20 tests at p=0.05, 1 spurious significant result

almost certain

– Need to adjust significance levels when performing multiple tests

• Bonferroni adjustment– 10 tests, for 0.05 overall, 0.005 for individual tests

Page 32: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Analysis Pitfalls

• Experimenters often fish for results– Consider various different subsets of the

data until you get the result they want– This problem is reduced by blind analysis

• Analyst doesn’t know which treatment is which

• Software datasets often have outliers– Sensitivity analysis ensures results are not

due to outliers

Page 33: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Presentation Guidelines• PG1: Describe or reference all

statistical procedures used

• PG2: Present analyses that are relevant to the hypothesis

• PG3: Present quantitative results as well as significance levels. Quantitative results should show the magnitude of effects and confidence limits

Page 34: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Presentation Guidelines - 2

• PG4: Present raw data, otherwise confirm that it is available for confidential review by reviewers and independent auditors

Page 35: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Raw data• It is important to allow readers to draw

their own conclusions– Need access to raw data

• Yancey “When science stops being public it stops being science”

– Most software engineering results are “Company confidential”

• Reviewers or auditors should be able to view the data

Page 36: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Interpretation of Results• IG1: Define the population to which

inferential statistics apply

• IG2: Differentiate between statistical significance and practical importance

• IG3: Define the type of study

• IG4: Specify any limitations of the study

• IG5: Ensure conclusions arise from the presented results

Page 37: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Study types• There are differences between the type of

inferences you can make from different types of study– Yancey: “Only truly randomized tightly

controlled prospective studies provide an opportunity for cause and effect statements”

• Regression and correlation studies can only lead to weak conclusions.

Page 38: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

Conclusions• Empirical software engineering needs to

improve

• Guidelines offer a means of propagating good practice– Need to be accepted by researchers– Need to be adopted by journal editors

• These guidelines are a starting point– Need input from wider group

Page 39: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

References

• B.A. Kitchenham, R.T. Hughes and S.G. Linkman. Modeling software measurement data. IEEE Trans on SE (In press).

• B.A. Kitchenham, S.G. Linkman and D.T. Law. Critical review of quantitative assessment. Software Engineering Journal 9(2), 1994, pp43-53

• B.A. Kitchenham, G. Travassos, A. von Mayhhauser, F. Niessink, N.F. Schneidewind, J. Singer, S. Takada, R. Vehvilainen, and H. Yang. Towards an Ontology of Software Maintenance. JSM In press.

• L.M. Pickard, B.A. Kitchenham and P. Jones. Combining empirical results in software engineering. Information and Software technology 40(14), 1998, pp811-821

• G.A. Milliken and D.A. Johnson Analysis of Messy Data Volums 1: Designed Experiments. Chapman & Hall, 1992, Chapters 5 & 32

Page 40: Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University.

Guidelines for empirical studies

References

ï W.F. Rossenberger. Dealing with multiplicities in pharmacoepidemiologic studies. Pharmacoepidemiologic and Drug Safety, 5, 1996, pp95-100

ï R. Vinter, M. Loomes and D. Kornbrot. Applying software metrics to formal specifications: a cognitive approach. proceedings of 5th International Software Metrics Symposium. IEEE Computer Society Press, 1998, pp216-223.

ï G.E. Welch and S.G. Gabbe. Review of statistics usage in the American Journal of Obstetrics and Gynecology, 175 (5), 1996, pp1138-1141.

ï J.M. Yancey. Ten rules for reading clinical research reports. Guest editorial. American Journal of Orthodontics and Dentofacial Orthopedics, 109 (5), May 1996, pp558-564.