Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara...

Guidelines for empirical studies

Guidelines for Conducting and Evaluating Empirical Studies

Barbara Kitchenham

Keele University


Agenda• Introduction

• Guideline topics– Experimental context– Study Design– Data Collection– Analysis– Result presentation– Interpretation

• Conclusions


Introduction• Empirical research in applied disciplines

is weak– Medical research

• Yancey reviewed A.J. Surgery and found– “methodologic errors so as to render invalid the

conclusions of the authors”

• Welch & Gaber reviewed A.J. Obstetrics and found

– Half the papers could not be assessed because of poor reporting

– Third of the paper used statistics inappropriately


Empirical Software Engineering• In my experience (as a reviewer and

researcher)– Just as bad (if not worse) than medicine

• David Hoaglin, former vice president of the American Statistical Society– Reviewed 8 papers from IEEE Trans SE

• Poor study design• Inappropriate use of statistical techniques• Conclusions that don’t follow from results


Possible Solution• Medical researchers have produced

statistical guidelines for researchers– CONSORT

– Adopted by 70 journals– International guidelines on statistical principles for

clinical trials ICH E9

• Guidelines might help SE research– Improve empirical studies– Assist reviewers– Enable meta-analysis


Interest group• Dr Shari Lawrence Pfleeger

– Assistant editor of IEEE Trans-SE

• Dr Lesley Pickard– Statistician & SE Researcher

• Prof Peter Jones– Professor of Statistics

• Jarret Rosenberg– Statistician working for SUN

• Dr David Hoaglin


Context Guidelines

• CG1: Be sure to specify as much of the software engineering context as possible

• CG2: State the hypothesis so that the study’s objectives are clear

• CG3: Discuss the theory from which the hypothesis is derived


Empirical context• Two types of context

– “Extraneous” factors that affect software development

• Product & Process variety• Corporate culture• Client expectations• Staff Motivation

– Theoretical background to empirical study– Both have implications for empirical

research


Product & process diversity

• Immense variety in products, methods, procedures, culture, – How can we tell whether results obtained

in one environment will apply in another?

• Need richer descriptions of context than is normally provided– “Company X, a large multinational

telecommunications company,..”


Maintenance Ontology• Aimed to identify concepts that affect

empirical maintenance studies– 12 “concepts” including

• product• procedure• human resources• service level agreement

– 23 concept properties that might affect results including

• product size, maturity, quality, age• user type and population size


Theoretical context• Scientific method

– Observe behaviour– Develop a theory– Test the theory

• Study hypothesis should reflect derive from the theory– Early empirical studies didn’t state

hypothesis at all


Shallow hypotheses• Most researchers now state their

hypotheses but they are usually not derived from a theory– Collect complexity metrics and fault counts

and do a correlation analysis– Null hypothesis “There is no correlation

between complexity and number of faults”– Hypothesis has no explanatory power


Deep Hypothesis• Vinter, Loomes & Kornbrot

– Interested in validity of claims made about formal methods

– Investigated cognitive psychology research concerned with logic errors

– Studied logic error made by people with formal methods background

• Null hypothesis that the error would be the same as those observed for naïve subjects


Design Guidelines - 1

• DG1: Identify the population from which subject/objects are drawn

• DG2: Define the process by which subject/objects were selected

• DG3: Define the process by which subjects/objects are assigned to treatments


Design Guidelines - 2• DG4: Perform a pre-experiment or pre-

calculation to identify the required sample size and experimental power

• DG5: Justify the choice of outcome measures in terms of their relevance to the objectives of the empirical study

• DG6: Restrict yourself to simple experiments or, at least, to designs that are fully defined in the literature


Design Guidelines - 3

• DG7: Explain how you handle possible subject bias

• DG8: Avoid evaluating your own inventions, or make explicit any vested interests


Subject Selection• Sample (subject) selection

– How was the sample obtained?• Was there any bias?

– All responses to a Web posting• Self-selecting samples are not random

– From what population was it derived?• Statistical results can only apply to the defined

population– 2nd year undergraduates are only representative of

2nd year undergraduates


Randomisation Pitfalls• Was the randomisation process

suitable?– Assign students in one class to one

treatment, students in another class to another

• The experimental “unit” is class not student• 0 degrees of freedom to test the treatment


Sample size

• Was the sample size appropriate?– Should perform a pre-study

• Assess the likely size of effect• Identify appropriate sample size for full

experiment• Identify the power of the experiment

Type II errors & Power

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-4 -2 0 2 4 6 8

Importance of Power• Need to ensure that power is

acceptable– power>0.9

• depends on – specific alternative hypothesis– sample size

• Non-parametric tests less powerful than parametric tests

– If parametric assumptions are true


Surrogate measures

• Are the outcome measures appropriate?– Use of surrogate measures can be

misleading• Defect counts for quality• Pre-release defect rates instead of post-release

rates


Complex Designs• Experimental design determines the

appropriate analysis– Some designs are too complex to analyse

• Software tasks are affected by subject experience & capability

– Cross-over designs use the different treatments on the same subject

• Allow for experience but add design complexity– Order of treatment may affect subject experience

• So order needs to be randomised• Adds to complexity of the cross-over experiment


Experimenter Influence

• Cannot perform blind or double-blind experiments in SE– A subject must by definition know what

treatment he/she is assigned to• May be affected by expectation

– How do we stop experimenter influence?• No way of addressing this problem• Beware vested interests


Data Collection Guidelines

• DC1: For surveys, – specify the response rate– discuss the representativeness of the

responses– discuss the impact of non-response

• DC2: Define all software metrics fully


Data Collection Guidelines -2

• DC3: For experiments record data on subjects who drop-out from experiments

• DC4: For experiments, record data about other performance measures that you do not want to be adversely affected by the treatment even if they are not the main focus of the study


Metrics Definitions• Software metrics are not well-defined

– Need to specify entity, attribute and counting rules

– Counting rules• Define the where, when, & how of collecting

measures– Defect counts are not comparable unless you know

• When in the development process counting started


Extra data

• Avoid sub-optimization– Many software engineering factors are

related• If we test a hypothesis about productivity, we

should consider the impact on quality


Analysis Guidelines

• AG1: Analyze data in accordance with the experimental design

• AG2: Justify any non-standard analysis

• AG3: Identify whether your statistics are inferential or descriptive


Analysis Guidelines - 2

• AG4: Adjust significance levels if performing many significance tests on the same dataset

• AG5: If possible, use blind analysis

• AG6: Perform sensitivity analyses


Multiple tests• Researchers often perform many tests

on the same data set– With many tests some “significant” results

will occur by chance • 20 tests at p=0.05, 1 spurious significant result

almost certain

– Need to adjust significance levels when performing multiple tests

• Bonferroni adjustment– 10 tests, for 0.05 overall, 0.005 for individual tests


Analysis Pitfalls

• Experimenters often fish for results– Consider various different subsets of the

data until you get the result they want– This problem is reduced by blind analysis

• Analyst doesn’t know which treatment is which

• Software datasets often have outliers– Sensitivity analysis ensures results are not

due to outliers


Presentation Guidelines• PG1: Describe or reference all

statistical procedures used

• PG2: Present analyses that are relevant to the hypothesis

• PG3: Present quantitative results as well as significance levels. Quantitative results should show the magnitude of effects and confidence limits


Presentation Guidelines - 2

• PG4: Present raw data, otherwise confirm that it is available for confidential review by reviewers and independent auditors


Raw data• It is important to allow readers to draw

their own conclusions– Need access to raw data

• Yancey “When science stops being public it stops being science”

– Most software engineering results are “Company confidential”

• Reviewers or auditors should be able to view the data


Interpretation of Results• IG1: Define the population to which

inferential statistics apply

• IG2: Differentiate between statistical significance and practical importance

• IG3: Define the type of study

• IG4: Specify any limitations of the study

• IG5: Ensure conclusions arise from the presented results


Study types• There are differences between the type of

inferences you can make from different types of study– Yancey: “Only truly randomized tightly

controlled prospective studies provide an opportunity for cause and effect statements”

• Regression and correlation studies can only lead to weak conclusions.


Conclusions• Empirical software engineering needs to

improve

• Guidelines offer a means of propagating good practice– Need to be accepted by researchers– Need to be adopted by journal editors

• These guidelines are a starting point– Need input from wider group


References

• B.A. Kitchenham, R.T. Hughes and S.G. Linkman. Modeling software measurement data. IEEE Trans on SE (In press).

• B.A. Kitchenham, S.G. Linkman and D.T. Law. Critical review of quantitative assessment. Software Engineering Journal 9(2), 1994, pp43-53

• B.A. Kitchenham, G. Travassos, A. von Mayhhauser, F. Niessink, N.F. Schneidewind, J. Singer, S. Takada, R. Vehvilainen, and H. Yang. Towards an Ontology of Software Maintenance. JSM In press.

• L.M. Pickard, B.A. Kitchenham and P. Jones. Combining empirical results in software engineering. Information and Software technology 40(14), 1998, pp811-821

• G.A. Milliken and D.A. Johnson Analysis of Messy Data Volums 1: Designed Experiments. Chapman & Hall, 1992, Chapters 5 & 32


References

ï W.F. Rossenberger. Dealing with multiplicities in pharmacoepidemiologic studies. Pharmacoepidemiologic and Drug Safety, 5, 1996, pp95-100

ï R. Vinter, M. Loomes and D. Kornbrot. Applying software metrics to formal specifications: a cognitive approach. proceedings of 5th International Software Metrics Symposium. IEEE Computer Society Press, 1998, pp216-223.

ï G.E. Welch and S.G. Gabbe. Review of statistics usage in the American Journal of Obstetrics and Gynecology, 175 (5), 1996, pp1138-1141.

ï J.M. Yancey. Ten rules for reading clinical research reports. Guest editorial. American Journal of Orthodontics and Dentofacial Orthopedics, 109 (5), May 1996, pp558-564.

Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara...

Documents

Transcript of Guidelines for empirical studies Guidelines for Conducting and Evaluating Empirical Studies Barbara...