1 Experimentation in Software Engineering: an introduction Mariano Ceccato FBK Fondazione Bruno...

1

Experimentation in Software Engineering: an introduction

Mariano Ceccato

FBK

Fondazione Bruno Kessler

[email protected]

Announcement: special exercise session

Obfuscation: April: 16th,17th and 23rd

Web testing: May: 21st, 22nd and 28th

2

3

Motivation

To understand better Software engineering …

Why should we perform empirical software engineering?

The major reasons for carrying out empirical studies is the opportunity of getting statistically significant results regarding understanding, prediction, and improvement of software development.

Empirical studies are important input to the decision-making in an organization.

4

Experimental questions

Experimental studies try to answer to experimental questions.

Examples: Does ‘stereotypes’ improve the understandability of

UML diagrams? Does ‘Design patterns’ improve the maintainability

of code? is ‘Rational rose’ more usable than ‘Omondo’ in the

development of the software house X? Does the use of ‘Junit’ reduce the number of defects

in the code of the industry Y?

General

Specific

Researchers execute an empirical study for answering to a question.

5

Our research question

Are stereotyped reverse engineered UML diagram (Conallen proposal) useful in Web application comprehension and maintenance tasks?

Experiment process

Definition

Planning

Operation

Analysis &interpretation

Presentation &package

Idea

Conclusions

Experiment process

“Is the technique A better than B ?”

No!!

7

Definition

In this activity the hypothesis that we want to test has to be stated clearly.

Usually the goal definition template is used.

Object of the study: is the entity that is studied in the experiment (code, process, design documents, …).

Purpose. What is the intention of the experiment? (i.e. evaluate the impact of two different techniques)

Quality focus. The effect under study in the experiment. (i.e. cost, reliability, …)

Perspective. Viewpoint from which the experiment results are interpreted. (developer, project manager, researcher, …).

Context. The “environment” in which the experiment is run. It defines subjects and objects used.

8

Conallen vs. Pure UML: definition

Object of the study. Design documents (Conallen and Pure UML) of Web applications.

Purpose. Evaluating the usefulness of Conallen diagrams in Web application comprehension, impact analysis and maintenance tasks.

Quality focus. Comprehensibility and maintainability. Perspective. Multiple.

- Researcher: evaluate how effective are Conallen diagrams during maintenance. - Project manager: evaluate the possibility of adopting a Web application reverse engineering tool (Conallen notation) in his/her organization.

Context. - Subjects: 13 Computer science students. - Objects: two simple Web application using JSP and servlets (Claros, WfMS).

9

Planning

Activity very important!

Definition determines “Why the experiment is conducted”

Planning prepares for “How the experiment is conducted”.

We have to state clearly: research questions, context, subjects, variables, metrics, design of the experiment, …

The result of the experiment can be disturbed or even destroyed if not planned properly …

10

Planning phase overview

Contextselection

Definition

Hypothesisformulation

Variableselection

Selection ofsubjects

Experimentdesign

InstrumentationValidity

evaluationExperiment

design

Experiment planning

11

Planning activities (1)

Context selection. We have four dimensions: off-line vs. on-line, student vs. professional, toys vs. real problems, specific context (i.e. only a particular industry) vs. general context.

Hypothesis formulation. The hypothesis of the experiment is stated formally, including a null hypothesis and an alternative hypothesis.

Variables selection. Determine independent variables (inputs) and dependent variables (outputs). The effect of the treatments is measured in the dependent variable (or variables). Determine the values the variables actually can take.

Selection of subjects. In order to generalize the results to the desired population, the selection must be representative for that population. The size of the sample also impacts the results when generalizing.

12

Planning activities (2)

Experiment design. To draw conclusions from an experiment, we apply statistical analysis methods (tests) on the collected data to interpret the results. Statistical analysis methods depend on the chosen design (one factor with two treatments, one factor with more than two treatments, …)

Instrumentation. In this activity guidelines are decided to guide the participants in the experiment. Materials (training, questionnaires, diagrams, …) are prepared and measurement procedures (metrics) are defined.

Validity evaluation. A fundamental question concerning results from an experiment is how valid the results are. External validity: can the result of the study be generalized outside the scope of our study?

Threats to validity. Compiling a list of possible threats …

13

Threats to validity

Examples: Violated assumptions of statistical tests Researchers may influence the results by looking for a

specific outcome. If the group is very heterogeneous there is a risk that the

variation due to individual differences is larger than due to the treatment.

Experiment badly designated (materials, …) Confounding factors. …

14

Hypothesis formulation

Two hypothesis have to be formulated:

1. H0. The null hypothesis. H0 states that there are no real underlying trends in the experiment; the only reasons for differences in our observations are coincidental. This is the hypothesis that the experimenter wants to reject with as high significance as possible.

2. H1. The alternative hypothesis. This is the hypothesis in favor of which the null hypothesis is rejected.

H0: a new inspection method finds on average the same number of faults as the old one.H1: a new inspection method finds on average more faults than the old one.

15

Conallen vs. Pure UML hypothesis

H01: When doing a comprehension task the use of stereotyped reverse engineered class diagrams (versus non-stereotyped reverse engineered class) does not significantly affect the comprehension level.

Ha1: When doing a comprehension task the use of stereotyped reverse engineered class diagrams (versus non-stereotyped reverse engineered class) significantly affects the comprehension level.

H02: When doing a maintenance task, the use of stereotyped reverse engineered class diagrams does not significantly affect the effectiveness of the task.

Ha2: When doing a maintenance task, the use of stereotyped reverse engineered class diagrams significantly affects the effectiveness of the task.

16

Variables selection

Before any design can start we have to choose the dependent and independent variables.

The independent variables (inputs) are those variables that we can control and change in the experiment. They describe the treatments and are thus the variables for which the effects should be evaluated.

The dependent variables (outputs) are the response variables that describe the effects of the treatments described by the independent variables.

Often there is only one dependent variable and it should therefore be derived directly from the hypothesis

Experiment

independent dependent

17

Confounding factors

Pay attention to the confounding factors! A confounding factor is defined as a variable

that can hide a genuine association between factors and confuse the conclusions of the experiment.

For example, it may be difficult to tell if a better result depends on the tool or the experience of the user of the tool …

18

C versus C++

Research question: C++ is better than C? The independent variable of interest in this study is the

choice of programming language (C++ or C). Dependent variables

Total time required to develop programs Total time required for testing Total defects …

Potential confounding factor: different experience of the programmers …

Experiment not valid with confounding factors …

19

Standard design types

One factor with two treatments One factor with more than treatments Two factors with two treatments More than two factors each with two treatments …

The design and the statistical analysis are closely related.

We decide the design type considering: objects and subjects we are able to use, hypothesis and measurement chose.

20

One factor with two treatments (1)

We want to compare the two treatments against each other.

Example: to investigate if a new design method produces software with higher quality than the previously used design method.

Factor = design method. Treatments = new/old.

Subjects New design Old design

1 X

2 X

3 X

4 X

5 X

6 X

Completely randomized design

If we have the same number of subjects per treatment the design is balanced

21

One factor with two treatments (2)

Example: to investigate if a new design method produces software with higher quality than the previously used design method.

The dependent variable can be the number of faults found in the development.

To understand if the new method is better than the old we use: .

Subjects New design Old design

1 X

2 X

3 X

4 X

5 X

6 X

Completely randomized design

Hypothesis: H0: 1 = 2H1: 1 > 2

i = The average of faults for treatment i

22

Two factors with two treatments

The experiments gets more complex.

Example: to investigate the understandability of the design document when using structured or OO design based on one “good” and one “bad” requirements documents.

OO Structured

Good Subjects: 4, 6 Subjects: 1, 7

Bad Subjects: 2, 3 Subjects: 5, 8

Factor A: design method treatments: OO, structuredFactor B: requirements document treatments: “good”, “bad”

We need more subjects …

23

Conallen vs. Pure UML: design experiment

WfMS

Application

Claros

application

Conallen Yellow Blue

Pure UML Red Green

We have divided students in four groups. We have tried to make sure that ‘High’ and ‘Low’ ability people are uniformly distributed across groups (first questionnaire).

Each group will undergo to 2 different treatments.

This is one factor (the design) with two treatments (Conallen and pure UML).

We have used this strategy to double the number of subjects (26 instead of 13).

Used when the number of subjects is small!

WfMS

Application

Claros

application

Conallen Green Red

Pure UML Blue Yellow

24

Metrics (dependent variables)

How to measure the collected data? Using metrics … Metrics reflect the data that are collected from

experiments; they are decided at “design time” of the experiment and computed after the entire experiment has ended.

Usually, they are derived directly from the research questions (or hypothesis).

25

Metrics: examples

Question: Does the design method A produce software with higher quality than the design method B? Metric: number of faults found in the development.

Question: Are OO design documents easier to understand than structured design documents? Metric: percentage of questions that were answered correctly.

Question: Are Computer science students more productive (as programmers) than Electronic engineers? Metric: number of line of codes per total development time

26

Conallen vs. Pure UML: understanding task

The questionnaire had 12 questions. Answers was a list of pages/classes/interfaces. Example:

Question: Which controller classes are used to retrieve the users from the page referred in the question 4?Correct answer: LoginController.java, CLUserController.java

How to evaluate these (real) answers?1. LoginController.java, CLUserController.java2. CLUser.java3. LoginController.java4. LoginController.java, CLUserController.java, CLUser.java

Correct!Wrong!

?

10

28

How to compute the F-measure in our experiment?

Question: Which controller classes are used to retrieve the users from the page referred in the question 4?

Correct answer: LoginController.java, CLUserController.java

Answer 1: LoginController.java

Precision = 1/1 = 1 ; Recall = ½ = 0.5 ; F-measure = 0.67 Answer 2: LoginController.java, CLUserController.java, CLUser.java

Precision = 2/3 = 0.67 ; Recall = 2/2 = 1 ; F-measure = 0.8

Precision = |answers given П correct answers| Recall = |answers given П correct answers|

|answers given| |correct answers|

29

Subject “X”

1 2 3 4 5 6 7 8 9 10 11 12

1 0.86 1 1 0.5 1 0 1 0.78 1 1 0

1 1 1 1 1 0.83 0 1 1 1 1 0

1 0.92 1 1 0.67 0.90 0 1 0.87 1 1 0

Subject “X”

P

R

F

F-measure average = 0.781 Good work!

30

Operation

Definition

Planning

Operation

Analysis &interpretation

Presentation &package

Idea

Conclusions

Experiment process

31

Operation

When an experiment has been designed and planned it must be carried out in order to collect the data.

Experimenter actually meets the subjects. Treatments are applied to the subjects. Even if an experiment has been perfectly designed

and the data are analyzed with the appropriate methods everything depends on the operation …

32

Operation Steps

Preparation

Experimentdesign

Execution

Datavalidation

Experimentdata

Experiment operation

33

Preparation, execution, data validation

Preparation: before the experiment can be executed, all instruments must be ready (forms, tools, software, …). Participants must be formed to execute the task.

Execution: Subjects perform their tasks according to different treatments and data is collected.

Data validation: the experimenter must check that the data is reasonable and that it has been collected correctly. Have participants understood the task? Have participants participated seriously in the experiment?

34

Analysis and interpretation

After collecting experimental data in the operation phase, we want to be able to draw conclusions based on this data.

Descriptivestatistics

Experimentdata

Data setreduction

Hypothesistesting

Conclusions

Analysis and interpretation

35

Descriptive statistics

Descriptive statistics deal with the presentation and numerical processing of a data set.

Descriptive statistics may be used to describe and graphically present interesting aspects of the data set.

Descriptive statistics may be used before carrying out hypothesis testing, in order to better understand the nature of the data and to identify abnormal data points (outliers).

36

Descriptive statistics

Measures of central tendency (mean, median, mode, …)

Measures of dispersion (variance, standard deviation, …)

Measures of dependency between variables

Graphical visualization (scatter plots, histograms, pie charts, …)

010

20

0 10 20 30 40 50 60 70 80

Years

Frequency

Stations, Etc.40%

Train Performance

21%

Equipment15%

Personnel14%

Schedules, Etc.10%

37

Outlier analysis

Outlier is a point that is much larger or much smaller than one could expect looking at the other points.

A way to identify outliers is to draw scatter plots.

There are statistical methods to identify outliers.

• • ••

••

• •

••

•

•

outlier

Scatter plots

38

Data set reductions

When outliers have been identified, it is important to decide what to do with them.

If the outlier is due to a strange or rare event that never will happen again, the point can be excluded. Example. The task was not understood.

If the outlier is due to a rare event that may occur again, it is not advisable to exclude the point. There is relevant information in the outlier.

Example: a module that is implemented by inexperienced programmers.

39

Hypothesis testing

The hypothesis of the experiment are evaluated statistically.

Is it possible to reject a certain null hypothesis? To answer at this question we have to use statistical

tests (t-test, ANOVA, Chi-2, …)

next lessons!!

40

Presentation and package

ingWrite report

Experiment

Experimnet report

Experiment presentation and package

Announcement: special exercise session

Obfuscation: April: 16th,17th and 23rd

Web testing: May: 21st, 22nd and 28th

41

1 Experimentation in Software Engineering: an introduction Mariano Ceccato FBK Fondazione Bruno...

Documents

Transcript of 1 Experimentation in Software Engineering: an introduction Mariano Ceccato FBK Fondazione Bruno...