Experimental Designusers.ece.utexas.edu/~perry/education/382c/L08.pdf · Defining terms For...
Transcript of Experimental Designusers.ece.utexas.edu/~perry/education/382c/L08.pdf · Defining terms For...
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 1© 2000-present, Dewayne E Perry
Experimental Design
Dewayne E PerryENS 623
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 2
Problems in Experimental Design
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 3
True Experimental DesignGoal: uncover causal mechanismsPrimary characteristic: random assignment to sampling unitsIf not random, then only Quasi Experimental Without randomization, cannot rule out some systematic biasesTypes of designs
Between subject designs: sampling units are subjected to one treatment eachWithin subjects designs: sampling units receive two or more treatments
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 4
The “No Effect” HypothesesSpecial place to the test of the hypothesis that the treatment is entirely without effectReason: in a randomized experiment, this test may be performed virtually without assumptions of any kind – ie, relying only on random assignmentContribution of randomization is clearest when expressed in terms of the test of no effect
Does not mean that such tests are of greater practical importanceIt sets randomized and non-randomized aspects in sharper contrast
Whereas inferences in non-randomized experiments require assumptions
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 5
The “No Effect” HypothesesTo say a treatment has no effect is to say that that each unit would exhibit the same value of the response whether assigned to treatment or to controlA change in response indicates the treatment has some effectWill discuss later various tests of the significance and the size of effects
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 6
Alternative HypothesesAs opposed to the null hypothesis, experimental (alternative) hypotheses take a stand
Different treatments behave differently (two tailed prediction)Predict what direction the expected differences take (one tailed)
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 7
Hypotheses and TheoryTheory is a large scale map, different areas represent general principlesHypotheses are like small sectional maps, focus on specific areasConceptual similarities
range from very explicit to very vaguefall back on hidden assumptions, regulative principlesGive directions to our observations
Some hypotheses spring from experimental observationsDon’t know where they will go
Others from theoryConceptual hypothesesRely on previous studies, theories
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 8
Creating HypothesesDefining terms
For observations to have value, abstractions have to be concretizedTwo types of definitions
Operational: x is defined in terms of test y under conditions zTheoretical: abstract constructs used
At some point must be operational for the experimentHypotheses are predictive statements about the expected outcomeThey call for a test and embed a conclusionExplicit statements are de rigueurWhen comparisons are predicted they have to be explicated
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 9
Stating the HypothesisConcomitant variation: X is a direct function of YComparative: other things being equal ….H0 and H1 (null and alternative) are mutually exclusive and exhaustive
Usually a specific H0 and a general H1Try to reject H0
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 10
Hypothesis Testing Errors2 types of errors: I and II
Type I – rejecting the null hypothesis when it is trueGreater psychologically important riskThink there is a relationship when there is notWaste time in blind alleys
Type II – accepting the null hypothesis when it is falseDeny a relationship when there is oneIn effect, reject useful results
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 11
VariablesIndependent and dependent
Dependent: effect in which the researcher is interestedIndependent: cause of the effectAny event or condition can be conceptualized as either an independent of dependent variable
Concerned about the effects of X on YIe, the causal effects of one on the otherBoth in the labs and in the field
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 12
Independent VariablesNo single or standard way of classifying variablesUseful categorization (not mutually exclusive):
BiologicalEg, affects of gender in mentoring developers
EnvironmentalEg, schedule pressure and fault insertion
HereditaryEg, IQ effects on complexity
Previous training and experienceEg, effects of first programming languages
MaturityEg, age and elegance of program structures
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 13
Independent VariablesManipulated variables
If an experiment, one expects manipulationIntentional and systematic variation
Naturally occurring variablesManipulated by real life experienceEg, desk versus meeting inspectionsContext: normal of exceptional conditions
Static group variablesPre-existing groups with identified characteristics:
Organismic variables: sex, age, weight, etcStatus variables: education, occupation, marital statusAttribute variables: diagnoses, personality traits, behaviors
Cannot be manipulated – but are selected to gain proper contrast groups
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 14
Independent VariablesAnalogous to experimental treatmentWhen used as a dependent variable, may make inferences as to how the group acquired its characteristics
Eg, overweight lowers self-esteemLower self esteem causes overweightness
Risk of causal inferences -Tempting but riskyDependent variable not an accurate descriptorAt best an association, connection, relationship, correlationExample of weight/esteem experiments
Case 1: high calorie diet -> check esteem• Ethical problemsCase 2: overweight + low calorie -> raise esteem• Doesn’t prove overweight, low esteemCase 3: overweight + success -> lower weight• High esteem, low weight doesn’t prove le/owMust be careful about the logic
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 15
Independent VariablesUnidirectional paths
Eg, height and self-esteem – cannot switchFixed by logic of antecedents and consequencesMultiple variables: income, age -> truancy, discipline
Need 2x2 analysesQuestion: does income discriminate truancy and discipline problems
One-way, non-causal enabling relationshipsEg, income – IQ -> incomeBut not vice versa
Two-way, sequential causationEg, success-failure and self-confidenceEg, baseball players slumps, hitting streaks
-> Causation established by experimentationManipulationUsing static variables is descriptive/relationalBut not experimental
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 16
Independent VariablesEstablishing levels of independent variables
First decision: categorical or continuousEg, age is continuous, sex is categoricalIf continuous data, whole range, dichotomous, or graduatedRisky as information is lostMay be theoretical reasons for categories
If hypothesis is state in categorical terms, then should be consistentIf a relationship, not appropriate to break into dichotomies or nominal categories if variable is continuous
For theoretical or rational, not statistical reasonsExamine how the levels of categories established
Should be consistent with hypothesisPossible groupings: extremes, ranges of categories, median split (as in IQ)
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 17
Independent VariablesContinuous full-range distribution
Sometimes linear correlations, sometimes notEg, learning (perhaps), visual acuity (not)
Theory driven levelsHypothesis stated consistently with current theory
Strength of independent variable (magnitude of effect)Extreme groupings tend to magnify effectIncreasing magnitude may reduce generalityCan more easily argue weaker to stronger (eg, stress) and have great generalityLevels of independent variable should match hypothesis
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 18
Dependent VariablesMany possible flavors, literally thousandsEg, learning new design techniques
Direction of observed changeAmount of changeThe ease with which change effectedPersistence of changes over time
2 general classesDiffusion – fan out
Eg, technology insertion and adoptionHierarchical variations – changes in ranking
Eg, changing roles in organizational structures
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 19
Practical ApplicationDistinguish two kinds of control groups
No treatmentOk for physical effectsProblems where belief may confound
PlaceboRule out belief effects
Practical decision is not easy which to useQuestion of greatest interestExperience or knowledge of the general areaEasy to make mistakes in a new area
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 20
Basic DesignsDesign 1
One shot case study: X OH- M- I(NR) S-
Deficient in terms of any reasonable controlsHistory may be alternative explanationMaturation not controlled forMay be changes in instruments or judgesUnknown state of participants
Instrumentation not a factor: no pre-measurementDesign 2
One group pretest: O X OSlight improvement, but no comparison
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 21
Basic DesignsDesign 3 – Solomon Design
True experimental, 4 groupI R O X OII R X OIII R O OIV R O
H+ M+ I+ S+All well controlled for
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 22
Basic DesignsSolomon provides elegant illustration of logic of control
Pretest performance scores in I & III to estimate pretest scores in II & IV
Requires a leap of faith even if scores the sameEven if differ greatly, II & IV could be equal to the mean of I & III
Use estimated pre-test scores to enrich factorial analysis of variance of post test scoresTells us if any confounding of pre-test and treatment
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 23
Basic DesignsPre/post test effects:
I+ II- III+ IV-Experimental treatment effects:
I+ II+ III- IV-Pretest & X sensitization:
I+ II- III- IV-Pretest sensitization =
Extraneous effects:I+ II+ III+ IV+
( ) ( )IVIIIIII YYYY −−−
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 24
Basic DesignsExternal Validity
All 3 suffer from the possible confounding of selection and treatmentDesign 3 - Solomon:
controls for confounding of treatment and pre-test sensitizationDesign 4:
I R O X O III R O OIV: H+ M+ I+ S+Deficient in pre-test sensitization – eg, problem in attitude change or learning experiments
Design 5II R X O IV R OIV: H+ M+ I+ S+Avoids pretest sensitization issues
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 25
Basic DesignsWithin subjects designs
Each subject receives all treatments in turnUseful in SWE/CS – repeated measures designAdvantages:
Same number of subjects used more effectivelyEach sampling unit serves as its own controlCan examine relationships longitudinally
Difficulties:Sensitization problems
Learning etcOrder of treatments may produce differences in successive measures
Another threat to IV in longitudinal studiesRegression towards mean
When linear relationship is imperfectEg, overweight people appear to lose weight, low IQs appear to become brighterObserved when variables consists of the same measure taken at two points in time and the correlation r < 1
382C Empirical Studies in Software Engineering Lecture 8
© 2000-present, Dewayne E Perry 26
Basic DesignsSolve threat by standard Z score
A raw score from which the sample mean has been subtracted and the difference then divided by the standard deviationRegression equation:
The estimated score of Y is predicted from the XY correlation r times the standard score of XIf there is a perfect correlation, the Z scores will be equivalent; otherwise not if r < 1
XXYY ZrZ =