Introduction to Experimental Design - University of...

1

Introduction to Experimental DesignIntroduction to Experimental DesignApplication to Gene Expression MicroarraysApplication to Gene Expression Microarrays

Kathleen KerrKathleen KerrBiostatistics 577 / Statistics 577Biostatistics 577 / Statistics 577

University of WashingtonUniversity of WashingtonSummer 2007Summer 2007

• Principles of statistical experimental

design.

• Design and analysis of agricultural

experiments.

• Introduction to gene expression

microarrays.

• Experimental design for gene expression

microarrays.

2

R.A. Fisher1890-1962

Statistician, experimentalist,geneticist

“To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of.''

R.A. Fisher, Indian Statistical Congress, Sankhya, ca 1938

3

“If I had to replicate my experiments, I could only do half as much.''

-Well-known Stanford molecular biologist, ca. 2000

Introduction to Experimental Design

4

Agricultural experiment to compare the yields of varieties of a crop.

Multiple blocks of land to use in the study.

Blocks of land vary in fertility, sunlight, rainfall, etc.

Block A Block B

Variety 1 Variety 2

Experiments in BlocksCase 1

5

Block A Block B

Variety 1 Variety 2

Block effects and variety effects are confounded.

Experiments in BlocksCase 1

1 212 1221

Block A Block B

Case 2

6

1 212 1221

Block effects and variety effects are orthogonal(unconfounded) because blocks and varieties are balanced.

Block A Block B

Case 2

Block A Block B

1 12 122

Case 3

7

Block A Block B

“In between” case:Block effects and variety effects are partially confounded.

1 12 122

Case 3

The data contain information about the relativevariety yields because varieties are grown on the same block of land.

There is a duality that can be overlooked:

There is information about the blocks of land because they have varieties in common.

8

One can simultaneously estimate the relativeyields of the varieties and the relative effectsof the blocks of land.

The tool to do this is a simple linear model.

The precision of estimates of variety differencesdepends on the experimental design.

Let yik be the yield from variety k on block i

Modelyik = μ + Bi + Vk + εik

μ is the mean yieldBi is the effect of block iVk is the effect of variety kεik is error

9

Block 1 Block 2 Block 3

ExampleGoal: Compare three varieties. Resources: Three blocks of size two. How should you put the varieties onto the blocks?

1 2 3 2 1 3


R 1 R 2 R 3


Design 1: “Balanced design”

Design 2: “Reference Design”

10

1 2 3 2 1 3


y=X β+ ε

-1-1-1-1101-1-1110101

-1-11011001101011

µ B1 B2 V1 V2

X=

Error df=1

yik = μ + Bi + Vk + εik

Identifiability constraints: B1+B2+B3=0, V1+V2+V3=0

1 2 3 2 1 3


Var( β) = σ2(X’X)-1

Var(V1-V2)=Var(V1-V3)=Var(V2-V3)= 4σ2/3

11

R 1 R 2 R 3


B1+B2+B3=0 V1+V2+V3+3VR=0

100-1-11-1/3-1/3-1/3-1-11

010101-1/3-1/3-1/3101

001011-1/3-1/3-1/3011

X=

µ B1 B2 V1 V2 V3

Error df=0

R 1 R 2 R 3


Var( β) = σ2(X’X)-1

Var(V1-V2)=Var(V1-V3)=Var(V2-V3)= 4σ2

4σ2 compared to 4σ2/3 for the balanced design

12

1 2 3 2 1 3

R 1 R 2 R 3



What is the balanced design more efficient?

Intuition: In the reference design, varieties 1,2,3 are compared only via the reference. In the balanced designed, each variety is compared directly as well as indirectly.

Introduction to Gene Expression Microarrays

13

The nucleus of each cell of our body has chromosomes. These contains our genes.

Genes are made up of DNA ( deoxyribonucleic acid): linear sequences of four letters or bases: A,G,C, and T.

DNA has the famous double-helix structure discovered by Watson and Crick. The double-helix is formed by complementary base-pairing. That is, A and T always pairtogether, and G and C always pair together.

...ATGATGATCCAGC...

...TACTACTAGGTCG...

Illustration:Post-genome Informatics

Minoru Kanehisa,Oxford University Press, 2000

14

The four base alphabet of DNA is really a recipe to make things.

More specifically, via the genetic code DNA contains the instructions for making proteins.

1. DNA is transcribed into messenger RNA.2. mRNA is translated into protein.

15

Post-genome InformaticsMinoru Kanehisa,

Oxford University Press, 2000

All of the cells in our body have the same DNA.

Why are they different from each other?

Each type of cell expresses a different set of genes. Two cells that express the same gene may express them at different levels.

16

Some Goals of Gene Expression Studies

• Discover gene function – “guilt by association” Genes with similar patterns of expression may be in similar pathways

• Distinguish cancer cells from normal cells on the molecular level (for further understanding of cancer, or to identify diagnostic or prognostic markers, or to identify drug targets)

• Characterize complex diseases at the molecular level

Gene Expression Microarray

Idea: instead of directly measuring the amount ofprotein corresponding to each gene, measure the amount of message to make each protein, the mRNA.

17

Each spot on the arraycontains single-strandedDNA representing oneparticular gene.

cDNA Microarray

mRNA

cDNA

DNA microarray Re-created from Brown and Botstein, Nature Genetics Supplement, 1999

18

Quantitative problems in microarrays

One of the goals of a microarray study is to estimate the relative expression of the genes in the different RNAs.

19

• There is variation in the amount of DNA from spot to spot on the arrays.

• The relative “red” and “green” signal from a spot is interesting because the sample that contained more transcript should produce higher signal.

• By taking ratios (log ratios) of “red” and “green” for a spot we are taking “within spot” estimates of expression differences (just like block design!)

A Arrays i = 1,...,ID Dyes j = 1,2V Varieties k = 1,...,K G Genes g = 1,...,N

Experimental Factors in a Microarray Study

20

Example: “Dye Swap” Experiment

Two “varieties.”

Experimental Design: “Dye Swap”

ControlTreatmentRedTreatmentControlGreen

21Array

R

G R

GControlTreatment

A Arrays i = 1,...,ID Dyes j = 1,2V Varieties k = 1,...,K G Genes g = 1,...,N

Microarray Main Effectsand interpretation

21

Effects Involving Gene

• main effects G

• array-by-gene interaction (spots) AG

• dye-by-gene interaction DG

• variety-by-gene interaction VG

VGkg - VGk’g = change in gene expression

Dye-swap: Pretend we have just two spots (two genes).

Gene 1

Gene 2

red signalgreen signal


Array 1

Gene 1

Gene 2



Array 2

22

-11-1-11-1-11-1111-1-111-11-11

-1-1-111-111111-1-11111111GVDAµ

Dye-swap: Pretend we have just two spots (two genes).Then the design matrix for the main effects looks like:

Note that every pair of columns is orthogonal.

Array 1

Gene 1

Gene 1

Gene 2

Gene 2

-11-1-11-1-11-1111-1-111-11-11

-1-1-111-111111-1-11111111GVDAµ

Dye-swap: Pretend we have just two spots (two genes).Then the design matrix for the main effects looks like:

Every column has four 1’s and four -1’s.

Array 1

Gene 1

Gene 1

Gene 2

Gene 2

23

-111-11-1-111-11-1-11-111-1-111-1-11

-11-11-11-1111-1-1-1-111

-1-1-1-11111-1-111-1-11111111111

VGDGAGGVDAµ

Add some two-way interactions:

Every column remains orthogonal to every other column.

1-11

-1-11

-11

AD

-111-11-1-111-11-1-11-111-1-111-1-11

-11-11-11-1111-1-1-1-111

-1-1-1-11111-1-111-1-11111111111

VGDGAGGVDAµ

What if we tried to add other interactions?

24

1-11

-1-11

-11

AD

-111-11-1-111-11-1-11-111-1-111-1-11

-11-11-11-1111-1-1-1-111

-1-1-1-11111-1-111-1-11111111111

VGDGAGGVDAµ

AD looks exactly like V – design matrix is no longer full rank

11

-1-1-1-111

DVG

-111-11-1-111-11-1-11-111-1-111-1-11

-11-11-11-1111-1-1-1-111

-1-1-1-11111-1-111-1-11111111111

VGDGAGGVDAµ

25

11

-1-1-1-111

DVG

-111-11-1-111-11-1-11-111-1-111-1-11

-11-11-11-1111-1-1-1-111

-1-1-1-11111-1-111-1-11111111111

VGDGAGGVDAµ

Confounding Structureof a Dye-Swap Experiment

Each experimental effect is confounded with one other effect, its alias. Non-aliased effects are orthogonal.

AVG~DGDVG~AGADG~VG

ADVG~GAD~VAV~DDV~A

ADV~µ

26

• A and D are orthogonal

• D and V can be confounded, partially confounded, or orthogonal ─This depends on the chosen experimental design

• A and V will be partially confounded if there are more than 2 varieties

Structure of the Design Space:Arrays, Dyes, and Varieties

Structure of the Design Space:Gene-specific Effects

• G orthogonal to A, D, and V

• AG orthogonal to DG

• AG and VG are partially confounded if there are more than 2 varieties

• DG and VG can be confounded, partially confounded, or orthogonal ─This depends on the chosen experimental design

27

Dye and Variety

D and V … and so also DG and VG … will be orthogonal if the design is “even.” That is, each RNA is labeled with both dyes and the design uses each labeled sample equally often.

This is important because the variety-by-gene interactions (VG) are the effects we care about.

“Dye Swap” DataData are from a drug-treated variety and a control.


21Array

R

G R

GControlTreatment

Mouse liver RNA

28

•Only 78 genes spotted on each array.

•Each gene spotted 4 times per array.

•(Atypically small experiment)

Dye-Swap Data

Array 1

29

Array 2

Analysis of Variance

3.3513.35Dye45.69145.69Variety11.9277917.53Gene0.46289145.19Spot

13431219.35Adjusted Total0.024182019.72Residual

0.287721.30Dye*Gene0.867766.46Variety*Gene

0.1110.11ArrayMSdfSSSource

SS=Sum of Squares, df=degrees of freedom, MS=Mean Square=SS/df

30

We assume that there is independent, random error εijkg with mean 0.

yijkgr = µ + Ai + Dj + Vk + Gg + (VG)kg + (AG)igr + (DG)kg + εijkgr

Statistical model underlying the analysis of variance:Let yijkgr be the signal* from rth spot for Gene g fromArray i, Dye j, and Variety k

*On log scale or similar

DG1g─DG2g forg=1,…,78

31

Q: What would we have concluded if we had run 1 array instead of 2 arrays?

The “design” of an experiment has many different facets

1. The samples selected for comparison

2. The specification of the units to which the samples will be applied

3. The way the samples are allocated to the experimental units

4. The specifications of the measurements to be made

32

The “design” of an experiment is





Representation of Microarray Designs

Mouse 1RNA

Mouse 3RNA

Mouse 2RNA

Mouse 4RNA

RNA samples are representedas rectangles or circles

Microarrays are representedby arrows, where one end is the “red” channel and the opposite end is the “green” channel

33

Example: “Dye Swap”


21Array

R

G R

GControlTreatment

R 1 R 2 R 3


321

R represents:

34

“Reference” Design

Reference sample

Samples of interest

1 2 3 2 1 3


Becomes…

1

23

35

“Loop” Design


“Loop” Design

36

Reference Design

• Uses v arrays to study v samples

• V and D confounded;VG and DG confounded

• May lack error degrees of freedom

• Inefficient* estimation

Loop Design

• Uses v arrays to study v samples

• V and D orthogonal;VG and DG orthogonal

• Increased error degrees of freedom

• Efficient* estimation

*Efficiency measured according to A-optimality. The loop design is more efficient for 10 or fewer samples.

A-Optimality

For a given number of varieties and a given number of arrays, the A-optimal design minimizes the average variance over allcomparisons VGkg-VGk’g

(Just one possible criterion, may or may notbe appropriate.)

37

Relative Efficiency

Without spoteffects

With spoteffects

K

Optimal Designs

For v varieties, these designs are A-optimal among all even designs using v+2 arrays.

“Design A” for HW1

38

Optimal Designs

For v varieties, these designs are A-optimal among all designsusing 2*v arrays.

Example: Study of Aging. Mutant mice live 30% longer than wild-type. The investigators plan to take RNA from young and old mice of each genetic strain. Thus there are four varieties but in a 2-by-2 factorial structure. The most interesting comparison is between the two genetic strains but this is not the only interesting comparison.

Old

Young

MutantWild-type

39

DCOld

BAYoung

MutantWild-type

I.

Original Design

Not possible

Not possible

½1

AGE acrossGEN

½(A+B)-½(C+D)

AGE within GEN

A-C or B-D

GEN across

AGE½(A+C)-½(B+D)

GEN withinAGE

A-B or C-D

Variance of Comparisons*

*The numbers are only relative, to be used to compare designs.

40

Old

YoungMutant

Wild-type

III.

Old

YoungMutant

Wild-type

II.

Old

YoungMutant

Wild-type

IV.

Alternative Designs

1/21.512IV.

12½1.5III.

11.511.5II.

Not possbile

Not possible½1I.

AGE across GEN

AGE within GEN

GEN across AGE

GEN withinAGEDesign

Variance Comparisons

41

Generally, we want efficient designs for an appropriate criterion but designing an experiment requires a lot of practical considerations:

• Robust properties• Extendibility• Simplicity of execution• Useful sub-designs

• Robust propertiesWhat happens to the efficiency if we lose an array? If we lose some spots on every array?

• ExtendibilityWhat if we decide to add more samples to the study later?

• Simplicity of executionWill we be able to keep track of the assays we need to do?

• Useful sub-designsWhat if we want to analyze the data on just a subset of samples?

42

The “design” of an experiment is





“Replication”First level of replication: Genes spotted multiple times per array

Second level of replication: Multiple arrays to study the same samples

contrast with:

Replication in the classical sense: Random sampling of individuals from the populations of interest or randomly assigning individuals to treatment groups. Without this kind of replication, inference is limited to the particular RNAs in the study.

43

Treatments: A B

Replicates:

Dyes:

Arrays:

R G R G R G R GRNA1 RNA2 RNA3 RNA4

1 2 3 4

ReplicationFirst level of replication: Genes spotted multiple times per array

Second level of replication: Multiple arrays to study the same samples

replication repeated measures / subsampling

Replication in the classical sense: Random sampling of individuals from the populations of interest or randomly assigning individuals to treatment groups. Without this kind of replication, inference is limited to the particular RNAs in the study.

MeasurementError

BiologicalVariability

44

Replication: N mice from each of two groups Limited number

of Arrays, say 2N

Example: Comparing 2 “Treatments”

vs.

45

N mice in each group2N arrays

Var(Tmt – Ctl) = 4σ2/N

N mice in each group

2N “looped” assays

2N arrays

Var(Tmt – Ctl) = σ2/N

46

I have presented the design problems in terms of measuring the difference in means between the treated and control mice.

Is Tmt – Ctl really the quantity of interest?

No.

The quantity of interest is really the difference in population means μtmt – μctl. We are only interested in the sample means Tmt – Ctl for the purpose of making inference about μtmt – μctl.


“Alternating Loop” Design

47

Let σ2 be error variance as before, but now let τ2 be population variability.

n individuals are sampled from each population. The variance of the estimated difference from the reference design is:

(4σ2 + 2τ2 )/n.

For the alternating loop design the variance is:

(σ2 + 2τ2 )/n.

Conclusion: The loop strategy is always better, but if τ2 >> σ2, not by much.

Summary: Experimental Design• Design determines the information content of data,

the analyses that are possible, and the quality of the results.

– Are effects of interest confounded with other effects?

– How precisely can one estimate the effects of interest?

– Can one estimate error in order to make inference?

• Choosing an experimental design means balancing multiple, competing objectives: precision vs. cost, practical constraints, considerations of robustness, etc.

48

Summary: Experimental Design• Design is the most important part of

conducting a good experiment

– A botched analysis can be re-done, but a botched design can mean things are hopeless.

Introduction to Experimental Design - University of...

Documents

Transcript of Introduction to Experimental Design - University of...