6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power...

Post on 19-Dec-2015

218 views 0 download

Transcript of 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power...

04/18/23 1

Microarray Data Analysis

04/18/23 2

Copyright notice

• Many of the images in this power point presentation of other people. The Copyright belong to the original authors. Thanks!

04/18/23 3

Gene Expression MatrixAfter image processing, obtain a data matrixThe final gene expression matrix (on the right) is needed for higher level analysis and mining.

Samples

Gen

es

Gene expression levels

Images

Spo

ts

Spot/Image quantiations

04/18/23 4

Missing data in microarray• Randomly missing values

• the fact that the value is missing is independent of its value

• methods are available for dealing with randomly missing data

• Non-randomly missing values:• the fact that the value is missing is

dependent on its value– (i.e. the value is missing because it is low

expression, or the value is missing because it is high expression)

• available methods do not adequately deal with the situation of non-randomly missing data

04/18/23 5

Missing data in microarray

Randomly missing data:– spotting problems– dust– finger prints– poor hybridization – inadequate resolution– fabrication errors (e.g.

scratches)– image corruption– omission of suspect values*

* could also be non-random

Non-randomly missing data:low expression

e.g. background exceeds signalcensored data

Arrays

max observable intensity

Exp

ress

ion

04/18/23 6

Dealing with missing data

• The problem:– many analyses require complete data

matrices• classification algorithms• clustering algorithms• dimension-reduction methods

• Solutions:– remove all genes (rows) and arrays (columns)

with missing values– estimate missing values

04/18/23 7

Imputation methods

• Naive approaches– missing values = row (gene) average– missing values = column (array) average

• Smarter approaches have been proposed:– K-nearest neighbors– regression-based methods– singular value decomposition

• like principal components for matrices with unequal numbers of rows and columns

04/18/23 8

K-Nearest Neighbors (KNN)

Arrays

Exp

ress

ion

?

randomly missing datum

• chose k genes that are most similar to the gene with the missing value (MV)

• estimate MV as the weighted mean of the neighbors

• considerations:– number of neighbors– distance metric– normalization step

04/18/23 9

KNN - considerations

• parameter k– 10 usually works (5-15)

• distance metric– euclidean distance– correlation-based

distance

Arrays

Exp

ress

ion

?

04/18/23 10

Ordinary Least Squares (OLS)• regression-based approach• also uses k-neighbors• algorithm:

– choose k neighbors (euclidean or correlation; normalize or not)

– the gene with the MV is regressed over the neighbor genes (one at a time, i.e. simple regression)

– for each neighbor, MV is predicted from the regression model

– MV is imputed as the weighed average of the k predictions

04/18/23 11

Singular Value Decomposition (SVD)

• goal:– use the strongest patterns of correlation within the

data matrix to estimate • algorithm

– set MVs to row average (need a starting point)– decompose expression matrix in orthogonal

components, “eigengenes”.– use the proportion, p, of eigengenes corresponding

to largest eigenvalues to reconstruct the MVs from the original matrix (i.e. improve your estimate)

– use EM approach to iteratively improve estimates of MVs until convergence

04/18/23 12

Other Imputation Methods:

• Local Singular Value Decomposition (LSVD)– combines KNN and SVD– algorithm:

• start with a ngenes x marrays matrix• select k neighbor genes (euclidean or correlation;

normalize or not)• perform SVD on the k x marray matrix

• Partial Least Squares (PLS) regression– uses all genes and available data from target gene

• Factor Analysis (FA) regression

04/18/23 13

Which imputation method to use?

• KNN is the most widely-used; current standard

• many alternative choices: OLS, SVD, LSVD, PLS, (FA)

• algorithms require user-supplied parameters: k, p, distance metric, etc.

• No set of rules for choosing which method to use

04/18/23 14

Characteristics of data that may affect choice of imputation method

• dimensionality

• percentage of values missing

• experimental design (time series, case/control, etc.)

• entropy - patterns of correlation in data

• others?

04/18/23 15

Data Analysis

• Determine differential gene expression• Identify up- and down-regulated genes• Gene lists produced using Factor 2 Rule, t-test based

methods

• Co-regulation of genes• Clustering algorithms

• Identify genes that regulate other genes• Networks (e.g. Bayesian)

04/18/23 16

Methods to Decide Differential Expression

• Compare treatment to the control– The fold approach– The t-test– Variations of the t-test

• SAM: significance analysis of microarrays

• Compare several treatments– ANOVA: analysis of variance– MAANOVA:

http://www.jax.org/staff/churchill/labsite/software/anova/index.html

04/18/23 17

Fold Change

• Measure ratios of gene expression levels.

• Ratio = Ti/Ci. Ratio of measured treatment intensity to control intensity for the ith spot

• The log2 ratio treats up and down regulated genes equally– e.g. when looking for genes with more than 2 fold

variation in expression

04/18/23 18

The Fold Approach

• In northern analysis, a 2-fold change can be seen with bare eyes

• Thus biologists tend to use 2-fold as the threshold of differential expression

• mean(x1, x2) > 1

• mean(x1, x2) < -1

04/18/23 19

Illustration of the benefit of using Log ratios

04/18/23 20

Two-fold up-regulation

• Problems with this approach:– Only identifies most changed genes.– Also identifies noise and highly variable

genes.– Ratio is unstable when the denominator is

small.

04/18/23 21

Ratios are unstable

• Initial measurements:

30/60 = 0.5

500/1000 = 0.5

• Add random noise (+15 numerator and -15 denominator):

45/45 = 1.0

515/985 = 0.52

04/18/23 22

Types of tests

• Standard t-test assumes the samples are drawn from normal distributions with equal variance and different means.

• Welch’s t-test allows for different variances between classes.

• Mann-Whitney (Wilcoxon) converts the data to ranks, and does not assume a particular distribution.

• Permutation test computes the t-statistic for many random permutations of the labels.

04/18/23 23

The Student’s t-test

• For sample sizes less than 30 we have to make use of a t-distribution

• We make use of this distribution in the two-sample Students t-test.

• This test is used to test whether two samples come from distributions with the same means.

• The samples are assumed to come from Gaussian (normal) distributions.

• The two samples must have similar dispersions

04/18/23 24

The student’s t distribution• The students t distribution

– is mound shaped– is symmetrical about zero– is more widely dispersed than the standard

normal distribution– it’s actual shape is dependent on the sample size

• different t distributions are identified by their degrees of freedom (df), where df = n-1

04/18/23 25

The student’s t distribution (cont.)

-4 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 4

Standard Errors

df=120 (=z)

df=30

df=15

EG’s (not to scale)

Mean and Median

• The mean is the most common measure of the location of a set of points.

• However, the mean is very sensitive to outliers. • Thus, the median or a trimmed mean is also

commonly used.

04/18/23 26

Range and Variance

• Range is the difference between the max and min• The variance or standard deviation sx is the most

common measure of the spread of a set of points.

• Because of outliers, other measures are often used.

04/18/23 27

04/18/23 28

Statistical Analysis

controlgroupmean

treatmentgroupmean

Is there a difference?

04/18/23 29

What does difference mean?

mediumvariability

highvariability

lowvariability

The mean differenceis the same for all

three cases

04/18/23 30

What does difference mean?

mediumvariability

highvariability

lowvariability

Which one showsthe greatestdifference?

04/18/23 31

What does difference mean?• a statistical difference is a function of the

difference between means relative to the variability

• a small difference between means with large variability could be due to chance

• like a signal-to-noise ratio

lowvariability

Which one showsthe greatestdifference?

04/18/23 32

So we estimate

lowvariability

signal

noise

difference between group means

variability of groups=

XT - XC

SE(XT - XC)=

= t-value

_ _

_ _

04/18/23 33

Probability - p

• With t we check the probability Reject or do not reject Null hypothesis

• You reject if p < 0.05 or less• Difference between means

(groups) is more & more significant if p is less & less

04/18/23 34

Important notes on two sample comparisons

• Type I errors (false positive)– we accept a difference is real when it is not (at the 95% confidence level we are, of course, wrong 5% of the time)– We can increase the significance level to

decrease these errors• Type II errors (false negative)– if we increase

our significance level we risk missing some real differences by making our testing too stringent.

• Convention is we should reduce Type I errors and be conservative

• Both can be minimised by increasing the sample size

04/18/23 35

Paired and unpaired tests

• There are different formulas for the T-test depending on whether we have paired or unpaired data– Paired – making observations of N individuals in two

different situations• In this situation we can consider the difference for each

individual rather than calculate separate means and SEs for the two effects

– Unpaired – Two separate samples drawn from the same parent population

• Can have different sample sizes

04/18/23 36

Tails

• Two-tailed: Do set A and set B come from different distributions?

• One-tailed: Does set A come from a distribution with larger mean than set B?

• This corresponds to finding differentially regulated genes versus finding up-regulated genes.

04/18/23 37

Selecting genes with a t-test

μi = mean expression value in class ini = number of examples in class iv = pooled variance across both classes

21

21

nv

nv

http://mathworld.wolfram.com/Studentst-Distribution.htmlZar. Biostatistical Analysis. 1999.

04/18/23 38

Standard T Test: An example

• Observed gene expression values:

Treatment A: 0.45 0.57 1.02 0.97

Treatment B: 1.50 2.07 0.51 1.63

• Compute mean:

mean (A) = 3.01 / 4 = 0.7525

mean (B) = 5.71 / 4 = 1.4275

04/18/23 39

Pooled variance

• The standard t-test assumes samples are drawn from distributions with the same variance.

• Pooled variance

= (SS1 + SS1) / (n1 + n2 - 2)

= (0.243675 + 1.300875) / (4 + 4 - 2)

= 0.2574SS: variance

04/18/23 40

Selecting genes with a t-test

t = (0.7525 - 1.4275) / sqrt(0.2574/4 + 0.2574/4) = 1.8815

21

21

nv

nv

04/18/23 41

If the Sample Variances are Unlikely to be Equal

• Use Welch’s t-test • degrees of freedom

• wherey

y

x

x

nn

yx22

11

)(22

2

yx nB

nA

BA

y

y

x

x

nB

nA

22

,

04/18/23 42

Welch’s approximation

t = 1.8815Welch’s = |0.7525 - 1.4275| / sqrt(0.08089/4 + 0.43363/4)

= 1.8821

21

21

nv

nv

2

2

1

1

21

nv

nv

t-test Welch’s

04/18/23 43

Degrees of freedom

• For the t-test, dof = n1 + n2 - 2.

• For Welch’s approximation, it is not so simple. Let Ai = vari / ni. Then

11 2

22

1

21

221

nA

nA

AAfloordof

04/18/23 44

Non-parametric p-value

• The t-test assumes the t-distribution– a parametric method– compute the test statistics– use the t pdf to determine the p-value

• A non-parametric method– data are labeled as X and Y– compute the test statistics with true labels– randomly permute the individual labels 10000 times, and

compute the test statistics– find the rank of the true test statistics among the test statistics of

random permutations– for example, if there are 10 permutations with test statistics

larger than the true test statistics, then the p-value is 0.001

04/18/23 45

Mann-Whitney u-test

• Mann-Whitney, also known as Wilcoxon, is a non-parametric test.

• Begin by converting to ranks:

Treatment A: 0.45 0.57 1.02 0.97

Treatment B: 1.50 2.07 0.51 1.63

Treatment A: 1 3 5 4

Treatment B: 6 8 2 7

04/18/23 46

Mann-Whitney u statistic

• The u statistic is

where Ri is the sum of the ranks in class i.

• U = 16 + 10 - 13 = 13

2

22211

1121 2

1,

21

max Rnn

nnRnn

nnU

04/18/23 47

Permutation test

04/18/23 48

Cost-benefits analysis

• t-test assumes both samples are drawn from the same normal distribution.

• Welch’s approximation allows the samples to be drawn from different normals.

• Mann-Whitney makes no assumption about the distribution.

• The tests, as listed, yield decreasing power.• The permutation test gives the most flexibility in

choosing a test statistic that reflects prior knowledge, but it can be computationally expensive for small p-values.

04/18/23 49

Multiple testing correction

• On an array of 10,000 spots, a p-value of 0.0001 may not be significant.

• For significance of 0.05 with 10,000 spots, you need a p-value of 5 10-6.

04/18/23 50

Family-wise Error-rate

• FWER• Chance of any false positives• Assume 0.01 significance level for one gene• Multiply by the number of genes• Many false positives• Bonferroni correction: divide 0.01 by the number

of genes• Bonferroni is conservative because it assumes

that all genes are independent.

04/18/23 52

False discovery rate

• The false discovery rate (FDR) is the percentage of genes above a given position in the ranked list that are expected to be false positives.

• False positive rate: percentage of non-differentially expressed genes that are flagged.

• False discovery rate: percentage of flagged genes that are not differentially expressed.

5 FP13 TP

33 TN5 FN

FDR = FP / (FP + TP) = 5/18 = 27.8%FPR = FP / (FP + TN) = 5/38 = 13.2%

04/18/23 53

Bonferroni vs. FDR

• Bonferroni controls the family-wise error rate; i.e., the probability of at least one false positive.

• FDR is the proportion of false positives among the genes that are flagged as differentially expressed.

04/18/23 54

Controlling the FDR

• Order the unadjusted p-values p1 p2 … pm.

• To control FDR at level α,

• Reject the null hypothesis for j = 1, …, j*.• This approach is conservative if many genes are

differentially expressed.

m

jpjj j:max*

(Benjamini & Hochberg, 1995)

04/18/23 55

q-value

• The p-value for a particular gene G is the probability that a randomly generated expression profile would be as or more extremely differentially expressed.

• The q-value for a particular gene G is the proportion of false positives among all genes that are as or more extremely differentially expressed.

• Equivalently, the q-value is the minimal FDR at which this gene appears significant.

04/18/23 56

Q-value software

http://faculty.washington.edu/~jstorey/qvalue/

04/18/23 57

SAMSignificance analysis of microarrays applied to the ionizing radiation response Virginia Goss Tusher, Robert Tibshirani, and Gilbert ChuProc. Natl. Acad. Sci. USA, Vol. 98, Issue 9, 5116-5121, April 24, 2001

04/18/23 58

Abstract• Method for gene filtering: find genes change

that significantly across samples• Significance Analysis of Microarrays (SAM)

assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements.

• For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR).

04/18/23 59

Introduction

• Suitable for oligo, cDNA, protein arrays

• Does not normalize the data!

• Challenge: – methods based on conventional t tests provide

the probability (P) that a difference in gene expression occurred by chance. For an array with 10000 genes, a significance level of alpha = 0.01 would identify 100 genes by chance.

– Experiments are expensive.

04/18/23 60

Introduction• Solution based on SAM:

– assimilate a set of gene-specific t tests. Each gene is assigned a score on the basis of its change in gene expression relative to the standard deviation of repeated measurements for that gene.

– Instead of more replicates, generate permutations of the data (mix the labels)

• Genes with scores greater than a threshold are deemed potentially significant. The percentage of such genes identified by chance is the false discovery rate (FDR). To estimate the FDR, nonsense genes are identified by analyzing permutations of the measurements.

• The threshold can be adjusted to identify smaller or larger sets of genes, and FDRs are calculated for each set. To demonstrate its utility, SAM was used to analyze a biologically important problem: the transcriptional response of lymphoblastoid cells to ionizing radiation (IR).

04/18/23 61

Motivating ExperimentH

uman

Cel

l Lin

esTreatment

Irradiated (I) Unirradiated (U)

1

One RNA sample for each combinationof cell line and treatment

2

04/18/23 62

Motivating ExperimentH

uman

Cel

l Lin

esTreatment

Irradiated (I) Unirradiated (U)

1 U1A U1B

U2A U2B

I1A I1B

I2A I2B

After labeling, each RNA sample wassplit into two aliquots denoted A and B.

2

04/18/23 63

Motivating ExperimentH

uman

Cel

l Lin

esTreatment

Irradiated (I) Unirradiated (U)

1 U1A U1B

U2A U2B

I1A I1B

I2A I2B

8 GeneChips, one for each sample, wereused to obtain measures of expression.

2

04/18/23 64

First glance at the data

Linear Scatter plot of gene expression Cube root scatter plot of gene expression

04/18/23 65

How to find the significant changes? Naïve method

Cube root scatter plot of average gene expression from the four hybridizations with uninduced cells (avg xU) and induced cells 4 h after exposure to 5 Gy of IR (avg xI). Some of the genes that responded to IR are indicated by arrows.

04/18/23 66

Test Statistic for the ith Gene

d(i) = xI(i) – xU(i)- -

s(i)+s0

Average of 4 normalizedmeasures from

irradiated samples

Average of 4 normalizedmeasures from

unirradiated samples

The usual standarddeviation in the denominator

of a two-sample t-stat

A constant common to allgenes that is added to makevariation in d(i) similar acrossgenes of all intensity levels

04/18/23 67

Selecting the constant s0• At low expression levels, variance in d(i) can be high

because of small values of s(i).

• To stabilize the variance of d(i) across genes, a small positive constant s0 was used in the denominator of the test statistic.

• “The coefficient of variation of d(i) was computed as a function of s(i) in moving windows across the data. The value for s0 was chosen to minimize the coefficient of variation.”

• s0 was chosen to be 3.3 for the ionizing radiation data.

04/18/23 68

More Detail on Selecting s0

• The d(i) are separated into approximately 100 groups. The 1% of the d(i) values with the smallest s(i) values are placed in the first group, the 1% of the d(i) values with the next smallest s(i) are placed in the second group, etc.

• The median absolute deviation (MAD) of the d(i) values is computed separately for each group.

• The coefficient of variation (CV) of these 100 MAD values is computed.

04/18/23 69

More Detail on Selecting s0 (continued)

• This process is repeated for values of s0 equal to the minimum of s(i) over i, the 5th percentile of the s(i) values, the 10th percentile of the s(i) values,..., the 95th percentile of the s(i) values, and the maximum of the s(i) values.

• The value of s0 that minimizes the CV of the 100 MAD values over candidate s0 described above is selected as the constant s0.

04/18/23 70

Balancing the Permutations

•There are differences between the two cell lines.

• Balanced permutations- to minimize the effects of these differences

A permutation is balanced if each group of four

experiments contained two experiments from

line 1 and two from line 2.There are 36 balanced permutations.

04/18/23 71

Example PermutationsH

uman

Cel

l Lin

esTreatment

Irradiated (I) Unirradiated (U)

1 I1A I1B U1A U1B

I2A I2B U2A U2B2

04/18/23 72

• Scatter plots of relative difference in gene expression d(i) vs. genespecific scatter s(i).

04/18/23 73

A Permutation Procedurefor Assessing Significance

1. The irradiated and unirradiated GeneChips were shuffled within each cell line.

2. The d(i) statistic was computed for each gene and ordered across genes from smallest to largest to obtain d1(1)<d1(2)< <d1(g) where g denotes the number of genes.

3. Steps 1 and 2 were repeated for all possible data permutations described in step 1 to obtain dp(1)<dp(2)< <dp(g) for p=1,...,36.

...

...

42

42

04/18/23 74

A Permutation Procedurefor Assessing Significance (continued)

4. For each i, d1(i),...,d36(i) were averaged to obtain dE(i), the “expected relative difference.”

5. The original d(i) statistics were also sorted so that d(1)<d(2)< <d(g).

6. Genes for which | d(i) – dE(i) | > were declared significant, where is a user specified cutoff for significance.

...

04/18/23 75

Example

04/18/23 76

Plot of Observed vs. “Expected” Test Statistics

d(i)

dE(i)

Points for genes withevidence of induction

Points for genes withevidence of repression

2

04/18/23 77

Plot of d(i) vs. log10s(i) forthe Ionizing Radiation Data

d(i)

log10s(i)

24 induced genes

22 repressed genes

04/18/23 78

Estimating FDR for a Selected 1. Find the smallest d(i) among those d(i) for

which d(i) – dE(i) > and call it dup.

2. Find the largest d(i) among those d(i) for which d(i) - dE(i) < - and call it ddown.

3. For each permuted data set, find the number of genes with d(i) >= dup or d(i) <= ddown and denote these counts by n1,...,n36.

4. FDR is estimated by n / n where n is the average of n1,...,n36 and n is the number of genes identified as significant in the original data.

- -

04/18/23 79

FDR cont’d

})()(|{#

})()(|{#

21

36

1 21361

tidtidi

tidtidiFDR p pp

• Note: Cutoffs are asymmetric

04/18/23 80

Counts of Genes beyond the Threshold For Each Permutation

1 45 2 5 3 2 4 3 5 4 6 11 7 8 8 5 9 110 111 312 4

13 414 115 316 917 1218 3119 3120 1221 922 323 124 4

25 426 227 128 129 530 931 1132 433 334 235 536 46

Perm Count Perm Count Perm Count

04/18/23 81

Mean Count = 8.472 FDR Estimate = 8.472/46 = 18.4%

1 45 2 5 3 2 4 3 5 4 6 11 7 8 8 5 9 110 111 312 4

13 414 115 316 917 1218 3119 3120 1221 922 323 124 4

25 426 227 128 129 530 931 1132 433 334 235 536 46

Perm Count Perm Count Perm Count

04/18/23 82

How to choose Δ?

Omitting s0 caused higher FDR.

04/18/23 83

Plot of Observed vs. “Expected” Test Statistics

d(i)

dE(i)

-4.073859

4.054688

04/18/23 84

Plot of d(i) vs. log10s(i) forthe Ionizing Radiation Data

d(i)

log10s(i)

-4.073859

4.054688

04/18/23 85

Same Plot for One of the Permuted Data Sets

d(i)

-4.073859

4.054688

log10s(i)

only 5 genes beyond thresholdscompared to 46 for original data

04/18/23 86

SAM vs. R fold

• R-fold Method:

• Gene i is significant if r(i)>R or r(i)<1/R

FDR 73%-84% - Unacceptable.

• Pairwise fold change: At least 12 out of 16 pairings satisfying the criteria. FDR 60%-71% - Unacceptable.

Why doesn’t it work?

)(

)()(

ix

ixir

U

I

04/18/23 87

Fold-change, SAM- Validation

04/18/23 88

04/18/23 89

SAM vs. Multiple t-Tests

• Trying to keep the FDR or FWER (Family–wise error rate).

• Why doesn’t it work? • FWER- too stringent (Bonferroni, Westfall

and Young)• FDR- too granular (Benjamini and Hochberg)• SAM does not assume normal distribution of

the data• SAM works effectively even with small

sample size.

04/18/23 90

Conclusion SAM• SAM is a method for identifying genes on a

microarray with statistically significant changes in expression.

• SAM provides an estimate of the FDR for each value of the tuning parameter. The estimated FDR is computed from permutations of the data.

• SAM can be generalized to other types of experiments and outcomes by redefining d(i)

• http://www-stat-class.stanford.edu/SAM/SAMServlet.

04/18/23 91

ANOVA

• The t-test and its variants only work when there are two sample pools.

• Analysis of variance (ANOVA) is a general technique for handling multiple variables, with replicates.

• A tutorial is available here:http://cran.at.r-project.org/doc/contrib/Faraway-PRA.pdf

04/18/23 92

A simple experiment

• Measure response to a drug treatment in two different mouse strains.

• Repeat each measurement five times.

• Total experiment = 2 strains * 2 treatments * 5 repetitions = 20 arrays

• If you look for treatment effects using a t-test, then you ignore the strain effects.

04/18/23 93

ANOVA lingo

• Factor: a variable that is under the control of the experimenter (strain, treatment).

• Level: a possible value of a factor (drug, no drug).

• Main effect: an effect that involves only one factor.

• Interaction effect: an effect that involves two or more factors simultaneously.

• Balanced design: an experiment in which each factor and level is measured an equal number of times.

04/18/23 94

Two-factor design

04/18/23 95

Fixed and random effects

• Fixed effect: a factor for which the levels would be repeated exactly if the experiment were repeated.

• Random effect: a term for which the levels would not repeat in a replicated experiment.

• In the simple experiment, treatment and strain are fixed effects, and we include a random effect to account for biological and experimental variability.

04/18/23 96

ANOVA model

is the mean expression level of the gene.• T and S are main effects (treatment, strain)

with n and m levels, respectively.• TS is an interaction effect.• p is the number of replicates per group. represents random error (to be minimized).

.,,1

,,,1

,,,1

pk

mj

ni

TSSTE ijkijjiijk

04/18/23 97

ANOVA steps

• For each gene on the array– Fit the parameters T and S, minimizing .– Test T, S and TS for difference from zero,

yielding three F statistics.– Convert the F statistics into p-values.

04/18/23 98

ANOVA assumptions

• For a given gene, the random error terms are independent, normally distributed and have uniform variance.

• The main effects and their interactions are linear.

04/18/23 99

Summary

• Individual measurements from microarray experiments are not trustworthy.

• Repetition or independent verification (e.g., RT-PCR) are the best means of verification.

• For simple designs, use Welch’s approximation of the t-test.

• For complex designs, use ANOVA.• Correct for multiple comparisons using FDR and

q-values.

04/18/23 100

Bioconductor• Bioconductor is an open source project to

design and provide high quality software and documentation for bioinformatics.

• Current focus: microarrays and gene (transcript) annotation

• Most of the early developments are in the form of R packages.

• Open to (your?) contributions• Software and documentation are available

from www.bioconductor.org.

04/18/23 101

Bioconductor packages

• General infrastructure– Biobase– annotate, AnnBuilder– tkWidgets

• Pre-processing for Affymetrix data– affy.

• Pre-processing for cDNA data– marrayClasses, marrayInput, marrayNorm, marrayPlots.

• Differential expression– edd, genefilter, multtest, ROC.

• etc.