6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power...

100
06/27/22 1 Microarray Data Analysis

Transcript of 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power...

Page 1: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 1

Microarray Data Analysis

Page 2: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 2

Copyright notice

• Many of the images in this power point presentation of other people. The Copyright belong to the original authors. Thanks!

Page 3: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 3

Gene Expression MatrixAfter image processing, obtain a data matrixThe final gene expression matrix (on the right) is needed for higher level analysis and mining.

Samples

Gen

es

Gene expression levels

Images

Spo

ts

Spot/Image quantiations

Page 4: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 4

Missing data in microarray• Randomly missing values

• the fact that the value is missing is independent of its value

• methods are available for dealing with randomly missing data

• Non-randomly missing values:• the fact that the value is missing is

dependent on its value– (i.e. the value is missing because it is low

expression, or the value is missing because it is high expression)

• available methods do not adequately deal with the situation of non-randomly missing data

Page 5: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 5

Missing data in microarray

Randomly missing data:– spotting problems– dust– finger prints– poor hybridization – inadequate resolution– fabrication errors (e.g.

scratches)– image corruption– omission of suspect values*

* could also be non-random

Non-randomly missing data:low expression

e.g. background exceeds signalcensored data

Arrays

max observable intensity

Exp

ress

ion

Page 6: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 6

Dealing with missing data

• The problem:– many analyses require complete data

matrices• classification algorithms• clustering algorithms• dimension-reduction methods

• Solutions:– remove all genes (rows) and arrays (columns)

with missing values– estimate missing values

Page 7: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 7

Imputation methods

• Naive approaches– missing values = row (gene) average– missing values = column (array) average

• Smarter approaches have been proposed:– K-nearest neighbors– regression-based methods– singular value decomposition

• like principal components for matrices with unequal numbers of rows and columns

Page 8: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 8

K-Nearest Neighbors (KNN)

Arrays

Exp

ress

ion

?

randomly missing datum

• chose k genes that are most similar to the gene with the missing value (MV)

• estimate MV as the weighted mean of the neighbors

• considerations:– number of neighbors– distance metric– normalization step

Page 9: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 9

KNN - considerations

• parameter k– 10 usually works (5-15)

• distance metric– euclidean distance– correlation-based

distance

Arrays

Exp

ress

ion

?

Page 10: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 10

Ordinary Least Squares (OLS)• regression-based approach• also uses k-neighbors• algorithm:

– choose k neighbors (euclidean or correlation; normalize or not)

– the gene with the MV is regressed over the neighbor genes (one at a time, i.e. simple regression)

– for each neighbor, MV is predicted from the regression model

– MV is imputed as the weighed average of the k predictions

Page 11: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 11

Singular Value Decomposition (SVD)

• goal:– use the strongest patterns of correlation within the

data matrix to estimate • algorithm

– set MVs to row average (need a starting point)– decompose expression matrix in orthogonal

components, “eigengenes”.– use the proportion, p, of eigengenes corresponding

to largest eigenvalues to reconstruct the MVs from the original matrix (i.e. improve your estimate)

– use EM approach to iteratively improve estimates of MVs until convergence

Page 12: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 12

Other Imputation Methods:

• Local Singular Value Decomposition (LSVD)– combines KNN and SVD– algorithm:

• start with a ngenes x marrays matrix• select k neighbor genes (euclidean or correlation;

normalize or not)• perform SVD on the k x marray matrix

• Partial Least Squares (PLS) regression– uses all genes and available data from target gene

• Factor Analysis (FA) regression

Page 13: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 13

Which imputation method to use?

• KNN is the most widely-used; current standard

• many alternative choices: OLS, SVD, LSVD, PLS, (FA)

• algorithms require user-supplied parameters: k, p, distance metric, etc.

• No set of rules for choosing which method to use

Page 14: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 14

Characteristics of data that may affect choice of imputation method

• dimensionality

• percentage of values missing

• experimental design (time series, case/control, etc.)

• entropy - patterns of correlation in data

• others?

Page 15: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 15

Data Analysis

• Determine differential gene expression• Identify up- and down-regulated genes• Gene lists produced using Factor 2 Rule, t-test based

methods

• Co-regulation of genes• Clustering algorithms

• Identify genes that regulate other genes• Networks (e.g. Bayesian)

Page 16: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 16

Methods to Decide Differential Expression

• Compare treatment to the control– The fold approach– The t-test– Variations of the t-test

• SAM: significance analysis of microarrays

• Compare several treatments– ANOVA: analysis of variance– MAANOVA:

http://www.jax.org/staff/churchill/labsite/software/anova/index.html

Page 17: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 17

Fold Change

• Measure ratios of gene expression levels.

• Ratio = Ti/Ci. Ratio of measured treatment intensity to control intensity for the ith spot

• The log2 ratio treats up and down regulated genes equally– e.g. when looking for genes with more than 2 fold

variation in expression

Page 18: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 18

The Fold Approach

• In northern analysis, a 2-fold change can be seen with bare eyes

• Thus biologists tend to use 2-fold as the threshold of differential expression

• mean(x1, x2) > 1

• mean(x1, x2) < -1

Page 19: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 19

Illustration of the benefit of using Log ratios

Page 20: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 20

Two-fold up-regulation

• Problems with this approach:– Only identifies most changed genes.– Also identifies noise and highly variable

genes.– Ratio is unstable when the denominator is

small.

Page 21: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 21

Ratios are unstable

• Initial measurements:

30/60 = 0.5

500/1000 = 0.5

• Add random noise (+15 numerator and -15 denominator):

45/45 = 1.0

515/985 = 0.52

Page 22: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 22

Types of tests

• Standard t-test assumes the samples are drawn from normal distributions with equal variance and different means.

• Welch’s t-test allows for different variances between classes.

• Mann-Whitney (Wilcoxon) converts the data to ranks, and does not assume a particular distribution.

• Permutation test computes the t-statistic for many random permutations of the labels.

Page 23: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 23

The Student’s t-test

• For sample sizes less than 30 we have to make use of a t-distribution

• We make use of this distribution in the two-sample Students t-test.

• This test is used to test whether two samples come from distributions with the same means.

• The samples are assumed to come from Gaussian (normal) distributions.

• The two samples must have similar dispersions

Page 24: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 24

The student’s t distribution• The students t distribution

– is mound shaped– is symmetrical about zero– is more widely dispersed than the standard

normal distribution– it’s actual shape is dependent on the sample size

• different t distributions are identified by their degrees of freedom (df), where df = n-1

Page 25: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 25

The student’s t distribution (cont.)

-4 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 4

Standard Errors

df=120 (=z)

df=30

df=15

EG’s (not to scale)

Page 26: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

Mean and Median

• The mean is the most common measure of the location of a set of points.

• However, the mean is very sensitive to outliers. • Thus, the median or a trimmed mean is also

commonly used.

04/18/23 26

Page 27: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

Range and Variance

• Range is the difference between the max and min• The variance or standard deviation sx is the most

common measure of the spread of a set of points.

• Because of outliers, other measures are often used.

04/18/23 27

Page 28: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 28

Statistical Analysis

controlgroupmean

treatmentgroupmean

Is there a difference?

Page 29: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 29

What does difference mean?

mediumvariability

highvariability

lowvariability

The mean differenceis the same for all

three cases

Page 30: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 30

What does difference mean?

mediumvariability

highvariability

lowvariability

Which one showsthe greatestdifference?

Page 31: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 31

What does difference mean?• a statistical difference is a function of the

difference between means relative to the variability

• a small difference between means with large variability could be due to chance

• like a signal-to-noise ratio

lowvariability

Which one showsthe greatestdifference?

Page 32: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 32

So we estimate

lowvariability

signal

noise

difference between group means

variability of groups=

XT - XC

SE(XT - XC)=

= t-value

_ _

_ _

Page 33: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 33

Probability - p

• With t we check the probability Reject or do not reject Null hypothesis

• You reject if p < 0.05 or less• Difference between means

(groups) is more & more significant if p is less & less

Page 34: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 34

Important notes on two sample comparisons

• Type I errors (false positive)– we accept a difference is real when it is not (at the 95% confidence level we are, of course, wrong 5% of the time)– We can increase the significance level to

decrease these errors• Type II errors (false negative)– if we increase

our significance level we risk missing some real differences by making our testing too stringent.

• Convention is we should reduce Type I errors and be conservative

• Both can be minimised by increasing the sample size

Page 35: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 35

Paired and unpaired tests

• There are different formulas for the T-test depending on whether we have paired or unpaired data– Paired – making observations of N individuals in two

different situations• In this situation we can consider the difference for each

individual rather than calculate separate means and SEs for the two effects

– Unpaired – Two separate samples drawn from the same parent population

• Can have different sample sizes

Page 36: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 36

Tails

• Two-tailed: Do set A and set B come from different distributions?

• One-tailed: Does set A come from a distribution with larger mean than set B?

• This corresponds to finding differentially regulated genes versus finding up-regulated genes.

Page 37: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 37

Selecting genes with a t-test

μi = mean expression value in class ini = number of examples in class iv = pooled variance across both classes

21

21

nv

nv

http://mathworld.wolfram.com/Studentst-Distribution.htmlZar. Biostatistical Analysis. 1999.

Page 38: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 38

Standard T Test: An example

• Observed gene expression values:

Treatment A: 0.45 0.57 1.02 0.97

Treatment B: 1.50 2.07 0.51 1.63

• Compute mean:

mean (A) = 3.01 / 4 = 0.7525

mean (B) = 5.71 / 4 = 1.4275

Page 39: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 39

Pooled variance

• The standard t-test assumes samples are drawn from distributions with the same variance.

• Pooled variance

= (SS1 + SS1) / (n1 + n2 - 2)

= (0.243675 + 1.300875) / (4 + 4 - 2)

= 0.2574SS: variance

Page 40: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 40

Selecting genes with a t-test

t = (0.7525 - 1.4275) / sqrt(0.2574/4 + 0.2574/4) = 1.8815

21

21

nv

nv

Page 41: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 41

If the Sample Variances are Unlikely to be Equal

• Use Welch’s t-test • degrees of freedom

• wherey

y

x

x

nn

yx22

11

)(22

2

yx nB

nA

BA

y

y

x

x

nB

nA

22

,

Page 42: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 42

Welch’s approximation

t = 1.8815Welch’s = |0.7525 - 1.4275| / sqrt(0.08089/4 + 0.43363/4)

= 1.8821

21

21

nv

nv

2

2

1

1

21

nv

nv

t-test Welch’s

Page 43: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 43

Degrees of freedom

• For the t-test, dof = n1 + n2 - 2.

• For Welch’s approximation, it is not so simple. Let Ai = vari / ni. Then

11 2

22

1

21

221

nA

nA

AAfloordof

Page 44: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 44

Non-parametric p-value

• The t-test assumes the t-distribution– a parametric method– compute the test statistics– use the t pdf to determine the p-value

• A non-parametric method– data are labeled as X and Y– compute the test statistics with true labels– randomly permute the individual labels 10000 times, and

compute the test statistics– find the rank of the true test statistics among the test statistics of

random permutations– for example, if there are 10 permutations with test statistics

larger than the true test statistics, then the p-value is 0.001

Page 45: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 45

Mann-Whitney u-test

• Mann-Whitney, also known as Wilcoxon, is a non-parametric test.

• Begin by converting to ranks:

Treatment A: 0.45 0.57 1.02 0.97

Treatment B: 1.50 2.07 0.51 1.63

Treatment A: 1 3 5 4

Treatment B: 6 8 2 7

Page 46: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 46

Mann-Whitney u statistic

• The u statistic is

where Ri is the sum of the ranks in class i.

• U = 16 + 10 - 13 = 13

2

22211

1121 2

1,

21

max Rnn

nnRnn

nnU

Page 47: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 47

Permutation test

Page 48: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 48

Cost-benefits analysis

• t-test assumes both samples are drawn from the same normal distribution.

• Welch’s approximation allows the samples to be drawn from different normals.

• Mann-Whitney makes no assumption about the distribution.

• The tests, as listed, yield decreasing power.• The permutation test gives the most flexibility in

choosing a test statistic that reflects prior knowledge, but it can be computationally expensive for small p-values.

Page 49: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 49

Multiple testing correction

• On an array of 10,000 spots, a p-value of 0.0001 may not be significant.

• For significance of 0.05 with 10,000 spots, you need a p-value of 5 10-6.

Page 50: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 50

Family-wise Error-rate

• FWER• Chance of any false positives• Assume 0.01 significance level for one gene• Multiply by the number of genes• Many false positives• Bonferroni correction: divide 0.01 by the number

of genes• Bonferroni is conservative because it assumes

that all genes are independent.

Page 51: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 52

False discovery rate

• The false discovery rate (FDR) is the percentage of genes above a given position in the ranked list that are expected to be false positives.

• False positive rate: percentage of non-differentially expressed genes that are flagged.

• False discovery rate: percentage of flagged genes that are not differentially expressed.

5 FP13 TP

33 TN5 FN

FDR = FP / (FP + TP) = 5/18 = 27.8%FPR = FP / (FP + TN) = 5/38 = 13.2%

Page 52: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 53

Bonferroni vs. FDR

• Bonferroni controls the family-wise error rate; i.e., the probability of at least one false positive.

• FDR is the proportion of false positives among the genes that are flagged as differentially expressed.

Page 53: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 54

Controlling the FDR

• Order the unadjusted p-values p1 p2 … pm.

• To control FDR at level α,

• Reject the null hypothesis for j = 1, …, j*.• This approach is conservative if many genes are

differentially expressed.

m

jpjj j:max*

(Benjamini & Hochberg, 1995)

Page 54: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 55

q-value

• The p-value for a particular gene G is the probability that a randomly generated expression profile would be as or more extremely differentially expressed.

• The q-value for a particular gene G is the proportion of false positives among all genes that are as or more extremely differentially expressed.

• Equivalently, the q-value is the minimal FDR at which this gene appears significant.

Page 55: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 56

Q-value software

http://faculty.washington.edu/~jstorey/qvalue/

Page 56: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 57

SAMSignificance analysis of microarrays applied to the ionizing radiation response Virginia Goss Tusher, Robert Tibshirani, and Gilbert ChuProc. Natl. Acad. Sci. USA, Vol. 98, Issue 9, 5116-5121, April 24, 2001

Page 57: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 58

Abstract• Method for gene filtering: find genes change

that significantly across samples• Significance Analysis of Microarrays (SAM)

assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements.

• For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR).

Page 58: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 59

Introduction

• Suitable for oligo, cDNA, protein arrays

• Does not normalize the data!

• Challenge: – methods based on conventional t tests provide

the probability (P) that a difference in gene expression occurred by chance. For an array with 10000 genes, a significance level of alpha = 0.01 would identify 100 genes by chance.

– Experiments are expensive.

Page 59: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 60

Introduction• Solution based on SAM:

– assimilate a set of gene-specific t tests. Each gene is assigned a score on the basis of its change in gene expression relative to the standard deviation of repeated measurements for that gene.

– Instead of more replicates, generate permutations of the data (mix the labels)

• Genes with scores greater than a threshold are deemed potentially significant. The percentage of such genes identified by chance is the false discovery rate (FDR). To estimate the FDR, nonsense genes are identified by analyzing permutations of the measurements.

• The threshold can be adjusted to identify smaller or larger sets of genes, and FDRs are calculated for each set. To demonstrate its utility, SAM was used to analyze a biologically important problem: the transcriptional response of lymphoblastoid cells to ionizing radiation (IR).

Page 60: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 61

Motivating ExperimentH

uman

Cel

l Lin

esTreatment

Irradiated (I) Unirradiated (U)

1

One RNA sample for each combinationof cell line and treatment

2

Page 61: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 62

Motivating ExperimentH

uman

Cel

l Lin

esTreatment

Irradiated (I) Unirradiated (U)

1 U1A U1B

U2A U2B

I1A I1B

I2A I2B

After labeling, each RNA sample wassplit into two aliquots denoted A and B.

2

Page 62: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 63

Motivating ExperimentH

uman

Cel

l Lin

esTreatment

Irradiated (I) Unirradiated (U)

1 U1A U1B

U2A U2B

I1A I1B

I2A I2B

8 GeneChips, one for each sample, wereused to obtain measures of expression.

2

Page 63: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 64

First glance at the data

Linear Scatter plot of gene expression Cube root scatter plot of gene expression

Page 64: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 65

How to find the significant changes? Naïve method

Cube root scatter plot of average gene expression from the four hybridizations with uninduced cells (avg xU) and induced cells 4 h after exposure to 5 Gy of IR (avg xI). Some of the genes that responded to IR are indicated by arrows.

Page 65: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 66

Test Statistic for the ith Gene

d(i) = xI(i) – xU(i)- -

s(i)+s0

Average of 4 normalizedmeasures from

irradiated samples

Average of 4 normalizedmeasures from

unirradiated samples

The usual standarddeviation in the denominator

of a two-sample t-stat

A constant common to allgenes that is added to makevariation in d(i) similar acrossgenes of all intensity levels

Page 66: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 67

Selecting the constant s0• At low expression levels, variance in d(i) can be high

because of small values of s(i).

• To stabilize the variance of d(i) across genes, a small positive constant s0 was used in the denominator of the test statistic.

• “The coefficient of variation of d(i) was computed as a function of s(i) in moving windows across the data. The value for s0 was chosen to minimize the coefficient of variation.”

• s0 was chosen to be 3.3 for the ionizing radiation data.

Page 67: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 68

More Detail on Selecting s0

• The d(i) are separated into approximately 100 groups. The 1% of the d(i) values with the smallest s(i) values are placed in the first group, the 1% of the d(i) values with the next smallest s(i) are placed in the second group, etc.

• The median absolute deviation (MAD) of the d(i) values is computed separately for each group.

• The coefficient of variation (CV) of these 100 MAD values is computed.

Page 68: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 69

More Detail on Selecting s0 (continued)

• This process is repeated for values of s0 equal to the minimum of s(i) over i, the 5th percentile of the s(i) values, the 10th percentile of the s(i) values,..., the 95th percentile of the s(i) values, and the maximum of the s(i) values.

• The value of s0 that minimizes the CV of the 100 MAD values over candidate s0 described above is selected as the constant s0.

Page 69: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 70

Balancing the Permutations

•There are differences between the two cell lines.

• Balanced permutations- to minimize the effects of these differences

A permutation is balanced if each group of four

experiments contained two experiments from

line 1 and two from line 2.There are 36 balanced permutations.

Page 70: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 71

Example PermutationsH

uman

Cel

l Lin

esTreatment

Irradiated (I) Unirradiated (U)

1 I1A I1B U1A U1B

I2A I2B U2A U2B2

Page 71: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 72

• Scatter plots of relative difference in gene expression d(i) vs. genespecific scatter s(i).

Page 72: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 73

A Permutation Procedurefor Assessing Significance

1. The irradiated and unirradiated GeneChips were shuffled within each cell line.

2. The d(i) statistic was computed for each gene and ordered across genes from smallest to largest to obtain d1(1)<d1(2)< <d1(g) where g denotes the number of genes.

3. Steps 1 and 2 were repeated for all possible data permutations described in step 1 to obtain dp(1)<dp(2)< <dp(g) for p=1,...,36.

...

...

42

42

Page 73: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 74

A Permutation Procedurefor Assessing Significance (continued)

4. For each i, d1(i),...,d36(i) were averaged to obtain dE(i), the “expected relative difference.”

5. The original d(i) statistics were also sorted so that d(1)<d(2)< <d(g).

6. Genes for which | d(i) – dE(i) | > were declared significant, where is a user specified cutoff for significance.

...

Page 74: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 75

Example

Page 75: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 76

Plot of Observed vs. “Expected” Test Statistics

d(i)

dE(i)

Points for genes withevidence of induction

Points for genes withevidence of repression

2

Page 76: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 77

Plot of d(i) vs. log10s(i) forthe Ionizing Radiation Data

d(i)

log10s(i)

24 induced genes

22 repressed genes

Page 77: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 78

Estimating FDR for a Selected 1. Find the smallest d(i) among those d(i) for

which d(i) – dE(i) > and call it dup.

2. Find the largest d(i) among those d(i) for which d(i) - dE(i) < - and call it ddown.

3. For each permuted data set, find the number of genes with d(i) >= dup or d(i) <= ddown and denote these counts by n1,...,n36.

4. FDR is estimated by n / n where n is the average of n1,...,n36 and n is the number of genes identified as significant in the original data.

- -

Page 78: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 79

FDR cont’d

})()(|{#

})()(|{#

21

36

1 21361

tidtidi

tidtidiFDR p pp

• Note: Cutoffs are asymmetric

Page 79: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 80

Counts of Genes beyond the Threshold For Each Permutation

1 45 2 5 3 2 4 3 5 4 6 11 7 8 8 5 9 110 111 312 4

13 414 115 316 917 1218 3119 3120 1221 922 323 124 4

25 426 227 128 129 530 931 1132 433 334 235 536 46

Perm Count Perm Count Perm Count

Page 80: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 81

Mean Count = 8.472 FDR Estimate = 8.472/46 = 18.4%

1 45 2 5 3 2 4 3 5 4 6 11 7 8 8 5 9 110 111 312 4

13 414 115 316 917 1218 3119 3120 1221 922 323 124 4

25 426 227 128 129 530 931 1132 433 334 235 536 46

Perm Count Perm Count Perm Count

Page 81: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 82

How to choose Δ?

Omitting s0 caused higher FDR.

Page 82: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 83

Plot of Observed vs. “Expected” Test Statistics

d(i)

dE(i)

-4.073859

4.054688

Page 83: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 84

Plot of d(i) vs. log10s(i) forthe Ionizing Radiation Data

d(i)

log10s(i)

-4.073859

4.054688

Page 84: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 85

Same Plot for One of the Permuted Data Sets

d(i)

-4.073859

4.054688

log10s(i)

only 5 genes beyond thresholdscompared to 46 for original data

Page 85: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 86

SAM vs. R fold

• R-fold Method:

• Gene i is significant if r(i)>R or r(i)<1/R

FDR 73%-84% - Unacceptable.

• Pairwise fold change: At least 12 out of 16 pairings satisfying the criteria. FDR 60%-71% - Unacceptable.

Why doesn’t it work?

)(

)()(

ix

ixir

U

I

Page 86: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 87

Fold-change, SAM- Validation

Page 87: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 88

Page 88: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 89

SAM vs. Multiple t-Tests

• Trying to keep the FDR or FWER (Family–wise error rate).

• Why doesn’t it work? • FWER- too stringent (Bonferroni, Westfall

and Young)• FDR- too granular (Benjamini and Hochberg)• SAM does not assume normal distribution of

the data• SAM works effectively even with small

sample size.

Page 89: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 90

Conclusion SAM• SAM is a method for identifying genes on a

microarray with statistically significant changes in expression.

• SAM provides an estimate of the FDR for each value of the tuning parameter. The estimated FDR is computed from permutations of the data.

• SAM can be generalized to other types of experiments and outcomes by redefining d(i)

• http://www-stat-class.stanford.edu/SAM/SAMServlet.

Page 90: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 91

ANOVA

• The t-test and its variants only work when there are two sample pools.

• Analysis of variance (ANOVA) is a general technique for handling multiple variables, with replicates.

• A tutorial is available here:http://cran.at.r-project.org/doc/contrib/Faraway-PRA.pdf

Page 91: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 92

A simple experiment

• Measure response to a drug treatment in two different mouse strains.

• Repeat each measurement five times.

• Total experiment = 2 strains * 2 treatments * 5 repetitions = 20 arrays

• If you look for treatment effects using a t-test, then you ignore the strain effects.

Page 92: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 93

ANOVA lingo

• Factor: a variable that is under the control of the experimenter (strain, treatment).

• Level: a possible value of a factor (drug, no drug).

• Main effect: an effect that involves only one factor.

• Interaction effect: an effect that involves two or more factors simultaneously.

• Balanced design: an experiment in which each factor and level is measured an equal number of times.

Page 93: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 94

Two-factor design

Page 94: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 95

Fixed and random effects

• Fixed effect: a factor for which the levels would be repeated exactly if the experiment were repeated.

• Random effect: a term for which the levels would not repeat in a replicated experiment.

• In the simple experiment, treatment and strain are fixed effects, and we include a random effect to account for biological and experimental variability.

Page 95: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 96

ANOVA model

is the mean expression level of the gene.• T and S are main effects (treatment, strain)

with n and m levels, respectively.• TS is an interaction effect.• p is the number of replicates per group. represents random error (to be minimized).

.,,1

,,,1

,,,1

pk

mj

ni

TSSTE ijkijjiijk

Page 96: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 97

ANOVA steps

• For each gene on the array– Fit the parameters T and S, minimizing .– Test T, S and TS for difference from zero,

yielding three F statistics.– Convert the F statistics into p-values.

Page 97: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 98

ANOVA assumptions

• For a given gene, the random error terms are independent, normally distributed and have uniform variance.

• The main effects and their interactions are linear.

Page 98: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 99

Summary

• Individual measurements from microarray experiments are not trustworthy.

• Repetition or independent verification (e.g., RT-PCR) are the best means of verification.

• For simple designs, use Welch’s approximation of the t-test.

• For complex designs, use ANOVA.• Correct for multiple comparisons using FDR and

q-values.

Page 99: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 100

Bioconductor• Bioconductor is an open source project to

design and provide high quality software and documentation for bioinformatics.

• Current focus: microarrays and gene (transcript) annotation

• Most of the early developments are in the form of R packages.

• Open to (your?) contributions• Software and documentation are available

from www.bioconductor.org.

Page 100: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.

04/18/23 101

Bioconductor packages

• General infrastructure– Biobase– annotate, AnnBuilder– tkWidgets

• Pre-processing for Affymetrix data– affy.

• Pre-processing for cDNA data– marrayClasses, marrayInput, marrayNorm, marrayPlots.

• Differential expression– edd, genefilter, multtest, ROC.

• etc.