Statistical Basics for Micro Array Studies

8/3/2019 Statistical Basics for Micro Array Studies

1/81

Statistical Basics forMicroarray Studies

Naomi AltmanPenn State University

Dept. of Statistics

Bioinformatics Consulting Center

[email protected]

Sept. 12/06
mailto:[email protected]:[email protected]


2/81


3/81

Data Used for this Talk

Data:Golub et al (1999) demonstrated that microarray gene expression

could be used to distinguish among human leukemia types:

acute myeloid leukemia (AML) and acute lymphoblasticleukemia (ALL).

There are 72 Affymetrix Hgu6800 microarrays (47 ALL, 25 AML),

with data on the expression of 7129 genes.The data were preprocessed to reduce array to array variability.

The entire dataset is available from www.bioconductor.org.We use a subset that is available in R when you load the affy

package.


4/81

Data Used for this Talk

Data: (e.g. 5 probe-sets, 19 randomly selectedcases from golubMergeSub in R)

ALL ALL ALL ALL ALL AML AML AML AML ALL ALL ALL ALL AML ALL AML AML ALL ALL

L37378_at 176 160 139 203 157 15 245 225 170 88 129 353 185 398 184 295 21 135 1585

L37792_at 7 10 14 23 0 3 34 23 22 33 -27 35 -5 18 36 34 6 -12 29

L37868_s_at 284 135 183 212 185 -61 319 182 185 281 68 -132 201 652 132 362 380 230 -382

L37882_at 251 514 441 571 230 516 342 234 183 134 271 439 351 1123 574 53 141 385 258

L37936_at 31 27 22 34 -23 287 6 8 42 84 20 16 26 50 8 14 1 -2 33

It usually pays to have a quick look at the numericalvalues of your data. e.g. in this data set, we havesome negative expression values (which is not

biologically supportable).


5/81

Statistical Software

Any standard statistical package (and probably Excel)can do the plots and analyses I will show.

I will use R because: Free from www.bioconductor.org

Available: Windows, Mac, Unix, Linux Programmable Great graphics

Bioconductor Project for genomics Lots of resources (R + Splus communities)

The learning curve for R is steep.
http://www.bioconductor.org/http://www.bioconductor.org/


6/81

What is statistics?

Statistical methods are used to:

Describe what was observed.

Make inference to what was not observed.

Determine how to collect the dataappropriately.


7/81

Description versus inference

Statistical methods are used both to:

Describe what was observed:

requires data

Make inference to what was not observed:

requires data collected using statistical

methods such as randomization andreplication.


8/81

Describing the Data

Graphically

Histogram

Density Plot Boxplot

Scatterplot

Numerically

Quantiles Averages

Variance, Standard Deviation

Correlation


9/81

Some R commands

# Load software and data

# get data

# see what is in the data

# extract expression values

# extract variables describing# samples

library(affy)

data(golubMergeSub)golubMergeSub

gExprs=exprs(golubMergeSub)

gPheno=pData(golubMergeSub)

gExprs[1:5,]

gPheno[1:5,]


10/81

Looking at the data:

Here are 4 of the genes,aggregated over the 72 arrays.

hist(gExprs[i,])

The "shapes" of the distributionsvary.

Most of the data for L38500_at arenegative.

We have combined ALL and AMLdata.


11/81

Looking at the data:Here are 2 of the genes, aggregated overthe 72 arrays with histograms organizedby cancer type

hist(gExprs[i,,gPheno$ALL.AML=="ALL"])

There are some obvious features, (e.g.outlier) but it is hard to compare the AMLand ALL samples because the x-axesdiffer and even the bin widths are not the

same.


12/81

Looking at the data:

for (i in c(1,11)) {low=min(golubExprs[i,])-1high=max(golubExprs[i,])+1b=seq(low,high,length.out=20)hist(golubExprs[i,ALL.AML=="ALL"],

main=paste(rownames(golubExprs)[i],"ALL"),xlab="Expression",breaks=b)hist(golubExprs[i,ALL.AML=="AML"],main=paste(rownames(golubExprs)[i],"AML"),xlab="Expression",breaks=b)}

Here we have used the same x-axis forboth cancer types (but varying by gene).

We can see that the AML patients tend tobe higher on L38503, but there is notmuch difference for L37378


13/81

Boxplot

The red line is themedian, the blue are the

the 75th quantiles.

A boxplot is another

summary thatemphasizes the central50% of the data and theoutliers and is extremelyuseful for comparisons.


14/81

Boxplot

The "whiskers" on the plot extend tothe furthest data value that is no more

than 1.5 boxwidths from the edge ofthe box. Any values beyond aredenoted by a circle.

We can see immediately that L38503is differentially expressed.

As well, L37868_s is more variable inthe AML group, which is not useful fordetecting AML, but may have somebiological meaning.

Statisticians love to use


15/81

Statisticians love to uselogarithms

Because:

1. Right skewed data become more symmetric.2. If the variance of the data increases with the size, logs

equalize variance.

3. Log(x/y)=Log(x)-Log(y)

Note that Logb(x) = Logc(b) Logc(x) so we statisticiansdon't really care what base.

In microarray analysis, many people use Log2.

In these units a "2-fold" difference (x/y=2) is Log(x/y)=1.


16/81

Why we use log2(expression)In this particular dataset,many of the genes did nothave skewed distributions,and the spread in both

cancers was about thesame, but usually we useLog for all or none of thegenes.


17/81

Scatterplots

Scatterplots give us the

opportunity to plot 2quantities and colorcoding adds information.

We can clearly see thateven using both genes,

we cannot completelyseparate the AML andALL patients

plot


18/81

Scatterplots

Scatterplots are the most

useful tool for qualitycontrol. I always ploteach array against otherarrays. Even when thearrays come fromdifferent conditions, weshould get a diagonal

"cloud".

Note array X5523 whichwas a dud. (These arefrom a local project.)


19/81

Golub Data


20/81

LO(W)ESS Trend Curves

It is sometimes useful to

add a trend line to ascatterplot (and this isan integral part of the

"normalization"process.

http://www.stat.unc.edu/faculty/marron/marron_movies.html


21/81

A plot is worth 50,000 p-values

Most people are far to eager to get to a "p-value"

Look at the data table Plot everything you can


22/81

Numerical Summaries

Measuring the Center

Mean=average=

Median=50th quantile

Mode=highest point on histogram or density

Geometric mean=antilog(mean(log(data))

Y


23/81

Numerical Summaries

Measuring the Spread

Variance= average squared deviation from the mean

Standard deviation=

Interquartile range=length of boxplot box=75th quantile 25th quantile

var1

)( 2

=

n

YYi


24/81

Numerical SummariesMeasuring Linearity

Correlation: -1 r 1If r=0 there is no linear relationship

If r>0, the relationship is increasing.

If r


25/81

Populations and StatisticalInference

What do we mean when we state:

Gene X expresses (or differentially expresses)?

Sepals are more similar to petals than to leaves?

Populations and Statistical


26/81

Populations and StatisticalInference

What do we mean when we state:

Gene X expresses (or differentially expresses)?

Sepals are more similar to petals than to leaves?

The population is the general class of organismsfor which we want to make an inferentialstatement.



27/81


Inference

If we could observe the entire population with nomeasurement error, we would not need

inferential statistics.

Alternatively, if all the members of the populationare identical and there is no measurementerror, we would not need inferential statistics.



28/81


InferenceIf we could observe the entire population with nomeasurement error, we would not need

inferential statistics.Alternatively, if all the members of the population

are identical and there is no measurementerror, we would not need inferential statistics.

Statistical inference is used to make a statementabout the population in the presence of errorbecause we observe only a sample from a

varying population and there is measurementerror.

Samples and replication


29/81

Samples and replication

A sample is the set of observed items fromthe population.

Generally, for statistical inference we need

information about how variable theseitems are.

If there is also measurement error, we mightwant to make multiple measurements on

the same individual, because averagingthese reduces the measurement error.

Hypotheses


30/81

Hypotheses

Classical statistical tests are based on comparing 2 hypotheses

H0: the null hypothesis (which we hope to disprove)

HA: the alternative hypothesis (which we accept if there isevidence to reject the null hypothesis)

e.g.

H0: the gene does not express

HA: the gene expresses

H0: the gene does not differentially express

HA: the gene differentially expresses


31/81

Classical hypothesis testingHypotheses are statements about the population:

e.g.HA: the gene expresses in leaf guard cells

Means that the average expression level oversome population of leaf guard cells is positive.

Since there is both measurement error and

biological variability from leaf to leaf, plant toplant, etc., we need to:

What goes into a hypothesis


32/81

What goes into a hypothesis

test?1. Determine the population of cells, leaves,

plants for which we want to make theinference.

2. Ensure that the sample represents thepopulation of interest, not just one member of

the population.

This has 2 implications:

Essential data for hypothesis


33/81

Essential data for hypothesis

testingWe need to sample several individuals from our

population.

We need to estimate the biological variability ofour population (using the sample).

Sampling should be done at random.

Tests about the population


34/81


average.The usual tests require that the data areindependent and have the samevariability.

Same variability take logarithms

Independent dependence is induced by

taking multiple measurements of the same

object - e.g. organism, RNA sample,microarray

Tests of one sample mean or


35/81

Tests of one sample mean or

median t-test

Wilcoxon test Permutation and bootstrap tests



36/81

p paverage.

Hypotheses about expression levels are usually treatedas hypotheses about population averages (t-test) or

population medians (Wilcoxon test).

We want to make a guessabout the value of the redline, but we do not get tomeasure the entire

population.



37/81

p paverage.

Hypotheses about expression levels are usually treatedas hypotheses about population averages (t-test) orpopulation medians (Wilcoxon test).

We want to make a guessabout the value of the red line,

but we do not get to measurethe entire population.

We estimate the parameterfrom the data and then see ifit is too far from the

hypothesized value to beplausible


38/81


39/81

Classical testing is basedon the distribution of thetest statistic when the null

hypothesis is true.

This depends on theoriginal population, butmay be the sampleaverage, sample "t" value,etc.

The distribution of the test

statistic is computedthrough statistical theory orsimulation, or by"resampling" from the data(bootstrap and permutation

methods).

2 i l id f t t


40/81

2 simple ideas for tests

If the population mean is , then the sample

mean should be close to .t-test, permutation test, bootstrap test

If the population median is , then about50% of the observations should be above and 50% below.

Sign and Wilcoxon test

Some Questions of Interest


41/81


Does genei express under this condition?

Does the expression level of genei differ under

these 2 (or more) conditions?

Do genei and genej have similar patterns ofexpression under several conditions



42/81


Does genei express under this condition?

H0: i= 0 HA: i> 0

Since we measure intensity, which is always

non-negative, we cannot test this. We use a

small constant (how small?) and do a one-

sided test.

This is often considered to be inaccurate - PCRis often used to check at least some of theconclusions



43/81


Does the expression level of genei differ underthese 2 (or more) conditions?

H0: ij=ik HA: ijik

This is considered the most appropriate setting

for microarray data

t t f


44/81

tests of means:

1. Is the population mean 0? (2-sided)

, (1-sided).

2. Is there a difference in populationmeans? (usually 2-sided)

1.5=2.5 Is the mean difference equal 0?(usually 2-sided)

S ti


45/81

Some necessary assumptions

1. The observations are independent.

2. The observations are a random samplefrom a known population.

3. If there are 2 populations, the populationhave the same shape.

t test


46/81

t-test

RNA was spiked into the sample sothat the expression level for the oligo

should have been 24. Was this levelachieved?

0: =0

: 0

nS

Yt

Y/

* 0=

We obtain the p-valuefrom a table or our

software (d.f.=n-1)

Classical Inference: P-values


47/81

We assume that the NULL distribution is true.The test statistic then has a distribution that is

either known or can be simulated under thisassumption.

The p-value is the probability of observing a valueof the test statistic more extreme than theobserved value when Ho is true.

The p-value can be thought of as a measurement

of "surprise". If small, we reject Ho.



48/81




The test is 2-sided if the null hypothesis proposesequality of 2 values. Then we consider theabsolute value of the test statistic.



49/81




The test is 1-sided if the null hypothesis proposes

or

t test


50/81

t-test

When the data come from a distribution that

is approximately symmetricaland are independent

and the null hypothesis is true

nS

Yt

Y /* 0

=

has a sampling distribution that iscalled Student's t distribution on n-

1 degrees of freedom.The degrees of freedom are ameasure of how often outlying

values are observed

t distribution


51/81

t distribution

The t-test is the usualexample.

We compute a value oft from the data (formulato follow.)

If the null hypothesis is

true, the distribution isgiven by one of thecurves on the plot.

The vertical lines aredrawn so the 2.5% ofthe distribution is moreextreme than the line.

observed t

t distribution


52/81

t distribution

P-valuefor

0

observed t

Fail to reject: this is atypical value if 0


53/81

t distribution


54/81

t distribution

P-valuefor

= 0

Reject: this is not atypical value if 0

observed t

Confidence Interval


55/81

Confidence Interval

A 1- confidence interval for the mean is all

the hypothesized values that would not berejected by doing the 2-sided test at level.

Looking at the test statistic, we can readilysee that this is

nstYn

/1,2/1

t-test


56/81

t testExample:

Ho: = 4Sample average=3.6 n=49Sample variance=0.041

One Sample t-testdata: R.gene1.array1

t = -13.5428, df = 48, p-value < 2.2e-16alternative hypothesis: true mean is not equal to 4

What is required for a one


57/81

sample t-test?The data should be independent. (This isessential.)

The data histogram should be symmetric andunimodal (i.e. the data should be "bell-shaped,

although normality is not essential).

Testing the difference in means


58/81

Testing the difference in means

If X1 Xn and Y1 Ym are independent

samples, another form of the t-test can be usedto test if there is a difference in the mean. (e.g.to test differential expression)

0: X = Y

2-sample t-test


59/81

We now test:

H0: difference in populations averages is 0

HA: difference in population averages is not 0

m

S

n

S

YXt

YX

22*

+

=

Again we use the t-table, with d.f. n+m-2

2-sample t-test


60/81

e.g. Do genes L37868 and L38503 differentially express betweenALL and AML patients (a test for each gene).

Note that these were independent arrays for each patient.

t.test(log(gExprs[1,],2)~gPheno$ALL.AML)

L37868t = -0.4401, df = 64.69, p-value = 0.6613alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: -0.9250874 0.5910186sample estimates:mean in group ALL 7.283753 mean in group AML 7.450787

L38503t = -4.9775, df = 64.913, p-value = 5.021e-06alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval: -1.8834056 -0.8047853sample estimates:mean in group ALL 10.03113 mean in group AML 11.37522

Some notes:


61/81

Some notes:

Notes:

1. Statistical significance is not the same aspractical significance.

2. There is an alternative form of the

denominator of the test if we assume that 2populations have the same spread.

3. We need to be careful about theindependence of the samples.

Another example:


62/81

A muscle tissue sample was taken from healthy volunteers. Theywere then infused with insulin and another sample was taken.

The 2 samples from each volunteer were hybridized to a 2-channel microarray

H0: gene expression is the same before and after infusion

HA: the gene differentially expresses

This looks like the two-sample t-test, but some of the samples are

not independent becausea) they are from the same subject

b) they are hybridized to the same array

Pairs of Dependent Samples


63/81

There is a simple solution:Combine the dependent samples.

In this case, the relevant quantity is thelog(ratio)=M.

For the M-values, we have hypotheses:

H0: =0 HA: 0

We do a 1-sample t-test using M. This issometimes called the aired t-test.

Here are the results for 4 of the genes:

t = 5 1619 df = 5 p value = 0 003579


64/81

t = -5.1619, df = 5, p-value = 0.003579

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval: -1.5292 -0.5124

sample estimates: mean of x -1.020839

t = 10.095, df = 5, p-value = 0.0001634


95 percent confidence interval: 0.7928 1.3345sample estimates: mean of x 1.063659

Here are the results for 4 of the genes:

t = 4 9325 df = 5 p value = 0 00435


65/81

t = -4.9325, df = 5, p-value = 0.00435


95 percent confidence interval: -2.124 -0.6686

sample estimates: mean of x -1.396339

t = -1.0535, df = 5, p-value = 0.3403


95 percent confidence interval: -0.752 0.315sample estimates: mean of x -0.2188503

Alternatives to the t-test


66/81

The t-test is not the only choice for testing

hypotheses about average expression levels.

These methods require equal variance andindependence, but not bell-shaped

Permutation test

Bootstrap test

Wilcoxon test

Permutation test (2-sample)


67/81

Compute t* as usual, but do not use t table.

Instead: Make one big sample X1XnY1Ym

For all possible choices of selection n items

(or a sample, if this is too many)

Select n observations at random to be the Xs;the remainder are the Ys.

Compute t* for each selection.

Permutation test (2-sample)


68/81

( p )

The p-value is based onwhere t* is in the

histogram of T*

Multiply the tailprobability by 2 for a 2-sided test

Bootstrap test (2-sample)


69/81

( )

Compute t* as usual, but do not use t-table.

Instead: Make one big sample X1XnY1Ym

For all possible choices of selection n items withreplacement to be the Xs and m items withreplacement to be the Ys.

Proceed as for the permutation test.

The Wilcoxon test (2-sample)


70/81

aka Mann-Whitney testMake one big sample X1XnY1Ym

Rank the observations from lowest (1)

to highest (n+m)

W*= rank(Xs) rank(Ys)

The p-value is computed by the permutation test,

(and these are available in tables based on nand m).

Note:


71/81

There are 1-sample and paired sample versionsof these tests.

The Wilcoxon test is often used instead of the t-test. The Wilcoxon p-value is usually very close

to the t-test p-value when the t-test isappropriate, and is more accurate when the t-

test is not appropriate. On the other hand, theWilcoxon test does not readily generalize to themore powerful analyses based on Bayesian

methods.

Power, False Discovery,


72/81

NondiscoveryAccept Reject total

H0 true U V m0H0 false T S m-m0

total m-R R m

Correct

Mistake

Power probability of rejecting when H0 false

FDR =V/R

FNR = T/(m-R)



73/81

Nondiscovery

nS

Yt

Y /

* 0

=The p-value is the probability ofrejecting the null hypothesis when it is

true.Under the alternative:

We can see that t* will be larger when the difference between the means is

larger, the variance is smaller or n islarger.

The only item under our control is thesample size.



74/81

Nondiscovery

nS

Yt

Y /

* 0

=For fixed p-value at which we declarestatistical significance, increasing the

sample size:Increases power

Reduces FDRReduces FNR



75/81

Nondiscovery

nS

Yt

Y /

* 0

=For fixed FDR at which we declarestatistical significance, increasing the

sample size:Reduces FNR

The sample size in most microarray studies is too

small.This leads to large FNR which needs to beconsidered when known genes are not found to bestatistically significant.

Improving Power


76/81

Besides increasing the sample size we can:

1. Pick the best tests. (Generally t-test andWilcoxon are more powerful thanresampling when appropriate.)

2. Use (Empirical) Bayes tests (SAM,

LIMMA, )

Bayes Methods


77/81

We have lots of genes.

Gene i has mean i and varianceBayesian methods assume that the means and

variances come from known distributions (thepriors).

Empirical Bayes methods assume that the means

and variances have distributions that are estimatedfrom the data.

2

i

Empirical Bayes Methods


78/81

Most focus is on the distribution of thevariances.

SAM add a constant based on a quantileof the distribution of the S2 over all thegenes. (SAM also uses permutationtests.)

LIMMA more formal eBayes analysis

- results in replacing gene variancesby a weighted average of the genevariance and the mean variance of all

genes.

m

S

n

S

YXt

YX

22*

+

=

True Bayes


79/81

Combine the prior information with the data.

Instead of p-values, we obtain the posteriorodds ratio of the hypothesis being true.

Increasing power


80/81

Classical methods are known to be lesspowerful than eBayes and Bayes methods.

We will read some papers about this.

Multiple comparisons problemAccept Reject total


81/81

The problem with p-valuesis that:

If we declare significancewhen P

Statistical Basics for Micro Array Studies

Documents

Transcript of Statistical Basics for Micro Array Studies