Statistical Basics for Micro Array Studies

download Statistical Basics for Micro Array Studies

of 81

Transcript of Statistical Basics for Micro Array Studies

  • 8/3/2019 Statistical Basics for Micro Array Studies

    1/81

    Statistical Basics forMicroarray Studies

    Naomi AltmanPenn State University

    Dept. of Statistics

    Bioinformatics Consulting Center

    [email protected]

    Sept. 12/06

    mailto:[email protected]:[email protected]
  • 8/3/2019 Statistical Basics for Micro Array Studies

    2/81

  • 8/3/2019 Statistical Basics for Micro Array Studies

    3/81

    Data Used for this Talk

    Data:Golub et al (1999) demonstrated that microarray gene expression

    could be used to distinguish among human leukemia types:

    acute myeloid leukemia (AML) and acute lymphoblasticleukemia (ALL).

    There are 72 Affymetrix Hgu6800 microarrays (47 ALL, 25 AML),

    with data on the expression of 7129 genes.The data were preprocessed to reduce array to array variability.

    The entire dataset is available from www.bioconductor.org.We use a subset that is available in R when you load the affy

    package.

  • 8/3/2019 Statistical Basics for Micro Array Studies

    4/81

    Data Used for this Talk

    Data: (e.g. 5 probe-sets, 19 randomly selectedcases from golubMergeSub in R)

    ALL ALL ALL ALL ALL AML AML AML AML ALL ALL ALL ALL AML ALL AML AML ALL ALL

    L37378_at 176 160 139 203 157 15 245 225 170 88 129 353 185 398 184 295 21 135 1585

    L37792_at 7 10 14 23 0 3 34 23 22 33 -27 35 -5 18 36 34 6 -12 29

    L37868_s_at 284 135 183 212 185 -61 319 182 185 281 68 -132 201 652 132 362 380 230 -382

    L37882_at 251 514 441 571 230 516 342 234 183 134 271 439 351 1123 574 53 141 385 258

    L37936_at 31 27 22 34 -23 287 6 8 42 84 20 16 26 50 8 14 1 -2 33

    It usually pays to have a quick look at the numericalvalues of your data. e.g. in this data set, we havesome negative expression values (which is not

    biologically supportable).

  • 8/3/2019 Statistical Basics for Micro Array Studies

    5/81

    Statistical Software

    Any standard statistical package (and probably Excel)can do the plots and analyses I will show.

    I will use R because: Free from www.bioconductor.org

    Available: Windows, Mac, Unix, Linux Programmable Great graphics

    Bioconductor Project for genomics Lots of resources (R + Splus communities)

    The learning curve for R is steep.

    http://www.bioconductor.org/http://www.bioconductor.org/
  • 8/3/2019 Statistical Basics for Micro Array Studies

    6/81

    What is statistics?

    Statistical methods are used to:

    Describe what was observed.

    Make inference to what was not observed.

    Determine how to collect the dataappropriately.

  • 8/3/2019 Statistical Basics for Micro Array Studies

    7/81

    Description versus inference

    Statistical methods are used both to:

    Describe what was observed:

    requires data

    Make inference to what was not observed:

    requires data collected using statistical

    methods such as randomization andreplication.

  • 8/3/2019 Statistical Basics for Micro Array Studies

    8/81

    Describing the Data

    Graphically

    Histogram

    Density Plot Boxplot

    Scatterplot

    Numerically

    Quantiles Averages

    Variance, Standard Deviation

    Correlation

  • 8/3/2019 Statistical Basics for Micro Array Studies

    9/81

    Some R commands

    # Load software and data

    # get data

    # see what is in the data

    # extract expression values

    # extract variables describing# samples

    library(affy)

    data(golubMergeSub)golubMergeSub

    gExprs=exprs(golubMergeSub)

    gPheno=pData(golubMergeSub)

    gExprs[1:5,]

    gPheno[1:5,]

  • 8/3/2019 Statistical Basics for Micro Array Studies

    10/81

    Looking at the data:

    Here are 4 of the genes,aggregated over the 72 arrays.

    hist(gExprs[i,])

    The "shapes" of the distributionsvary.

    Most of the data for L38500_at arenegative.

    We have combined ALL and AMLdata.

  • 8/3/2019 Statistical Basics for Micro Array Studies

    11/81

    Looking at the data:Here are 2 of the genes, aggregated overthe 72 arrays with histograms organizedby cancer type

    hist(gExprs[i,,gPheno$ALL.AML=="ALL"])

    There are some obvious features, (e.g.outlier) but it is hard to compare the AMLand ALL samples because the x-axesdiffer and even the bin widths are not the

    same.

  • 8/3/2019 Statistical Basics for Micro Array Studies

    12/81

    Looking at the data:

    for (i in c(1,11)) {low=min(golubExprs[i,])-1high=max(golubExprs[i,])+1b=seq(low,high,length.out=20)hist(golubExprs[i,ALL.AML=="ALL"],

    main=paste(rownames(golubExprs)[i],"ALL"),xlab="Expression",breaks=b)hist(golubExprs[i,ALL.AML=="AML"],main=paste(rownames(golubExprs)[i],"AML"),xlab="Expression",breaks=b)}

    Here we have used the same x-axis forboth cancer types (but varying by gene).

    We can see that the AML patients tend tobe higher on L38503, but there is notmuch difference for L37378

  • 8/3/2019 Statistical Basics for Micro Array Studies

    13/81

    Boxplot

    The red line is themedian, the blue are the

    the 75th quantiles.

    A boxplot is another

    summary thatemphasizes the central50% of the data and theoutliers and is extremelyuseful for comparisons.

  • 8/3/2019 Statistical Basics for Micro Array Studies

    14/81

    Boxplot

    The "whiskers" on the plot extend tothe furthest data value that is no more

    than 1.5 boxwidths from the edge ofthe box. Any values beyond aredenoted by a circle.

    We can see immediately that L38503is differentially expressed.

    As well, L37868_s is more variable inthe AML group, which is not useful fordetecting AML, but may have somebiological meaning.

    Statisticians love to use

  • 8/3/2019 Statistical Basics for Micro Array Studies

    15/81

    Statisticians love to uselogarithms

    Because:

    1. Right skewed data become more symmetric.2. If the variance of the data increases with the size, logs

    equalize variance.

    3. Log(x/y)=Log(x)-Log(y)

    Note that Logb(x) = Logc(b) Logc(x) so we statisticiansdon't really care what base.

    In microarray analysis, many people use Log2.

    In these units a "2-fold" difference (x/y=2) is Log(x/y)=1.

  • 8/3/2019 Statistical Basics for Micro Array Studies

    16/81

    Why we use log2(expression)In this particular dataset,many of the genes did nothave skewed distributions,and the spread in both

    cancers was about thesame, but usually we useLog for all or none of thegenes.

  • 8/3/2019 Statistical Basics for Micro Array Studies

    17/81

    Scatterplots

    Scatterplots give us the

    opportunity to plot 2quantities and colorcoding adds information.

    We can clearly see thateven using both genes,

    we cannot completelyseparate the AML andALL patients

    plot

  • 8/3/2019 Statistical Basics for Micro Array Studies

    18/81

    Scatterplots

    Scatterplots are the most

    useful tool for qualitycontrol. I always ploteach array against otherarrays. Even when thearrays come fromdifferent conditions, weshould get a diagonal

    "cloud".

    Note array X5523 whichwas a dud. (These arefrom a local project.)

  • 8/3/2019 Statistical Basics for Micro Array Studies

    19/81

    Golub Data

  • 8/3/2019 Statistical Basics for Micro Array Studies

    20/81

    LO(W)ESS Trend Curves

    It is sometimes useful to

    add a trend line to ascatterplot (and this isan integral part of the

    "normalization"process.

    http://www.stat.unc.edu/faculty/marron/marron_movies.html

  • 8/3/2019 Statistical Basics for Micro Array Studies

    21/81

    A plot is worth 50,000 p-values

    Most people are far to eager to get to a "p-value"

    Look at the data table Plot everything you can

  • 8/3/2019 Statistical Basics for Micro Array Studies

    22/81

    Numerical Summaries

    Measuring the Center

    Mean=average=

    Median=50th quantile

    Mode=highest point on histogram or density

    Geometric mean=antilog(mean(log(data))

    Y

  • 8/3/2019 Statistical Basics for Micro Array Studies

    23/81

    Numerical Summaries

    Measuring the Spread

    Variance= average squared deviation from the mean

    Standard deviation=

    Interquartile range=length of boxplot box=75th quantile 25th quantile

    var1

    )( 2

    =

    n

    YYi

  • 8/3/2019 Statistical Basics for Micro Array Studies

    24/81

    Numerical SummariesMeasuring Linearity

    Correlation: -1 r 1If r=0 there is no linear relationship

    If r>0, the relationship is increasing.

    If r

  • 8/3/2019 Statistical Basics for Micro Array Studies

    25/81

    Populations and StatisticalInference

    What do we mean when we state:

    Gene X expresses (or differentially expresses)?

    Sepals are more similar to petals than to leaves?

    Populations and Statistical

  • 8/3/2019 Statistical Basics for Micro Array Studies

    26/81

    Populations and StatisticalInference

    What do we mean when we state:

    Gene X expresses (or differentially expresses)?

    Sepals are more similar to petals than to leaves?

    The population is the general class of organismsfor which we want to make an inferentialstatement.

    Populations and Statistical

  • 8/3/2019 Statistical Basics for Micro Array Studies

    27/81

    Populations and Statistical

    Inference

    If we could observe the entire population with nomeasurement error, we would not need

    inferential statistics.

    Alternatively, if all the members of the populationare identical and there is no measurementerror, we would not need inferential statistics.

    Populations and Statistical

  • 8/3/2019 Statistical Basics for Micro Array Studies

    28/81

    Populations and Statistical

    InferenceIf we could observe the entire population with nomeasurement error, we would not need

    inferential statistics.Alternatively, if all the members of the population

    are identical and there is no measurementerror, we would not need inferential statistics.

    Statistical inference is used to make a statementabout the population in the presence of errorbecause we observe only a sample from a

    varying population and there is measurementerror.

    Samples and replication

  • 8/3/2019 Statistical Basics for Micro Array Studies

    29/81

    Samples and replication

    A sample is the set of observed items fromthe population.

    Generally, for statistical inference we need

    information about how variable theseitems are.

    If there is also measurement error, we mightwant to make multiple measurements on

    the same individual, because averagingthese reduces the measurement error.

    Hypotheses

  • 8/3/2019 Statistical Basics for Micro Array Studies

    30/81

    Hypotheses

    Classical statistical tests are based on comparing 2 hypotheses

    H0: the null hypothesis (which we hope to disprove)

    HA: the alternative hypothesis (which we accept if there isevidence to reject the null hypothesis)

    e.g.

    H0: the gene does not express

    HA: the gene expresses

    H0: the gene does not differentially express

    HA: the gene differentially expresses

  • 8/3/2019 Statistical Basics for Micro Array Studies

    31/81

    Classical hypothesis testingHypotheses are statements about the population:

    e.g.HA: the gene expresses in leaf guard cells

    Means that the average expression level oversome population of leaf guard cells is positive.

    Since there is both measurement error and

    biological variability from leaf to leaf, plant toplant, etc., we need to:

    What goes into a hypothesis

  • 8/3/2019 Statistical Basics for Micro Array Studies

    32/81

    What goes into a hypothesis

    test?1. Determine the population of cells, leaves,

    plants for which we want to make theinference.

    2. Ensure that the sample represents thepopulation of interest, not just one member of

    the population.

    This has 2 implications:

    Essential data for hypothesis

  • 8/3/2019 Statistical Basics for Micro Array Studies

    33/81

    Essential data for hypothesis

    testingWe need to sample several individuals from our

    population.

    We need to estimate the biological variability ofour population (using the sample).

    Sampling should be done at random.

    Tests about the population

  • 8/3/2019 Statistical Basics for Micro Array Studies

    34/81

    Tests about the population

    average.The usual tests require that the data areindependent and have the samevariability.

    Same variability take logarithms

    Independent dependence is induced by

    taking multiple measurements of the same

    object - e.g. organism, RNA sample,microarray

    Tests of one sample mean or

  • 8/3/2019 Statistical Basics for Micro Array Studies

    35/81

    Tests of one sample mean or

    median t-test

    Wilcoxon test Permutation and bootstrap tests

    Tests about the population

  • 8/3/2019 Statistical Basics for Micro Array Studies

    36/81

    p paverage.

    Hypotheses about expression levels are usually treatedas hypotheses about population averages (t-test) or

    population medians (Wilcoxon test).

    We want to make a guessabout the value of the redline, but we do not get tomeasure the entire

    population.

    Tests about the population

  • 8/3/2019 Statistical Basics for Micro Array Studies

    37/81

    p paverage.

    Hypotheses about expression levels are usually treatedas hypotheses about population averages (t-test) orpopulation medians (Wilcoxon test).

    We want to make a guessabout the value of the red line,

    but we do not get to measurethe entire population.

    We estimate the parameterfrom the data and then see ifit is too far from the

    hypothesized value to beplausible

  • 8/3/2019 Statistical Basics for Micro Array Studies

    38/81

  • 8/3/2019 Statistical Basics for Micro Array Studies

    39/81

    Classical testing is basedon the distribution of thetest statistic when the null

    hypothesis is true.

    This depends on theoriginal population, butmay be the sampleaverage, sample "t" value,etc.

    The distribution of the test

    statistic is computedthrough statistical theory orsimulation, or by"resampling" from the data(bootstrap and permutation

    methods).

    2 i l id f t t

  • 8/3/2019 Statistical Basics for Micro Array Studies

    40/81

    2 simple ideas for tests

    If the population mean is , then the sample

    mean should be close to .t-test, permutation test, bootstrap test

    If the population median is , then about50% of the observations should be above and 50% below.

    Sign and Wilcoxon test

    Some Questions of Interest

  • 8/3/2019 Statistical Basics for Micro Array Studies

    41/81

    Some Questions of Interest

    Does genei express under this condition?

    Does the expression level of genei differ under

    these 2 (or more) conditions?

    Do genei and genej have similar patterns ofexpression under several conditions

    Some Questions of Interest

  • 8/3/2019 Statistical Basics for Micro Array Studies

    42/81

    Some Questions of Interest

    Does genei express under this condition?

    H0: i= 0 HA: i> 0

    Since we measure intensity, which is always

    non-negative, we cannot test this. We use a

    small constant (how small?) and do a one-

    sided test.

    This is often considered to be inaccurate - PCRis often used to check at least some of theconclusions

    Some Questions of Interest

  • 8/3/2019 Statistical Basics for Micro Array Studies

    43/81

    Some Questions of Interest

    Does the expression level of genei differ underthese 2 (or more) conditions?

    H0: ij=ik HA: ijik

    This is considered the most appropriate setting

    for microarray data

    t t f

  • 8/3/2019 Statistical Basics for Micro Array Studies

    44/81

    tests of means:

    1. Is the population mean 0? (2-sided)

    , (1-sided).

    2. Is there a difference in populationmeans? (usually 2-sided)

    1.5=2.5 Is the mean difference equal 0?(usually 2-sided)

    S ti

  • 8/3/2019 Statistical Basics for Micro Array Studies

    45/81

    Some necessary assumptions

    1. The observations are independent.

    2. The observations are a random samplefrom a known population.

    3. If there are 2 populations, the populationhave the same shape.

    t test

  • 8/3/2019 Statistical Basics for Micro Array Studies

    46/81

    t-test

    RNA was spiked into the sample sothat the expression level for the oligo

    should have been 24. Was this levelachieved?

    0: =0

    : 0

    nS

    Yt

    Y/

    * 0=

    We obtain the p-valuefrom a table or our

    software (d.f.=n-1)

    Classical Inference: P-values

  • 8/3/2019 Statistical Basics for Micro Array Studies

    47/81

    We assume that the NULL distribution is true.The test statistic then has a distribution that is

    either known or can be simulated under thisassumption.

    The p-value is the probability of observing a valueof the test statistic more extreme than theobserved value when Ho is true.

    The p-value can be thought of as a measurement

    of "surprise". If small, we reject Ho.

    Classical Inference: P-values

  • 8/3/2019 Statistical Basics for Micro Array Studies

    48/81

    We assume that the NULL distribution is true.The test statistic then has a distribution that is

    either known or can be simulated under thisassumption.

    The p-value is the probability of observing a valueof the test statistic more extreme than theobserved value when Ho is true.

    The test is 2-sided if the null hypothesis proposesequality of 2 values. Then we consider theabsolute value of the test statistic.

    Classical Inference: P-values

  • 8/3/2019 Statistical Basics for Micro Array Studies

    49/81

    We assume that the NULL distribution is true.The test statistic then has a distribution that is

    either known or can be simulated under thisassumption.

    The p-value is the probability of observing a valueof the test statistic more extreme than theobserved value when Ho is true.

    The test is 1-sided if the null hypothesis proposes

    or

    t test

  • 8/3/2019 Statistical Basics for Micro Array Studies

    50/81

    t-test

    When the data come from a distribution that

    is approximately symmetricaland are independent

    and the null hypothesis is true

    nS

    Yt

    Y /* 0

    =

    has a sampling distribution that iscalled Student's t distribution on n-

    1 degrees of freedom.The degrees of freedom are ameasure of how often outlying

    values are observed

    t distribution

  • 8/3/2019 Statistical Basics for Micro Array Studies

    51/81

    t distribution

    The t-test is the usualexample.

    We compute a value oft from the data (formulato follow.)

    If the null hypothesis is

    true, the distribution isgiven by one of thecurves on the plot.

    The vertical lines aredrawn so the 2.5% ofthe distribution is moreextreme than the line.

    observed t

    t distribution

  • 8/3/2019 Statistical Basics for Micro Array Studies

    52/81

    t distribution

    P-valuefor

    0

    observed t

    Fail to reject: this is atypical value if 0

  • 8/3/2019 Statistical Basics for Micro Array Studies

    53/81

    t distribution

  • 8/3/2019 Statistical Basics for Micro Array Studies

    54/81

    t distribution

    P-valuefor

    = 0

    Reject: this is not atypical value if 0

    observed t

    Confidence Interval

  • 8/3/2019 Statistical Basics for Micro Array Studies

    55/81

    Confidence Interval

    A 1- confidence interval for the mean is all

    the hypothesized values that would not berejected by doing the 2-sided test at level.

    Looking at the test statistic, we can readilysee that this is

    nstYn

    /1,2/1

    t-test

  • 8/3/2019 Statistical Basics for Micro Array Studies

    56/81

    t testExample:

    Ho: = 4Sample average=3.6 n=49Sample variance=0.041

    One Sample t-testdata: R.gene1.array1

    t = -13.5428, df = 48, p-value < 2.2e-16alternative hypothesis: true mean is not equal to 4

    What is required for a one

  • 8/3/2019 Statistical Basics for Micro Array Studies

    57/81

    sample t-test?The data should be independent. (This isessential.)

    The data histogram should be symmetric andunimodal (i.e. the data should be "bell-shaped,

    although normality is not essential).

    Testing the difference in means

  • 8/3/2019 Statistical Basics for Micro Array Studies

    58/81

    Testing the difference in means

    If X1 Xn and Y1 Ym are independent

    samples, another form of the t-test can be usedto test if there is a difference in the mean. (e.g.to test differential expression)

    0: X = Y

    2-sample t-test

  • 8/3/2019 Statistical Basics for Micro Array Studies

    59/81

    We now test:

    H0: difference in populations averages is 0

    HA: difference in population averages is not 0

    m

    S

    n

    S

    YXt

    YX

    22*

    +

    =

    Again we use the t-table, with d.f. n+m-2

    2-sample t-test

  • 8/3/2019 Statistical Basics for Micro Array Studies

    60/81

    e.g. Do genes L37868 and L38503 differentially express betweenALL and AML patients (a test for each gene).

    Note that these were independent arrays for each patient.

    t.test(log(gExprs[1,],2)~gPheno$ALL.AML)

    L37868t = -0.4401, df = 64.69, p-value = 0.6613alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: -0.9250874 0.5910186sample estimates:mean in group ALL 7.283753 mean in group AML 7.450787

    L38503t = -4.9775, df = 64.913, p-value = 5.021e-06alternative hypothesis: true difference in means is not equal to 0

    95 percent confidence interval: -1.8834056 -0.8047853sample estimates:mean in group ALL 10.03113 mean in group AML 11.37522

    Some notes:

  • 8/3/2019 Statistical Basics for Micro Array Studies

    61/81

    Some notes:

    Notes:

    1. Statistical significance is not the same aspractical significance.

    2. There is an alternative form of the

    denominator of the test if we assume that 2populations have the same spread.

    3. We need to be careful about theindependence of the samples.

    Another example:

  • 8/3/2019 Statistical Basics for Micro Array Studies

    62/81

    A muscle tissue sample was taken from healthy volunteers. Theywere then infused with insulin and another sample was taken.

    The 2 samples from each volunteer were hybridized to a 2-channel microarray

    H0: gene expression is the same before and after infusion

    HA: the gene differentially expresses

    This looks like the two-sample t-test, but some of the samples are

    not independent becausea) they are from the same subject

    b) they are hybridized to the same array

    Pairs of Dependent Samples

  • 8/3/2019 Statistical Basics for Micro Array Studies

    63/81

    There is a simple solution:Combine the dependent samples.

    In this case, the relevant quantity is thelog(ratio)=M.

    For the M-values, we have hypotheses:

    H0: =0 HA: 0

    We do a 1-sample t-test using M. This issometimes called the aired t-test.

    Here are the results for 4 of the genes:

    t = 5 1619 df = 5 p value = 0 003579

  • 8/3/2019 Statistical Basics for Micro Array Studies

    64/81

    t = -5.1619, df = 5, p-value = 0.003579

    alternative hypothesis: true mean is not equal to 0

    95 percent confidence interval: -1.5292 -0.5124

    sample estimates: mean of x -1.020839

    t = 10.095, df = 5, p-value = 0.0001634

    alternative hypothesis: true mean is not equal to 0

    95 percent confidence interval: 0.7928 1.3345sample estimates: mean of x 1.063659

    Here are the results for 4 of the genes:

    t = 4 9325 df = 5 p value = 0 00435

  • 8/3/2019 Statistical Basics for Micro Array Studies

    65/81

    t = -4.9325, df = 5, p-value = 0.00435

    alternative hypothesis: true mean is not equal to 0

    95 percent confidence interval: -2.124 -0.6686

    sample estimates: mean of x -1.396339

    t = -1.0535, df = 5, p-value = 0.3403

    alternative hypothesis: true mean is not equal to 0

    95 percent confidence interval: -0.752 0.315sample estimates: mean of x -0.2188503

    Alternatives to the t-test

  • 8/3/2019 Statistical Basics for Micro Array Studies

    66/81

    The t-test is not the only choice for testing

    hypotheses about average expression levels.

    These methods require equal variance andindependence, but not bell-shaped

    Permutation test

    Bootstrap test

    Wilcoxon test

    Permutation test (2-sample)

  • 8/3/2019 Statistical Basics for Micro Array Studies

    67/81

    Compute t* as usual, but do not use t table.

    Instead: Make one big sample X1XnY1Ym

    For all possible choices of selection n items

    (or a sample, if this is too many)

    Select n observations at random to be the Xs;the remainder are the Ys.

    Compute t* for each selection.

    Permutation test (2-sample)

  • 8/3/2019 Statistical Basics for Micro Array Studies

    68/81

    ( p )

    The p-value is based onwhere t* is in the

    histogram of T*

    Multiply the tailprobability by 2 for a 2-sided test

    Bootstrap test (2-sample)

  • 8/3/2019 Statistical Basics for Micro Array Studies

    69/81

    ( )

    Compute t* as usual, but do not use t-table.

    Instead: Make one big sample X1XnY1Ym

    For all possible choices of selection n items withreplacement to be the Xs and m items withreplacement to be the Ys.

    Proceed as for the permutation test.

    The Wilcoxon test (2-sample)

  • 8/3/2019 Statistical Basics for Micro Array Studies

    70/81

    aka Mann-Whitney testMake one big sample X1XnY1Ym

    Rank the observations from lowest (1)

    to highest (n+m)

    W*= rank(Xs) rank(Ys)

    The p-value is computed by the permutation test,

    (and these are available in tables based on nand m).

    Note:

  • 8/3/2019 Statistical Basics for Micro Array Studies

    71/81

    There are 1-sample and paired sample versionsof these tests.

    The Wilcoxon test is often used instead of the t-test. The Wilcoxon p-value is usually very close

    to the t-test p-value when the t-test isappropriate, and is more accurate when the t-

    test is not appropriate. On the other hand, theWilcoxon test does not readily generalize to themore powerful analyses based on Bayesian

    methods.

    Power, False Discovery,

  • 8/3/2019 Statistical Basics for Micro Array Studies

    72/81

    NondiscoveryAccept Reject total

    H0 true U V m0H0 false T S m-m0

    total m-R R m

    Correct

    Mistake

    Power probability of rejecting when H0 false

    FDR =V/R

    FNR = T/(m-R)

    Power, False Discovery,

  • 8/3/2019 Statistical Basics for Micro Array Studies

    73/81

    Nondiscovery

    nS

    Yt

    Y /

    * 0

    =The p-value is the probability ofrejecting the null hypothesis when it is

    true.Under the alternative:

    We can see that t* will be larger when the difference between the means is

    larger, the variance is smaller or n islarger.

    The only item under our control is thesample size.

    Power, False Discovery,

  • 8/3/2019 Statistical Basics for Micro Array Studies

    74/81

    Nondiscovery

    nS

    Yt

    Y /

    * 0

    =For fixed p-value at which we declarestatistical significance, increasing the

    sample size:Increases power

    Reduces FDRReduces FNR

    Power, False Discovery,

  • 8/3/2019 Statistical Basics for Micro Array Studies

    75/81

    Nondiscovery

    nS

    Yt

    Y /

    * 0

    =For fixed FDR at which we declarestatistical significance, increasing the

    sample size:Reduces FNR

    The sample size in most microarray studies is too

    small.This leads to large FNR which needs to beconsidered when known genes are not found to bestatistically significant.

    Improving Power

  • 8/3/2019 Statistical Basics for Micro Array Studies

    76/81

    Besides increasing the sample size we can:

    1. Pick the best tests. (Generally t-test andWilcoxon are more powerful thanresampling when appropriate.)

    2. Use (Empirical) Bayes tests (SAM,

    LIMMA, )

    Bayes Methods

  • 8/3/2019 Statistical Basics for Micro Array Studies

    77/81

    We have lots of genes.

    Gene i has mean i and varianceBayesian methods assume that the means and

    variances come from known distributions (thepriors).

    Empirical Bayes methods assume that the means

    and variances have distributions that are estimatedfrom the data.

    2

    i

    Empirical Bayes Methods

  • 8/3/2019 Statistical Basics for Micro Array Studies

    78/81

    Most focus is on the distribution of thevariances.

    SAM add a constant based on a quantileof the distribution of the S2 over all thegenes. (SAM also uses permutationtests.)

    LIMMA more formal eBayes analysis

    - results in replacing gene variancesby a weighted average of the genevariance and the mean variance of all

    genes.

    m

    S

    n

    S

    YXt

    YX

    22*

    +

    =

    True Bayes

  • 8/3/2019 Statistical Basics for Micro Array Studies

    79/81

    Combine the prior information with the data.

    Instead of p-values, we obtain the posteriorodds ratio of the hypothesis being true.

    Increasing power

  • 8/3/2019 Statistical Basics for Micro Array Studies

    80/81

    Classical methods are known to be lesspowerful than eBayes and Bayes methods.

    We will read some papers about this.

    Multiple comparisons problemAccept Reject total

  • 8/3/2019 Statistical Basics for Micro Array Studies

    81/81

    The problem with p-valuesis that:

    If we declare significancewhen P