From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion

From the homework: Distribution of DNA fragmentsgenerated by Micrococcal nuclease digestion

mean(nucs) = 113.5 bpmedian(nucs) = 110 bpsd(nucs+ = 17.3 bp

http://www.stat.wisc.edu/courses/language

Link to Jean-Yves Sgro’s R & Bioconductor Manualhttp://akka.genetics.wisc.edu/groups/genetics875/wiki/7a793/M_Oct_1_.html

R2 = 0.9897

R2 = 0.686

R2 = 0.804Why did the correlationgo up?

4

Expression difference

Gene X expression under condition 1Gene X expression under condition 2

Select differentially expressed genes to focus on

Methods of gene selection:

-- arbitrary fold-expression-change cutoffexample: genes that change >3X in expression between samples

-- statistically significant change in expressionrequires replicates

5

Test if the means of 2 (or more) groups are the same or statistically different

The ‘null hypothesis’ H0 says that the two groups are statistically the same-- you will either accept or reject the null hypothesis

Choosing the right test:

parametric test if your data are normally distributed with equal variance

nonparametric test if neither of the above are true

Why do the data need to be normally distributed?

6

Test if the means of 2 groups are the same or statistically different


T = X1 – X2 difference in the meansstandard error of the difference in the meansSED

If your two samples are normally distributed with equal variance, use the t-test

If T > Tc where Tc is the critical value for the degrees of freedom & confidence level,then reject H0

Notice that if the data aren’t normally distributed mean and standard deviation are not meaningful.

7

T = D Average difference in expression

Standard error of the mean differenceSEM

If your two samples are normally distributed with equal variance AND your data were paired before collection, use the paired t-test

The paired t-test for gene expression ratios

Example: Tumor sample before and after treatmentGene expression differences expressed as ratios

eg) mutant vs. wt log2 [ratio]: 5.0 4.3 6.7

If T > Tc where Tc is the critical value for the degrees of freedom (n-1) & confidence level,then reject H0

8

Test if the means of 2 (or more) groups are the same or statistically different


ANOVA (ANalysis Of Variance): for comparing 2 or more means

variation between samples

variation within samplesF =

ANOVA only tells you that at least one of your samples is different … may need to identify which is different for >2 sample comparisons

If F > Fc where Fc is the critical value for the degrees of freedom (n-1) & confidence level,then reject H0

9

Assessing & minimizing error in calls

Type I error = false positivesFDR = False Discovery Rate

Type II error = false negatives

Balance between minimizing false positives vs. false negatives

Assessing false positives vs. false negatives: sensitivity vs. specificity

Sensitivity (how well did you find what you want): # of true positives # of total significant calls ( = #true positives + # false negatives)

Specificity (how well did you discriminate): # of true negatives # of total negative calls (= #true negatives + #false positives)

Assessing accuracy based on known truth: Receiver-Operator Curves (ROC)

Plot the fraction of True Positives (TPR) vs False Positives (FPR) called at each significance threshold

Known Truth is either a set of

positive controls …or can come from

simulated data

11

When working with many genes must correct for multiple testing …

p < 0.01 means that there is a 1 in 100 chance that the observation is H0

But if you have 30,000 genes, with 0.01 change that each conclusion is wrong then you will get 300 false positives!

Adjust the p-value cutoff such that there is a 1 in 100 chance of falseidentification for each gene:

p = 0.01 / 30,000 ‘trials’ p < 3 x 10-7 is significant

(this is also known as Bonferroni correction)

Newer, better way of dealing with this is FDR correction

FDR: false discovery rateHow many of the called positives are false?5% FDR means 5% of calls are false positive

This is different from the false positive rate:The rate at which true negatives are called significant 5% false positives means 5% of true negatives are incorrectly called significant

“The p-value cutoff [and false positive rate] says little about the content of the features actually called significant” (Storey and Tibshirani 2003)

Storey and Tibshirani 2003: q-value to represent FDR

FDR = expected ratio of false positives vs all positives (Expected [F/S])

q value:for a given region of data space, what fraction of genes in that region are false?

eg) Gene X has a q = 0.04 … this means that if you call Gene X significant, then all the genes with better statistics must also be called significant ->

4% of all of these genes are false positives

“The q-value for a particular feature is the expected proportion of false positives incurred when calling that feature significant.”

FDR = expected ratio of false positives vs all positives:Expected [F/S] ~ Expected[F] / Expected [S]

-- can initially estimate S based on a simple p-value cutoff

We need to estimate 0 = m0 / m = fraction of all features that are truly negative

Genes with p > 0.5 show a relativelyflat density … because we expectthat p-values of null genes are randomlydistributed, we assume that most of thesegenes are true nulls … (The tuning parameter is the p cutoff above which nulls are assumed)

The density for genes with p>0.5allows us to estimate the # of true negatives and thus 0

in this case

From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion

Documents

Transcript of From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion