1-9-2006 1
• First approach - repeating a simple analysis for each gene separately - 30k times
• Assume we have two experimental conditions (j=1,2)
• We measure expression of all genes n times under both experimental conditions (n two-channel microarrays)
• For a specific gene (focusing on a single gene) xij = ith measurement under condition j
• Statistical models for expression measurements under two different
Identifying Differentially Expressed Genes
)σ,μ(~ x 21i1 N )σ,μ(~ x 2
2i2 N
1, 2, are unknown model parameters - j represents the average expression measurement in the large number of replicated experiments, represents the variability of measurements
• Question if the gene is differentially expressed corresponds to assessing if 1 2
• Strength of evidence in the observed data that this is the case is expressed in terms of a p-value
1-9-2006 2
• Estimate the model parameters based on the data
P-value
• Calculating t-statistic which summarizes information about our hypothesis of interest (1 2)
2
)1()1(ˆ
21
222
21122
nn
snsns
j
n
iij
j n
x
x
j
1jˆ 1
)(
1
2
2
j
n
ijij
j n
xx
s
j
• Establishing the null-distribution of the t-statistic (the distribution assuming the “null-hypothesis” that 1 = 2)
• The “null-distribution” in this case turns out to be the t-distribution with n1+n2-2 degrees of freedom
• P-value is the probability of observing as extreme or more extreme value under the “null-distribution” as it was calculated from the data (t*)
21
12*
n
1
n
1s
t
xx
1-9-2006 3
t-distribution• Number of experimental replicates affects the precision at two levels
1. Everything else being equal, increase in sample size increases the t*
2. Everything else being equal, increase in sample size “shrinks” the “null-distribution”
• Suppose that t*=3. What is the difference in p-values depending on the sample size alone.
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
df = 1df = 2df = 10df = 100
p-value = 0.2p-value = 0.1p-value = 0.01p-value = 0.003
1-9-2006 4
Performing t-test
> load(url("http://eh3.uc.edu/teaching/cfg/2006/data/SimpleData.RData"))
> ls()
[1] "SimpleData"
> SimpleData[1:5,]
Name Ctl Nic Nic.1 Nic.2 Ctl.1 Ctl.2
1 D49382 11.365781 11.852662 9.534654 11.492123 10.649501 10.003857
2 X58426 8.270075 9.543917 8.191639 8.622752 8.682251 8.515828
3 M59821 6.896622 7.391191 7.706090 7.069613 7.501968 7.188065
4 U59761 10.017569 10.378232 9.981623 9.333508 9.631872 10.939635
5 X84037 7.962413 8.512166 8.393332 8.105295 8.075670 9.103248
>
> Nic<-grep("Nic",dimnames(SimpleData)[[2]])
> Ctl<-grep("Ctl",dimnames(SimpleData)[[2]])
> Nic
[1] 3 4 5
> Ctl
[1] 2 6 7
> SimpleData[1,Nic]
Nic Nic.1 Nic.2
1 11.85266 9.534654 11.49212
> SimpleData[1,Ctl]
Ctl Ctl.1 Ctl.2
1 11.36578 10.6495 10.00386
1-9-2006 5
Performing t-test> MNic<-mean(unlist(SimpleData[1,Nic]))> MNic[1] 10.95981> MCtl<-mean(unlist(SimpleData[1,Ctl]))> MCtl[1] 10.67305> VNic<-var(unlist(SimpleData[1,Nic]))> VNic 11 1.555805> VCtl<-var(unlist(SimpleData[1,Ctl]))> VCtl 11 0.464125> NNic<-sum(!is.na(SimpleData[1,Nic]))> NNic[1] 3> NCtl<-sum(!is.na(SimpleData[1,Ctl]))> NCtl[1] 3> VNicCtl<-(((NNic-1)*VNic)+((NCtl-1)*VCtl))/(NNic+NCtl-2)> VNicCtl 11 1.009965> DF<-NNic+NCtl-2> DF[1] 4
> TStat<-abs(MNic-MCtl)/((VNicCtl*((1/NNic)+(1/NCtl)))^0.5)> TStat 11 0.3494791> TPvalue<-2*pt(TStat,DF,lower.tail=FALSE)> TPvalue 11 0.744353> >t.test(SimpleData[1,Nic],SimpleData[1,Ctl],var.equal=TRUE)
Two Sample t-test
data: LSimpleData[1, W] and LSimpleData[1, C]
t = 0.7974, df = 10, p-value = 0.4437
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.3653337 0.7725582
sample estimates:
mean of x mean of y
6.597047 6.393434
source("http://eh3.uc.edu/teaching/cfg/2006/R/RSimpleTTest.R",verbose=T)source("http://eh3.uc.edu/teaching/cfg/2006/R/MySimpleTTest.R
",verbose=T)
1-9-2006 6
Statistical Inference and Statistical Significance – P-value
• Statistical Inference consists of drawing conclusions about the measured phenomenon (e.g. gene expression) in terms of probabilistic statements based on observed data. P-value is one way of doing this.
• P-value is NOT the probability of null hypothesis being true.• Rigorous interpretation of p-value is tricky.• It was introduced to measure the level of evidence against the “null-hypothesis” or better
to say in favor of a “positive experimental finding”• In this context p-value of 0.0001 could be interpreted as a stronger evidence than the p-
value of 0.01• Establishing Statistical Significance (is a difference in expression level statistically
significant or not) requires that we establish “cut-off” points for our “measure of significance” (p-value)
• For various historic reasons the cut-off 0.05 is generally used to establish “statistical significance”.
• It’s a rather arbitrary cut-off, but it is taken as a gold standard• Originally the p-value was introduced as a descriptive measure to be used in conjuction
with other criteria to judge the strength of evidence one way or another
1-9-2006 7
Statistical Inference and Statistical Significance-Hypothesis Testing
• The 5% cut-off points comes from the Hypothesis testing world• In this world the exact magnitude of p-value does not matter. It only matters if it is smaller than
the pre-specified statistical significance cut-off ().• The null hypothesis is rejected in favor of the alternative hypothesis at a significance level of =
0.05 if p-value<0.05• Type I error is committed when the null-hypothesis is falsely rejected• Type II error is committed when the null-hypothesis is not rejected but it is false • By following this “decision making scheme” you will on average falsely reject 5% of null-
hypothesis• If such a “decision making scheme” is adopted to identify differentially expressed genes on a
microarray, 5% of non-differentially expressed genes will be falsely implicated as differentially expressed.
• Family-wise Type I Error is committed if any of a set of null hypothesis is falsely rejected• Establishing statistical significance is a necessary but not sufficient step in assuring the
“reproducibility” of a scientific finding – Important point that will be further discussed when we start talking about issues in experimental design
• The other essential ingredient is a “representative sample” from the “population of interest”• This is still a murky point in molecular biology experimentation
1-9-2006 8
• For a specific gene xij = ith measurement under condition j, i=1,…,6; j=1,2
Is a Specific Gene Differentially Expressed
• Differential expression 1 2
)σ,μ(~ x 21i1 N )σ,μ(~ x 2
2i2 N• Statistical Model of observed data
• Estimate the model parameters based on the data
22
)1()1(ˆ
22
2122
n
snsns
n
xx
n
iij
j
1jˆ 1
)( 1
2
2
n
xxs
n
ijij
j
• Calculating t-statistic
n2
s
t 12*
xx
t*-t*-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
t-statistics
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
t-statistics
• Calculating p-value based on the “null distribution” of the t-statistic assuming 1 = 2
1-9-2006 9
• How do we perform t-test for 30,000 at once
• How do we handle results, present data and results
• What is significant
• How to compare different approaches to normalization of the data and the statistical analysis of results
• Ideally, we would like to maximize our ability to identify truly differentially expressed genes and minimize the falsely implicated genes.
• Doing it by hand (by R) first
• Using Bioconductor
Genome-wide analysis
1-9-2006 10
Calculating t-test for 30,000 genes at a time
> load(url("http://eh3.uc.edu/teaching/cfg/2006/data/SimpleData.RData"))
> ls()
[1] "SimpleData"
> SimpleData[1:5,]
Name Ctl Nic Nic.1 Nic.2 Ctl.1 Ctl.2
1 D49382 11.365781 11.852662 9.534654 11.492123 10.649501 10.003857
2 X58426 8.270075 9.543917 8.191639 8.622752 8.682251 8.515828
3 M59821 6.896622 7.391191 7.706090 7.069613 7.501968 7.188065
4 U59761 10.017569 10.378232 9.981623 9.333508 9.631872 10.939635
5 X84037 7.962413 8.512166 8.393332 8.105295 8.075670 9.103248
>
> Nic<-grep("Nic",dimnames(SimpleData)[[2]])
> Ctl<-grep("Ctl",dimnames(SimpleData)[[2]])
> Nic
[1] 3 4 5
> Ctl
[1] 2 6 7
> SimpleData[1,Nic]
Nic Nic.1 Nic.2
1 11.85266 9.534654 11.49212
> SimpleData[1,Ctl]
Ctl Ctl.1 Ctl.2
1 11.36578 10.6495 10.00386
1-9-2006 11
Calculating t-test for 30,000 genes at a timeCalculating t-tests : source("http://eh3.uc.edu/teaching/cfg/2006/R/MultipleTTests.R",verbose=T)
> MNic<-apply(SimpleData[,Nic],1,mean,na.rm=TRUE)
> VNic<-apply(SimpleData[,Nic],1,var,na.rm=TRUE)
> MCtl<-apply(SimpleData[,Ctl],1,mean,na.rm=TRUE)
> VCtl<-apply(SimpleData[,Ctl],1,var,na.rm=TRUE)
> NNic<-apply(!is.na(SimpleData[,Nic]),1,sum,na.rm=TRUE)
> NCtl<-apply(!is.na(SimpleData[,Ctl]),1,sum,na.rm=TRUE)
>
> VNicCtl<-(((NNic-1)*VNic)+((NCtl-1)*VCtl))/(NCtl+NNic-2)
>
> DF<-NNic+NCtl-2
>
> TStat<-abs(MNic-MCtl)/((VNicCtl*((1/NNic)+(1/NCtl)))^0.5)
> TPvalue<-2*pt(TStat,DF,lower.tail=FALSE)
> TStat[1]
1
0.3494791
> TPvalue[1]
1
0.744353
1-9-2006 12
Calculating t-test for 30,000 genes at a timeCalculating t-tests : source("http://eh3.uc.edu/teaching/cfg/2006/R/TTestScatterPlots.R",verbose=T)
> par(mfrow=c(2,2))
>
> plot((MNic-MCtl),-log(TPvalue,base=10),type="p",main="Vulcano Plot",xlab="Mean Difference",ylab="-log10(p-value)")
> grid(nx = NULL, ny = NULL, col = "lightgray", lty = "dotted",lwd = NULL, equilogs = TRUE)
>
> plot(VNicCtl^0.5,-log(TPvalue,base=10),type="p",main="Signficance vs Variability",xlab="Standard Deviation",ylab="-log10(p-value)")
> grid(nx = NULL, ny = NULL, col = "lightgray", lty = "dotted",lwd = NULL, equilogs = TRUE)
>
> plot((MNic+MCtl)/2,-log(TPvalue,base=10),type="p",main="p-values vs Average Expression",xlab="Average Expression",ylab="-log10(p-value)")
> grid(nx = NULL, ny = NULL, col = "lightgray", lty = "dotted",lwd = NULL, equilogs = TRUE)
>
> plot((MNic+MCtl)/2,(MNic-MCtl),type="p",main="Differences vs Average Expression",xlab="Average Expression",ylab="Mean Difference")
> grid(nx = NULL, ny = NULL, col = "lightgray", lty = "dotted",lwd = NULL, equilogs = TRUE)
>
1-9-2006 13
source("http://eh3.uc.edu/TTestScatterPlots.R")
Displaying results – Scatter Plots
-4 -2 0 2 4
01
23
Vulcano Plot
Mean Difference
-log1
0(p-
valu
e)
0 1 2 3
01
23
Signficance vs Variability
Standard Deviation
-log1
0(p-
valu
e)
8 10 12 14 16
01
23
p-values vs Average Expression
Average Expression
-log1
0(p-
valu
e)
8 10 12 14 16
-4-2
02
4
Differences vs Average Expression
Average Expression
Mea
n D
iffer
ence
1-9-2006 14
Annotating Significant GenesCalculating t-tests : source("http://eh3.uc.edu/teaching/cfg/2006/R/SimpleGeneAnnotation.R",verbose=T)
> SigGenes<-(TPvalue<0.001)
> sum(SigGenes)
[1] 7
> SimpleData[SigGenes,]
Name Ctl Nic Nic.1 Nic.2 Ctl.1 Ctl.2
34 M77497 14.889944 10.320421 9.611866 9.605977 14.201846 15.510924
440 AK014133 8.707496 10.497572 10.149103 10.712493 8.337171 8.575321
596 AF192382 9.244788 8.805788 8.679325 8.793788 9.339985 9.226626
2797 NM008000 12.566866 11.891405 12.026945 11.827393 12.512149 12.614613
4466 NM008181 9.150932 10.654799 10.715937 10.553323 9.259762 8.887743
4512 AF186373 8.288511 9.544167 9.837916 9.556097 7.988661 8.222104
7651 AF057156 8.869441 10.953028 11.638788 10.882626 8.691189 8.822723
>
http://www.ncbi.nlm.nih.gov/
1-9-2006 15
Annotating Significant GenesCalculating t-tests : source("http://eh3.uc.edu/teaching/cfg/2006/R/SimpleGeneAnnotation.R",verbose=T)
> library(annotate)
> library(mouseLLMappings)
>
> locuslinkByID("13107")
[1] "http://www.ncbi.nih.gov/LocusLink/LocRpt.cgi?l=13107"
>
> ACC2LL <- as.list(mouseLLMappingsACCNUM2LL)
> ACC2LL["M77497"]
$M77497
[1] 13107
> SigGenesLL<-ACC2LL[as.character(SimpleData[SigGenes,"Name"])]
http://www.ncbi.nlm.nih.gov/
1-9-2006 16
Annotating Significant GenesCalculating t-tests : source("http://eh3.uc.edu/teaching/cfg/2006/R/SimpleGeneAnnotation.R",verbose=T)
> SigGenesLL<-ACC2LL[as.character(SimpleData[SigGenes,"Name"])]
> SigGenesLL
$M77497
[1] 13107
$AK014133
[1] 15572
$"NA"
NULL
$"NA"
NULL
$"NA"
NULL
$AF186373
[1] 21816
$"NA"
NULL
> locuslinkByID(unlist(SigGenesLL))
[1] "http://www.ncbi.nih.gov/LocusLink/list.cgi?ID=13107&ID=15572&ID=21816"
>
http://www.ncbi.nlm.nih.gov/
Top Related