Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad...
-
Upload
gregory-cook -
Category
Documents
-
view
216 -
download
0
Transcript of Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad...
Bioinformatics
Expression profiling and functional genomics
Part II: Differential expression
Ad 27/11/2006
ANOVA based
Filtering
Linearisation
Bootstrapping
Log transformation
Array by array approach
Filtering
normalization
Ratio
Test statistic (T-test)
Log transformation
Preprocessing
Background corrBackground corr
Overview further analysis
Raw data
Preprocessed data
Differentially expressed genes
Clusters of coexpressed
genes
Preprocessing
ClusteringTest statistic
Comparison of 2 experiments:
• Fold test
• T-test
• SAM
• …
A plethora of different method available
Which one performs best?
Different underlying statistical assumptions
Implication on the final result
Difficult to define the best method
Test Statistic
Preprocessing: test statistic
Type1: Comparison of 2 samples
Statistical testing
Control sample
Induced sample
Retrieve statistically over or under expressed genes
Diff Expr Genes: test statistic
black/white experiment description (array V mice genes)
• Condition 1 : pygmee mouse 10 days old (test)• Condition 2 : normal mouse 10 days old (ref)
detect differentially expressed genes
Experiment design (Latin Square)
Condition 1
Dye1
Replica L
Condition 1
dye1
Replica R
Condition 2
dye2
Replica L
Condition 2
dye2
Replica R
Condition 2
dye1
Replica L
Condition 2
dye1
Replica R
Condition 1
dye2
Replica L
Condition 1
dye2
Replica R
Arra
y 1A
rray 2
Per gene, per condition 4 measurements available
Diff Expr Genes : test statistic
Fold change (ratio test)
4 measurements per gene, condition
Calculate average
Sort averages
log(Sample/control) > threshold (usually 2)
• Arbitrary threshold
• Discards all information obtained from replicates
• Implicitly assumes constant variance but variance depends on expression value
refitestii IIM ,,
Diff Expr Genes : test statistic
Why does fold chance fail:
• Majority of genes expressed at low levels where signal/noise is low
=> not sufficiently conservative
– 2 fold change occurs at random for a large number of genes
– High number of false positives
• Higher levels of expression smaller changes in gene expression may be real
=> too conservative
– High number of false negatives
Improvement:
– T-test
– pairwise fold change: genes significantly differentially expressed if R=-fold change is observed consistently between paired samples
– SAM http://www-stat-class.stanford/SAM/SAMServlet
Diff Expr Genes : test statistic
• Possible if replicates of reference and test are available
• Significance of the difference between the reference and test data (level of expression) relative to the observed level of within class variation (consistency)
• Assumptions
• Normal distribution of variables
• Population mean and variance estimated from data => (Student t distribution for H0 hypothesis)
• Not all genes need to have the same variance
• Under null hypothesis sample means should be equal (rescaling obligatory)
)(/)(22
test
test
ref
reftestrefunpaired n
s
n
smmt
T-test: hypothesis test
Diff Expr Genes : test statistic
• Consider paired data as new variable
• Calculate average ratio
• Calculate standard deviation of the 4 ratio measurements
Determine t-value
df, student t distribution, t-value
nS
DT
D /0
p-value
p-value (represents the probability that a certain null hypothesis is true)
Paired t-test (microarray data are paired)
D
Diff Expr Genes : test statistic
• Classical hypothesis tests (t-test, Wilcoxon rank-sum test, ...):
– a test statistic is calculated (t-value)
– the probability or p-value is calculated that an equally good or better test statistic is generated if a certain null hypothesis is true
– The null hypothesis: gene has no difference in mean expression levels between 2 conditions
– Low p-value (below rejection level ): null hypothesis is not likely: reject null hypothesis: there is a difference in (mean) expression between the two classes
t-test
H0 H1
H0: D=0
H1: D<>0
Gene x
Type I
Type II
Diff Expr Genes : test statistic
Comparison of fold test with paired t-test
• Gene expression levels measured under two different conditions
• Rejection level
– pj < : null hypothesis rejected (result Positive)
– pj > : null hypothesis not rejected (result Negative)
• But: Multiple testing: Type I and Type II error
= False positives and negatives
Diff Expr Genes : test statistic
• Each gene is assigned a score on the basis of its change in expression relative to the standard deviation of repeated measurements for that gene
• H0 (expected relative difference) is estimated by permutation analysis
– Permute the samples
– Calculate d(i) values for both the experimental samples and the permutated control samples
– Rank genes by magnitude of their d(i) values for both the experimental and the permutated control samples
SAM
Diff Expr Genes : test statistic
Observed values• Calculate d(I) value for each gene• Rank genes according to their d(I) value
Simulated values• Permute dataset• Calculate d(I) value for each gene in each permuted dataset• Calculate average d(I) value for each gene• Rank d(I) values
• Make scatterplot
SAM
Diff Expr Genes : test statistic
SAM Plot
-8
-6
-4
-2
0
2
4
6
8
10
-5 -4 -3 -2 -1 0 1 2 3 4 5
Expected
Ob
serv
ed
Signif icant: 37Median # false signif icant: 2,00000
Delta 1,20000Threshold 0,00000
SAM
Diff Expr Genes : test statistic
T-test
Paired t-test
SAM
Parametrized : Student t-distribution
Errors normally distributed
Restricted number of repeat measurements
Impossible to evaluate assumption
No explicit assumption
Order statistics
Test statistic Assumptions Distribution H0
Errors equal variance (iid)
Less stringent assumption
Diff Expr Genes : test statistic
Diff Expr Genes : test statistic
Multiple testing: problem• P value: measure of significance in terms of the false positive rate
• The rate that truly null features are called significant
• Significance is 5%: on average 5% of the truly null features will be called significant (type-I error)
• Type I error: Null hypothesis rejected when it is true – ‘accidental’ low p-value – falsely declared differentially expressed = false positive
• Multiple testing: Example: 10000 genes with random expression profiles - = 5% - one would find 500 genes with a p-value lower than 5% = false positives
• Type II error: Null hypothesis not rejected when it is not true (false negatives). Gene that is actually differentially expressed is not declared differentially expressed.
Adapted from De Smet et al
Diff Expr Genes: test statistic
Multiple testing: solutions
• Control of the familywise error rate (FWE):
P(FP 1) – protection against type I errors• Bonferonni correction: reject null hypothesis at rejection level /N,
which guarantees that FWE = P(FP 1) < • Is OK when very few genes are expected to be actually differentially
expressed (i.e., affected by the difference in conditions / for which the null hypopthesis is false): every false positive is ‘costly’
• Rejection rate becomes very conservative• But in microarray data, usually a considerable number of genes is
actually differentially expressed: control of the FWE results in a severe loss of statistical power (FN or type II error is large)
• In practice we do not have to protect against every possible FP
Better solution FDR: false positive discovery rate
Adapted from De Smet et al
Diff Expr Genes: test statistic
• We need a sensible balance between the number of true positives and the number of false positives
• Therefore is is better to control the ‘False Discovery Rate’ (FDR) instead of the FWE:
• The false positive rate: The rate that truly null features are called significant
• The FDR: = % of false positives among all the genes that are declared positive = % of true null hypotheses erroneously rejected among all the null hypotheses rejected
Adapted from De Smet et al
)()(S
FE
TF
FEFDR
FDR
Diff Expr Genes: test statistic
Difference p-value and FDR
• 5% FDR: 5% false positives among the features called significant
• 5% p value cutoff: 5% false positives among all the null features in the dataset, says little about the content of the features actually called significant
Diff Expr Genes: test statistic
• An estimate of E[S(t)] is the observed S(t): i= the number of observed pvalues <pi
• E[F(t)] = N0pi • Estimate N0
)()(S
FE
TF
FEFDR
No real differential expressionRandomised data setUniform distribution
FN
TN
TP
FP
Rejection level
Non-accidental differential expressionSuperposition of two distribuions
)-N(1
value-p with genes ofnumber )(0
Adapted from De Smet et al
i
NppFDR ii
0.)(
Overview
MICROARRAY PREPROCESSING
• Gene expression
• Omics era
• Transcript profiling
• Experiment design
• Preprocessing
• Slide by slide normalisation
• ANOVA
• Exercises