Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can...
Transcript of Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can...
![Page 1: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/1.jpg)
Lecture 8 2014
Quality control of high throughput biological data
and Statistical testing for large biological data
Anja Bråthen Kristoffersen
![Page 2: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/2.jpg)
Introduction
• There are many sources of variability and bias in high-
throughput biological experiments.
• Make it difficult to distinguish between biological differences
and experimental noise.
• Raw data can be very misleading.
• We will look at design, transformation and normalization
methods to reduce technical noise.
Statistical bioinformatics 3
![Page 3: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/3.jpg)
Randomization of experiments
• Ensure that you will not have any systematic biases:
– Distribute the biological groups in a balanced way
– Divide into batches of the same size, limited by the
capacity on each step
• Randomize and balance according to the biology
that you are interested in
– Make random numbers by using the funciton sample() in R
– E.g. draw 10 numbers between 1 and 10 without
replacment:
> sample(10,10, replace = F)
Statistical bioinformatics 4
![Page 4: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/4.jpg)
Experimental plan: an example
11. januar 2014 Statistical bioinformatics 5
![Page 5: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/5.jpg)
Samples color coded according to biology
11. januar 2014 Statistical bioinformatics 6
![Page 6: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/6.jpg)
Samples color coded according to labeling date
11. januar 2014 Statistical bioinformatics 7
![Page 7: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/7.jpg)
Precautions
• Experimental methods should be standardized
across the same experiment
– ideally across all experiments
• Multiple biological replicates make it possible to
account for individual variability.
• If possible, multiple technical replicates
– Partition the same sample into multiple runs or even
multiple machines
• In the end, the data should be precise, accurate,
and directly comparable to other data. Ny Powerpoint mal 2011 8
![Page 8: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/8.jpg)
Statistical goals in quality control analysis
• Examine distributional properties of data and to
assess their quality
– Goal 1: to examine whether the data are appropriate
for any subsequent analysis outlier detection
– Goal 2: to investigate the variability and relationships
among different samples and replicates
Statistical bioinformatics 9
![Page 9: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/9.jpg)
The goal
• Most of statistical analysis rely on well-behaved
distributions.
– Skewed distribution data transformation
– Heterogeneous variance variance-stabilizing
transformation
• e.g. power transformations
– Outliers and noise robust statistics, e.g.
• median is robust while mean is not
– Data from different experiments should be comparable
• Data normalization
Statistical bioinformatics 10
![Page 10: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/10.jpg)
Example: effect of outliers
Statistical bioinformatics 11
A random set of 10 points: one more point added:
Even one outlier can change your whole idea about data, if you are
not carefull!
![Page 11: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/11.jpg)
Motivation for Normalization
• Assume you do an experiment and find a negative correlation. Then
you combine this with the results from your colleague, who used a
different reference:
Your results Combined results
Statistical bioinformatics 12
This is called Simpson's Paradox, and it can ruin your whole day!
![Page 12: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/12.jpg)
Descriptive Statistics - Box-plot
Statistical bioinformatics 13
-2-1
01
2
75% quantile
25% quantile
Median
IQR
1.5xIQR
1.5xIQR
Everything above
or below are
considered outliers
IQR= 75% quantile -25% quantile= Inter Quantile Range
x <- rnorm(100, mean=0, sd=1)
boxplot(x)
![Page 13: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/13.jpg)
Transformations (log)
𝑦 = 𝑒𝑥, 𝑥 = ln(𝑦)
𝑦 = 10𝑥, 𝑥 = log 10 𝑦
𝑦 = 2𝑥, 𝑥 = log 2 𝑦
• cannot handle negative values
• minimize the impact of extreme values
• log2 transformation helps you easily identify
doublings or halvings in ratios.
Statistical bioinformatics 14
![Page 14: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/14.jpg)
Biexponential transformation
• Arcsinh
𝑦 =𝑒𝑥−𝑒−𝑥
2, 𝑥 = 𝑙𝑛 𝑦 + 𝑦2 + 1
• Logicle
𝑦 = 𝑎𝑒𝑏(𝑥−𝑤) − 𝑐𝑒−𝑑 𝑥−𝑤 + 𝑓
• Logicle transform is similar to arcsinh but with
more parameters in the transformation.
Statistical bioinformatics 15
![Page 15: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/15.jpg)
Example: histogram and boxplot
11. januar 2014 Statistical bioinformatics 16
![Page 16: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/16.jpg)
log - transformation
• Data is highly skewed (positively)
– Lots of small values with a few very large values.
• Need to transform this into a well-behaved
distribution.
– Ideally something like a Gaussian.
• Log transformation is generally used for positively
skewed data.
– Use log2(X)
– More intuitive, each whole number is a twofold change
(+1 → * 2)
Statistical bioinformatics 17
![Page 17: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/17.jpg)
Example: histogram and boxplot, after log transformation
11. januar 2014 Statistical bioinformatics 18
![Page 18: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/18.jpg)
QQ-plot
• The QQ-plot shows the theoretical quantiles
versus the empirical quantiles. If the distribution
assumed (theoretical one) is indeed the correct
one, we should observe a straight line with
gradient equal to 1.
Statistical bioinformatics 19
![Page 19: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/19.jpg)
QQ-plot
11. januar 2014 Statistical bioinformatics 20
![Page 20: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/20.jpg)
DLBCLpatientDataNEW.txt
• http://llmpp.nih.gov/DLBCL/
• Is already normalized
Ny Powerpoint mal 2011 21
![Page 21: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/21.jpg)
summary(dat[,7:12])
Statistical bioinformatics 22
![Page 22: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/22.jpg)
boxplot(dat[,7:12])
Statistical bioinformatics 23
![Page 23: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/23.jpg)
x <- c(dat[,7], dat[,8], dat[,9], dat[,10], dat[,11], dat[,12])
hist(x)
Ny Powerpoint mal 2011 24
![Page 24: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/24.jpg)
qqnorm(x)
qqline(x)
Ny Powerpoint mal 2011 25
![Page 25: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/25.jpg)
Data Normalization
• Normalization allows us to handle several datasets of
different origin and use them together.
– Remember Simpson's Paradox!
• There are several standard methods:
– Shifting Add a constant to all data points, shifting the mean.
• Called centering if the constant added is - µ
– Scaling Multiply data points with a scaling factor based on some
reference mean, xref .
𝑥′𝑖𝑗 = 𝑥𝑖𝑗𝑥𝑟𝑒𝑓
𝑥𝑗
– Quantile Normalization Match quantiles of two distributions
Statistical bioinformatics 26
![Page 26: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/26.jpg)
Quantile normalization
If you have a reference distribution:
• Sort your data.
• For any value in your data, find its rank among all other
data points, and calculate the probability that X < x:
𝑃 𝑋 < 𝑥 = 1 −𝑟𝑎𝑛𝑘(𝑥)
𝑛
• Lookup the value for that probability in the reference
cumulative density distribution (CDF).
• Replace your value with the reference value at the
same quantile.
Statistical bioinformatics 27
![Page 27: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/27.jpg)
Statistical bioinformatics 28
![Page 28: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/28.jpg)
Mean average plot (MA plot)
• XY scatter plot often leads to seeing biased error patterns
• Mathematical bias when a regression-based normalization
used
• MA transformation: A = (X1+X2)/2 and M = X1-X2
11. januar 2014 Statistical bioinformatics 29
![Page 29: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/29.jpg)
MA plot
• The MA plot in the example shows bias.
• Typically, you want a distribution centered on A=0.
11. januar 2014 Statistical bioinformatics 30
![Page 30: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/30.jpg)
MA plot, baseline correction
• The distribution can be corrected by finding and removing
the baseline of the MA plot.
– Locally weighted scatterplot smoothing (LOESS)
– Problem: intensity values are nonlinear transformed after
normalization, so linear relationship such as fold change are not
completely conserved.
11. januar 2014 Statistical bioinformatics 31
![Page 31: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/31.jpg)
Statistical testing and large datasets
Statistical bioinformatics 32
![Page 32: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/32.jpg)
Sensitivity, specificity, FPR, FNR and FDR
Test result
Disease Negative (testedN) Positive (testedP)
Negative (N) Correct False positive (FP)
(type I error)
Positive (P) False negative (FN)
(type II error)
Correct
Statistical bioinformatics 33
Falsepositiverate = 𝐹𝑃𝑅 = 𝐹𝑃
𝑁= 1 − specificity
Falsenegativerate = 𝐹𝑁𝑅 = 𝐹𝑁
𝑃= 1 − sensitivity
Falsediscoveryrate = 𝐹𝐷𝑅 =𝐹𝑃
𝑡𝑒𝑠𝑡𝑒𝑑𝑃
![Page 33: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/33.jpg)
Plot for FPR vs.1-FNR of a statistical test
• Need to know the true positive and true
negative. Easily done in R using the package
ROCR
install.packages("ROCR")
library(ROCR)
pred <- prediction(pvalue, truePN)
perf <- performance(pred,"sens","spec")
plot(perf)
#pvalue and truePN is vectors of
#similar length where truePN is the true
#positive or negative value while pvalue
#is a calculated pvalue for the
#datapoint to be positive or negative
Receiver Operator Characteristic (ROC) curve
Ny Powerpoint mal 2011 34
![Page 34: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/34.jpg)
Multiple hypothesis testing
• Tests are designed such that it has an expected
proportion of incorrectly rejected null hypotheses,
most often this level is 5%.
• When many tests are done the probability of
rejecting a null hypotheses falsely increase,
hence we can correct the probabilities according
to how many tests that are done.
35
![Page 35: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/35.jpg)
• Q: is gene g, g = 1, …, 10 000, differentially
expressed?
• Gives 10 000 null hypothesis: 𝐻01, 𝐻0
2, … ,𝐻010000
– 𝐻01: gene 1 not differentially expressed
– …
• Assume: no genes differentially expressed
– 𝐻0𝑔true for all g
• Significance level α ≤ 0.01
– The probability to incorrectly conclude that one gene is
differentially expressed is 0.01. e.g. 0.01 * 10000 = 100
expected wrong rejections of 𝐻0𝑔
Example 10000 genes
36
![Page 36: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/36.jpg)
Need to control the risk of false positive
Type I error
• Corrected p-value:
– The original p-values do not tell the full story.
– Instead of using the original p-values for decision
making, we should use corrected ones.
37
![Page 37: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/37.jpg)
Different correction methods
• Bonferroni (1935)
– Just multiply all the p-values by the number of tests
– To conservative • need very small p-value to reject 𝐻0
• giveverylittlepower
• Methods that control the family-wise error rate
(FWER).
• Methods that control the false discovery rate
(FDR).
38
![Page 38: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/38.jpg)
Family-Wise Error Rate (FWER)
• Control type I errors at a level α: Pr(FP ≥ 1) < α
– Control the probability of making any false positive call at
the desired significance level
– Conservative methods such as Bonferroni correction
• Divide p-value by number of tests done (e.g. genes)
– Other less conservative but similar methods are:
• Sidak
• Bonferroni-Holm
• Westfall & Young
• Use one of these if you are most afraid of getting
stuff on your significant list that should not have
been there 39
![Page 39: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/39.jpg)
False Discovery Rate (FDR)
• Calculate the expected proportion of type I error
among the rejected hypotheses: – E(FDR) = E(#false positive prediction/#total positive predictions)
• Control the prorortion of false positive calls in all
positive calls at the desired significance level
• Technique that applies to a set of p-values
– Benjamini & Hochberg
– Different newer variants of Benjamini & Hochberg
• Use one of these if you are you most afraid of
missing out on interesting stuff 40
![Page 40: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/40.jpg)
help(p.adjust)
41
![Page 41: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022070923/5fbbcad7479bfb140e338610/html5/thumbnails/41.jpg)
False discovery rate (FDR)
2014.03.05 42