MSCL Analyst’s Toolbox, Part 2
description
Transcript of MSCL Analyst’s Toolbox, Part 2
January 2008 1
MSCL Analyst’s Toolbox, Part 2
Instructors:
Jennifer Barb, Zoila G. Rangel, Peter Munson
March 2007Mathematical and Statistical Computing LaboratoryDivision of Computational Biosciences
January 2008 2
Statistical topics
• Quality Control Charts• False Discovery Rate• Principal Components Analysis explained• PCA Heatmap• Data normalization, transformation• Affymetrix probesets and “Probe-level” analysis
• MAS5, RMA, S10 compared
January 2008 3
Gene Expression Microarrays
• Started in mid-1990s, exponential growth in popularity
• High-throughput -- measures 10,000s of genes at once
• Very noisy -- systematic and random errors– Chip manufacturing, printing artifacts– RNA sample quality issues– Sample preparation, amplification, labeling reaction problems
– Hybridization reaction variability– Linearity of response, saturation, background
• Affymetrix has controlled chip quality well.• REPLICATION IS STILL REQUIRED!• Statistical methods are critical in analysis!• Quality Control is Essential!
January 2008 4
Quality Control Plotsfor Parameters RawQ,
ScaleFactor
January 2008 5
New Scanner Installed
Scanner “burn-in”?
Quality Control Plotsfor Parameters RawQ,
ScaleFactor
January 2008 6
Quality ControlFigure 2
Figure 3
Levy-Jennings (QC) plots for parameters RawQ, reflecting image background noise, and SF, or scale-factor for over 700 Affymetrix U133A and U133 Plus 2 arrays processed in the NIH/CC and NHLBI microarray laboratories. Average and Upper Control Limit values are set based on historical data extending over 5 years. These and other parameters are tracked regularly and used as basis for acceptance of new array data quality.
January 2008 7
Experimental Designs for Gene Expression
• Cross-sectional clinical studies from 2 or more patient groups or tissues; identify markers, prognostic indicators.
• Animal model: samples compared between treatments, groups, or over time; identify genes involved in disease process.
• Intervention Trial: collect blood samples pre/post treatment or over time, identify (and rationalize) genes involved.
• Cell culture: Treat cells in culture, identify genes and patterns of response. Complex study designs possible.
• Genetic Knock-out: Perturb genotype, give treatment, investigate expression response, in animal or cells.
January 2008 8
Gene Expression Analysis Strategies
• Clinical Studies: – Exploratory analysis, Hierarchical Cluster, Heat maps– Sample size often insufficient– Two-sample tests, Discriminant Analysis, “machine learning” approaches to find prognostic factors
• Designed studies: Analysis plan should follow design– T-tests, one-way ANOVA to select significantly changing genes
– Blocking to account for experimental batch– Two-way ANOVA for complete two-factor experiments– Regression (etc.) for time-course experimemts
• Corrections for multiple-comparison (20,000 genes tested)– False Discovery Rate
• Interpretation of gene lists (open-ended problem!)
January 2008 9
P-values should be uniformly distributed
• Note excess of small p-values in 45,000 probe sets
• Indicates presence of significant, differentially expressed genes
Cut at p<.05
Falsediscoveries
True discoveries
January 2008 10
False Discovery Rate calculation
(simplified version)
= Number Discovered at this p-value
(Number of tests) x p-value cutoff
Example: 48 genes detected at p<.001 in chip with 12,000 genes.
12,000 * .00148
= 48
12= 25%FDR =
FDR* = Expected Number of False Discoveries
Number Discovered
*Benjamini, Y., Hochberg, Y. (1995) JRSS-B, 57, 289-300.
January 2008 11
False Discovery Rate calculation
(full version)
€
FDR(p) = minp*> p
NumberofTests × p *NumberDetected(p*)
€
FDR(p) ≤1
Now we have guarantee that,
January 2008 12
January 2008 13
Gene Expression Data Matrix, X
(transpose of “Final File” format)
Expression Matrix, X
1 12,625
Genes
1
n
Sam
ples
Information abouteach Sample
Annotations for each Gene
January 2008 14
Analyzing the Data Matrix
•"pre-condition" the Expression Data Matrix
•Select "significant" Genes (False Discovery Rate)
•Select relevant Samples (Outlier rejection, QC)
•Re-order, partition the Genes ("clustering")
•Re-order the Samples
•Visualize the matrix ("heat-map", PCA scatterplot), encode Gene and Sample annotations
•Visualize by Sample (rows of X, scatterplots, line plots)
•Visualize by Gene (cols of X)
•Visualize the Annotations (how?)
•Browse the display for new hypotheses!
January 2008 15
Principal Component Analysis
€
PC1(i) = a1,1x1(i) + a1,2x2(i) + ...+ a1,12625x12625(i)PC2(i) = a2,1x1(i) + a2,2x2(i) + ...+ a2,12625x12625(i)
€
PC = X ⋅A
Each Principal Component is an orthogonal, linear combination of the expression levels. For the ith gene chip:
In matrix notation:
Principal Components Matrix
Expression Data Matrix
Patterns Matrix
January 2008 16
Data can be Reconstructed from PCs!
€
PC = X ⋅A
€
PC ⋅AT = X ⋅(A ⋅AT ) = XA was chosen so that AAT is the Identity matrix:
€
X = PC ⋅ATOr
January 2008 17
Data Matrix (X) equals Principal Components (PC)
times Expression Patterns (EP = AT)
X
1 12,6251
n
Genes
Expe
rimen
ts
Genes
EP
1 12,6251
n
Com
pone
nts
PC
1
nEx
perim
ents
1 nComponents
*=
Plot PC(i,1) vs PC(i,2)for each experiment
•EP row1 contains most important “expression pattern"•PC col 1 defines how that pattern is manifest in each experiment•Similarly for EP row 2, PC col 2, etc.•Only a few patterns needed to reconstruct data matrix X
January 2008 18
Principal Components Analysis
-15
-10
-5
0
5
10
15
20
Pattern 2
4C
4Dex
4IF4IFDex
5C
5Dex
5IF 5IFDex
6C
6Dex6IF
6IFDex7C7Dex7IF
7IFDex
8C8Dex
8IF
8IFDex
-30 -20 -10 0 10 20Pattern
10050861008115
Probe Array Lot
-15
-10
-5
0
5
10
15
20
Pattern 2
4C
4Dex
4IF4IFDex
5C
5Dex
5IF 5IFDex
6C
6Dex6IF
6IFDex7C7Dex7IF
7IFDex
8C8Dex
8IF
8IFDex
-30 -20 -10 0 10 20PatternPC 1(38%)
PC 2(12%)
January 2008 19
GLOBAL DATABASE (HG U95A)PCA BI-PLOT
-110
-100
-90
-80
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
Pattern 2
-50 -40 -30 -20 -10 0 10 20 30 40 50 60 70Pattern
Each spot is one chipN=469
January 2008 20
GLOBAL DATABASE PCA BI-PLOT (PC2 vs PC3)
-40
-30
-20
-10
0
10
20
30
40
Pattern 3
-110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30Pattern 2
January 2008 21
X
1 12,6251
n
Genes
Expe
rimen
ts
Genes
EP
1 12,6251
n
Com
pone
nts
PC
1
n
Expe
rimen
ts
1 nComponents
*=
PCA HEATMAPData (X) equals Components (PC) times Expression
Patterns (EP)
Visualize coefficientsof a first few “Patterns”, Re-order Experiments
January 2008 22
Conclusion:Sample Type and Project determine clusters
U95A DatabasePCA Heatmap
colored bySample Type
(12)
January 2008 23
PCA Heatmap of Entire Database
469 Chips, 468 Components5,933,750 values!
January 2008 24
January 2008 25
Data Normalization and Transformation
January 2008 26
Chip-to-chip normalization,
Data transformation• Signal intensity varies chip-to-chip for a variety of technical reasons. – Scale adjustments can be made in variety of ways.
– Median adjustment (divide by col median) is commonly used
– Other quantiles (e.g.75th percentile) may work better
• Log-transform – spreads data more evenly– makes variance more uniform
• “Lmed” is median normalized, log transform
January 2008 27
Chip-to-chip normalization,
Data transformation (2)• Quantile normalization (“ranking” the data): every
percentile becomes identical across chips• Quantile normalization may remove technical
artifacts (e.g. curvature)
• Variance should be homogeneous across measurement scale
• Variance may be “homogenized” with appropriate transform (e.g. logarithm, square-root, arcsinh)
• “S10” transform -- optimal variance stabilizing, quantile normalizing transform, calibrated to match Log10 over central part of measurement scale
January 2008 28
Data Transformation and Normalization
January 2008 29
Log(x/median x) transform (“Lmed”)
January 2008 30
2 Comparison of two chips-MAS5 signal
January 2008 31
2 Comparison of two chips - Log10(Signal)
January 2008 32
2 Comparison of two chips - Lmed(SG)
Note deviation from line of identity
January 2008 33
2 Comparison of two chips - 2 x limits
•Note deviation from line of identity
•Note nonuniform variance
January 2008 34
Median-normalized Log-transform“Lmed”• Adequate in most cases
BUT….• Some nonlinearity may remain, requiring further normalization
• Variance is not truly constant, expands at low intensities
• Cannot treat zero or negative values• Logarithm may not be best transformation
• Median normalization may not always be adequate
January 2008 35
Variance Stabilizing Transform (3)
Symmetric Adaptive Transform (S10):• We start with quantile normalization to convenient distribution
• We further transform to make variance constant with mean
• We adapt transform to empirical variance model (with experiment with at least 5 to 10 chips)
• We scale transform to match log10 units midrange
• We require symmetry around origin
January 2008 36
2 Comparison of two chips - Lmed(Signal)
Model the nonlinear relationship
Red line is plot of
quantile of chip 1 vs quantile of chip 2
January 2008 37
2 Comparison of two chips - Quantile normalization
• Second chip is quantile-normalized to first chip
• Curvature is cured!
• Now, can we remove the variable spread?
• Nonuniform variance?
January 2008 38
2 Comparison of two chips -Symmetric Adaptive Transpose,
base10 “S10”• Uses Quantile normalization
• Gives better fit to line of identity
• Adapts scale to give homogeneous variance
• Uniform scatter about line
• Calibrated to match Log10 in middle of scale*Munson, P.J. A consistency test for determining the significance of gene expression changes on replicate samples and two convenient variance-stabilizing transformations. in GeneLogic Workshop of Low Level Analysis of Affymetrix GeneChip Data. 2001. Bethesda, MD.
January 2008 39
Symmetric Adaptive Transform (“S10”)
January 2008 40
Symmetric Adaptive Transform (“S10”)
Lmed
S10
January 2008 41
PCA on Lmed transformed data
• 12 Chips• 3 Groups• Two apparent outliers• Groups not well separated• 1st PC explains 15.3% of variation
January 2008 42
PCA on S10 transformed data
• Outliers no longer obvious• Groups well-separated• 1st PC explains 30.8% of variation
January 2008 43
Fold Change due to Drug - Log10 scale
Log Fold Change-Drug vs. Control - Repl. 1
LFC - Repl. 2
January 2008 44
Fold Change due to Drug - S10 scale
SFC-Drug vs. Control - Repl. 1
SFC - Repl. 2
January 2008 45
Variance Stabilizing Transforms (1)
January 2008 46
Variance Stabilizing Transforms (2)
2
January 2008 47
Log of “Signal”, Variance Model
Mean Lmed Value Std Dev Lmed
Signal ValueLmed Transform Value
January 2008 48
S10(“Signal”), Variance Model
S10 Transform Value
Lmed Transform ValueStd Dev S10
Mean S10 Value
January 2008 49
“Probe Level” analysis
Comparison of Signal, RMA, S10
January 2008 50
Affymetrix Technology
January 2008 51
Affymetrix uses multiple probes per gene
January 2008 52
Data Summarizing Algorithms
To go from 11 probe pairs to a single number:
• Affymetrix MAS 4.0 (Average difference)• Affymetrix MAS 5.0 (Signal)• dChip (Li and Wong, 2001)• RMA (Irizarry, 2003)• PLIER (Hubbell, 2004, Affymetrix)
• Transformations of above statistics (Log, Glog, S10, etc.)
January 2008 53
Which Algorithm is Best?Latin Square Data Answers Question
• Spike-in (or Latin Square) study on Affy U133A chip
• 13 concentrations plus “control” spiked into complex HeLa background
• 42 oligos, 0, 0.125 - 512 pM
• Concentration doubles at each step
• Three chips run for each concentration
www.affymetrix.com“Latin Square Data for Expression Algorithm Assessment” Concentration Number
Mean Intensity for
Probeset
January 2008 54
Detect 2x changes for each Spike-in
Using Volcano Plot
Move selector box to detect more Red, fewer Blue points
RED - spike-in genesBLUE - background
January 2008 55
ROC curve for Lmed of Signal
TP=Red points inside detection box
FP=Blue points inside detection box
Number of False Positives
Number of True Positives
Lmed(Signal)
January 2008 56
ROC for S10(Signal)
Number of False Positives
Number of True Positives
S10(Signal)
Lmed(Signal)
RMA
January 2008 57
Lmed(Signal) Details
January 2008 58
S10(Signal) Details
January 2008 59
RMA Details
January 2008 60
Comparison of Algorithms
• RMA– gives overall best ROC curve– requires probes on multiple chips be summarized together
– Implemented in Affy EC, R, Bioconductor or ArrayAssistLite
• Signal (MAS5) – is convenient, available in Affy GCOS software – summarizes each chip separately– has expanded variance near baseline – LmedMAS5 give worst ROC curve
• S10 transform – cures variance problem for Signal, – improves detection efficiency (ROC curve), – is simple to compute!