Microarray normalization, error models, quality Wolfgang Huber EMBL Brixen 15 June 2009.
Brixen ChIP-chip, 2008users.unimi.it/marray/2008/material/lectures/day2/Brixen, ChIP-chip... · Log...
Transcript of Brixen ChIP-chip, 2008users.unimi.it/marray/2008/material/lectures/day2/Brixen, ChIP-chip... · Log...
•! Introduction
•! Protein-DNA interaction
•! Chromatin immunoprecipitation & tiling arrays
•! Models
•! Target abundance
•! Fluorescence signal
•! Analysis methods
•! Overview
•! Comparisons
•! Normalization
•! Proteins interact with DNA to •! Carry out transcription of “activated” genes.
•! Carry out DNA replication.
•! Repair damaged DNA.
•! Mediate recombination in meiosis.
•! Modify or “remodel” the chromatin.
•! Enhance or suppress gene transcription.
•! Etc.
•! Transcription factor proteins regulate gene expression, and recognize short, degenerate motifs in the DNA.
•! ChIP-chip permits in vivo, genome-wide localization of transcription factor binding sites.
•! Other applications:
•! Localization of transcriptional machinery.
•! Histone modifying or chromatin remodeling proteins, or the modified (e.g., methylated) forms themselves.
•! Origin recognition complexes.
•! In vitro?
•! Oligo-selection or gel-shift assays are often poor predictors of in vivo binding.
•! With expression arrays?
•! Change in expression may be through intermediaries.
•! If required co-factors aren’t present, genes which are direct targets may not
exhibit differential expression.
•! In silico?
•! Consensus sites appear far too often.
•! Motifs are degenerate.
•! Cross-link all proteins to genomic DNA in vivo.
•! Extract chromatin and fragment by sonication.
•! ChIP: preferentially filter TF-associated fragments.
•! Purify DNA and amplify.
•! Prepare control DNA by
•! Omitting the immunoprecipitation step, or
•! Using a non-specific antibody for IP.
•! Hybridize treatment and control DNA to separate tiling microarrays.
•! Wash, stain, scan.
•! Cross-link all proteins to genomic DNA in vivo.
•! Extract chromatin and fragment by sonication.
•! ChIP: preferentially filter TF-associated fragments.
•! Purify DNA and amplify.
•! Affymetrix D. melanogaster tiling arrays
•! 6 million 25-mer oligo probes (PM/MM pairs).
•! Median distance between probe starts: 36 bp.
•! Probes targeting repetitive sequence, or with expected hybridization or
synthesis problems, are omitted.
•! BAC spike-in
•! Genomic input control arrays (2x)
•! Artificially enriched treatment arrays (2x): regions from chr2 and chr3 (!150kb in length) added at known relative concentrations.
•! Anti-Pol II
•! Genomic input control arrays (2x)
•! Mock-IP control arrays (IgG, 2x)
•! ChIP arrays (2x)
•! Anti-Zeste
•! Genomic input control arrays (3x)
•! Mock-IP control arrays (IgG, 2x biological, 3x technical)
•! ChIP arrays (2x biological, 3x technical)
All fragments with no TF
binding site pass with
low probability: !
All fragments with no TF
binding site pass with
low probability: !
!! Under the model, the
expected fragment length
after sonication is 1/".
!! Unobservable target
abundance (Ai) will form
peaks around binding sites.
!! For an average fragment
length of 500 bases,
!! is appreciable over a large
number of probes in the
tiling.
( , )Corr( , ) (1 )
d i j
i jA A ! " #
•! Target abundance (Aij)
•! The unobservable number of DNA fragments in sample j which contain sequence
complementary to the probes in feature i.
•! Fluorescence intensity (Iij)
•! The observable, scanned intensity reading for feature i, sample j.
•! Abundance and intensity are related, but not in a simple way…
•! For probe i of sample j, assume that
•! When control data are available, we can eliminate the probe affinity effects
with a ratio of intensities:
= ! " .ij i ij ijI A
Target abundance Probe affinity
Multiplicative error (# > 0)
= ! + "log logT C
i i i iLR A A
•! Assume additive background is
removed during pre-processing.
•!
•! The expected log-ratio signal
also exhibits peaks.
•! Peak amplitude and width depend on the efficiency ratio:
•! Note: !$ is binding-site specific.
!"
".
•! D. melanogaster chromosome 2L
•! Log ratios (unsmoothed) from 3 vs. 3 comparisons, two different IP/
PCR/hybridization groups.
•! Under the model, we expect some spatial correlation in the log-ratios,
even in null regions…
•! For simplicity, ignore
irregularity of probe spacing.
•! Compute auto-correlation at
various lags.
•! For both data sets, there is statistically significant auto-
correlation up to a lag of %15
positions.
Consider…
•! A constant, non-zero
background (B = 1).
•! Fixed enrichment ratio.
•! No noise (" = 1).
•! Varying probe response (#).
A better model:
Iij= !
iA
ij"
ij+ B
ij.
Additive background
•! Varying probe affinity and additive background are important issues.
•! The model predicts peak-like signal response near binding sites.
•! The model predicts spatial correlation in both target abundance and log-ratio.
Target sequence for neighboring probes tends to end up on the same fragment. IP
and amplification take place at the fragment level.
•! Actual binding site signal spans multiple positions.
•! Single probes are…
•! Prone to gross error.
•! Frequently either lazy or promiscuous hybridizers.
•! Statistical approaches:
•! Two-state hidden Markov models.
Li, Meyer and Liu, Bioinformatics, 2005; TileMap, Ji and Wong, Bioinformatics, 2005.
•! Smoothed or windowed probe-level statistics
Cawley et al., Cell, 2004; Keles et al., 2004; MAT, Johnson et al., PNAS, 2006;
Buck, Nobel and Lieb, Genome Biology 2005; Toedling et al., BMC Bioinformatics 2008
•! Ad hoc post-processing of probe-level calls
•! Peak fitting
Kim et al., Nature, 2005; Keles, Biometrics, 2007; Zheng et al., Biometrics, 2007.
•! Actual binding site signal spans multiple positions.
•! Single probes are…
•! Prone to gross error.
•! Frequently either lazy or promiscuous hybridizers.
•! Statistical approaches:
•! Two-state hidden Markov models.
Li, Meyer and Liu, Bioinformatics, 2005; TileMap, Ji and Wong, Bioinformatics, 2005.
•! Smoothed or windowed probe-level statistics
Cawley et al., Cell, 2004; Keles et al., 2004; MAT, Johnson et al., PNAS, 2006;
Buck, Nobel and Lieb, Genome Biology 2005; Toedling et al., BMC Bioinformatics 2008
•! Ad hoc post-processing of probe-level calls
•! Peak fitting
Kim et al., Nature, 2005; Zheng et al., Biometrics 2007.
•! Quantile normalize all slides together
•! Compute a difference of average log-intensities (equivalent to a logged ratio
of the geometric mean intensity).
•! Smooth by mean or trimmed mean over a moving window (typically 675 to
1000 bp).
•! Compute a non-parametric p-value for the smoothed, window-level scores.
•! Adjust for multiple testing to control FDR, by Storey q-value method.
Applications in Drosophila melanogaster:
•! Polycomb targets. YB Schwartz et al., Nature Genetics, 2006.
•! Myb-MuvB/dREAM complex. D Georlette et al., Genes Dev., 2007.
•! Maternal and gap factors. X Li and S MacArthur, et al., PLoS Biology, 2008.
•! Dosage compensation complex. J Kind and JM Vaquerizas, et al., Cell, 2008.
•! Assuming…
•! Enrichment occurs at
only a small fraction of genomic positions.
•! At positions of no
enrichment, the log-ratios are symmetrically
distributed
•! Note! This last assumption
can fail badly if data are not
properly normalized.
•! Gibbons et al., Genome Biology, 2005; and Toedling et al., BMC
Bioinformatics, 2008 are similar. Also see Efron, JASA, 2004.
ENCODE data: Pol2, 00hr, B1 vs. B1,4,5 pooled.
•! Sequence-based variability in probe response
•! When presented with the same target concentrations, probes respond very
differently. How do we deal with this?
•! Additive background
•! Is it appreciable? Do we gain by correcting for it?
•! Estimation of probe-level variance
Variability across replicates is different for different probes. Is it possible/beneficial
to address this?
•! All methods…
•! compute test statistics, then
•! select a threshold for making positive enrichment calls
•! To avoid confounding the two issues, we focus on test statistics only and use
ROC (or pseudo-ROC) performance metrics: all possible thresholds are considered simultaneously.
Method Background correction
TiMAT ! —
MAT ! —
TileMap ! —
Li ’05 HMM Mismatch subtraction
Kele$ ’06 ! —
TAS (Affymetrix) ! Mismatch subtraction, adjust non-positive values to 1.
HGMM —
Chipper/vsn ! Affine adjustment, plus variance-stabilizing transform
GC-RMA “affinities” ! Sequence-based background estimation.
GC-RMA “full model” ! Smooth between MM sub. and sequence based
H0= regions with no enrichment{ }
H1= regions with enrichment{ }
S0= regions less likely to have enrichment{ }
S1= regions more likely to have enrichment{ }
•! Pseudo-positives: 125 bp intervals upstream from annotated
transcription start sites (!14K).
•! Pseudo-negatives: 125 bp intergenic intervals, matched to ps-
positives for (i) GC content and (ii) probe density (!13K).
Method Background correction
TiMAT ! Log ratio
MAT ! Sequence based estimate, followed by log ratio
TileMap Log ratio
Li ’05 HMM Empirically estimated from putative null experiments
Kele$ ’06 Log ratio
TAS (Affymetrix) None: probes are treated as interchangeable
HGMM Hierarchical model w/ probe-specific distributions
Chipper/vsn NA
GC-RMA “affinities” NA
GC-RMA “full model” NA
Method Background correction
TiMAT ! Single global estimate
MAT Binned estimates, for probe with similar affinities
TileMap ! Empirical Bayes smoothed estimate
Li ’05 HMM Empirically estimated from putative null experiments
Kele$ ’06 ! Probe specific (standard two-sample t statistic)
TAS (Affymetrix) NA
HGMM Probe-specific (CV assumed constant)
Chipper/vsn NA
GC-RMA “affinities” NA
GC-RMA “full model” NA
Genome Research 18:393-403 (2008)
•! 100 cloned human fragments, average size ! 500 bp, from ENCODE regions.
•! Enrichment (relative to genomic DNA) from 1.25 to ! 200 fold.
•! Direct hybridization and diluted mixutures (! 25:1).
•! Three different amplification protocols:
•! Ligation-mediated PCR
•! Random-priming PCR
•! Whole-genome amplification
•! Nimblegen (50-mer), Affymetrix (25-mer), and Agilent (44-mer to 60-mer
isothermal) arrays.
•! 13 analysis algorithms.
•! Best results on the three platforms, for unamplified DNA, were comparable.
•! “Variance between experiments within the same platform is similar to, if not greater than, the variance observed between the different platforms.”
•! “The NimbleGen platform (4 replicates) is the most sensitive at lower levels of
enrichment (< 3 fold), followed closely by Agilent (2 replicates).”*
•! “The WGA method was used only on NimbleGen, but produced results with
very little reduction in AUC.”
•! GC content was not correlated with errors, but simple tandem repeats and
segmental duplications — not caught by RepeatMasker — were responsible for a large fraction of false positives and/or false negatives.
•! Common ChIP-chip
normalization scheme:
•! Quantile within treatment
condition.
•! Median scaling between
treatment and control.
•! Here, distributional differences
are too strong for median
scaling.
ENCODE Pol2: B1 (2x), B2 (2x), B3 (2x).
•! Derived statistics may have unexpected properties:
8 10 12 14 16
!4
!2
02
4
H3K4me3, brain, array 1
A
M
H3K4me3, brain, array 1
M
Frequency
!4 !2 0 2 40
5000
10000
20000
30000
•! Background
•! Additive background is clearly present at the probe level
•! Correction using MM probes degraded detection performance. Other
approaches had little detectable effect.
•! Probe response
•! Correcting for probe response by taking ratios is effective.
•! Sequence-based corrections alone are insufficient, and don’t seem to
add anything to ratio-based corrections.
•! Variance estimation
•! Small n (e.g., 2 vs. 2): standard t-statistics perform worse.
•! Small or moderate n: moderated t-statistics don’t hurt or help.
•! U.C. Berkeley
Terry Speed
•! LBNL
Mike Eisen, Mark Biggin, Xiaoyong Li, Stewart MacArthur
•! Affymetrix
Simon Cawley, Tom Gingeras, Antonio Piccolboni,
Stefan Bekiranov, Srinka Ghosh, David Nix
•! EBI
Wolfgang Huber