Brixen ChIP-chip, 2008users.unimi.it/marray/2008/material/lectures/day2/Brixen, ChIP-chip... · Log...

•! Introduction

•! Protein-DNA interaction

•! Chromatin immunoprecipitation & tiling arrays

•! Models

•! Target abundance

•! Fluorescence signal

•! Analysis methods

•! Overview

•! Comparisons

•! Normalization

•! Proteins interact with DNA to •! Carry out transcription of “activated” genes.

•! Carry out DNA replication.

•! Repair damaged DNA.

•! Mediate recombination in meiosis.

•! Modify or “remodel” the chromatin.

•! Enhance or suppress gene transcription.

•! Etc.

•! Transcription factor proteins regulate gene expression, and recognize short, degenerate motifs in the DNA.

•! ChIP-chip permits in vivo, genome-wide localization of transcription factor binding sites.

•! Other applications:

•! Localization of transcriptional machinery.

•! Histone modifying or chromatin remodeling proteins, or the modified (e.g., methylated) forms themselves.

•! Origin recognition complexes.

•! In vitro?

•! Oligo-selection or gel-shift assays are often poor predictors of in vivo binding.

•! With expression arrays?

•! Change in expression may be through intermediaries.

•! If required co-factors aren’t present, genes which are direct targets may not

exhibit differential expression.

•! In silico?

•! Consensus sites appear far too often.

•! Motifs are degenerate.

•! Cross-link all proteins to genomic DNA in vivo.

•! Extract chromatin and fragment by sonication.

•! ChIP: preferentially filter TF-associated fragments.

•! Purify DNA and amplify.

•! Prepare control DNA by

•! Omitting the immunoprecipitation step, or

•! Using a non-specific antibody for IP.

•! Hybridize treatment and control DNA to separate tiling microarrays.

•! Wash, stain, scan.

•! Cross-link all proteins to genomic DNA in vivo.

•! Extract chromatin and fragment by sonication.

•! ChIP: preferentially filter TF-associated fragments.

•! Purify DNA and amplify.

•! Affymetrix D. melanogaster tiling arrays

•! 6 million 25-mer oligo probes (PM/MM pairs).

•! Median distance between probe starts: 36 bp.

•! Probes targeting repetitive sequence, or with expected hybridization or

synthesis problems, are omitted.

•! BAC spike-in

•! Genomic input control arrays (2x)

•! Artificially enriched treatment arrays (2x): regions from chr2 and chr3 (!150kb in length) added at known relative concentrations.

•! Anti-Pol II


•! Mock-IP control arrays (IgG, 2x)

•! ChIP arrays (2x)

•! Anti-Zeste


•! Mock-IP control arrays (IgG, 2x biological, 3x technical)

•! ChIP arrays (2x biological, 3x technical)

All fragments with no TF

binding site pass with

low probability: !

!! Under the model, the

expected fragment length

after sonication is 1/".

!! Unobservable target

abundance (Ai) will form

peaks around binding sites.

!! For an average fragment

length of 500 bases,

!! is appreciable over a large

number of probes in the

tiling.

( , )Corr( , ) (1 )

d i j

i jA A ! " #

•! Target abundance (Aij)

•! The unobservable number of DNA fragments in sample j which contain sequence

complementary to the probes in feature i.

•! Fluorescence intensity (Iij)

•! The observable, scanned intensity reading for feature i, sample j.

•! Abundance and intensity are related, but not in a simple way…

•! For probe i of sample j, assume that

•! When control data are available, we can eliminate the probe affinity effects

with a ratio of intensities:

= ! " .ij i ij ijI A

Target abundance Probe affinity

Multiplicative error (# > 0)

= ! + "log logT C

i i i iLR A A

•! Assume additive background is

removed during pre-processing.

•!

•! The expected log-ratio signal

also exhibits peaks.

•! Peak amplitude and width depend on the efficiency ratio:

•! Note: !$ is binding-site specific.

!"

".

•! D. melanogaster chromosome 2L

•! Log ratios (unsmoothed) from 3 vs. 3 comparisons, two different IP/

PCR/hybridization groups.

•! Under the model, we expect some spatial correlation in the log-ratios,

even in null regions…

•! For simplicity, ignore

irregularity of probe spacing.

•! Compute auto-correlation at

various lags.

•! For both data sets, there is statistically significant auto-

correlation up to a lag of %15

positions.

Consider…

•! A constant, non-zero

background (B = 1).

•! Fixed enrichment ratio.

•! No noise (" = 1).

•! Varying probe response (#).

A better model:

Iij= !

iA

ij"

ij+ B

ij.

Additive background

•! Varying probe affinity and additive background are important issues.

•! The model predicts peak-like signal response near binding sites.

•! The model predicts spatial correlation in both target abundance and log-ratio.

Target sequence for neighboring probes tends to end up on the same fragment. IP

and amplification take place at the fragment level.

•! Actual binding site signal spans multiple positions.

•! Single probes are…

•! Prone to gross error.

•! Frequently either lazy or promiscuous hybridizers.

•! Statistical approaches:

•! Two-state hidden Markov models.

Li, Meyer and Liu, Bioinformatics, 2005; TileMap, Ji and Wong, Bioinformatics, 2005.

•! Smoothed or windowed probe-level statistics

Cawley et al., Cell, 2004; Keles et al., 2004; MAT, Johnson et al., PNAS, 2006;

Buck, Nobel and Lieb, Genome Biology 2005; Toedling et al., BMC Bioinformatics 2008

•! Ad hoc post-processing of probe-level calls

•! Peak fitting

Kim et al., Nature, 2005; Keles, Biometrics, 2007; Zheng et al., Biometrics, 2007.

•! Actual binding site signal spans multiple positions.

•! Single probes are…

•! Prone to gross error.

•! Frequently either lazy or promiscuous hybridizers.

•! Statistical approaches:

•! Two-state hidden Markov models.

Li, Meyer and Liu, Bioinformatics, 2005; TileMap, Ji and Wong, Bioinformatics, 2005.

•! Smoothed or windowed probe-level statistics

Cawley et al., Cell, 2004; Keles et al., 2004; MAT, Johnson et al., PNAS, 2006;

Buck, Nobel and Lieb, Genome Biology 2005; Toedling et al., BMC Bioinformatics 2008

•! Ad hoc post-processing of probe-level calls

•! Peak fitting

Kim et al., Nature, 2005; Zheng et al., Biometrics 2007.

•! Quantile normalize all slides together

•! Compute a difference of average log-intensities (equivalent to a logged ratio

of the geometric mean intensity).

•! Smooth by mean or trimmed mean over a moving window (typically 675 to

1000 bp).

•! Compute a non-parametric p-value for the smoothed, window-level scores.

•! Adjust for multiple testing to control FDR, by Storey q-value method.

Applications in Drosophila melanogaster:

•! Polycomb targets. YB Schwartz et al., Nature Genetics, 2006.

•! Myb-MuvB/dREAM complex. D Georlette et al., Genes Dev., 2007.

•! Maternal and gap factors. X Li and S MacArthur, et al., PLoS Biology, 2008.

•! Dosage compensation complex. J Kind and JM Vaquerizas, et al., Cell, 2008.

•! Assuming…

•! Enrichment occurs at

only a small fraction of genomic positions.

•! At positions of no

enrichment, the log-ratios are symmetrically

distributed

•! Note! This last assumption

can fail badly if data are not

properly normalized.

•! Gibbons et al., Genome Biology, 2005; and Toedling et al., BMC

Bioinformatics, 2008 are similar. Also see Efron, JASA, 2004.

ENCODE data: Pol2, 00hr, B1 vs. B1,4,5 pooled.

•! Sequence-based variability in probe response

•! When presented with the same target concentrations, probes respond very

differently. How do we deal with this?

•! Additive background

•! Is it appreciable? Do we gain by correcting for it?

•! Estimation of probe-level variance

Variability across replicates is different for different probes. Is it possible/beneficial

to address this?

•! All methods…

•! compute test statistics, then

•! select a threshold for making positive enrichment calls

•! To avoid confounding the two issues, we focus on test statistics only and use

ROC (or pseudo-ROC) performance metrics: all possible thresholds are considered simultaneously.

Method Background correction

TiMAT ! —

MAT ! —

TileMap ! —

Li ’05 HMM Mismatch subtraction

Kele$ ’06 ! —

TAS (Affymetrix) ! Mismatch subtraction, adjust non-positive values to 1.

HGMM —

Chipper/vsn ! Affine adjustment, plus variance-stabilizing transform

GC-RMA “affinities” ! Sequence-based background estimation.

GC-RMA “full model” ! Smooth between MM sub. and sequence based

H0= regions with no enrichment{ }

H1= regions with enrichment{ }

S0= regions less likely to have enrichment{ }

S1= regions more likely to have enrichment{ }

•! Pseudo-positives: 125 bp intervals upstream from annotated

transcription start sites (!14K).

•! Pseudo-negatives: 125 bp intergenic intervals, matched to ps-

positives for (i) GC content and (ii) probe density (!13K).


TiMAT ! Log ratio

MAT ! Sequence based estimate, followed by log ratio

TileMap Log ratio

Li ’05 HMM Empirically estimated from putative null experiments

Kele$ ’06 Log ratio

TAS (Affymetrix) None: probes are treated as interchangeable

HGMM Hierarchical model w/ probe-specific distributions

Chipper/vsn NA

GC-RMA “affinities” NA

GC-RMA “full model” NA


TiMAT ! Single global estimate

MAT Binned estimates, for probe with similar affinities

TileMap ! Empirical Bayes smoothed estimate

Li ’05 HMM Empirically estimated from putative null experiments

Kele$ ’06 ! Probe specific (standard two-sample t statistic)

TAS (Affymetrix) NA

HGMM Probe-specific (CV assumed constant)

Chipper/vsn NA

GC-RMA “affinities” NA

GC-RMA “full model” NA

Genome Research 18:393-403 (2008)

•! 100 cloned human fragments, average size ! 500 bp, from ENCODE regions.

•! Enrichment (relative to genomic DNA) from 1.25 to ! 200 fold.

•! Direct hybridization and diluted mixutures (! 25:1).

•! Three different amplification protocols:

•! Ligation-mediated PCR

•! Random-priming PCR

•! Whole-genome amplification

•! Nimblegen (50-mer), Affymetrix (25-mer), and Agilent (44-mer to 60-mer

isothermal) arrays.

•! 13 analysis algorithms.

•! Best results on the three platforms, for unamplified DNA, were comparable.

•! “Variance between experiments within the same platform is similar to, if not greater than, the variance observed between the different platforms.”

•! “The NimbleGen platform (4 replicates) is the most sensitive at lower levels of

enrichment (< 3 fold), followed closely by Agilent (2 replicates).”*

•! “The WGA method was used only on NimbleGen, but produced results with

very little reduction in AUC.”

•! GC content was not correlated with errors, but simple tandem repeats and

segmental duplications — not caught by RepeatMasker — were responsible for a large fraction of false positives and/or false negatives.

•! Common ChIP-chip

normalization scheme:

•! Quantile within treatment

condition.

•! Median scaling between

treatment and control.

•! Here, distributional differences

are too strong for median

scaling.

ENCODE Pol2: B1 (2x), B2 (2x), B3 (2x).

•! Derived statistics may have unexpected properties:

8 10 12 14 16

!4

!2

02

4

H3K4me3, brain, array 1

A

M

H3K4me3, brain, array 1

M

Frequency

!4 !2 0 2 40

5000

10000

20000

30000

•! Background

•! Additive background is clearly present at the probe level

•! Correction using MM probes degraded detection performance. Other

approaches had little detectable effect.

•! Probe response

•! Correcting for probe response by taking ratios is effective.

•! Sequence-based corrections alone are insufficient, and don’t seem to

add anything to ratio-based corrections.

•! Variance estimation

•! Small n (e.g., 2 vs. 2): standard t-statistics perform worse.

•! Small or moderate n: moderated t-statistics don’t hurt or help.

•! U.C. Berkeley

Terry Speed

•! LBNL

Mike Eisen, Mark Biggin, Xiaoyong Li, Stewart MacArthur

•! Affymetrix

Simon Cawley, Tom Gingeras, Antonio Piccolboni,

Stefan Bekiranov, Srinka Ghosh, David Nix

•! EBI

Wolfgang Huber

Brixen ChIP-chip, 2008users.unimi.it/marray/2008/material/lectures/day2/Brixen, ChIP-chip... · Log...

Documents

Transcript of Brixen ChIP-chip, 2008users.unimi.it/marray/2008/material/lectures/day2/Brixen, ChIP-chip... · Log...