NGS Sequence Analysis for Regulation and Epigenomics -...

Post on 06-Jul-2020

0 views 0 download

Transcript of NGS Sequence Analysis for Regulation and Epigenomics -...

NGS Sequence Analysis for Regulation and Epigenomics

Timothy Bailey Winter School in Mathematical

and Computational Biology July 2, 2013

NGS Analysis and Transcriptional Regulation

•  RNA-seq – Measuring transcription levels (gene

expression) – Detecting RNA regulators (e.g., miRNA)

•  ChIP-seq – Chromatin modifications – Binding of transcription factor proteins

Talk Overview I.  Transcriptional Regulation 101 II.  ChIP-seq 101 III.  Analyzing ChIP-seq data IV.  Combining ChIP-seq and RNA-seq

Part I: Basic Transcriptional Regulation

Source:  Steven  Chu  

Transcription Factors •  Mammalian transcription is controlled

(in part) by about 1400 DNA-binding transcription factor (TF) proteins.

•  These proteins control transcription in two main ways: – Directly, by promoting (or preventing) the

assembly of the pre-initiation complex. – Indirectly, by modifying chromatin.

BASAL TRANSCRIPTION:  • The pre-initiation complex assembles at the core promoter. • This results in only low levels of transcription because the interaction is unstable.

DNA  

+  

Core  Promoter  

TATA        INR  

DNA  

Proximal  Promoter  

TATA        INR  

PROXIMAL PROMOTER: • The proximal promoter extends upstream of the promoter. • It contains binding sites for repressor and activator transcription factors.

• Some transcription factors (“activators”) stabilize the transcriptional machinery when they bind to sites in the proximal promoter.

ACTIVATORS:

• This increases transcription.  

DNA  

+  +  

Proximal  Promoter  

TATA        INR  

• This reduces transcription.

• Their binding can block binding by co-factors and activators.

• Some factors do not stabilize the transcriptional machinery.

REPRESSORS:

+++  DNA  

+  +  

Proximal  Promoter  

TATA        INR  

ENHANCER REGIONS:

DNA  

+  +  

Proximal  Promoter  

TATA        INR  

Enhancer  Region  1-­‐-­‐100Kb  

• Often very distant—1000s of base pairs.  

• Groups of binding sites located upstream or downstream of a promoter.  

• Activator and repressor transcription factors compete to occupy enhancer regions. • DNA looping brings factors into contact with transcriptional machinery. • Bound activators increase transcription.    

ENHANCER REGIONS:

DNA  

+  

Proximal  Promoter  

TATA        INR  

Enhancer  Region  

+++  TATA        INR  

+++  

Chromatin modification by TFs:

DNA  

+  +  

Proximal  Promoter  

TATA        INR  

Enhancer  Region  

• Tissue-specific transcription factors can bind to HATs, causing chromatin to open.  • This can increase transcription.

• Example: Histone Acetyltransferases (HATs) acetylate histones.  

Specific   General  

HAT  

Part II: ChIP-seq Overview

Source:  Steven  Chu  

ChIP-seq •  Chromatin

ImmunoPrecipitation followed by high-throughput sequencing.

•  TF binding sites (“punctate peaks”)

•  Chromatin mods (“broad peaks”

Steps in ChIP-seq

•  Cross-link proteins to DNA

•  Fragment chromatin •  Immunoprecipitate

with antibody to protein

•  Size-select and ligate

•  Amplify •  Sequence

Cross-­‐link  

What can I learn from ChIP-seq?

•  What chromatin regions are marked as active promoters or enhancers?

•  Where is my TF bound? •  What is its DNA-binding

motif? •  What genes might it

regulate?

Part III: Analyzing ChIP-seq Data

Source:  Steven  Chu  

Analyzing TF ChIP-seq Data •  Key messages of this talk:

– Use controls! – Validate your data at each

step. – But this is Science! What

could possibly go wrong…?

Things that can go wrong in ChIP-seq…

1.  Low affinity antibody 2.  Non-specific antibody 3.  Contamination 4.  Poor choice of peak calling algorithm (or

parameters) … etc.

Steps in ChIP-seq Data Analysis

1.  Mapping: where do the sequence “tags” map to the genome?

2.  Peak Calling: where are the regions of significant tag concentration?

3.  Motif Discovery: what is the binding motif?

4.  Location Analysis: where are the peaks w/respect to genes, promoters, introns etc?

1) Mapping ChIP-seq Tags •  Tags: ChIP-seq produces a pool of

“tags” (~100bp) •  Tag Count: measure of enrichment of region •  Negative Control: “input DNA” tag count

Tallack  et  al.,  Genome  Res.,  2010  

Do the mapped tags make sense? •  Each ~100 bp tag is the

5’ end of a DNA fragment.

•  But DNA is double-stranded so there are tags from both strands.

•  We expect pairs of clusters of tags on opposite strands, separated by the fragment length.

Wilbanks  and  FaccioK,  PLoS  One,  2010  

Strand Cross Correlation Analysis (SCCA)

•  If we shift the anti-sense tags left by the (average) fragment length, we should see maximum correlation between the reads on the two strands.

Kharchenko  et  al.,  Nature  Biotechnology,  2009  

SCCA often shows two maxima •  Fragment-length

peak at average fragment length (as we expected)

•  Read-length peak at average read length(due to variable and dispersed mappability of genomic positions)

read-­‐length  peak  

fragment-­‐length  peak  

Landt  S  G  et  al.  Genome  Res.  2012;22:1813-­‐1831  

Quality control 1: SCCA identifies failed ChIP-seq

Landt  S  G  et  al.  Genome  Res.  2012;22:1813-­‐1831  

ENCODE Guidelines: •  Normalized Strand Correlation,

NSC > 1.05 •  Relative Strand Correlation,

RSC > 0.8 •  https://code.google.com/p/

phantompeakqualtools

2) ChIP-seq Peak Calling •  Peak callers combine

overlapping tags to get the “peak height”.

•  Often, strand information and shifting is used to combine tags on opposite strands.

•  Fold-enrichment (tag count / control tag count) is usually used as the criterion for declaring a peak.

Wilbanks  and  Faccio.,  PLoS  One,  2010  

Some ChIP-seq peak callers use SCCA

Bailey  et.  al.,  PLoS  Comp  Bio,  in  press.  

Uses  SCCA  

Uses  SCCA  

Sanity checks: Are your peaks reasonable?

•  Width: TF ChIP-seq peaks should be relatively short (< 300bp) compared to histone modification peaks. –  Are your peaks too wide?

•  Number: Is the number of TF ChIP-seq peaks reasonable? –  Some key TFs bind ~30,000 sites but your TF

probably only binds far fewer (~1000?) •  Location: Do your peaks co-occur with histone

marks and genes that your TF regulates? –  Examine some peaks using the UCSC genome

browser and ENCODE histone tracks

Quality control 2: Fraction of Reads in Peaks (FRiP)

•  Only a fraction of reads typically fall within ChIP-seq peaks.

•  ENCODE guideline: FRiP > 1%

•  Caveat: A lower FRiP threshold may be appropriate if there are very few peaks.

Landt  S  G  et  al.  Genome  Res.  2012;22:1813-­‐1831  

How many of my peaks are “real”?

•  Irreproducible Discovery Rate (IDR) compares the ranks of peaks from two biological replicates. – Rank peaks by significance (p-value or q-

value) – Reproducible discoveries (peaks) should have

similar ranks between replicates. •  ENCODE: reports peaks at 1% IDR •  https://sites.google.com/site/

anshulkundaje/projects/idr

Quality control 3: IDR identifies failed ChIP-seq

Landt  S  G  et  al.  Genome  Res.  2012;22:1813-­‐1831  

High  Reproducibility  

Low  Reproducibility  

3) Motif Discovery & Enrichment Analysis

•  If your TF binds DNA directly (and sequence-specifically), Motif Discovery should find its binding motif.

•  The DNA-binding motif of your TF should be centrally enriched in the peaks, and Central Motif Enrichment Analysis (CMEA) should find it.

Caveats in ChIP-seq Motif Analysis

•  Peak regions may contain other TF motifs due to looping.

•  The binding of the ChIP-ed factor “X” may be indirect.

•  ChIP-ed motif might be weak due to assisted binding.

Farnham,  Nature  Reviews  Gene>cs,  2009  

TF Binding Motif Discovery •  ChIP-seq provides

extremely rich data for inferring the DNA-binding affinity of the ChIP-ed transcription factor.

•  In principle, discovering the motif is simple. ààà •  ChIP-seq peaks tend

to be within +/- 50bp of the bound factor.

•  So we just examine the peak regions for enriched patterns.

MEME Suite tools for ChIP-seq motif discovery and enrichment

•  The MEME Suite (http://meme.nbcr.net) contains several motif discovery and enrichment algorithms appropriate for ChIP-seq data analysis.

–  Discovery & Enrichment: MEME-ChIP

–  Discovery: MEME, DREME, GLAM2

–  Enrichment: CentriMo, AME

Example: Motif discovery in NFIC ChIP-seq data

•  Pjanic et al. predicted 39,807 ChIP-seq peaks in NFIC ChIP-seq data.

•  They do not report a using motif discovery on these peaks.

•  We used MEME-ChIP which runs both MEME and DREME to perform motif discovery on the 100-bp NFIC ChIP-seq peak regions.

Machanick  &  Bailey,  Bioinforma>cs,  2011  

Motif discovery fails in the (original) NFIC dataset

•  An NFIC motif is known from in vitro data, based on only 16 sites.

•  MEME and DREME fail to find this motif in the NFIC data.

•  But so do the other algorithms we tried: Amadeus, peak-motifs, Trawler and Weeder.

The problem: poor peak calling! •  We applied a

different ChIP-seq peak calling algorithm (ChIP-peak) which predicts only 700 peaks (rather than 40,000).

•  MEME discovers the NFI-family binding motif in this new set of peaks.

“site-­‐probability”  curve    MA0119.1

Position CEQLOGO 22.09.10 17:31

TGGCCTAAGCATGCTGACATGCCAGTA

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

0.0045

-250 -200 -150 -100 -50 0 50 100 150 200 250

pro

ba

bility

position of best site in sequence

ATGGCG p=9.9e-081,w=89,n=3495CRSAGC p=9.2e-040,w=207,n=15406

CHGSAGC p=2.3e-038,w=138,n=11787MTGCGCA p=1.4e-033,w=250,n=2184

CDCCKCC p=2.6e-028,w=266,n=11544

PosiKon  of  Best  Site  

Prob

ability  

Central Motif Enrichment Analysis: CentriMo

•  CentriMo searches for known motifs whose sites are most centrally enriched in the ChIP-seq regions.

•  Use 500bp regions centered on each ChIP-seq peak.

500-­‐bp  ChIP-­‐seq  regions  

W=120  L=500  

S  =  number  of  “successes”  =  4  T  =  number  of  “trials”  =  5  

Bailey  et  al,  NAR,  2012  

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

-250 -200 -150 -100 -50 0 50 100 150 200 250

pro

babili

ty

position of best site in sequence

MA0119.1 p=2.4e-031,w=295,n=5409MA0244.1 p=4.6e-015,w=381,n=39398MA0161.1 p=7.3e-015,w=329,n=39356MA0099.1 p=5.5e-014,w=343,n=34267MA0406.1 p=8.1e-012,w=323,n=31383

Central Motif Enrichment confirms the known NFIC motif—even in the original peaks

•  NFIC motif is most centrally enriched of 862 JASPAR and UniPROBE motifs (p = 10-31).

MA0119.1

Position CEQLOGO 22.09.10 17:31

TGGCCTAAGCATGCTGACATGCCAGTA NFIC  

•  However, standard motif enrichment algorithms do not show the NFIC as the most enriched motif.

Quality control 4: CMEA identifies failed ChIP-seq

0

0.0005

0.001

0.0015

0.002

0.0025

-250 -200 -150 -100 -50 0 50 100 150 200 250

pro

babili

ty

position of best site in sequence

MA0039.2 p=7.2e-001,w=365,n=11404

MA0039.2

Position CEQLOGO 10.10.11 18:17

T

C

AGT

G

A

CA

T

CACCT

GACC

T

CC

TA

p  =  0.7  

2.  Failed  KLF1  ChIP-­‐seq  

KLF4  

Pilon  et  al.,  Blood,  2011  

-0.002

-0.001

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

-250 -200 -150 -100 -50 0 50 100 150 200 250

pro

babili

ty

position of best site in sequence

MA0039.2 p=4.4e-066,w=111,n=712Klf7_primary p=6.9e-056,w=103,n=676

MA0140.1 p=1.5e-048,w=177,n=693MA0035.2 p=2.4e-040,w=194,n=756

1.  Successful  KLF1  ChIP-­‐seq  

MA0039.2

Position CEQLOGO 10.10.11 18:17

T

C

AGT

G

A

CA

T

CACCT

GACC

T

CC

TA

KLF4  

Tallack  et  al.,  Genome  Res,  2010  

New motif databases

•  In vitro motifs are especially useful for verifying that your ChIP-seq worked.

•  They are independent of the motifs found by motif discovery in your ChIP-seq data. – UniPROBE: 386 mouse TF motifs from

protein-binding microarrays. – Jolma et al., Cell, 2013: 738 human and

mouse TF motifs from SELEX

4) Location Analysis •  Counts how often TF binding sites are in, say,

promoters, intergenic or intragenic regions.

Farnham,  Nature  Reviews  Gene>cs,  2009  

Example: Predicting Target Genes •  TF binding sites in promoters probably are

regulatory.

•  “Nearest TSS” rule is often used to assign binding sites to target genes.

•  But distal sites may regulate some other gene via chromatin looping.

Farnham,  Nature  Reviews  Gene:cs,  2009  

Klf1 binding near TSSs •  Histogram of

distances from Klf1 ChIP-seq peak to the nearest TSS.

•  KLF1 has a population of binding sites in promoters (small hump on left), but most are distal.

Tallack  et  al,  Genome  Res,  2010  

Motif Spacing Analysis finds co-factor motifs and TF complexes

Part IV: Combining ChIP-seq and RNA-seq

Source:  Steven  Chu  

Identification of KLF1 target genes using RNA-seq

3 x Klf1-/- libraries

3 x Klf1+/+ libraries CuffDiff

RefSeq.gtf (gene definition set)

690 KLF1 “Activated” genes

118 KLF1 “Repressed” genes At Bonferroni corrected p-val <0.05 and >1.5 fold change (KO vs WT)

E2f2 E2f40

200

400

600

800

1000

mR

NA

-seq

FPK

M

mRNA-seq

**

qRT-PCR

valida.on  

Tallack  et  al,  Genome  Res,  2012  

The KLF1 Transcriptome

Tallack  et  al,  Genome  Res,  2012  

KLF1 is a (direct) Activator The distance from KLF1 ChIP-seq peaks to the nearest TSS (putative target gene) is less for “Activated” genes than for “Repressed” genes.

Tallack  et  al,  Genome  Res,  2012  

Final reminders •  Check your data at each step!

– Read mapping •  Strand Cross Correlation Analysis (SCCA)

– Peak calling •  Fraction of Reads in Peaks (FRiP) •  Irreproducible Discovery Rate (IDR) analysis

– Motif discovery / enrichment analysis •  De novo motif found? •  In vitro motif centrally enriched?

   

Acknowledgements The MEME Suite •  Tom Whitington •  Philip Machanick •  James Johnson •  Martin Frith •  William Noble •  Charles Grant •  Shobhit Gupta

KLF Project •  Michael Tallack •  Tom Whitington •  Andrew Perkins •  Sean Grimmond •  Brooke Gardiner •  Ehsan Nourbakhsh •  Nicole Cloonan •  Elanor Wainwright •  Janelle Keys •  Wai Shan Yuen