Modern Epigenomics Histone Code - remc.wustl.eduremc.wustl.edu/dragonStar/DS2012_Lecture4.pdf ·...

Modern Epigenomics

Histone Code

Ting Wang Department of Genetics

Center for Genome Sciences and Systems Biology Washington University

Dragon Star 2012 Changchun, China

July 2, 2012

DNA methylation +

Histone modification

Chromatin

- 2 each of histones:

H2A,H2B, H3 and H4

Chromatin DNA plus Protein in cells with nuclei

146 bp of DNA

Nucleosome

The Nucleosome core particle

Nucleosome

H3

H4

h"p://www.nature.com/nsmb/journal/v14/n11/images/nsmb1337-‐F1.gif

Post-translational Histone Modifications

Post-translational Histone Modifications

H3 tail Modifications:

Active

HDACs HATs

=Acetylation

=Methylation

KMTases Repressive

Li e. al. (2007) Cell 128, 707

Li e. al. (2007) Cell 128, 707

Histone Modifications in Relation to Gene Transcription

DNA methylation mediated repression

Repression independent of DNA methylation

H3K9 methylation

“condensed” chromatin

H3K27 methylation mediated repression

1.  H3K27 methylation

2.  DNA methylation

Mechanisms of Epigenetic Crosstalk

“Epigenetic cancer therapy”

DNA-methylation and HDAC inhibitors in clinical trials

Summary

•  Dnmt1, Dnmt3A, Dnmt3b - the mammalian DNMTs

•  Chromatin structure is influenced by covalent modification of histone tails

•  Multiple chromatin modification pathways involved in silencing of genes which may show “crosstalk” with DNA methylation

Technologies for Interrogating Chromatin States

ChIP-chip

Antibody specific to one type of histone modification

Histone Modifications

ChIP-seq

Deep sequencing

Chromatin-IP Sequencing K4me1 K4me2 K4me3 K27me3 “repressive”

“acKve”

K4me3

K27me3

Silent developmental gene Transcribed gene

K9me3

K20me3

Constitutive heterochromatin

FoxP1 Olig1

Olig1

K4me3

K27me3

‘Poised’ developmental gene

Histone methylation and transcriptional state

Predicting non-coding RNA?

•  From sequence? –  Not clear which properties can be exploited –  Sequence features such as promoters are too

weak •  Histone modifications + conservation worked

Nucleosome Positioning from Histone ChIP-seq

•  Barski et al, Cell 2007 –  Nucleosome resolution ChIP-seq of 21 histone

marks in CD4+ T-cells –  Total 185.7 M

25 nt tags sequenced –  Analysis not at

nucleosome resolution to map nucleosomes at specific regions

Antibody for

MNase digest

Combine Tags From All ChIP-Seq

Extend Tags 3’ to 150 nt Check Tag Count Across Genome

Take the middle 75 nt

Inaccessible Inaccessible

Accessible

Precise delineation of the accessible regulatory DNA compartment

Digital DNaseI profiling

Digital DNaseI profiling: direct access to regulatory sequences

ChromHMM

Transcription Start Site Enhancer DNA

Observed chromatin marks. Called based on a Poisson distribution

Most likely Hidden State

Transcribed Region

1 6 53 4 6 6 6 6 5

1:

3:

4:

5:

6:

5High Probability Chromatin Marks in State

2:

0.8

0.9

0.9

0.8 0.7

0.9

200bp intervals

All probabilities are learned from the data

2

K4me3 K36me3 K36me3 K36me3 K36me3 K4me1 K4me3 K4me1

K27ac

0.8

K4me1

K36me3

K27ac

K4me1 K4me3

K4me3

K4me1 K4me1

ChromHMM

29

Prom

oter

Tran

scribe

d Ac1ve intergen

ic

Repe

11ve R

epressed

ChromaKn Marks from (Barski et al, Cell 2007; Wang et al Nature GeneKcs, 2008); DNAseI hypersensiKvity from (Boyle et al, Cell 2008); Expression Data from (Su et al, PNAS 2004); Lamina data from (Guelen et al; Naature 2008)

ApplicaKon of ChromHMM to 41 chromaKn marks in CD4+ T-‐cells (Barski’07, Wang’08)

Next-gen Sequencing Technology

Forward Genetics

Phenotype

Genotype

Hypothesis

Test Hypothesis By Genetic Manipulation

Forward Genetics

Phenotype

Genotype

Hypothesis


Two groups: 1. Develop Colorectal cancer At Young Age

2. Do not

Mutation in APC Gene

APC is a Tumor Supressor Gene

Delete APC in Mouse Control: Isogenic APC+

The Cycle of Forward Genetics

Phenotype

Genotype

Hypothesis


Observation

?Sequencing?

Thinking

Gene Deletion/Replacement Recombinant Technology

In 2005 $9 million/genome Not feasible

The Problem with Forward Genetics

Phenotype

Genotype

Hypothesis


Sequencing

Thinking

Gene Deletion/Replacement Recombinant Technology

Currently $40,000* /genome Cost is rapidly dropping

Sequencing

Past and Current Sequencing Technologies

1992-1999 1999 2003“old fashioned

way”

Pre-1992ABI 373/377 ABI 3700 ABI 3730XL

Fluorescent ddNTPsCapillariesRobotic loadingAutomated base callingReliable*

Fluorescent ddNTPs Capillaries*Robotic loading*Automated base callingBreaks down frequently

S35 ddNTPsGelsManual loadingManual base calling

Fluorescent ddNTPs*GelsManual loadingAutomated base calling*

0 and 1st generation sequencing

Next or 2nd-generation sequencing Next generation sequencing technology

454/Roche GS-20/FLX(Oct 2005)

Illumina/Solexa1G Genetic Analyser (Feb 2007)

ABI SOLiD(Oct 2007)

A simple comparison of seq. tech. Comparison of Next Generation

Sequencing Technologies

Technology Reads/run Ave read length

bp per Run Data output

~100,000

70 million

1 billion

1 billion

1-2MB

20GB

1.5TB

1.5-3.0TB

3730XL (ABI) 96 900-1200 bp

454 (Roche) 400,000 250-310 bp

Illumina 1G (Solexa)

40 million 36 bp

SoLID (ABI) 88-132 million(44-66 per slide)

35 bp

They can be applied to different areas Is Sanger sequencing dead? Future of sequencing centers

Next Gen long read instrument

(454)

Next Gen short read instrument

(Solexa)

ABI 3730XL

•Routine sequencing•Verify SNPs from next gen•1X scaffold for novel genomes

“When quantity mattersbut length doesn’t”

“When length matters”

•Novel genomes•Metagenomics

•Expression tags•Chip Seq•Re-sequencing

Illumina Genome Analyzer

Illumina Genome AnalyzerIntroduction to the Technology

IGA Sequencing Pipeline Illumina Sequencing pipeline2. Cluster generation on flow cell

(1.5 day)1. Sample Prep

(1-5 days)

Ligate adapters Clonal Single molecular Array

4. Data Analysis(days-months) 3. Sequencing and imaging

(2-3 days)

Cluster generation Cluster Generation

8 channels (lanes)

Attach DNA to flow cell Attach DNA to flow cell

Bridge amplification Attach DNA to flow cell

Can we amplify epigenetic mark??

Cluster generation Cluster Generation

Clonal Single molecular Array

Clonal single molecule array Clonal Single molecule Array

Random array of clusters

100um~1000 molecules per ~ 1 um cluster ~20-30,000 clusters per tile~40 M clusters per flowcell

Sequencing by synthesis 5’

G

T

C

A

G

T

C

A

G

T

C

A

GT

3’

Cycle 1: Add sequencing reagents

First base incorporated

Remove unincorporated bases

C

A

G

T

C

A

T

C

A

C

C

T

AG

CG

T

A

Detect signal

Cycle 2-n: Add sequencing reagents and repeat

Sequencing By Synthesis (SBS)

5’

Base calling from images Base Calling From Images

1 2 3 7 8 94 5 6

T T T T T T T G T …

T G C T A C G A T …

The identity of each base of a cluster is read off from sequential images

Reversible terminator chemistry solves homopolymer problem

IGA without cover

Flow cell imaging Flowcell imaging

A flow cell A flow cell contains eight lanes Lane 1

Lane 2

Lane 8

.

.

.

Each lane/channel contains three columns of tiles

Column 1

Column 2

Column 3

TileEach column contains 100 tiles

Each tile is imaged four times per cycle – one image per base.

20K-30KClusters 345,600 images for a 36-cycle run

350 X 350 µm

Data analysis pipeline Data Analysis Pipeline

intensity files

Firecrest Bustard

tiff image files (345,600) Sequence files

ElandAdditionalData Analysis Alignment to Genome

Applications Applications of the Technology

Gene ExpressionWhole GenomeRe-sequencing

TargetedRe-sequencing ChIP Sequencing

Other Applications MicroRNA discovery

Read Length is Not As Important For Resequencing

Applications •  Genomes •  Re-sequencing Human Exons (Microarray capture/amplification) •  small (including mi-RNA) and long RNA profiling (including splicing) •  ChIP-Seq:

•  Transcription Factors •  Histone Modifications •  Effector Proteins

•  DNA Methylation •  Polysomal RNA •  Origins of Replication/Replicating DNA •  Whole Genome Association (rare, high impact SNPs) •  Copy Number/Structural Variation in DNA •  ChIA-PET: Transcription Factor Looping Interactions •  ???

Functional Genomics Data Analysis •  Map reads to the genome

•  Available Tools •  MAQ •  SOAP •  MOSAIK •  BWA •  BOWTIE

•  Determine the target genome sequence (i.e., repeat classes) •  Mapping options

•  Number of allowed mis-matches (as function of position) •  Number of mapped loci (e.g., 1 = unique read sequence)

•  Generate Consensus Sequence and identify SNPs •  Generate Read Enrichment Profile (e.g., Wald Lab tool) •  Develop Null Model and Calculate Significantly Enriched Sites •  High level analysis: compare to annotations, other data sets, etc

Limitations of short read technology •  Need a genome

•  De-novo assembly difficult

•  Can’t sequence through repeats •  80% of the human genome is “sequenceable”

•  Need high coverage 15-20X to detect polymorphisms •  Missed SNPs are likely due to low coverage •  300X for 1 in 20 event (1 heterozygous in 10 samples)

•  Error rate increases past the first 30~50 bases

Paired End Reads are Important!

Repetitive DNA Unique DNA

Single read maps to multiple positions

Paired read maps uniquely

Read 1 Read 2

Known Distance

Paired Ends are Important Part 2

Shendure et al 2005

Deletion Insertion Inversion

Paired end mapping reveal structural variations High-throughput

paired-end mapping (PEM)

22

one read maps while the other one does not. Such pairs form a ‘hang-ing insertion’ signature5 (Fig. 1i). De novo assembly of such hanging reads can be used to reconstruct a small inserted segment, although if it is substantially larger than the insert size, hanging reads will not cover the entire insertion.

Signatures based on depth of coverageThe high coverage of NGS makes it possible to identify a completely different type of signature, based on the depth of coverage (DOC). Assuming the sequencing process is uniform, the number of reads mapping to a region follows a Poisson distribution and is expected to be proportional to the number of times the region appears in the donor. Thus, a region that has been deleted (duplicated) will have less (more) reads mapping to it. Although earlier work used DOC to identify recent segmental duplications in the human genome37 and compare segmental duplications between human and chimp38, Campbell et al.34 were the first to use these ‘gain/loss’ signatures to detect CNVs between tumor and healthy samples of the same individuals (Fig. 2). Unlike the PEM insertion signatures, the gain signature does not indicate where an insertion occurred, but rather

which is composed of two linking signatures where the linked regions are close to each other (Fig. 1e). Unlike the basic insertion, the linked insertion signature can be used to identify the region that has been insert-ed. However, if the size of the insertion is large, then the confidence that the two link-ing signatures are associated with the same insertion decreases, and thus this signature becomes weak for very large insertions.

Another type of linking signature is creat-ed by a region of the reference that has been tandemly duplicated in the donor. Cooper et al.7 first observed that a mate pair that has an end in each of the two copies will have an ‘everted’ mapping: the order of the mates is reversed while the orientation stays the same (Fig. 1f). We call this an ‘everted duplication’ signature. This signature can only be used to detect a novel tandem duplication—for example, it cannot detect a tandemly repeat-ed region whose copy count changes from two to three.

All of the methods outlined above, although able to identify approximate loca-tions of breakpoints, cannot indicate the exact locations. The methods below describe signatures that address this shortcoming.

Breakpoint identification: split mapping and hanging insertion. A read sampled across a deletion breakpoint will leave a ‘split mapping’ signature in the reference, with a prefix and suffix of the read map-ping to different locations. Whereas this signature is detectable with longer reads5,35, there are too many such spurious mappings of short read halves, and hence too many spurious signatures, with short read data. Nevertheless, Ye et al.36 showed that if one uses the fact that the mate of a split read must map nearby, then the search space for the split mapping of the hanging read can be much reduced. Thus we have the ‘anchored split mapping’ signature, in which one of the mates maps to the reference and the other has a split mapping with one of its parts about 1 insert size away (Fig. 1g). A similar situation occurs when there is an insertion of a few base pairs. This will leave behind a similar signature, except that the split read will have a prefix and suf-fix mapping to adjacent locations, and there will be a middle part of the read (the bases inserted) that will not be part of either the prefix or suffix mapping (Fig. 1h).

The anchored split mapping signature has the advantage that it can pinpoint the breakpoint of the event with base-pair precision. However, if the deletion is too large, then there will be too many spurious hits for the farther part of the split mapping. Similarly, the size of the insertion detectable with this signature is only a few base pairs, as every inserted base reduces the fraction of the read that matches the genome.

To identify insertions that contain a novel genomic segment, it is possible to use mate pairs spanning either of the breakpoints, where

Basic insertion Basic inversionBasic deletion

Linking Everted duplicationLinked insertion

Anchored split mapping(deletion)

Hanging insertionAnchored split mapping(insertion)

Donor

Ref

Donor

Ref

Donor

Ref

A

A

B

B

A

A

B

C

C

B

0

a cb

d fe

g ih

Figure 1 | Illustrations of PEM signatures. Mate pairs are sampled from the donor, where they are ordered with opposite orientation (the blue mate follows the orange), and are mapped to the reference (ref). Basic signatures include (a) insertions and (b) deletions, where the mapped distance is different from the insert size, as well as (c) inversions, where the order of the two mates is preserved but one of them changes orientation. (d) The linking signature has several discordant mate pairs with similar mapped distances identifying adjacency in the donor (dashed orange arrows) of two distal segments of the reference. The orientation and order of the mapped mate pairs depends on the orientation and order of the two segments in the reference; here, these are unchanged. (e) A linked insertion signature is composed of two linking signatures and arises when the inserted sequence (green) is copied from another location in the genome. (f) A tandem duplication will create an everted duplication linking signature, with mates mapping out of order but with proper orientations. These mate pairs link the end of the duplicated region to its beginning. (g,h) In the anchored split mapping signature, one mate has a good mapping, whereas the other has a split mapping. For a deletion (g) the prefix and suffix surround the deletion, whereas for an insertion (h) the split read has the prefix and suffix mapped to adjacent locations, while a middle part does not map. (i) When a novel genomic segment is inserted, a hanging insertion signature is created, in which only one of the mates has a good mapping.

NATURE METHODS SUPPLEMENT | VOL.6 NO.11s | NOVEMBER 2009 | S15

REVIEW

Medvedev et al. Nature Methods 2009

We need more genomes!

•  Complete genomics ($5000)

•  ABI ($10,000)

•  Illumina ($10,000)

•  Intelligent Biosystems (<$1000)

“3rd generation” sequencing

•  Ion torrent

•  Pac Bio

•  Nanopore

JM Rothberg et al. Nature 475, 348-352 (2011) doi:10.1038/nature10242

Sensor, well and chip architecture.

Ion Torrent

Wafer, die and chip packaging.

Pros and Cons

•  Fast (4 hour sequencing)

•  Cheap per run, but not per base*

•  Homopolymers?

* Yet

Single-molecule, real-time (SMRT) sequencing PacBio

Nanopore sequencing

Modern Epigenomics Histone Code - remc.wustl.eduremc.wustl.edu/dragonStar/DS2012_Lecture4.pdf ·...

Documents

Transcript of Modern Epigenomics Histone Code - remc.wustl.eduremc.wustl.edu/dragonStar/DS2012_Lecture4.pdf ·...