Modern Epigenomics Histone Code - remc.wustl.eduremc.wustl.edu/dragonStar/DS2012_Lecture4.pdf ·...
Transcript of Modern Epigenomics Histone Code - remc.wustl.eduremc.wustl.edu/dragonStar/DS2012_Lecture4.pdf ·...
Modern Epigenomics
Histone Code
Ting Wang Department of Genetics
Center for Genome Sciences and Systems Biology Washington University
Dragon Star 2012 Changchun, China
July 2, 2012
DNA methylation +
Histone modification
Chromatin
- 2 each of histones:
H2A,H2B, H3 and H4
Chromatin DNA plus Protein in cells with nuclei
146 bp of DNA
Nucleosome
The Nucleosome core particle
Nucleosome
H3
H4
h"p://www.nature.com/nsmb/journal/v14/n11/images/nsmb1337-‐F1.gif
Post-translational Histone Modifications
Post-translational Histone Modifications
H3 tail Modifications:
Active
HDACs HATs
=Acetylation
=Methylation
KMTases Repressive
Li e. al. (2007) Cell 128, 707
Li e. al. (2007) Cell 128, 707
Histone Modifications in Relation to Gene Transcription
DNA methylation mediated repression
Repression independent of DNA methylation
H3K9 methylation
“condensed” chromatin
H3K27 methylation mediated repression
1. H3K27 methylation
2. DNA methylation
Mechanisms of Epigenetic Crosstalk
“Epigenetic cancer therapy”
DNA-methylation and HDAC inhibitors in clinical trials
Summary
• Dnmt1, Dnmt3A, Dnmt3b - the mammalian DNMTs
• Chromatin structure is influenced by covalent modification of histone tails
• Multiple chromatin modification pathways involved in silencing of genes which may show “crosstalk” with DNA methylation
Technologies for Interrogating Chromatin States
ChIP-chip
Antibody specific to one type of histone modification
Histone Modifications
ChIP-seq
Deep sequencing
Chromatin-IP Sequencing K4me1 K4me2 K4me3 K27me3 “repressive”
“acKve”
K4me3
K27me3
Silent developmental gene Transcribed gene
K9me3
K20me3
Constitutive heterochromatin
FoxP1 Olig1
Olig1
K4me3
K27me3
‘Poised’ developmental gene
Histone methylation and transcriptional state
Predicting non-coding RNA?
• From sequence? – Not clear which properties can be exploited – Sequence features such as promoters are too
weak • Histone modifications + conservation worked
Nucleosome Positioning from Histone ChIP-seq
• Barski et al, Cell 2007 – Nucleosome resolution ChIP-seq of 21 histone
marks in CD4+ T-cells – Total 185.7 M
25 nt tags sequenced – Analysis not at
nucleosome resolution to map nucleosomes at specific regions
Antibody for
MNase digest
Combine Tags From All ChIP-Seq
Extend Tags 3’ to 150 nt Check Tag Count Across Genome
Take the middle 75 nt
Inaccessible Inaccessible
Accessible
Precise delineation of the accessible regulatory DNA compartment
Digital DNaseI profiling
Digital DNaseI profiling: direct access to regulatory sequences
ChromHMM
Transcription Start Site Enhancer DNA
Observed chromatin marks. Called based on a Poisson distribution
Most likely Hidden State
Transcribed Region
1 6 53 4 6 6 6 6 5
1:
3:
4:
5:
6:
5High Probability Chromatin Marks in State
2:
0.8
0.9
0.9
0.8 0.7
0.9
200bp intervals
All probabilities are learned from the data
2
K4me3 K36me3 K36me3 K36me3 K36me3 K4me1 K4me3 K4me1
K27ac
0.8
K4me1
K36me3
K27ac
K4me1 K4me3
K4me3
K4me1 K4me1
ChromHMM
29
Prom
oter
Tran
scribe
d Ac1ve intergen
ic
Repe
11ve R
epressed
ChromaKn Marks from (Barski et al, Cell 2007; Wang et al Nature GeneKcs, 2008); DNAseI hypersensiKvity from (Boyle et al, Cell 2008); Expression Data from (Su et al, PNAS 2004); Lamina data from (Guelen et al; Naature 2008)
ApplicaKon of ChromHMM to 41 chromaKn marks in CD4+ T-‐cells (Barski’07, Wang’08)
Next-gen Sequencing Technology
Forward Genetics
Phenotype
Genotype
Hypothesis
Test Hypothesis By Genetic Manipulation
Forward Genetics
Phenotype
Genotype
Hypothesis
Test Hypothesis By Genetic Manipulation
Two groups: 1. Develop Colorectal cancer At Young Age
2. Do not
Mutation in APC Gene
APC is a Tumor Supressor Gene
Delete APC in Mouse Control: Isogenic APC+
The Cycle of Forward Genetics
Phenotype
Genotype
Hypothesis
Test Hypothesis By Genetic Manipulation
Observation
?Sequencing?
Thinking
Gene Deletion/Replacement Recombinant Technology
In 2005 $9 million/genome Not feasible
The Problem with Forward Genetics
Phenotype
Genotype
Hypothesis
Test Hypothesis By Genetic Manipulation
Sequencing
Thinking
Gene Deletion/Replacement Recombinant Technology
Currently $40,000* /genome Cost is rapidly dropping
Sequencing
Past and Current Sequencing Technologies
1992-1999 1999 2003“old fashioned
way”
Pre-1992ABI 373/377 ABI 3700 ABI 3730XL
Fluorescent ddNTPsCapillariesRobotic loadingAutomated base callingReliable*
Fluorescent ddNTPs Capillaries*Robotic loading*Automated base callingBreaks down frequently
S35 ddNTPsGelsManual loadingManual base calling
Fluorescent ddNTPs*GelsManual loadingAutomated base calling*
0 and 1st generation sequencing
Next or 2nd-generation sequencing Next generation sequencing technology
454/Roche GS-20/FLX(Oct 2005)
Illumina/Solexa1G Genetic Analyser (Feb 2007)
ABI SOLiD(Oct 2007)
A simple comparison of seq. tech. Comparison of Next Generation
Sequencing Technologies
Technology Reads/run Ave read length
bp per Run Data output
~100,000
70 million
1 billion
1 billion
1-2MB
20GB
1.5TB
1.5-3.0TB
3730XL (ABI) 96 900-1200 bp
454 (Roche) 400,000 250-310 bp
Illumina 1G (Solexa)
40 million 36 bp
SoLID (ABI) 88-132 million(44-66 per slide)
35 bp
They can be applied to different areas Is Sanger sequencing dead? Future of sequencing centers
Next Gen long read instrument
(454)
Next Gen short read instrument
(Solexa)
ABI 3730XL
•Routine sequencing•Verify SNPs from next gen•1X scaffold for novel genomes
“When quantity mattersbut length doesn’t”
“When length matters”
•Novel genomes•Metagenomics
•Expression tags•Chip Seq•Re-sequencing
Illumina Genome Analyzer
Illumina Genome AnalyzerIntroduction to the Technology
IGA Sequencing Pipeline Illumina Sequencing pipeline2. Cluster generation on flow cell
(1.5 day)1. Sample Prep
(1-5 days)
Ligate adapters Clonal Single molecular Array
4. Data Analysis(days-months) 3. Sequencing and imaging
(2-3 days)
Cluster generation Cluster Generation
8 channels (lanes)
Attach DNA to flow cell Attach DNA to flow cell
Bridge amplification Attach DNA to flow cell
Can we amplify epigenetic mark??
Cluster generation Cluster Generation
Clonal Single molecular Array
Clonal single molecule array Clonal Single molecule Array
Random array of clusters
100um~1000 molecules per ~ 1 um cluster ~20-30,000 clusters per tile~40 M clusters per flowcell
Sequencing by synthesis 5’
G
T
C
A
G
T
C
A
G
T
C
A
GT
3’
Cycle 1: Add sequencing reagents
First base incorporated
Remove unincorporated bases
C
A
G
T
C
A
T
C
A
C
C
T
AG
CG
T
A
Detect signal
Cycle 2-n: Add sequencing reagents and repeat
Sequencing By Synthesis (SBS)
5’
Base calling from images Base Calling From Images
1 2 3 7 8 94 5 6
T T T T T T T G T …
T G C T A C G A T …
The identity of each base of a cluster is read off from sequential images
Reversible terminator chemistry solves homopolymer problem
IGA without cover
Flow cell imaging Flowcell imaging
A flow cell A flow cell contains eight lanes Lane 1
Lane 2
Lane 8
.
.
.
Each lane/channel contains three columns of tiles
Column 1
Column 2
Column 3
TileEach column contains 100 tiles
Each tile is imaged four times per cycle – one image per base.
20K-30KClusters 345,600 images for a 36-cycle run
350 X 350 µm
Data analysis pipeline Data Analysis Pipeline
intensity files
Firecrest Bustard
tiff image files (345,600) Sequence files
ElandAdditionalData Analysis Alignment to Genome
Applications Applications of the Technology
Gene ExpressionWhole GenomeRe-sequencing
TargetedRe-sequencing ChIP Sequencing
Other Applications MicroRNA discovery
Read Length is Not As Important For Resequencing
Applications • Genomes • Re-sequencing Human Exons (Microarray capture/amplification) • small (including mi-RNA) and long RNA profiling (including splicing) • ChIP-Seq:
• Transcription Factors • Histone Modifications • Effector Proteins
• DNA Methylation • Polysomal RNA • Origins of Replication/Replicating DNA • Whole Genome Association (rare, high impact SNPs) • Copy Number/Structural Variation in DNA • ChIA-PET: Transcription Factor Looping Interactions • ???
Functional Genomics Data Analysis • Map reads to the genome
• Available Tools • MAQ • SOAP • MOSAIK • BWA • BOWTIE
• Determine the target genome sequence (i.e., repeat classes) • Mapping options
• Number of allowed mis-matches (as function of position) • Number of mapped loci (e.g., 1 = unique read sequence)
• Generate Consensus Sequence and identify SNPs • Generate Read Enrichment Profile (e.g., Wald Lab tool) • Develop Null Model and Calculate Significantly Enriched Sites • High level analysis: compare to annotations, other data sets, etc
Limitations of short read technology • Need a genome
• De-novo assembly difficult
• Can’t sequence through repeats • 80% of the human genome is “sequenceable”
• Need high coverage 15-20X to detect polymorphisms • Missed SNPs are likely due to low coverage • 300X for 1 in 20 event (1 heterozygous in 10 samples)
• Error rate increases past the first 30~50 bases
Paired End Reads are Important!
Repetitive DNA Unique DNA
Single read maps to multiple positions
Paired read maps uniquely
Read 1 Read 2
Known Distance
Paired Ends are Important Part 2
Shendure et al 2005
Deletion Insertion Inversion
Paired end mapping reveal structural variations High-throughput
paired-end mapping (PEM)
22
one read maps while the other one does not. Such pairs form a ‘hang-ing insertion’ signature5 (Fig. 1i). De novo assembly of such hanging reads can be used to reconstruct a small inserted segment, although if it is substantially larger than the insert size, hanging reads will not cover the entire insertion.
Signatures based on depth of coverageThe high coverage of NGS makes it possible to identify a completely different type of signature, based on the depth of coverage (DOC). Assuming the sequencing process is uniform, the number of reads mapping to a region follows a Poisson distribution and is expected to be proportional to the number of times the region appears in the donor. Thus, a region that has been deleted (duplicated) will have less (more) reads mapping to it. Although earlier work used DOC to identify recent segmental duplications in the human genome37 and compare segmental duplications between human and chimp38, Campbell et al.34 were the first to use these ‘gain/loss’ signatures to detect CNVs between tumor and healthy samples of the same individuals (Fig. 2). Unlike the PEM insertion signatures, the gain signature does not indicate where an insertion occurred, but rather
which is composed of two linking signatures where the linked regions are close to each other (Fig. 1e). Unlike the basic insertion, the linked insertion signature can be used to identify the region that has been insert-ed. However, if the size of the insertion is large, then the confidence that the two link-ing signatures are associated with the same insertion decreases, and thus this signature becomes weak for very large insertions.
Another type of linking signature is creat-ed by a region of the reference that has been tandemly duplicated in the donor. Cooper et al.7 first observed that a mate pair that has an end in each of the two copies will have an ‘everted’ mapping: the order of the mates is reversed while the orientation stays the same (Fig. 1f). We call this an ‘everted duplication’ signature. This signature can only be used to detect a novel tandem duplication—for example, it cannot detect a tandemly repeat-ed region whose copy count changes from two to three.
All of the methods outlined above, although able to identify approximate loca-tions of breakpoints, cannot indicate the exact locations. The methods below describe signatures that address this shortcoming.
Breakpoint identification: split mapping and hanging insertion. A read sampled across a deletion breakpoint will leave a ‘split mapping’ signature in the reference, with a prefix and suffix of the read map-ping to different locations. Whereas this signature is detectable with longer reads5,35, there are too many such spurious mappings of short read halves, and hence too many spurious signatures, with short read data. Nevertheless, Ye et al.36 showed that if one uses the fact that the mate of a split read must map nearby, then the search space for the split mapping of the hanging read can be much reduced. Thus we have the ‘anchored split mapping’ signature, in which one of the mates maps to the reference and the other has a split mapping with one of its parts about 1 insert size away (Fig. 1g). A similar situation occurs when there is an insertion of a few base pairs. This will leave behind a similar signature, except that the split read will have a prefix and suf-fix mapping to adjacent locations, and there will be a middle part of the read (the bases inserted) that will not be part of either the prefix or suffix mapping (Fig. 1h).
The anchored split mapping signature has the advantage that it can pinpoint the breakpoint of the event with base-pair precision. However, if the deletion is too large, then there will be too many spurious hits for the farther part of the split mapping. Similarly, the size of the insertion detectable with this signature is only a few base pairs, as every inserted base reduces the fraction of the read that matches the genome.
To identify insertions that contain a novel genomic segment, it is possible to use mate pairs spanning either of the breakpoints, where
Basic insertion Basic inversionBasic deletion
Linking Everted duplicationLinked insertion
Anchored split mapping(deletion)
Hanging insertionAnchored split mapping(insertion)
Donor
Ref
Donor
Ref
Donor
Ref
A
A
B
B
A
A
B
C
C
B
0
a cb
d fe
g ih
Figure 1 | Illustrations of PEM signatures. Mate pairs are sampled from the donor, where they are ordered with opposite orientation (the blue mate follows the orange), and are mapped to the reference (ref). Basic signatures include (a) insertions and (b) deletions, where the mapped distance is different from the insert size, as well as (c) inversions, where the order of the two mates is preserved but one of them changes orientation. (d) The linking signature has several discordant mate pairs with similar mapped distances identifying adjacency in the donor (dashed orange arrows) of two distal segments of the reference. The orientation and order of the mapped mate pairs depends on the orientation and order of the two segments in the reference; here, these are unchanged. (e) A linked insertion signature is composed of two linking signatures and arises when the inserted sequence (green) is copied from another location in the genome. (f) A tandem duplication will create an everted duplication linking signature, with mates mapping out of order but with proper orientations. These mate pairs link the end of the duplicated region to its beginning. (g,h) In the anchored split mapping signature, one mate has a good mapping, whereas the other has a split mapping. For a deletion (g) the prefix and suffix surround the deletion, whereas for an insertion (h) the split read has the prefix and suffix mapped to adjacent locations, while a middle part does not map. (i) When a novel genomic segment is inserted, a hanging insertion signature is created, in which only one of the mates has a good mapping.
NATURE METHODS SUPPLEMENT | VOL.6 NO.11s | NOVEMBER 2009 | S15
REVIEW
Medvedev et al. Nature Methods 2009
We need more genomes!
• Complete genomics ($5000)
• ABI ($10,000)
• Illumina ($10,000)
• Intelligent Biosystems (<$1000)
“3rd generation” sequencing
• Ion torrent
• Pac Bio
• Nanopore
JM Rothberg et al. Nature 475, 348-352 (2011) doi:10.1038/nature10242
Sensor, well and chip architecture.
Ion Torrent
Wafer, die and chip packaging.
Pros and Cons
• Fast (4 hour sequencing)
• Cheap per run, but not per base*
• Homopolymers?
* Yet
Single-molecule, real-time (SMRT) sequencing PacBio
Nanopore sequencing