Introduction to genome biology - FASTAgenome browser () From Hoffman et al, Nucl Acid Res 41:827,...
Transcript of Introduction to genome biology - FASTAgenome browser () From Hoffman et al, Nucl Acid Res 41:827,...
Introduction to genome biology
Lisa Stubbs
We’ve found most genes; but what about the rest of the genome?
Most notably: • Coding gene number is relatively constant in metazoans, BUT • Number of alternative transcripts per gene and Gene density are not
– Each gene gives rise to many more isoforms: protein sequence diversity – Much more non-coding DNA, including gene regulatory DNA
Genomesize*12Mb 95Mb 170Mb 1500Mb 2700Mb 3200Mb#codinggenes ~7000~20000 ~14000 ~26000 ~23000~21000#transcripts ~7000~50000 ~29000~53000 ~93000 ~200000Kb/gene 1714bp4750bp12143bp57,692bp117381bp152381bp
*datatakenfromENSEMBLgenomebrowserwww.ensembl.org
Most traditional studies have focused on promoters
and nearby (proximal) enhancers
• Promoter regions are most likely to be involved in recruiting RNA polymerase and related proteins – TATA binding proteins (TAFs) – General transcription factors (GTFs) – Mediator complexes
• Some transcription factors (TF) are also more likely to be found at
promoter sites – SP1, E2F family are classical examples
• BUT, most other metazoan TFs are found preferentially at distant sites – Introns, intergenic regions – Some may be 100s or 1000s of bp from the target promoter, or even
embedded within neighboring genes
Transcription factors and their binding sites • Most known TFs have short, and variable binding sites, e.g.
• BUT The probability of finding a string such as the Yy1 “core” (even as a simple string, rather than a matrix) is (1/4)4 = 1/256 bp!
– Most TFBS are not much more specific than this!
• So, how to raise the probability that the site you find is functional? 1. Interspecies conservation: sites that are found in similar locations in
diverse species are more likely to be functional 2. Site clustering: most TFBS form homo- or heterodimers that
significantly stabilize binding and influence function 3. Location within regions that are known to be in an “open” state in the
cell type and conditions of interest
YY1 SP1 Mzf1
How to find the regulatory needles in the haystack?
• Vertebrate genomes are mostly non-coding – ~2% coding; ~5% noncoding and evolutionarily conserved (at the DNA sequence
alignment level) • Websites to view pre-aligned sequence conservation levels abound; e.g. the
ECR browser http://ecrbrowser.dcode.org/ • zPicture and Mulan provide “do it yourself” tools for pairwise or multi-
sequence alignments of up to 1Mb; http://zpicture.dcode.org/, http://mulan.dcode.org/
• All three tools allow detection of conserved TFBS from Transfac, Jaspar, and other databases
Conserved motifs are more likely to be functional…
• As long as the biology you are interested in is also conserved – Important to consider the appropriate species for comparisons
SpaWaldisplayOfconservedTFBS
Focusing on accessible chromatin
• Even well conserved motifs cannot be accessed in closed regions of chromatin
accessible
Notaccessible e.g.H3K9Me3,H3K27Me3
e.g.H3K27Ac
How to find active elements? Chromatin immunoprecipitation with TF and
histone-modification antibodies
• Chromatin and attendant proteins are chemically crosslinked (lightly) using formaldehyde
– Crosslinking will also attach proteins to each other, so that detection of secondary chromatin interactions is inevitable
• Cross-linked chromatin is randomly sheared by sonication (average fragment size 200-500bp)
• Sonicated fragments in solution are exposed to a protein-specific antibody
• Antibody is retrieved with DNA still attached
• DNA is released with salt and heat (reverses the crosslinks)
– Library is created for sequencing : ligation of “tags” and light PCR amplification
– Sequenced directly e.g. illumina sequencing
+
ATGGCCTTAACGA…..
Sequence-based ChIP approaches… • Harness ChIP, DNAse
sensitivity, and other assays, to Illumina sequencing
– ChIP enriched DNA is ligated to Illumina linkers and sequenced directly
– If you experiment works, you’ve enriched a very small fraction of the genome:
– Requires a lot of input chromatin! Traditional methods need ~10^7 cells per experiment!!
– Critical step is an efficient, selective antibody (and very few exist)
ChIP computational issues
• Sequence is read from randomly position ends of multiple, overlapping randomly sheared fragments – Reads will be scattered around a distance ~2X shear fragment length; – ChIP seq reads surround but may not contain the DNA binding site
• Computational tools (like MACS) need to join adjacent sets of read peaks and define a “shift” distance between read peaks to determine a summit
Bindingsite
ChIPfragmentsSeqreads
Analytical considerations • Genomic neighborhoods
– Shear efficiency is not really “random” • Some genomic regions are fragile and sensitive; some are
protected • Chromatin-matched, co-sheared controls are essential • Most peak-finders are strongly biased to compare controls and
experimental with similar numbers of reads • Repeatability is key
– Biological, or at least technical, replicates are also essential
– Artifactual peaks are very easy to generate! – Other ways to validate:
• Known targets • Known motifs • Similar targets in different cell types or tissues
• Peak width – Transcription factors typically yield sharp peaks; chromatin marks are
sometimes broader and more diffuse
• User-friendly tools – MACS:
• ‘Model based” peak detection, is sensitive to peak enrichment and background
• Zhang et al, Genome Biology 2008, Feng et al. 2012, Nat Procols PMID: 22936215 (Xiaole Liu lab);
• MACS1 is best for sharp peaks (TFs); will break diffuse peaks into smaller regions
• MACS2 is designed to allow broad- or sharp-peak detection – HOMER (http://homer.salk.edu/homer)
• Can be easily tweaked for more sensitive peak detection • Comes packaged wiith a rich set of peak annotation tools • Tools for DNAse-seq, High-C, differential ChIP analysis and many more
– Both tools permit generation of “wiggle files” or similar that can be viewed in the UCSC browser
• Looking at your data is a very important step! Peak finders can miss peaks that you can easily see by eye!
Differential ChIP and connection to differential expression
• Just like differential sequence analysis
– comparison requires rigorous normalization
• Normalization is complicated for ChIP
– peak height? Peak shape? Summit position? Read density? Local neighborhoods?
– Not as simple as an intensity score or a yes/no count
• Chromatin dynamics and expression dynamics
– *might* or *might not* be temporally coordinated
Scalechr15:
Spliced ESTs
Mouse mRNAs
5 kb mm976,304,000 76,305,000 76,306,000 76,307,000 76,308,000 76,309,000 76,310,000 76,311,000 76,312,000 76,313,000
UCSC Genes (RefSeq, GenBank, tRNAs & Comparative Genomics)
94-95 Frontal Cortex 120 min control samples 1+2 1M cells H3K4me3 ChIP
99-100 Frontal Cortex 120 min exp samples 1+2 1M cells H3K4me3 ChIP
42-46 Frontal Cortex 30 min control sample 1+2 5M h3k27ac
41-45 Frontal Cortex 30 min experimental sample 1+2 5M h3k27ac
69-70 Frontal Cortex 120 min control sample 1+2 4M cells h3k4me1
72-73 Frontal Cortex 120 min experimental sample 1+2 4M cells h3k4me1
108+109 Frontal Cortex 120 min exp samples 1+2 5M cells H3K27me3 ChIP
108+109 Frontal Cortex 120 min control samples 1+2 5M cells H3K27me3 ChIP
Cortex 8w H3K27ac Histone Mods by ChIP-seq Peaks from ENCODE/LICRCortex 8w H3K4me3 Histone Mods by ChIP-seq Signal from ENCODE/LICR
Cortex 8w H3K4me1 Histone Mods by ChIP-seq Signal from ENCODE/LICR
Cortex 8w H3K4me1 Histone Mods by ChIP-seq Peaks from ENCODE/LICR
Mouse ESTs That Have Been Spliced
Cortex 8w H3K4me3 Histone Mods by ChIP-seq Peaks from ENCODE/LICR
Cortex 8w H3K27ac Histone Mods by ChIP-seq Signal from ENCODE/LICR
Mouse mRNAs from GenBank
Bop1Hsf1Hsf1Hsf1Hsf1Hsf1
94-95 FCX120 CK1+2 1M H3K4me3 ChIP200 _
5 _
99-100 FCX120 EX1+2 1M H3K4me3 ChIP200 _
5 _
42-46 FCX30 CK1+2 5M h3k27ac ChIP70 _
5 _
41-45 FCX30 EX1+2 5M h3k27ac ChIP70 _
5 _
69-70 FCX120 CK1+2 4M h3k4me1 ChIP40 _
5 _
72-73 FCX120 EX1+2 4M h3k4me1 ChIP40 _
5 _
108+109 FCX120 EX1+2 5M H3K27me3 ChIP30 _
5 _
108+109 FCX120 CK1+2 5M H3K27me3 ChIP30 _
5 _
?
Data from ChIP with TFs, modified Histones, and other proteins are available for human (and to some degree, mouse and flies) as Tables in the UCSC
genome browser (www.genome.ucsc.edu)
FromHoffmanetal,NuclAcidRes41:827,2013
Yet another example of why you should “look at your data”
Scalechr17:
Mouse mRNAs
Spliced ESTs
5 kb mm935,095,000 35,100,000 35,105,000
Hspa1b Hspa1a
94-95 FCX120 CK1+2 1M H3K4me3 ChIP200 -
5 _
99-100 FCX120 EX1+2 1M H3K4me3 ChIP200 -
5 _
42-46 FCX30 CK1+2 5M h3k27ac ChIP70 -
5 _
41-45 FCX30 EX1+2 5M h3k27ac ChIP70 -
5 _
69-70 FCX120 CK1+2 4M h3k4me1 ChIP30 -
5 _
66-67 FCX120 EX1+2 1M h3k4me1 ChIP20 -
5 _
108+109 FCX120 CK1+2 5M H3K27me3 ChIP30 -
5 _
108+109 FCX120 EX1+2 5M H3K27me3 ChIP30 -
5 _
Transposon-based alternatives • These tools address an important issue:
– Library preps fail unless you start with significant ChIP input
– How to work with samples for which millions of cells are not available?
• Solution – Library prep without linker ligation – A transposon brings in the essential Illumina (or
other) primers – Library prep is completed simply with PCR – The need for substantial input DNA is removed
TN5
transposase
inserWon
(e.g.Illuminalibraryoligos)
tagmentaWon
ConWnuedreacWon
PCR
Readytosequence
ChIP tagmentation • Regular ChIP prep
• Treat with transposase and tag oligos while chromatin is still on the beads
• Release after tagmentation, PCR, size-select and sequence (no library prep!)
Issues related to tagmentation • Ratio of DNA: transposase
– Has to be adjusted for each cell type and chromatin prep
• Need even fragmentation to avoid bias, and small enough fragments, in general, for illumina
• Need to avoid making fragments too small • Bias observed in DNA: controls are complicated
• Solution in “ChiPmentation” – Tagmentation while DNA is still protected by the
antibody and cross-linked chromatin, still on the bead
• Protects from over-tagmentation, this allowing a full digestion without fear of losing the DNA
• Allows the protocol to work over a 25X range of DNA: transposon and lessens worries about time
Illumina-owned kit is expensive but…
GenomeRes24:2033–2040
Genome Biology Topic overview • Lectures
– Ross Hardison • Basics of gene regulation, epigenetics and ENCODE results
– David Hawkins • Chromatin states, biological applications
– James Taylor • Higher dimension chromatin structure
– Lisa Stubbs • Integrating data for biological inference: Basics of Expression correlation methods
• Workshops – Bowtie and MACS on Galaxy – Peaks to features in Galaxy – Bowtie and MACs / Tophat->Cuffdiff on the command line – Monday: student’s choice
• “How to” for ECR browser and Z-picture (sequence alignments and conserved motifs) • Simple methods for expression correlation: Cluster and Cytoscape • ChIP peaks to Meme-ChIP (online connection to the meme suite for large peak sets) • DAVID functional clustering analysis (GO and pathway analysis tools online