Monica C. Sleumer ( 苏漠 ) 2012-09-19
description
Transcript of Monica C. Sleumer ( 苏漠 ) 2012-09-19
Monica C. Sleumer (苏漠 )2012-09-19
Human Genome
• 3,101,804,739 base pairs• 22 chromosomes plus X and Y• 21,224 protein-coding genes• 15,952 ncRNA genes• 3–8% of bases are under selection– From comparative genomic studies
• Question: What is the genome doing?
Objectives• Find all functional elements
– Bound by specific proteins– Transcribed– Histone modifications– DNA methylation
• Use this information to annotate functional regions– Genes (coding and non-coding)– Promoters– Enhancers– Specific transcription factor binding sites– Silencers– Insulators– Chromatin states
• Cross-reference data from other studies– Comparative genomics– 1000 Genomes Project– Genome-wide association studies (GWAS)
ENCODE projects• ENCODE pilot project: 1% of the genome 2003-2007• modENCODE: Drosophila and C. elegans • ENCODE main project 2007-2012– 1649 dataset-generating experiments– 147 cell types– 235 antibodies and assay protocols– 450 authors– 32 institutes
• 31 publications 2012-09-06– 6 in Nature– 18 in Genome Research– 6 in Genome Biology– 1 in BMC Genetics www.nature.com/encode/category/research-papers
Materials• 147 types of human cell lines, 3 priority levels• Tier 1 cell lines: top priority for all experiments
• Tier 2 cell lines to be done after Tier 1 (next slide)• Tier 3: any other cell lines
Name Description Lineage Tissue Karyotype
GM12878B-lymphocyte, lymphoblastoid, Epstein-Barr Virus, 1000 Genomes Project
mesoderm blood normal
H1-hESC embryonic stem cells inner cell mass embryonic stem cell normal
K562leukemia, 53-year-old female with chronic myelogenous leukemia
mesoderm blood cancer
Tier 2 Cell LinesName Description Lineage Tissue Karyotype
A549 lung carcinoma epithelium, 58-year-old caucasian male endoderm epithelium cancer
CD20+ donor B cells: RO01778 and RO01794 mesoderm blood normal
CD20+_RO01778 B cells, caucasian mesoderm blood normalCD20+_RO01794 B cells, African American mesoderm blood normal
H1-neurons neurons derived from H1 embryonic stem cells ectoderm neurons normal
HeLa-S3 cervical carcinoma ectoderm cervix cancerHepG2 hepatocellular carcinoma endoderm liver cancerHUVEC umbilical vein endothelial cells mesoderm blood vessel normalIMR90 fetal lung fibroblasts endoderm lung normal
LHCN-M2 skeletal myoblasts from pectoralis major muscle, 41 year old caucasian mesoderm skeletal muscle
myoblast
MCF-7 mammary gland, adenocarcinoma ectoderm breast cancerMonocytes-CD14+
Monocytes-CD14+, leukapheresis from RO 01746 and RO 01826 mesoderm monocytes normal
SK-N-SH neuroblastoma, 4 year old ectoderm brain cancer
http://encodeproject.org/ENCODE/cellTypes.html
MethodsRNA-Seq Different fractions of RNA -> sequencing
CAGE 5’ Capped RNA sequencing
RNA-PET Sequencing 5’ Cap plus poly-A tail
ChIP-seq Chromatin immunoprecipitation of a DNA binding protein -> sequencing
DNase-seq Cut exposed DNA with DNase I -> sequencing
FAIRE-seq Nucleosome-depleted DNA -> sequencing
RRBS Bisulphite treatment: unmethylated C->U -> sequencing
3C,5C, ChIA-PET Chromatin interactions -> sequencing
Results: RNA Sequencing
• 62% of the genome is transcribed into sequences >200 bp long– 5.5% of this is exon– 31% is intergenic – no annotated gene– Remaining: intronic
• CAGE-seq: 62,403 TSS– 44% within 100bp of the 5’ end of a GENCODE gene– Others: exons and 3’ UTRs, significance unknown
• Lots of short ncRNAs: tRNA, miRNA, snRNA etc.• Further description: Wu Dingming, 9:30
Results: Transcribed and protein-coding regions
• GENCODE reference gene set– 20,687 Protein-coding • 6.3 alternatively spliced transcripts on average• 3.9 protein isoforms on average• Protein-coding exons: 1.22% of the genome• Still more to come: unidentified peptides in mass-spec
– 18,441 ncRNA genes• 8801 short ncRNA• 9640 long nc RNA
– 11,224 pseudogenes• 863 transcribed
ChIP-SeqAcronym Description Factors
analysedChromRem ATP-dependent chromatin
complexes5
DNARep DNA repair 3
HISase Histone acetylation, deacetylation or methylation
complexes
8
Other Cyclin kinase associated with transcription
1
Pol2 Pol II subunit 1 (2 forms)
Pol3 Pol III-associated 6
TFNS General Pol II-associated factor, not site-specific
8
TFSS Pol II transcription factor with sequence-specific DNA binding
87
Total 119
www.illumina.com/technology/chip_seq_assay.ilmn
ChIP-Seq: Histone modificationsHistone modification or
variantSignal
characteristicsAssociation
H2A.Z Peak dynamic chromatin
H3K4me1 Peak/region enhancers and other distal elements, also downstream of transcription starts
H3K4me2 Peak promoters and enhancers
H3K4me3 Peak promoters/transcription starts
H3K9ac Peak promoters
H3K9me1 Region 5 end of genes′H3K9me3 Peak/region Gene repression, constitutive heterochromatin and
repetitive elementsH3K27ac Peak Gene expression, active enhancers and promoters
H3K27me3 Region polycomb complex, repressive domains and silent developmental genes
H3K36me3 Region Elongation, transcribed portions of genes, 3 regions after ′intron 1
H3K79me2 Region Transcription, 5 end of genes′H4K20me1 Region 5 end of genes′
Results: ChIP-Seq
• 636,336 binding regions• 8.1% of the genome• Sequence-specific TF ChIP-seq:– 86% of the DNA segments occupied by sequence-
specific transcription factors contained a strong DNA-binding motif
– 55% cases contained the expected motif • Further description: Qin Zhiyi & Ma Xiaopeng,
13:30
DNase I hypersensitivity
• 2,890,000 unique hypersensitive sites (DHSs) • 4,800,000 sites across 25 cell types • Tier 1 and tier 2 cell types: 205,109 DHSs per cell type • 98.5% of ChIP-seq TFBS within DHSs• Further description: Guo Weilong 12:30, He Chao 14:30
https://www.nationaldiagnostics.com/electrophoresis/article/dnase-i-footprinting
FAIRE-seq• Like the opposite of ChIP-seq• Cross-link the nucleosomes to the DNA– But not the sequence-specific TFs
• Shear the DNA into small pieces• Remove the protein-bound DNA• Sequence the non-bound DNA
Gaulton KJ et al, Nature Genetics 42, 255–259 (2010) doi:10.1038/ng.530
DNA methylation• CpG methylation: regulates gene expression
– In promoters: gene repression– In genes: gene transcription
• 1,200,000 methylated CpGs in 82 cell lines and tissues– 96% differentially methylated, especially those in
genes• Unmethylated genic CpG islands associated with
P300 binding , an enhancer-related histone acetyltransferase
• Allele-specific methylation: genomic imprinting• Aberrant methylation in cancer cell lines• Reproducible methylation outside CpG
dinucleotides
http://www.diagenode.com/en/applications/bisulfite-conversion.php
Chromosome conformation capture
Montavon and Duboule, Trends in Cell Biology (2012) 22:7, 347–354
Results: Chromosome interactions• Chromosome conformation capture (3C) :
– 5C: 3C-carbon copy – ChIA-PET
• Identified 127,417 promoter-centred chromatin interactions using ChIA-PET– 98% intra-chromosomal
• 2,324 promoters involved in ‘single-gene’ enhancer–promoter interactions
• 19,813 promoters were involved in ‘multi-gene’ interaction complexes spanning up to several megabases
• 50–60% of long-range interactions occurred in only one of the four cell lines
• Further discussion: Li Yanjian, 10:40
Primary Findings• 80.4% of the human genome is doing at least one of the following:
– Bound by a transcription factor– Transcribed– Modified histone
• 99% is within 1.7 kb of at least one of the biochemical events • 95% within 8 kb of a DNA–protein interaction or DNase I footprint• 7 chromatin states:
– 399,124 enhancer-like regions– 70,292 promoter-like regions
• Correlation between transcription, chromatin marks, and TF binding• Functional regions contain lots of SNPs
– Disease-associated SNPs in non-coding regions tend to be in functional elements
End of Introduction
Summary of ENCODE elements• 80.4% of the human genome is covered by at least one ENCODE-
identified element• 62% of the genome is transcribed• 56% of the genome associated with histone modifications • Excluding RNA elements and broad histone elements, 44.2% of
the genome is covered– open chromatin (15.2%) – transcription factor binding (8.1%)– 19.4% DHS or transcription factor ChIP-seq peaks across all cell lines
• 8.5% of bases are covered by either a transcription-factor-binding-site motif (4.6%) or a DHS footprint (5.7%)– 4.5x the amount of protein-coding exons (1.2%)– 2x the amount of conserved sequence between mammals
• Estimate: 50% of DHS remain to be found– Based on saturation curves
Diversity vs Conservation: Interactive Figure
Conservation
Dive
rsity
A high-resolution map of human evolutionary constraint using 29 mammalsNature 478, 476–482 (2011)
Conservation in Bound Motifs vs Unbound Motifs
Conservation
Dive
rsity
http://www.nature.com/encode/interactive-figures/nature11247_F1
Model of gene expression – histone marks
Model of gene expression – TF binding
Transcription factor co-associations
Seven major classes of genome statesCTCF CTCF-enriched
elementCTCF signal , no histone modifications, open chromatin, may have insulator function, enriched for cohesin components
RAD21 and SMC3E Predicted enhancer Open chromatin, H3K4me1, other enhancer-associated marks,
enriched for EP300, FOS, FOSL1, GATA2, HDAC8, JUNB, JUND, NFE2, SMARCA4, SMARCB1, SIRT6 and TAL1 sites, nuclear and
whole-cell RNA poly(A) signalPF Predicted promoter
flankingRegions that generally surround TSS segments
R Predicted repressed H3K27me3 polycomb-enriched regions, REST, BRF2, CEBPB, MAFK, TRIM28, ZNF274 and SETDB1 sites or no signal at all
TSS Predicted promoter including TSS
H3K4me3, open chromatin, Pol II, Pol III, short RNAs, close to TSS sites
T Predicted transcribed
H3K36me3 transcriptional elongation signal., overlap with gene bodies, phosphorylated Pol II , cytoplasmic poly(A)+ RNA
WE Weak enhancer Similar to the E state, but weaker signals and weaker enrichments.
Data integration and genome segmentation
Transcribed Repressed TSSEnhancer
Association between genome states and annotationsTr
ansc
riptio
n fa
ctor
s
RNA
expr
essi
onGenome segment Genome segment
Enhancer validation in mouse and fishEnhancer from K562 cell (leukemia) drives basal promoter with reporter gene in embryonic mouse blood cells and medaka fish
Genome segment clustering
6 cell types
Genome cluster function
Genome state is related to gene function
Allele-specific expression
Pol II
Txn
Rpn
Correlation of allele-specific signal
by gene by genomic segment
Genome-wide association studiesAnnotated disease-
causing SNPs
Control SNPs
Selected TFBS tracks
Diseases
Significantoverlap
No genes, but several TFBS near the disease-causing SNPs
Conclusions• 80% of human genome annotated with at least one
association– Protein-binding– Histone modification– Transcription
• ENCODE data combination– Model gene expression – Genome segmented into 7 types
• Different in each cell line
• ENCODE data combined with other data– 1000 genomes: see influence of parental DNA– Genome-wide association studies
Discussion• 147 types of cells, and the human body has a few thousand• 80% functional : controversial
– 80% of the genome is being transcribed and/or has a protein bound to it some of the time
– Heterochromatin: tightly packed repeat sequences– most of that activity isn’t particularly specific or interesting and may
not have impact– Important not to overstate the findings– Ewan Birney: “cumulative occupation of 8% of the genome by TFs”
• Reproducibility– In exactly the same cell lines, same conditions, different time or place– Same cell lines, different conditions– Same cell type, different people
• Cell lines vs tissue• Cancer vs normalhttp://blogs.nature.com/news/2012/09/fighting-about-encode-and-junk.html
http://blogs.discovermagazine.com/notrocketscience/2012/09/05/encode-the-rough-guide-to-the-human-genome/
Applications
• Visible as genome tracks in UCSC• Mutation from – Cancer sequencing– GWAS– Find out what that part of the genome is doing
• Compare with your cancer data (RNA-seq)• Comparative genome analysis• Gene or pathway of interest
Online Resources• Interactive graphics in online version of paper• Interactive app on Nature ENCODE main page
www.nature.com/encode/