Monica C. Sleumer ( 苏漠 ) 2012-09-19

Monica C. Sleumer (苏漠 )2012-09-19

Human Genome

• 3,101,804,739 base pairs• 22 chromosomes plus X and Y• 21,224 protein-coding genes• 15,952 ncRNA genes• 3–8% of bases are under selection– From comparative genomic studies

• Question: What is the genome doing?

Objectives• Find all functional elements

– Bound by specific proteins– Transcribed– Histone modifications– DNA methylation

• Use this information to annotate functional regions– Genes (coding and non-coding)– Promoters– Enhancers– Specific transcription factor binding sites– Silencers– Insulators– Chromatin states

• Cross-reference data from other studies– Comparative genomics– 1000 Genomes Project– Genome-wide association studies (GWAS)

ENCODE projects• ENCODE pilot project: 1% of the genome 2003-2007• modENCODE: Drosophila and C. elegans • ENCODE main project 2007-2012– 1649 dataset-generating experiments– 147 cell types– 235 antibodies and assay protocols– 450 authors– 32 institutes

• 31 publications 2012-09-06– 6 in Nature– 18 in Genome Research– 6 in Genome Biology– 1 in BMC Genetics www.nature.com/encode/category/research-papers

Materials• 147 types of human cell lines, 3 priority levels• Tier 1 cell lines: top priority for all experiments

• Tier 2 cell lines to be done after Tier 1 (next slide)• Tier 3: any other cell lines

Name Description Lineage Tissue Karyotype

GM12878B-lymphocyte, lymphoblastoid, Epstein-Barr Virus, 1000 Genomes Project

mesoderm blood normal

H1-hESC embryonic stem cells inner cell mass embryonic stem cell normal

K562leukemia, 53-year-old female with chronic myelogenous leukemia

mesoderm blood cancer

Tier 2 Cell LinesName Description Lineage Tissue Karyotype

A549 lung carcinoma epithelium, 58-year-old caucasian male endoderm epithelium cancer

CD20+ donor B cells: RO01778 and RO01794 mesoderm blood normal

CD20+_RO01778 B cells, caucasian mesoderm blood normalCD20+_RO01794 B cells, African American mesoderm blood normal

H1-neurons neurons derived from H1 embryonic stem cells ectoderm neurons normal

HeLa-S3 cervical carcinoma ectoderm cervix cancerHepG2 hepatocellular carcinoma endoderm liver cancerHUVEC umbilical vein endothelial cells mesoderm blood vessel normalIMR90 fetal lung fibroblasts endoderm lung normal

LHCN-M2 skeletal myoblasts from pectoralis major muscle, 41 year old caucasian mesoderm skeletal muscle

myoblast

MCF-7 mammary gland, adenocarcinoma ectoderm breast cancerMonocytes-CD14+

Monocytes-CD14+, leukapheresis from RO 01746 and RO 01826 mesoderm monocytes normal

SK-N-SH neuroblastoma, 4 year old ectoderm brain cancer

http://encodeproject.org/ENCODE/cellTypes.html

MethodsRNA-Seq Different fractions of RNA -> sequencing

CAGE 5’ Capped RNA sequencing

RNA-PET Sequencing 5’ Cap plus poly-A tail

ChIP-seq Chromatin immunoprecipitation of a DNA binding protein -> sequencing

DNase-seq Cut exposed DNA with DNase I -> sequencing

FAIRE-seq Nucleosome-depleted DNA -> sequencing

RRBS Bisulphite treatment: unmethylated C->U -> sequencing

3C,5C, ChIA-PET Chromatin interactions -> sequencing

Results: RNA Sequencing

• 62% of the genome is transcribed into sequences >200 bp long– 5.5% of this is exon– 31% is intergenic – no annotated gene– Remaining: intronic

• CAGE-seq: 62,403 TSS– 44% within 100bp of the 5’ end of a GENCODE gene– Others: exons and 3’ UTRs, significance unknown

• Lots of short ncRNAs: tRNA, miRNA, snRNA etc.• Further description: Wu Dingming, 9:30

Results: Transcribed and protein-coding regions

• GENCODE reference gene set– 20,687 Protein-coding • 6.3 alternatively spliced transcripts on average• 3.9 protein isoforms on average• Protein-coding exons: 1.22% of the genome• Still more to come: unidentified peptides in mass-spec

– 18,441 ncRNA genes• 8801 short ncRNA• 9640 long nc RNA

– 11,224 pseudogenes• 863 transcribed

ChIP-SeqAcronym Description Factors

analysedChromRem ATP-dependent chromatin

complexes5

DNARep DNA repair 3

HISase Histone acetylation, deacetylation or methylation

complexes

8

Other Cyclin kinase associated with transcription

1

Pol2 Pol II subunit 1 (2 forms)

Pol3 Pol III-associated 6

TFNS General Pol II-associated factor, not site-specific

8

TFSS Pol II transcription factor with sequence-specific DNA binding

87

Total 119

www.illumina.com/technology/chip_seq_assay.ilmn

ChIP-Seq: Histone modificationsHistone modification or

variantSignal

characteristicsAssociation

H2A.Z Peak dynamic chromatin

H3K4me1 Peak/region enhancers and other distal elements, also downstream of transcription starts

H3K4me2 Peak promoters and enhancers

H3K4me3 Peak promoters/transcription starts

H3K9ac Peak promoters

H3K9me1 Region 5 end of genes′H3K9me3 Peak/region Gene repression, constitutive heterochromatin and

repetitive elementsH3K27ac Peak Gene expression, active enhancers and promoters

H3K27me3 Region polycomb complex, repressive domains and silent developmental genes

H3K36me3 Region Elongation, transcribed portions of genes, 3 regions after ′intron 1

H3K79me2 Region Transcription, 5 end of genes′H4K20me1 Region 5 end of genes′

Results: ChIP-Seq

• 636,336 binding regions• 8.1% of the genome• Sequence-specific TF ChIP-seq:– 86% of the DNA segments occupied by sequence-

specific transcription factors contained a strong DNA-binding motif

– 55% cases contained the expected motif • Further description: Qin Zhiyi & Ma Xiaopeng,

13:30

DNase I hypersensitivity

• 2,890,000 unique hypersensitive sites (DHSs) • 4,800,000 sites across 25 cell types • Tier 1 and tier 2 cell types: 205,109 DHSs per cell type • 98.5% of ChIP-seq TFBS within DHSs• Further description: Guo Weilong 12:30, He Chao 14:30

https://www.nationaldiagnostics.com/electrophoresis/article/dnase-i-footprinting

FAIRE-seq• Like the opposite of ChIP-seq• Cross-link the nucleosomes to the DNA– But not the sequence-specific TFs

• Shear the DNA into small pieces• Remove the protein-bound DNA• Sequence the non-bound DNA

Gaulton KJ et al, Nature Genetics 42, 255–259 (2010) doi:10.1038/ng.530

http://www.nature.com/ng/journal/v42/n3/full/ng.530.html

DNA methylation• CpG methylation: regulates gene expression

– In promoters: gene repression– In genes: gene transcription

• 1,200,000 methylated CpGs in 82 cell lines and tissues– 96% differentially methylated, especially those in

genes• Unmethylated genic CpG islands associated with

P300 binding , an enhancer-related histone acetyltransferase

• Allele-specific methylation: genomic imprinting• Aberrant methylation in cancer cell lines• Reproducible methylation outside CpG

dinucleotides

http://www.diagenode.com/en/applications/bisulfite-conversion.php

Chromosome conformation capture

Montavon and Duboule, Trends in Cell Biology (2012) 22:7, 347–354

http://www.sciencedirect.com/science/article/pii/S0962892412000578

http://www.sciencedirect.com/science/article/pii/S0962892412000578

http://www.sciencedirect.com/science/journal/09628924/22/7

Results: Chromosome interactions• Chromosome conformation capture (3C) :

– 5C: 3C-carbon copy – ChIA-PET

• Identified 127,417 promoter-centred chromatin interactions using ChIA-PET– 98% intra-chromosomal

• 2,324 promoters involved in ‘single-gene’ enhancer–promoter interactions

• 19,813 promoters were involved in ‘multi-gene’ interaction complexes spanning up to several megabases

• 50–60% of long-range interactions occurred in only one of the four cell lines

• Further discussion: Li Yanjian, 10:40

Primary Findings• 80.4% of the human genome is doing at least one of the following:

– Bound by a transcription factor– Transcribed– Modified histone

• 99% is within 1.7 kb of at least one of the biochemical events • 95% within 8 kb of a DNA–protein interaction or DNase I footprint• 7 chromatin states:

– 399,124 enhancer-like regions– 70,292 promoter-like regions

• Correlation between transcription, chromatin marks, and TF binding• Functional regions contain lots of SNPs

– Disease-associated SNPs in non-coding regions tend to be in functional elements

End of Introduction

Summary of ENCODE elements• 80.4% of the human genome is covered by at least one ENCODE-

identified element• 62% of the genome is transcribed• 56% of the genome associated with histone modifications • Excluding RNA elements and broad histone elements, 44.2% of

the genome is covered– open chromatin (15.2%) – transcription factor binding (8.1%)– 19.4% DHS or transcription factor ChIP-seq peaks across all cell lines

• 8.5% of bases are covered by either a transcription-factor-binding-site motif (4.6%) or a DHS footprint (5.7%)– 4.5x the amount of protein-coding exons (1.2%)– 2x the amount of conserved sequence between mammals

• Estimate: 50% of DHS remain to be found– Based on saturation curves

Diversity vs Conservation: Interactive Figure

Conservation

Dive

rsity

A high-resolution map of human evolutionary constraint using 29 mammalsNature 478, 476–482 (2011)

Conservation in Bound Motifs vs Unbound Motifs

Conservation

Dive

rsity

http://www.nature.com/encode/interactive-figures/nature11247_F1

Model of gene expression – histone marks

Model of gene expression – TF binding

Transcription factor co-associations

Seven major classes of genome statesCTCF CTCF-enriched

elementCTCF signal , no histone modifications, open chromatin, may have insulator function, enriched for cohesin components

RAD21 and SMC3E Predicted enhancer Open chromatin, H3K4me1, other enhancer-associated marks,

enriched for EP300, FOS, FOSL1, GATA2, HDAC8, JUNB, JUND, NFE2, SMARCA4, SMARCB1, SIRT6 and TAL1 sites, nuclear and

whole-cell RNA poly(A) signalPF Predicted promoter

flankingRegions that generally surround TSS segments

R Predicted repressed H3K27me3 polycomb-enriched regions, REST, BRF2, CEBPB, MAFK, TRIM28, ZNF274 and SETDB1 sites or no signal at all

TSS Predicted promoter including TSS

H3K4me3, open chromatin, Pol II, Pol III, short RNAs, close to TSS sites

T Predicted transcribed

H3K36me3 transcriptional elongation signal., overlap with gene bodies, phosphorylated Pol II , cytoplasmic poly(A)+ RNA

WE Weak enhancer Similar to the E state, but weaker signals and weaker enrichments.

Data integration and genome segmentation

Transcribed Repressed TSSEnhancer

Association between genome states and annotationsTr

ansc

riptio

n fa

ctor

s

RNA

expr

essi

onGenome segment Genome segment

Enhancer validation in mouse and fishEnhancer from K562 cell (leukemia) drives basal promoter with reporter gene in embryonic mouse blood cells and medaka fish

Genome segment clustering

6 cell types

Genome cluster function

Genome state is related to gene function

Allele-specific expression

Pol II

Txn

Rpn

Correlation of allele-specific signal

by gene by genomic segment

Genome-wide association studiesAnnotated disease-

causing SNPs

Control SNPs

Selected TFBS tracks

Diseases

Significantoverlap

No genes, but several TFBS near the disease-causing SNPs

Conclusions• 80% of human genome annotated with at least one

association– Protein-binding– Histone modification– Transcription

• ENCODE data combination– Model gene expression – Genome segmented into 7 types

• Different in each cell line

• ENCODE data combined with other data– 1000 genomes: see influence of parental DNA– Genome-wide association studies

Discussion• 147 types of cells, and the human body has a few thousand• 80% functional : controversial

– 80% of the genome is being transcribed and/or has a protein bound to it some of the time

– Heterochromatin: tightly packed repeat sequences– most of that activity isn’t particularly specific or interesting and may

not have impact– Important not to overstate the findings– Ewan Birney: “cumulative occupation of 8% of the genome by TFs”

• Reproducibility– In exactly the same cell lines, same conditions, different time or place– Same cell lines, different conditions– Same cell type, different people

• Cell lines vs tissue• Cancer vs normalhttp://blogs.nature.com/news/2012/09/fighting-about-encode-and-junk.html

http://blogs.discovermagazine.com/notrocketscience/2012/09/05/encode-the-rough-guide-to-the-human-genome/

http://blogs.nature.com/news/2012/09/fighting-about-encode-and-junk.html

http://blogs.discovermagazine.com/notrocketscience/2012/09/05/encode-the-rough-guide-to-the-human-genome/

Applications

• Visible as genome tracks in UCSC• Mutation from – Cancer sequencing– GWAS– Find out what that part of the genome is doing

• Compare with your cancer data (RNA-seq)• Comparative genome analysis• Gene or pathway of interest

Online Resources• Interactive graphics in online version of paper• Interactive app on Nature ENCODE main page

www.nature.com/encode/

Monica C. Sleumer ( 苏漠 ) 2012-09-19

Documents

Transcript of Monica C. Sleumer ( 苏漠 ) 2012-09-19