New insights into the human genome by ENCODE project

51
New insights into the human genome by ENCODE

description

It’s been ten years since scientists sequenced the human genome. But what do all these letters? Researchers could identify in its 3 billion letters many of the regions that code for proteins, but those make up little more than 1% of the genome, contained in around 21,000 genes a few familiar objects in an otherwise stark and unrecognizable landscape. Many biologists suspected that the information responsible for the wondrous complexity of humans lay somewhere in the ‘deserts’ between the genes (The ENCODE Project Consortium, 2012). Interpreting the human genome sequence is one of the leading challenges of 21st century biology (Collins et al., 2003). In 2003, the National Human Genome Research Institute (NHGRI) embarked on an ambitious project the Encyclopedia of DNA Elements (ENCODE), aiming to delineate all of the functional elements encoded in the human genome sequence (The ENCODE Project Consortium 2004). To further this goal, NHGRI organized the ENCODE Consortium, an international group of investigators with diverse backgrounds and expertise in production and analysis of high-throughput functional genomic data. In a pilot project phase spanning 2003–2007, the Consortium applied and compared a variety of experimental and computational methods to annotate functional elements in a defined 1% of the human genome (The ENCODE Project Consortium, 2007)

Transcript of New insights into the human genome by ENCODE project

Page 1: New insights into the human genome by ENCODE project

New insights into the human genome by ENCODE

Page 2: New insights into the human genome by ENCODE project

What is a gene???

1860

s–19

00s:

Gen

e as

a

disc

rete

uni

t of

her

edity

1910

s: G

ene

as a

dist

inct

locu

s

1940

s: G

ene

as a

blue

prin

t fo

r a

prot

ein

1950

s: G

ene

as a

phys

ical

mol

ecul

e

1960

s: G

ene

as

tran

scribe

d co

de19

70s–

1980

s: G

ene

as o

pen

read

ing

fram

e (O

RF)

sequ

ence

pat

tern

1990

s–20

00s:

Ann

otat

ed g

enom

ic

entity

,en

umer

ated

in t

he

data

bank

s (c

urre

nt

view

, pre

-EN

CO

DE)

ENCODE

• Union of genomic sequences encoding a coherent set of potentially overlapping functional products.

(Gerstein et al., 2007)

Page 3: New insights into the human genome by ENCODE project

Its been ten years since scientists sequenced the human genome

But What do all these letters????????

Page 4: New insights into the human genome by ENCODE project

21,000 genes

Page 5: New insights into the human genome by ENCODE project

ENCODE- the Encyclopedia of DNA Elements has ANSWERS

Aiming to delineate all of the functional elements encoded in the human genome sequence

Page 6: New insights into the human genome by ENCODE project

ENCODE Consortium

(The ENCODE Project Consortium, 2011)

Page 7: New insights into the human genome by ENCODE project

Pilot Phase • 2003-2007

Technology development

phase

• 2007-2012• 30 papers

Production phase

Page 8: New insights into the human genome by ENCODE project
Page 9: New insights into the human genome by ENCODE project

ENCODE

Major methods

Data production and initial analysis

Accessing ENCODE data

Working with ENCODE data

Data analysis

Limitations

Threads – Nature explorer

Page 10: New insights into the human genome by ENCODE project

Major Methods

(The ENCODE Project Consortium, 2004)

Page 11: New insights into the human genome by ENCODE project

Overall data flow

(The ENCODE Project Consortium, 2011)

Page 12: New insights into the human genome by ENCODE project

(The ENCODE Project Consortium, 2011)

Page 13: New insights into the human genome by ENCODE project

RNA-seq – Isolation of RNA sequences followed by high-throughput

sequencing

CAGE – Capture of the methylated cap at the 5’end of RNA, followed

by high-throughput sequencing

RNA-PET – Simultaneous capture of RNAs with both a 5’methyl cap

and a poly(A) tail

ChIP-seq - Chromatin immunoprecipitation followed by sequencing

FAIRE-seq - Formaldehyde assisted isolation of regulatory elements.

Crosslinking, phenol extraction, and sequencing the DNA fragments

in the aqueous phase

Page 14: New insights into the human genome by ENCODE project

(The ENCODE Project Consortium, 2011)

Page 15: New insights into the human genome by ENCODE project

ENCODE cell types

(The ENCODE Project Consortium, 2011)

Page 16: New insights into the human genome by ENCODE project

ENCODE data production and initial analyses

• Since 2007, ENCODE has developed methods and performed a

large number of sequence-based studies to map functional

elements across the human genome.

• The elements mapped (and approaches used) include

RNA transcribed regions (RNA-seq, CAGE, RNA-PET and

manual annotation),

Protein-coding regions (mass spectrometry),

Transcription-factor-binding sites (ChIP-seq and DNase-seq),

Chromatin structure (DNase-seq, FAIRE-seq, histone ChIP-seq),

DNA methylation sites (RRBS assay)

(The ENCODE Project Consortium, 2012)

Page 17: New insights into the human genome by ENCODE project

Transcribed and protein-coding regions

• In total, GENCODE-annotated exons of protein-coding genes cover 2.94% of the

genome or 1.22% for protein-coding exons.

• Protein-coding genes span 33.45% from the outermost start to stop codons, or

39.54% from promoter to poly(A) site.

• Additional protein-coding genes remain to be found.

• In addition, they annotated 8,801 automatically derived small RNAs and 9,640

manually curated long non-coding RNA (lncRNA) loci

• The GENCODE annotated 11,224 pseudogenes

(The ENCODE Project Consortium, 2012)

Page 18: New insights into the human genome by ENCODE project

Process flow of experimental evaluation of pseudogene transcription

Experimental validation results showing the transcription of pseudogenes in different tissues

(Pei et al., 2012)

Page 19: New insights into the human genome by ENCODE project

ENCODE gene and transcript annotations.

(The ENCODE Project Consortium, 2011)

Page 20: New insights into the human genome by ENCODE project

RNA

• They sequenced RNA from different cell lines and multiple

subcellular fractions to develop an extensive RNA expression

catalogue.

• They used CAGE-seq (5’cap-targeted RNA isolation and

sequencing) to identify 62,403 (TSSs) in tier 1 and2 cell types

(The ENCODE Project Consortium, 2012)

Page 21: New insights into the human genome by ENCODE project

A large majority of GENCODE elements are detected by RNA-seq data

(Djebali et al., 2012)

Page 22: New insights into the human genome by ENCODE project

Protein bound regions

• 119 different DNA-binding proteins and a number of RNA

polymerase components in 72 cell types using ChIP-seq

• Overall, 636,336 binding regions covering 231 mega bases (8.1%)

of the genome are enriched for regions bound by DNA-binding

proteins across all cell types.

(The ENCODE Project Consortium, 2012)

Page 23: New insights into the human genome by ENCODE project

Occupancy of transcription factors and RNA polymerase 2 on human chromosome 6p as

determined by ChIP-seq

Page 24: New insights into the human genome by ENCODE project

(The ENCODE Project Consortium, 2011)

Page 25: New insights into the human genome by ENCODE project

DNase I hypersensitive sites and footprinting

• Chromatin accessibility characterized by DNase I hypersensitivity

is the hallmark of regulatory DNA regions.

• 2.89 million unique, non-overlapping (DHSs) by DNase-seq in 125

cell types – lie distal to TSSs

• In tier 1 and tier 2 cell types - 205,109 DHSs per cell type,

encompassing an average of 1.0% of the genomic sequence in

each cell type, and 3.9% in aggregate.

(The ENCODE Project Consortium, 2012)

Page 26: New insights into the human genome by ENCODE project

Density of DNase I cleavage sites for selected cell types

(Thurman et al., 2012)

Page 27: New insights into the human genome by ENCODE project

• On average, 98.5% of the occupancy sites of transcription factors

mapped by ENCODE ChIP-seq

• Using genomic DNase I footprinting on 41 cell types they identified

8.4million distinct DNase I footprints

(The ENCODE Project Consortium, 2012)

Page 28: New insights into the human genome by ENCODE project

Regions of histone modification

• They assayed chromosomal locations for up to 12 histone

modifications and variants in 46 cell types, across tier 1 and 2.

(The ENCODE Project Consortium, 2012)(http://www.factorbook.org)

Page 29: New insights into the human genome by ENCODE project

DNA methylation

• They used reduced representation bisulphite sequencing (RRBS) to

profile DNA methylation quantitatively for an average of 1.2

million CpGs in each of 82 cell lines and tissues (8.6% of non-

repetitive genomic CpGs), including CpGs in intergenic regions,

proximal promoters and intragenic regions.

(The ENCODE Project Consortium, 2012)

Page 30: New insights into the human genome by ENCODE project

Proteomics

To assess putative protein products generated from novel RNA

transcripts and isoforms, proteins are sequenced and quantified by

mass spectrometry and mapped back to their encoding

transcripts.

K562 and GM12878 – protein study begun

(The ENCODE Project Consortium, 2011)

Page 31: New insights into the human genome by ENCODE project

ENCODE chromatin annotations in the HLA locus

(The ENCODE Project Consortium, 2011)

Page 32: New insights into the human genome by ENCODE project

Accessing ENCODE Data

ENCODE Data Release and Use Policy

• The ENCODE Data Release and Use Policy is described at http://

www.encodeproject.org/ENCODE/terms.html.

• ENCODE data are released for viewing in a publicly accessible

browser (initially at http://genome-preview.ucsc.edu/ENCODE and,

after additional quality checks, at http://encodeproject.org)

Public Repositories

• UCSC Genome Browser database (http://genome.ucsc.edu).

(The ENCODE Project Consortium, 2011)

Page 33: New insights into the human genome by ENCODE project

UCSC Portal

Page 34: New insights into the human genome by ENCODE project

Working with ENCODE Data

Using ENCODE Data in the UCSC Browser

• Many users will want to view and interpret the ENCODE data for

particular genes of interest. At the online ENCODE portal (http://

encodeproject.org), users should follow a ‘‘Genome Browser’’ link

to visualize the data in the context of other genome annotations.

(The ENCODE Project Consortium, 2011)

Page 35: New insights into the human genome by ENCODE project

ENCODE Data Analysis

• Development and implementation of algorithms and pipelines for

processing and analyzing data - major activity of the ENCODE

Project.

• Short sequences are aligned to the reference genome

1st Phase

• Identifying the enriched regions

2nd Phase • Integrating the identified regions of enriched signal with each other and with other data types

3rd Phase

(The ENCODE Project Consortium, 2011)

Page 36: New insights into the human genome by ENCODE project

Analysis tools applied by the ENCODE consortium

(The ENCODE Project Consortium, 2011)

Page 37: New insights into the human genome by ENCODE project

Integrating ENCODE with other projects and the

Scientific Community

1. defining promoter and enhancer regions by combining transcript

mapping and biochemical marks,

2. delineating distinct classes of regions within the genomic

landscape by their specific combinations of biochemical and

functional characteristics, and

3. defining transcription factor co-associations and regulatory

networks.

(The ENCODE Project Consortium, 2011)

Page 38: New insights into the human genome by ENCODE project

• ENCODE Project - interpretation of human genome variation that is

associated with disease or quantitative phenotypes

• Integrate with 1,000 Genomes Project - how SNPs and structural

variation may affect transcript, regulatory and DNA methylation

data

• ENCODE - GWAS and other sequence variation driven studies of

human phenotypes

Major contributor not only of data but also novel technologies for

deciphering the human genome

(The ENCODE Project Consortium, 2011)

Page 39: New insights into the human genome by ENCODE project

Limitations of ENCODE Annotations

• Cell types - physiologically and genetically inhomogeneous.

• Local micro-environments in culture may also vary

• Use of DNA sequencing to annotate functional genomic features is

also constrained.

• Considerable quantitative variation in the signal strength along

the genome

(The ENCODE Project Consortium, 2011)

Page 40: New insights into the human genome by ENCODE project

Challenges

• Adult human body contains several hundred distinct cell types

• Each of which expresses a unique subset of the 1,800 TFs encoded

in the human genome

• Brain alone contains thousands of types of neurons that are likely

to express not only different sets of TFs but also a larger variety of

non-coding RNAs

• A truly comprehensive atlas of human functional elements is not

practical with current technologies

(The ENCODE Project Consortium, 2011)

Page 41: New insights into the human genome by ENCODE project

Outcome

• Understanding of the human genome

• The broad coverage of ENCODE annotations enhances our

understanding of common diseases with a genetic component,

rare genetic diseases

• 119 of 1,800 known transcription factors and 13 of more than 60

currently known histone or DNA modifications across 147 cell

types

• Overall these data reflect a minor fraction of the potential

functional information encoded in the human genome

(The ENCODE Project Consortium, 2012)

Page 42: New insights into the human genome by ENCODE project

http://www.nature.com/encode/#/threads

Page 43: New insights into the human genome by ENCODE project

13 Threads

1. Transcription factor motifs

2. Chromatin patterns at transcription factor binding sites

3. Characterization of intergenic regions and gene definition

4. RNA and chromatin modification patterns around promoters

5. Epigenetic regulation of RNA processing

6. Non-coding RNA characterization

7. DNA methylation

8. Enhancer discovery and characterization

9. Three-dimensional connections across the genome

10. Characterization of network topology

11. Machine learning approaches to genomics

12. Impact of functional information on understanding variation

13. Impact of evolutionary selection on functional regions

Page 44: New insights into the human genome by ENCODE project

Schematic overview of the functional SNP approach

(Schaub et al., 2012)

Page 45: New insights into the human genome by ENCODE project

Comparison of GWAS identified loci with ENCODE data

Page 46: New insights into the human genome by ENCODE project
Page 47: New insights into the human genome by ENCODE project

(Boyle et al., 2012)

Page 48: New insights into the human genome by ENCODE project

Future goal

• Mechanistic processes that generate these elements and how and

where they function

• Enlarge the data set to additional factors, modifications and cell

types, complementing the other related projects

• Constitute foundational resources for human genomics, allowing a

deeper interpretation of the organization of gene and regulatory

information and the mechanisms of regulation, and thereby

provide important insights into human health and disease

(The ENCODE Project Consortium, 2012)

Page 49: New insights into the human genome by ENCODE project

Project is still far from complete

Conclusion

For update: https://www.facebook.com/ENCODEProject

Page 50: New insights into the human genome by ENCODE project

Encode – assign word to letter

Page 51: New insights into the human genome by ENCODE project

Thank you:)