The Encyclopedia of DNA Elements (ENCODE) Project · 2014-01-10 · The Encyclopedia of DNA...

Post on 11-Aug-2020

1 views 0 download

Transcript of The Encyclopedia of DNA Elements (ENCODE) Project · 2014-01-10 · The Encyclopedia of DNA...

The Encyclopedia of DNA Elements (ENCODE) Project

Elise A. Feingold, Ph.D. National Human Genome Research Institute

National Institutes of Health

AgENCODE Workshop January 10, 2014

How can we “read” the human genome sequence?

• Genetic code, but no genomic code • Evolutionary conservation helps to identify

functionally important regions ~5% conserved/ ~1.5% protein coding

What is function of non-coding conserved sequences? What is function of non-conserved sequences?

• Moderately good at identifying protein-coding regions, but fine structures difficult to predict from sequence

• Regulatory regions can be very far away from genes • Need unbiased experimental investigation

ENCODE: Encyclopedia of DNA Elements

Compile a comprehensive encyclopedia of all sequence features in the human genome and in the genomes of selected model organisms

Approach: Apply lessons learned from the success of the

Human Genome Project Start with well-defined pilot project Develop and test high-throughput technologies

Community Resources Use by research community to enhance understanding of:

– regulation of gene expression on a spatial, temporal and quantitative level

– genetic basis of disease

Rapid pre-publication data release Consortia publications Analysis requires development of:

– Common data reporting formats – Data standards – Analytical tools

ENCODE Timeline

ENCODE Products

“Marker” Papers

<>

PLoS Biol (2011) 9:e1001046

modENCODE Publications

19 companion papers in Nature, Genome Research,

Genome Biology and Database

ENCODE 2 Publications

September 2012

modENCODE and ENCODE 2 Final Efforts

modENCODE

• Cross-species Analyses – Transcription – Chromatin – Regulation

• Transfer of data and analyses to ENCODE 3 DCC

ENCODE “2”

• Mouse ENCODE Cross-species Analyses

• Transfer of data and analyses to ENCODE 3 DCC

ENCODE Data

Modified from PLoS Biol 9-e1001046,2011

ENCODE 2 Data

Human Data >2,800 Datasets • >200 Cell types • >250 RNA-seq • 150 DNase • 1,100 Transcription factor

binding • >200 Histone modification • 90 DNAme • GENCODE mRNA • Functional Characterization

Mouse Data >600 Datasets • 100 Cell types • 100 RNA-seq • 50 DNase • 170 Transcription factor

binding • 170 Histone

modification

Cel

ls

182

cell

Line

s/ T

issu

es

ENCODE Dimensions

Methods/Factors

164 Assays (114 different Chip)

3,010 Experiments 5 TeraBases

1716x of the Human Genome

Ewan Birney

More than 30 papers in • Nature • Genome Research • Genome Biology • Science • Cell

Publishing innovations

• Threads of themes • Virtual machines • iPad app

ENCODE increased our understanding of non-coding DNA and human disease

ENCODE 2 Publications

From www.nature.com/encode

High-Level Findings • Very large fraction of the genome is biochemically active

– 80% of the genome has an ENCODE annotation in at least one cell type

– Fraction that are functional TBD

• GWAS SNPs are enriched within non-coding functional elements – >50% of non-coding GWAS SNPs are near ENCODE-defined

regions – In many cases, disease phenotypes can be associated with a

specific cell type or transcription factor.

• Segmenting the genome into 7 chromatin states predicts ~400,000 enhancers and ~70,000 promoters as well as 1000s of quiescent states

Non-coding DNA Is Important For Disease And Evolution

• Non-coding DNA variants are known to cause human diseases

• Non-coding variants are known to cause changes in drug metabolism

• About 90% of GWAS findings lie outside of protein-coding regions

• More than 80% of recent adaptation signatures in three recent studies are not associated with protein-coding mutations

Stamatoyannopoulos, Science 337-1190, 2012 Kingsley, Nature 484-55,2012; Sabeti, Cell 152-703,2013; Fraser, Genome Research, doi:10.1101/gr.152710.112,2013

Data Access

Data Access

www.encodeproject.org

UCSC Genome Browser

Ensembl

wwww.modENCODE.org

NCBI

FlyBase

WormBase

ENCODE Portal http://encodeproject.org

Displaying ENCODE data from ENCODE portal

http://encodeproject.org

ENCODE Experiment Matrix

http://encodeproject.org

ENCODE Data Standards

http://encodeproject.org

ENCODE Software Tools

http://encodeproject.org

Publications

http://encodeproject.org

0

100

200

300

400

500

600

Num

ber o

f Pub

licat

ions

Cumulative ENCODE Publications Over Time

Papers from Non-ENCODE Authors

Papers from ENCODE 2 Production Groups

0

20

40

60

80

100

120

140

160

Num

ber o

f Pub

licat

ions

Cumulative Publications Using ENCODE Data by Non-ENCODE Authors

Basic Biology

Methods Development

Human Disease

Use of ENCODE Data in Linking Genotype to Phenotype

• ENCODE data can be used in hypothesis generation and refinement – What is the causal variant? – What is the target gene? – What is the target cell type? – How does the variant alter the phenotype?

Social Media Facebook

ENCODE (ENCyclopedia Of DNA Elements)

Twitter @ENCODE_NIH

ENCODE Tutorial Pages

http://www.genome.gov/27553900

ENCODE 3

Catalog is incomplete

• Only a small fraction of transcription factors studied

• Deeper analysis across many additional cell types (more primary cells) needed

• Additional data types need to be studied, e.g., RNA-binding proteins, lncRNAs

ENCODE 3 Solicitation • Comprehensive catalogs of functional elements

• Existing capacity for high-throughput, efficient production

• Centralized production, management & coordination

• 7 high priority scientific areas

• More integrated data coordination and analysis

• Primary focus on human, secondary focus on mouse

• Fly/worm allowed if demonstrate need for: – highly centralized effort for specific data type

– Work to be undertaken as part of highly interactive consortium

Priority areas • Maps of all classes of functional RNA molecules • Fine structural genome annotation (of the human and mouse

genomes only) by improving gene models • Maps of sites of open chromatin • Maps of selected histone marks and other relevant chromatin

proteins • Maps of sites of DNA methylation • Maps of all functional sequence elements within RNA

molecules • Maps of the binding sites for more transcription factors, using

a minimum of two cell types for each previously unstudied factor, and additional, well justified cell types as resources permit – For transcription factors for which binding site maps

already exist, development of maps in additional cell types will be considered, but will be of lower priority and expansion of this data set must be strongly justified

ENCODE 3 Structure

Gene Models

RNA TF Binding

Data Coordination Center

Data Analysis Center Analysis Working Group

Element ID

Chromatin States

Histone Mods DNase DNAme

RBP Binding

Computational Analysis Groups

Technology Development Groups

Data Production Groups

Project Management

Project Management • Monthly teleconference calls

• Working groups to address specific issues

• Data Analysis Working Groups

• Annual meetings

• Project oversight by external advisors

Individual Project Management

• Yearly quantitative milestones • Quarterly progress reports

– Track status of experiments and data submission to identify bottlenecks

– Track costs – Additional narrative section to track non-

quantitative milestones, e.g., technology development and to discuss bottlenecks

Participants • Groups funded by ENCODE solicitations • Open to additional data production or data

analysis groups agreeing to criteria for participation – Genome-wide analysis – Full participation in Consortium activities – Abide by data release policy – Demonstrated funding source

• Encourage inter-consortia collaborations • Encourage other collaborations/coordination

Peak Calling

ChIP/CLIP/RIP-seq

Human Subjects

Operational

ENCODE Consortium Activities

Human Resources

Policies/Logistics

Mouse Resources

Data Release and Publications

Outreach

Functional Characterization and Validation

Data Coordination, Analysis, and Interpretation

Analysis Working Group

Datatype Specific Coordination

DNase RNA

Binding DCC

DAC

EDCAC

Consortium

Production PI

ENCODE Wiki

Nature 489-49,2012

Lessons Learned • Plan data collection

– Develop focused project goals and target end users in advance

– Employ high-throughput, robust methods • Keep production and technology development pipelines separate

– Centralize data collection to the extent possible to maximize economies of scale and consistent data quality

– Generate data on common samples to the extent possible • Consider centralized sample collection/distribution • Very powerful to have multiple data types on same samples

– Develop metadata useful for people outside of project – Develop experimental standards, data quality metrics and

uniform data processing • Especially needed if multiple groups are generating data using same

experimental assays • Ensure high (known) data quality • Perform data quality evaluation on ongoing basis

Lessons Learned

• Devote sufficient resources to bioinformatics (data storage, processing and analysis)

– Don’t assume that organism –specific community will come together on its own for analysis without dedicated support

• Be realistic about data analysis and publication timeline – Overestimate by at least 2X

• Create centralized mode of sharing information – e.g, wiki sites, google docs

Lessons Learned • Need for significant, centralized management

– Explicit, written guidelines, standards and rules • e.g., policies for data release, publications

• Balance needs of individual investigators with those of Consortium – Retain ability to publish independently – Focus on global data production and analysis – Beware of focus on individual research agendas and

“interesting biology” • Foster collegial interactions

– Encourage diversity of opinions – Keep consortium open and bring in needed expertise – Avoid “group think” – Have explicit process for decision making

Summary • Set clear goals, articulate to community • Maximize utility of data to the community

– Rapid pre-publication data release – High (knowable) data quality – Data standards – Interoperability with other projects, especially metadata

• Take advantage of high-throughput production capabilities to maximize economies of scale

• Open consortium • Set and monitor production milestones • Facilitate communication between data production groups and

computational analysis • Devote sufficient resources (data production, analysis and

infrastructure)

AgENCODE Considerations

• Focused goals • Number of species • Quality of genome sequence • Number of individuals per species • Number of phenotypes • Number of tissues/cell types

ENCODE Production Centers

Bradley Bernstein (John Rinn, Manolis Kellis)

Thomas Gingeras (Carrie Davis, Roderic Guigo)

Brenton Graveley (Christopher Burge, Xiang-Dong Fu, Eugene Yeo)

Richard Myers (Devin Absher, Gregory Cooper, Shawn Levy, Florencia Pauli Behn, Ross Hardison, Ali Mortazavi, Timothy Reddy, Barbara Wold)

Bing Ren (Joseph Ecker, Len Pennacchio, Axel Visel, Wei Wang)

Michael Snyder (Kevin White, Sherman Weissman, Peggy Farnham)

John Stamatoyannopoulos (Ralph Hansen, Rajinder Kaul, Patrick Navas, George Stamatoyannopoulos, Piper Treuting, Michael Bender, Job Dekker, Mark Groudine)

ENCODE Data Coordination Center

Mike Cherry (Jim Kent)

ENCODE Data Analysis Center

Zhiping Weng (Mark Gerstein, Manolis Kellis, Roderic Guigo, Rafael Irizarry, Xiaole Shirley Liu, William Stafford Noble)

Additional ENCODE Participants

Timothy Hubbard (Mark Gerstein, Roderic Guigo, Jen Harrow, Rachel Harte, David Haussler, Manolis Kellis, Alexandre Reymond, Stephen Searle, Alfonso Valencia)

David Gilbert (Tamer Kahveci)

ENCODE 3 ENCODE Computational Analysis Groups

Peter Bickel (Haiyan Huang, Leonard Lipovich, Bin Yu)

David Gifford (Tommi Jaakkola)

Sunduz Keles (Emery Bresnick, Colin Dewey)

Robert Klein (Christina Leslie, Souma Raychaudhuri, Ross Levine, Kenneth Offit)

Jonathan Pritchard (Yoav Gilad)

Xinshu Xiao

ENCODE Technology Development Groups

Christopher Burge (Wendy Gilbert, Brenton Graveley, Robert Horvitz)

Barak Cohen and Joseph Corbo

Peggy Farnham (Victor Jin, David Jay Segal)

R. David Hawkins

Christina Leslie (Christopher Mason)

Jason Lieb (Karen Mohlke, Eran Segal)

Mats Ljungman (Thomas Wilson)

Tarjei Mikkelsen

Jay Shendure and Nadav Ahituv (Michael McManus)

Alexey Wolfson

Guo-Cheng Yuan (Stuart Orkin)

… and many senior scientists, postdocs, students, technicians, computer scientists, statisticians and administrators in these groups

Current ENCODE participants: http://www.genome.gov/26525220

The ENCODE 3 Consortium

NHGRI Staff

Program Directors Elise Feingold Peter Good Michael Pazin

Deputy Director Mark Guyer

Division Director Jeff Schloss

Program Analysts Sherry Zhou Preetha Nandi