Post on 13-Dec-2014
description
The ENCODE metadata standard to integrate diverse experimental data sets
Eurie L. Hong, Ph.D. (@elhong) Project Manager, ENCODE DCC
Department of GeneFcs • Stanford University School of Medicine
Intro to the DCC
Metadata definiFon
Using ontologies
Accessing metadata
2
Not pictured: Tim Dreszer, Jorge Garcia, Donna Karolchik, Katrina Learned, Forrest Tanaka, Marcus Ho
ENCODE DCC
Galt Barber, Morgan Maddren, Nikhil Podduturi, Greg Roe, Kate Rosenbloom, Laurence Rowe
Esther Chan, Venkat Malladi, Cricket Sloan, Seth StraWan
Eurie Hong, Mike Cherry (PI), Jim Kent (co-‐PI), Ben Hitz
Brian Lee, Stuart Miyasato, MaW Simison, Zhenhua Wang
@encodedcc encode-‐help@lists.stanford.edu
Data Wranglers
So]ware engineers
QA, sysadmins, admin
hWps://github.com/ENCODE-‐DCC/encoded
ProducFon labs Analysis groups
Role: Data genera?on Data organiza?on Data access Tasks: Perform assays Data processing & validaFon Web-‐based searches
Perform analyses Data file storage Data downloads Validate data Metadata curaFon Submit data files Submit metadata
Genome Browser
ENCODE portal (DCC)
Role of the Data CoordinaFon Center
Data files
Metadata DCC DCC Integrative websites!
Scientific!community!
Challenge: How do you define a metadata standard for diverse assays in mulFple
species?
Modified from PLoS Biol 9-‐e1001046,2011 (M. Pazin)
Principles driving metadata definiFon
• Provide transparency about how experiments were performed
• Capture data provenance during analyses
• Communicate key experimental variables of an experiment
• Communicate quality metrics about the data • Help analyze and interpret the data • Help organize and find the data
Capture the experimental design
Biological replicate 1
Technical replicate 1
Technical replicate 2
Biological replicate 2
Technical replicate 1
Technical replicate 2
Control 1
Control 2
Data file
Technical replicate 1
Data file
Results file Experiment
Experiment
IdenFfy reusable experimental variables
Biosamples
• Type (e.g. Fssue, cell line) • Ontology term name • Source, product id, lot id • Treatments • Knockdown • Fusion construct informaFon • Donor or strain informaFon • Dates (e.g. growth, harvest, procurement) • Passage number • StarFng amount • Lab assigned IDs
AnFbodies
• Source, product id, lot id • Isotype • AnFgen • Host • PurificaFon method • ValidaFon status • NHGRI approval status • Target • Species • Dbxrefs
Libraries
• Library preparaFon protocol • Strand specificity • Size selecFon method • ValidaFon document • Lysis method • SonicaFon method • ExtracFon method • Nucleic acid type • Nucleic acid size range
+
Files
Peak calls
• Reference genome version • Alignment so]ware • So]ware parameters • So]ware version • Quality metrics (e.g. NRF, FRiP)
Alignment
(selected subset of all metadata)
Experiment with replicates
Accession them
Biosamples
• Type (e.g. Fssue, cell line) • Ontology term name • Source, product id, lot id • Treatments • Knockdown • Fusion construct informaFon • Donor or strain informaFon • Dates (e.g. growth, harvest, procurement) • Passage number • StarFng amount • Lab assigned IDs
AnFbodies
• Source, product id, lot id • Isotype • AnFgen • Host • PurificaFon method • ValidaFon status • NHGRI approval status • Target • Species • DBxrefs
Libraries
• Library preparaFon protocol • Strand specificity • Size selecFon method • ValidaFon document • Lysis method • SonicaFon method • ExtracFon method • Nucleic acid type • Nucleic acid size range
+
Files
Peak calls
• Reference genome version • Alignment so]ware • So]ware parameters • So]ware version • Quality metrics (e.g. NRF, FRiP)
Alignment
(selected subset of all metadata)
Experiment with replicates (ENCSR000DRY)
ENCBS095DKV (biosample) ENCDO826IFN (donors) ENCAB964IAU ENCLB239KAN ENCFF254TDA
Define their relaFonship to each other
Biosample
AnFbodies
Libraries
+
Files
Donor
Biosample
Replicate
has
has
has
has
has
has
Experiment
has
Challenge: Find common biosamples from data generated by two consorFa
356 terms hWp://encodeproject.org/ENCODE/cellTypes.html
Projects are internally consistent…..
314 terms GEO characterisFcs: common_name, Fssue_type, cell_type, lines
360 terms Cell type
… but only 3 biosample names match exactly between projects
314 terms GEO
IMR90 PBMC Th17
Challenge: Find all heart-‐related Fssues?
Heart_OC HCF HCFaa HCM Others?
Fetal Heart Heart Right Atrium Right Ventricle Others?
Project integraFon using ontologies
DCC
1. Uber Anatomy ontology (UBERON; hWp://uberon.org/) 2. Cell Ontology (CL; hWp://cellontology.org/) 3. Experimental Factor Ontology (EFO; hWp://www.ebi.ac.uk/efo/)
4. Ontology for Biomedical Inves?ga?ons (OBI; hWp://obi-‐ontology.org/page/Main_Page)
OBI (for assays): hWp://obi-‐ontology.org EFO (for cell lines): hWp://www.ebi.ac.uk/efo/ UBERON (for Fssues): hWp://uberon.org/ CL (for primary cells): hWp://cellontology.org/
ENCODE portal (DCC)
Other projects
Ontology-‐driven searches
hWp://www.encodedcc.org/
Metadata database Metadata in JSON-‐LD
Metadata viewed as web page
Scripts
Query using REST API commands: GET, PATCH, POST
DCC
Challenge: Provide user-‐friendly *AND* programmaFc access to the data
Genome Browser
IntegraFon with other resources
hWp://www.encodedcc.org/
Future direcFons
• Metadata definiFon: Finalize so]ware and file provenance
• Ontology-‐based searches: Implement searches for ChIP-‐seq targets using GO annotaFons
• ProgrammaFc access: Implement addiFonal validaFons upon data submission
Intro to the DCC
Metadata definiFon
Using ontologies
Accessing metadata
We developed a single data model that reflects the experimental process to store the 30+ assays done by the ENCODE producFon labs
Using ontologies to annotate metadata provides instant interoperability with other datasets & search funcFonality
ApplicaFon built on a REST API & JSON-‐LD supports programmaFc querying across other scienFfic resources
Conclusions
19
Acknowledgements
Brian Lee, Nikhil Podduturi, Greg Roe, Laurence Rowe
Esther Chan, Venkat Malladi, Cricket Sloan, Seth StraWan
Eurie Hong, Mike Cherry (PI), Jim Kent (co-‐PI), Ben Hitz
@encodedcc encode-‐help@lists.stanford.edu