ENCODE Portal and Uniform Processing Pipelines : Open Web ...
Transcript of ENCODE Portal and Uniform Processing Pipelines : Open Web ...
ENCODE Portal and Uniform Processing Pipelines : Open Web and
Programmatic Access to ENCODE Data, Metadata, and Software Pipelines
ENCODE Data Coordination Center Stanford University, Department of Genetics
Asia Pacific Bioinformatics Conference
January 10, 2016
ScheduleforWorkshop• Welcome&whoareyou?• Introduc9ontoENCODE• DCC,itsroleinENCODE• ENCODEPortal(~45min)• DataAccess&Availability• DataProcessingviaUniformPipelines(~45)
ENCODEDCC2
Whatwouldyouliketolearn?
3
Howmanyofyou:
1. ...workinalabthatperformsomicsmethology?
2. …workinacomputa9onallabthatanalysesomicsdata?
3. …havedownloadedENCODEdataandintersecteditwithotherdata?
4. …knowwheretogoforacomprehensivecatalogofallassaysdoneby
ENCODE?
5. …couldrepeatanENCODEanalysis(fromfastq’s)togenerateIDR-
thresholdedsetsofpeaks?
6. …wanttorepeatoneoftheENCODEanalysispipelinesonyourdata?
7. …needtoaccessENCODEdatabutfounditdifficultordon’tknow
wheretobegin?ENCODEDCC
WhatisENCODE?• 13yearoldNIHproject,threephases• Standards– Experimentalmethodsandqualitymetrics– An9bodystandards– Metadataforexperiment,biosample&pipeline– Transparentaccess– Fulldatasharing
• Data• Tools&Pipelines• Results&Publica9ons
ENCODEDCC4
Image courtesy of Mike Pazin Modified from PLoS Biol 9:e1001046, 2011
Science 306:636, 2004
ENCODE mapping features of the genome
NIH ENCODE and Roadmap Epigenomics projects have produced >150TB of data
ENCODE 5403 experiments (+1316) 3659 biosamples (+3230)
Roadmap Epigenomics 3137 experiments 985 biosamples
REMC Data Coordinating Center Aleks Milosavlijevic, Baylor College of Medicine
Nature feature: Challenges in irreproducible research
http://www.nature.com/news/reproducibility-1.17552
QC & measures of confidence
antibodies
standard & unique IDs
built in replication
data sharing
Metadata integration using ontologies
ENCODE Portal
• Central source for ENCODE data: experimental and analysis data • Hub for project information: data standards & publications • High-quality metadata: data provenance & transparency
The ENCODE Portal at a glance www.encodeproject.org
Sloanetal.,2016,NAR
Rich experimental metadata is collected and presented for clarity and context
Pla`orm
• Instrument• Readlength• SingleorPairedend• Lanenumber• Sequencingdepth
• Agent(chemical,biological)• Concentra9on• Dura9on• Constructtype• Tag• Tagloca9on• Insertsequence• Target• Transfec9ontype• Protocol
Treatment&gene9cmodifica9ons• Species
• Age• Sex• Healthstatus• Ethnicity• Strain
Donor&biosample
• Type(e.g.9ssue,cellline)• Source• Productid• Lotid• Dates(e.g.growth,harvest,procurement)• Passagenumber• Star9ngamount• LabassignedIDs
Libraryprepara9on
• Lysismethod• Sonica9onmethod• Extrac9onmethod• Nucleicacidtype• Nucleicacidsizerange• Libraryprepara9onprotocol• Strandspecificity• Sizeselec9onmethod• Valida9ondocument
+
For example:
Metadata integration using ontologies!
EFO (for cell lines): http://www.ebi.ac.uk/efo/!UBERON (for tissues): http://uberon.org/!CL (for primary cells): http://cellontology.org/!OBI (for assays): http://obi-ontology.org!
DCC!ENCODE portal!
(DCC)!
Otherprojects
Main components of an experimental analysis are uniquely accessioned
ENCSR###XXX ENCBS###XXX ENCDO###XXX ENCLB###XXX ENCAB###XXX ENCFF###XXX
Experiments Biosamples
Donors/strains Libraries
Antibody lots Files
Each antibody lot is characterized & accessioned
Replication and transparency of methods
ENCODE experiments are designed to minimally have two replicates.
Data provenance & process transparency
DNase I
RNA-seq
DNase-seq
ChIP-seq
Bisulfite-seq
Processed data for further
analyses or visualization
? pip
elin
e
Avoid the pipeline blackbox
Mike Cherry (PI) Ben Hitz Cricket Sloan
@encodedcc [email protected] https://github.com/ENCODE-DCC/
The ENCODE DCC
Tim Dreszer Marissa Melen Laurence Rowe Forrest Tanaka Stuart Miyasato Matt Simison Zhenhua Wang
Esther Chan Jean Davidson Idan Gabdank Seth Strattan Marcus Ho Aditi Narayanan Jason Hilton Kathrina Onate