ENCODE-DCC-metadata-standard-Biocurator 2014

19
The ENCODE metadata standard to integrate diverse experimental data sets Eurie L. Hong, Ph.D. (@elhong) Project Manager, ENCODE DCC Department of GeneFcs • Stanford University School of Medicine Intro to the DCC Metadata definiFon Using ontologies Accessing metadata

description

Overview of three areas where the ENCODE DCC is facilitating the integration of diverse datasets: (1) defining a metadata standard (2) using ontologies for annotation (3) creating a RESTful interface for data access

Transcript of ENCODE-DCC-metadata-standard-Biocurator 2014

Page 1: ENCODE-DCC-metadata-standard-Biocurator 2014

The  ENCODE  metadata  standard  to  integrate  diverse  experimental  data  sets  

Eurie  L.  Hong,  Ph.D.  (@elhong)  Project  Manager,  ENCODE  DCC      

Department  of  GeneFcs  •  Stanford  University  School  of  Medicine  

Intro  to  the  DCC  

Metadata  definiFon  

Using  ontologies  

Accessing  metadata  

Page 2: ENCODE-DCC-metadata-standard-Biocurator 2014

2  

Not  pictured:  Tim  Dreszer,    Jorge  Garcia,  Donna  Karolchik,  Katrina  Learned,  Forrest  Tanaka,  Marcus  Ho  

ENCODE  DCC  

Galt  Barber,  Morgan  Maddren,  Nikhil  Podduturi,  Greg  Roe,  Kate  Rosenbloom,  Laurence  Rowe  

Esther  Chan,  Venkat  Malladi,  Cricket  Sloan,  Seth  StraWan    

Eurie  Hong,  Mike  Cherry  (PI),  Jim  Kent  (co-­‐PI),  Ben  Hitz  

Brian  Lee,  Stuart  Miyasato,  MaW  Simison,  Zhenhua  Wang  

@encodedcc   encode-­‐[email protected]  

Data  Wranglers  

So]ware  engineers  

QA,  sysadmins,  admin  

hWps://github.com/ENCODE-­‐DCC/encoded  

Page 3: ENCODE-DCC-metadata-standard-Biocurator 2014

ProducFon  labs  Analysis  groups  

 Role:  Data  genera?on  Data  organiza?on  Data  access      Tasks:    Perform  assays  Data  processing  &  validaFon  Web-­‐based  searches  

   Perform  analyses  Data  file  storage  Data  downloads      Validate  data  Metadata  curaFon      Submit  data  files        Submit  metadata    

Genome  Browser  

ENCODE  portal  (DCC)  

Role  of  the  Data  CoordinaFon  Center  

Data  files  

Metadata   DCC  DCC   Integrative websites!

Scientific!community!

Page 4: ENCODE-DCC-metadata-standard-Biocurator 2014

Challenge:  How  do  you  define  a  metadata  standard  for  diverse  assays  in  mulFple  

species?  

Modified  from  PLoS  Biol  9-­‐e1001046,2011  (M.  Pazin)  

Page 5: ENCODE-DCC-metadata-standard-Biocurator 2014

Principles  driving  metadata  definiFon  

•  Provide  transparency  about  how  experiments  were  performed  

•  Capture  data  provenance  during  analyses  

•  Communicate  key  experimental  variables  of  an  experiment  

•  Communicate  quality  metrics  about  the  data    •    Help  analyze  and  interpret  the  data      •    Help  organize  and  find  the  data  

Page 6: ENCODE-DCC-metadata-standard-Biocurator 2014

Capture  the  experimental  design  

Biological  replicate  1  

Technical  replicate  1  

Technical  replicate  2  

Biological  replicate  2  

Technical  replicate  1  

Technical  replicate  2  

Control  1  

Control  2  

Data  file  

Technical  replicate  1  

Data  file  

Results  file  Experiment  

Experiment  

Page 7: ENCODE-DCC-metadata-standard-Biocurator 2014

IdenFfy  reusable  experimental  variables  

Biosamples  

•  Type  (e.g.  Fssue,  cell  line)  •  Ontology  term  name  •  Source,  product  id,  lot  id  •  Treatments  •  Knockdown  •  Fusion  construct  informaFon  •  Donor  or  strain  informaFon  •  Dates  (e.g.  growth,  harvest,            procurement)  •  Passage  number  •  StarFng  amount    •  Lab  assigned  IDs  

AnFbodies  

•  Source,  product  id,  lot  id  •  Isotype  •  AnFgen  •  Host  •  PurificaFon  method  •  ValidaFon  status  •  NHGRI  approval  status  •  Target  •  Species  •  Dbxrefs  

Libraries  

•  Library  preparaFon  protocol  •  Strand  specificity  •  Size  selecFon  method  •  ValidaFon  document  •  Lysis  method  •  SonicaFon  method  •  ExtracFon  method  •  Nucleic  acid  type  •  Nucleic  acid  size  range  

+  

Files  

Peak  calls  

•  Reference  genome  version  •  Alignment  so]ware  •  So]ware  parameters  •  So]ware  version  •  Quality  metrics  (e.g.  NRF,  FRiP)    

Alignment  

(selected  subset  of  all  metadata)  

Experiment  with  replicates  

Page 8: ENCODE-DCC-metadata-standard-Biocurator 2014

Accession  them  

Biosamples  

•  Type  (e.g.  Fssue,  cell  line)  •  Ontology  term  name  •  Source,  product  id,  lot  id  •  Treatments  •  Knockdown  •  Fusion  construct  informaFon  •  Donor  or  strain  informaFon  •  Dates  (e.g.  growth,  harvest,            procurement)  •  Passage  number  •  StarFng  amount    •  Lab  assigned  IDs  

AnFbodies  

•  Source,  product  id,  lot  id  •  Isotype  •  AnFgen  •  Host  •  PurificaFon  method  •  ValidaFon  status  •  NHGRI  approval  status  •  Target  •  Species  •  DBxrefs  

Libraries  

•  Library  preparaFon  protocol  •  Strand  specificity  •  Size  selecFon  method  •  ValidaFon  document  •  Lysis  method  •  SonicaFon  method  •  ExtracFon  method  •  Nucleic  acid  type  •  Nucleic  acid  size  range  

+  

Files  

Peak  calls  

•  Reference  genome  version  •  Alignment  so]ware  •  So]ware  parameters  •  So]ware  version  •  Quality  metrics  (e.g.  NRF,  FRiP)    

Alignment  

(selected  subset  of  all  metadata)  

Experiment  with  replicates  (ENCSR000DRY)  

ENCBS095DKV  (biosample)  ENCDO826IFN  (donors)   ENCAB964IAU   ENCLB239KAN   ENCFF254TDA  

Page 9: ENCODE-DCC-metadata-standard-Biocurator 2014

Define  their  relaFonship  to  each  other  

Biosample  

AnFbodies  

Libraries  

+  

Files  

Donor  

Biosample  

Replicate  

has  

has  

has  

has  

has  

has  

Experiment  

has  

Page 10: ENCODE-DCC-metadata-standard-Biocurator 2014

Challenge:  Find  common  biosamples  from  data  generated  by  two  consorFa  

356  terms  hWp://encodeproject.org/ENCODE/cellTypes.html  

Projects  are  internally  consistent…..    

314  terms  GEO  characterisFcs:  common_name,  Fssue_type,  cell_type,  lines    

Page 11: ENCODE-DCC-metadata-standard-Biocurator 2014

360  terms  Cell  type  

…  but  only  3  biosample  names  match  exactly  between  projects  

314  terms  GEO  

IMR90  PBMC  Th17  

Page 12: ENCODE-DCC-metadata-standard-Biocurator 2014

Challenge:  Find  all  heart-­‐related  Fssues?  

Heart_OC  HCF  HCFaa  HCM  Others?  

Fetal  Heart  Heart  Right  Atrium  Right  Ventricle  Others?  

Page 13: ENCODE-DCC-metadata-standard-Biocurator 2014

Project  integraFon  using  ontologies  

DCC  

1.    Uber  Anatomy  ontology  (UBERON;  hWp://uberon.org/)    2.    Cell  Ontology  (CL;  hWp://cellontology.org/)    3.   Experimental  Factor  Ontology  (EFO;  hWp://www.ebi.ac.uk/efo/)    

4.    Ontology  for  Biomedical  Inves?ga?ons  (OBI;  hWp://obi-­‐ontology.org/page/Main_Page)  

OBI  (for  assays):  hWp://obi-­‐ontology.org  EFO  (for  cell  lines):    hWp://www.ebi.ac.uk/efo/  UBERON  (for  Fssues):  hWp://uberon.org/  CL  (for  primary  cells):  hWp://cellontology.org/  

ENCODE  portal  (DCC)  

Other  projects  

Page 14: ENCODE-DCC-metadata-standard-Biocurator 2014

Ontology-­‐driven  searches  

hWp://www.encodedcc.org/  

Page 15: ENCODE-DCC-metadata-standard-Biocurator 2014

Metadata  database  Metadata  in  JSON-­‐LD  

Metadata  viewed  as  web  page  

Scripts  

Query  using  REST  API  commands:  GET,  PATCH,  POST  

DCC  

Challenge:  Provide  user-­‐friendly  *AND*  programmaFc  access  to  the  data    

Genome  Browser  

Page 16: ENCODE-DCC-metadata-standard-Biocurator 2014

IntegraFon  with  other  resources  

hWp://www.encodedcc.org/  

Page 17: ENCODE-DCC-metadata-standard-Biocurator 2014

Future  direcFons  

•  Metadata  definiFon:  Finalize  so]ware  and  file  provenance  

•  Ontology-­‐based  searches:  Implement  searches  for  ChIP-­‐seq  targets  using  GO  annotaFons  

•  ProgrammaFc  access:  Implement  addiFonal  validaFons  upon  data  submission  

Page 18: ENCODE-DCC-metadata-standard-Biocurator 2014

Intro  to  the  DCC  

Metadata  definiFon  

Using  ontologies  

Accessing  metadata  

We  developed  a  single  data  model  that  reflects  the  experimental  process  to  store  the  30+  assays  done  by  the  ENCODE  producFon  labs  

Using  ontologies  to  annotate  metadata  provides  instant  interoperability  with  other  datasets  &  search  funcFonality  

ApplicaFon  built  on  a  REST  API  &  JSON-­‐LD  supports  programmaFc  querying  across  other  scienFfic  resources  

Conclusions  

Page 19: ENCODE-DCC-metadata-standard-Biocurator 2014

19  

Acknowledgements  

Brian  Lee,  Nikhil  Podduturi,  Greg  Roe,  Laurence  Rowe  

Esther  Chan,  Venkat  Malladi,  Cricket  Sloan,  Seth  StraWan    

Eurie  Hong,  Mike  Cherry  (PI),  Jim  Kent  (co-­‐PI),  Ben  Hitz  

@encodedcc   encode-­‐[email protected]