BCU 2013

62
The Inves)ga)on/Study/Assay (ISA) metadata framework for reproducible and reusable bioscience research Alejandra GonzálezBeltrán, PhD on behalf of the ISATeam Oxford eResearch Centre, University of Oxford Faculty of Technology, Environment and Engineering Birmingham City University 12 th March 2013

description

 

Transcript of BCU 2013

The  Inves)ga)on/Study/Assay  (ISA)  metadata  framework  for  reproducible  and  reusable  bioscience  research  

Alejandra  González-­‐Beltrán,  PhD  on  behalf  of  the  ISATeam  

   

Oxford  e-­‐Research  Centre,  University  of  Oxford    

Faculty  of  Technology,  Environment  and  Engineering  Birmingham  City  University  

12th  March  2013        

Ioannidis   et   al.,   Repeatability   of   published   microarray  gene  expression  analyses.  Nature  Gene*cs  41(2),  149-­‐55  (2009)  doi:10.1038/ng.295    

Ioannidis   et   al.,   Repeatability   of   published   microarray  gene  expression  analyses.  Nature  Gene*cs  41(2),  149-­‐55  (2009)  doi:10.1038/ng.295    

h[p://www.nature.com/news/2011/110111/full/469139a.html  

h[p://www.nature.com/news/2011/110111/full/469139a.html  

h[p://www.economist.com/node/21528593  

h[p://www.nature.com/news/2011/110111/full/469139a.html  

h[p://www.economist.com/node/21528593   h[p://www.ny)mes.com/2011/07/08/health/research/08genes.html  

Contextual  informa)on  (metadata):  •  Sample  characteris)cs  •  Technology  and  measurement  types  •  Instrument  parameters  •  …  

Need  for  a  generic  representa)on,  applied  to:    •microarray  based  experiments  (MAGE)    •sequencing  based  experiments  (SRA)    •flow  cytometry  based  experiments  (FuGE-­‐Flow  Cyt)    •mass  spectrometry  and  NMR  spectroscopy  

experiments  (Metabolights  and  PRIDE)  

Roadmap  

Reproducible  &  Reusable    Bioscience  Research  

Roadmap  

Reproducible  &  Reusable    Bioscience  Research  

Well-­‐annotated  &  Structured  Data  

reasoning  

analysis  

exchange  

integra)on  

visualiza)on  

browsing  retrieval  

Roadmap  

Reproducible  &  Reusable    Bioscience  Research  

Well-­‐annotated  &  Structured  Data  

reasoning  

analysis  

exchange  

integra)on  

visualiza)on  

browsing  retrieval  

User  community  

Roadmap  

Reproducible  &  Reusable    Bioscience  Research  

Well-­‐annotated  &  Structured  Data  

reasoning  

analysis  

exchange  

integra)on  

visualiza)on  

browsing  retrieval  

Community  Standards   Sodware  Tools  

User  community  

Roadmap  

Reproducible  &  Reusable    Bioscience  Research  

reasoning  

analysis  

exchange  

integra)on  

visualiza)on  

browsing  retrieval  

Source  of  the  figure:  EBI  website  

§       Interdisciplinary  and  integra:ve  in  character    •  need  to  deal  with  new  and  exis:ng  datasets  

•  deal  with  a  variety  of  data  types  

Bioscience  is  mul)-­‐domain…  

tox/pharma  

env  

health  

agro  

Mul)ple  communi)es,  mul)ple  norms  and  standards,  e.g.:  

report  the  same  core,    essen)al  informa)on    

use  the  same  term  to  refer  to  the  same  ‘thing’  allow  data  to  flow  from  

one  system  to  another  

Challenges: lack of interaction and coordination, duplication of effort, fragmentation and uneven coverage…hinders interoperability

130  +      

Es:mated  

150  +      

Source:  MIBBI,    EQ

UATO

R  

303  +      

Source:  BioPortal  Databases,    annota)on,  cura)on    tools  

miame!MIAPA!

MIRIAM!MIQAS!MIX!

MIGEN!

CIMR!MIAPE!

MIASE!

MIQE!

MISFISHIE….!

REMARK!

CONSORT!

MAGE-Tab!GCDML!

SRAxml!SOFT! FASTA!

DICOM!

MzML!SBRML!

SEDML…!

GELML!

ISA-Tab!

CML!

MITAB!

AAO!CHEBI!

OBI!

PATO! ENVO!MOD!

BTO!IDO…!

TEDDY!

PRO!XAO!

DO  

VO!GIATE!

Growing  number  of  bioscience  repor)ng  standards  

But…    what  do  we  know  about  them  and  how  they  are  related  

miame!MIAPA!

MIRIAM!MIQAS!MIX!

MIGEN!

CIMR!MIAPE!

MIASE!

MIQE!

MISFISHIE….!

REMARK!

CONSORT!

MAGE-Tab!GCDML!

SRAxml!SOFT! FASTA!

DICOM!

MzML!SBRML!

SEDML…!

GELML!

ISA-Tab!

CML!

MITAB!

AAO!CHEBI!

OBI!

PATO! ENVO!MOD!

BTO!IDO…!

TEDDY!

PRO!XAO!

DO  

VO!GIATE!

Which  ones  are  mature  enough  for  

me  to  use  or  recommend?  

I  work  on  plants,  are  these  standards  just  

for  biomedical  applica)ons?  

What  are  the  criteria  to  evaluate  their  status  and  

value?  

How  can  I  get  involved  to  propose  

extensions  or  modifica)ons?  

Which  tools  and  databases  

implement  which  standards?  

I  use  high  throughput  sequencing  technologies,  which  ones  are  relevant  to  

me?  

Which  formats  support  specific  

minimum  informa)on  guidelines?  

But…    what  do  we  know  about  them  and  how  they  are  related  

A  coherent,  curated  and  searchable  catalogue  of  data  sharing  resources  

 •  Bioscience  standards  and  

associated  data-­‐sharing  policies,  publica:ons,  tools  and  databases  

•  Assessment  criteria  for  usability  and  popularity  of  standards  

•  Rela:onships  among  standards  

•  Encouragement  for  communica:on  &  interac:on  among  groups  

•  Promo)ng  interoperability  &  informed  decisions  about  standards  

                           infrastructure  

•  Assist  in  the  annota)on  and  management  of  experimental  metadata  at  source,  suppor)ng  data  provenance  tracking  

•  Deal  with  high-­‐throughput  studies  using  one  or  a  combina)on  of  omics  and  other  technologies  

•  Empower  users  to  uptake  community-­‐defined  checklists  and  ontologies  

•  Facilitate  data  sharing,  re-­‐use,  comparison  and  reproducibility  of  experiments,  submission  to  interna)onal  public  repositories  

                           infrastructure  ISA  sodware  suite:  suppor)ng  standards-­‐compliant  experimental  annota)on  and  enabling  cura)on  at  the  community  level  Rocca-­‐Serra  et  al,    2010  Bioinforma)cs  

faahKO  dataset  •  Available  in  Bioconductor  •  Subset  of  the  original  data  on  global  metabolite  profiling  

•  LC/MS  peaks  from  the  spinal  cords  of  6  wild-­‐type  and  6  FAAH  (fa[y  acid  amyde  hydrolase)  knockout  mice  

Saghatlian  et  al.  Biochemistry.  2004  

faahKO  inves)ga)on  -­‐    Define  key  en))es  (e.g.  factors,    protocols,  parameters)  -­‐  Grouping  of  studies  -­‐  Relate  studies  and  assays  

faahKO  study  

NEWT  UniProt  Taxonomy  Database  Mouse  Genome  Informa)cs  

-­‐  Subjects  studied:  source(s),  sampling  methodology,  characteris)cs  -­‐  treatments/manipula)ons  performed    to  prepare  the  specimens    

faahKO  study  

Mouse  Adult  Gross  Anatomy  

-­‐  Subjects  studied:  source(s),  sampling  methodology,  characteris)cs  -­‐  treatments/manipula)ons  performed    to  prepare  the  specimens    

faahKO  assay  -­‐  measurement  type,  e.g.  metabolite  profiling  -­‐  technology,  e.g.  mass  spectrometry  

Create template(s) to fit the type of experiments to be described  

Create  templates  detailing  the  steps  to  be  reported  for  different  inves)ga)ons,  complying  to  community  standards,  e.g.  configuring  the  value(s)  allowed  for  each  field  to  be    •  text  (with/without  regular  expression  tes)ng),  •  ontology  terms,  •  numbers  etc.    

     

Describe, curate your experiment using a desktop-based tool  

Report and edit the description using this tool, (also customized using the templates) with a spreadsheet like look and feel, packed with functionalities such as •  ontology search (access via ) •  term-tagging features •  import from spreadsheets etc…  

•  Ontology  search  and  automated  tagging    (relying  on    NCBO  Bioportal  services)  on  Google  Spreadsheets  •  Collabora)ve  annota)on;  support  for  distributed  users  •  Version  control  &  history  

OntoMaton:  a  Bioportal  powered  Ontology  widget  for  Google  Spreadsheets  Maguire  et  al,    2013  Bioinforma)cs  

•  R  package  available  in  BioConductor  2.11    h[p://bioconductor.org/packages/release/bioc/html/Risa.html  

•  ISAtab  class  •  Read  ISAtab  files  into  ISAtab  objects  and  write  ISAtab  files  back  to  disk  

•  Increment  metadata  with  defini)on  factors/treatments/groups  

•  Build  xcmsSet  (xcms  package)  objects  from  mass  spectrometry  assays      

•  Augment  the  ISAtab  dataset  ader  analysis  •                                                           source  &  issues  tracking  

       

h[ps://github.com/ISA-­‐tools/Risa    

•  faahKO  package  v.  2.12  contains  ISAtab  files  describing  the  experiment          faahkoISA  =  readISAta(find.package("faahKO"))          assay.filename  <-­‐  faahkoISA["assay.filenames"][[1]]          xset  =  processAssayXcmsSet(faahkoISA,  assay.filename)          …          updateAssayMetadata(faahkoISA,  assay.filename,"Derived  Spectral  Data  File","faahkoDSDF.txt"  )  

•  MTBLS2  processing  and  analysis  using  Risa,  xcms  and  CAMERA  BioConductor  packages  

Metabolights – an open access general-purpose repository for metabolomics studies and associated meta-data Haug et al, 2012 Nucleic Acids Research

The  implicit  seman)cs  of  the                                                            syntax  

Protocol  Process  

Characteristics[…] Factor Value[…] (independent variables) Material Type Comment[…]

 Date  (day effect)

Performer    (operator effect)

Parameter  Value  […]  

Derived Data File

Raw Data File

Data  File  Node  

" DATA!

" Material!

Material  Node  

Sample  Name   Material  Type    

Hybridiza)on  Assay  Name   Assay  Design  REF   Array  Data  File   Protocol  REF   Derived  Array  Data  File  

 

sample1   genomic  DNA   assay1   A-AFFY-107" assay1.cel   data  normaliza)on   assay1.txt  

sample2   genomic  DNA   assay2   A-AFFY-107" assay2.cel   data  normaliza)on   assay2.txt  

sample3   genomic  DNA   assay3   A-AFFY-107" assay3.cel   data  normaliza)on   assay3.txt  

Material  transforma)ons...  

" Material!

" DATA!

Tagging:  from  free  text  to  ontology-­‐based  • single  interven)on  representa)on,  free  text  annota)on  

• single  interven)on,  ontology-­‐based  annota)on  

45  

Source  Name   Characteris)cs[organism]    

Factor  Value[perturba)on  agent]  

Factor  Value[dose]  

Factor  Value[dura)on]  

individual1   human   aspirin   high  dose   12  weeks  

Source  Name  Characteris)cs[organismobi:0100026)])    

Term  Source  REF  

Term  Accession  Number  

Factor  Value[chemical  compound  CHEBI_37577)]  

Term  Source  REF  

Term  Accession  Number  

individual1   Homo  sapiens   NCBITax   9606   aspirin   CHEBI   1231354  

Factor  Value[dose(OBI_0000984)  

Term  Source  REF  

Term  Accession  Number  

Factor  Value[)me  (PATO_0000165)]   Unit   Term  Source  

REF  Term  Accession  Number  

low  dose   LNC   LP30872-­‐3   12   week   UO   “0000034”  

Kohonen  et  al.  The  ToxBank  Data  Warehouse:  a  research  cluster  of  7    EU  FP7  Health  systems  toxicology  and  toxicogenomics  projects.    

Health  Care  &  Life  Sciences    Interest  Group    

ToxBank  effort    developed  by  Nina  Jeliazkova    

•  Make  the  seman)cs  of  ISAtab  explicit,  including  materials  &  data  en))es  &  processes  &  their  rela)onships  

•  Provide  incen)ves  for  provision  of  ontology-­‐based  annota)ons  in  ISA-­‐TAB  datasets;  exploit  those  annota)ons    

•  Augment  ISA  syntax  with  new  elements  (e.g.  groups),  facilita)ng  the  understanding  &  querying  of  experimental  design  

•  Facilitate  data  integra)on  &  knowledge  discovery/reasoning  

architecture  

ISA-­‐TAB  parser   isa2owl  mapping  

parser            graph  analysis  

Configura)on  file  

Implementa)on:  -­‐  java-­‐based  -­‐  Using  owlapi  

 Expe

rimen

tal  

domain  

 

Biomolecular    domain  

 

Chemical  domain  

 

Informa)on  domain  

 

vocabularies  

Source  Name  Characteris)cs[organismobi:0100026)])    

Term  Source  REF  

Term  Accession  Number  

Factor  Value[chemical  compound  CHEBI_37577)]  

Term  Source  REF  

Term  Accession  Number  

individual1   Homo  sapiens   NCBITax   9606   aspirin   CHEBI   1231354  

Source  Name  Characteris)cs[organismobi:0100026)])    

Term  Source  REF  

Term  Accession  Number  

Factor  Value[chemical  compound  CHEBI_37577)]  

Term  Source  REF  

Term  Accession  Number  

individual1   Homo  sapiens   NCBITax   9606   aspirin   CHEBI   1231354  

OBI  

GO  ChEBI   IAO  

Open  Biological  and  Biomedical  Ontologies  (OBO)  Foundry   BFO  

ISA-­‐OBI  mapping  

ISA-­‐SIO  mapping  

Data  subset:  LC/MS  peaks  from  the  spinal  cords  of  6  wild-­‐type  and  6  FAAH  (fa[y  acid  amyde  hydrolase)  knockout  mice  

faahKO  dataset    Available  in  Bioconductor    (with  ISA-­‐TAB  metadata)  Global  metabolite  profiling  

•  support  different  conversion  modes  (different  levels  of  granularity)  

•  querying  for  ISA-­‐TAB  datasets,  across  mul)ple  experiment  types  

•  reasoning  exploi)ng  ontology  annota)ons  –   seman)c  valida)on  of  ISA-­‐TAB  datasets  

•  augmented  annota)on  over  na)ve  ISA  syntax  –  iden)fica)on  gaps  in  ontological  representa)ons    –  feedback  of  findings  to  community  ontologies  

 

Increasing  level  of  structure    for  experimental  metadata  

Notes  in  Lab  books    

Spreadsheets  &  Tables  (ISAtab  metadata)    

Facts  as  RDF  statements  

A  growing  ecosystem    of  over  30  public  and  internal  resources  using  the  ISA  metadata  tracking  framework    to  facilitate  standards-­‐compliant  collec)on,  cura)on,  management  and  reuse  of  inves)ga)ons  in  an  increasingly  diverse  set  of  life  science  domains.  

Towards  interoperable  bioscience  data  Sansone  et  al,  2012  Nature  Gene)cs  

Implementa)on  at  Harvard  

ISA

h[p://discovery.hsci.harvard.edu/  

60

Implementa)on  at  the    European  Bioinforma)cs  Ins)tute  

h[p://www.ebi.ac.uk/m

etabolights  

Reproducible  &  Reusable    Bioscience  Research  

reasoning  

analysis  

exchange  

integra)on  

visualiza)on  

browsing  retrieval  

@isatools  @biosharing  isa-­‐tools.org        

isacommons.org    biosharing.org