Alejandra González-Beltrán, Ph.D
University of Oxford e-Research Centre, UK
The ISA infrastructure:supporting bio-scientists from experimental
design to data publication
4to. Congreso Argentino de Bioinformática y Biología Computacional (4CAB2C) &4ta. Conferencia Internacional de la Sociedad Iberoamericana de Bioinformática (SolBio)
29-31 October 2013, Rosario, Argentina
1
h"p://www.nature.com/news/2011/110111/full/469139a.html;
h"p://www.nature.com/news/2011/110111/full/469139a.html;
h"p://www.economist.com/node/215285937
h"p://www.nature.com/news/2011/110111/full/469139a.html;
h"p://www.economist.com/node/215285937h"p://www.ny*mes.com/2011/07/08/health/research/08genes.html:
3
Ioannidis( et( al.,( Repeatability( of( published( microarray(gene(expression(analyses.(Nature'Gene*cs(41(2),(149@55((2009)(doi:10.1038/ng.295((
3
Ioannidis( et( al.,( Repeatability( of( published( microarray(gene(expression(analyses.(Nature'Gene*cs(41(2),(149@55((2009)(doi:10.1038/ng.295((
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
metadata
data+
Experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
metadata
data+
Science Reproducibility
Experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
metadata
data+
Science Reproducibility
Experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
metadata
data+
Science Reproducibility
Experimental workflow
Experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
retrospective
metadata
Experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
metadata
Experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
metadata
Experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
prospective
metadata
metadata
metadata
metadata
metadata
metadata
metadata
Experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
metadata tracking infrastructure
prospective
metadata
metadata
metadata
metadata
metadata
metadata
metadata
Experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
metadata tracking infrastructure
prospective
metadata
metadata
metadata
metadata
metadata
metadata
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Reusability
Experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Reusability
Experimental workflow
Traceability
Evidence
Provenance
Assessm
ent
Accountability
Retrieval
Mining Science
Reproducibility
Motivation Data Collection
heterogeneous experimental data
Motivation Publicationformats and database fragmentation
Roadmap
• Importance of data+metadata availability
• Experimental workflow
• Multi-omic experiments, heterogeneous data & formats
•
• The Investigation/Study/Assay (ISA) infrastructure
•
•
•
• Experimental workflow revisited
Roadmap
• Importance of data+metadata availability
• Experimental workflow
• Multi-omic experiments, heterogeneous data & formats
•
• The Investigation/Study/Assay (ISA) infrastructure
•
•
•
• Experimental workflow revisited
12
Differentcommunities
12
Differentcommunities
report&the&same&core,&&essen.al&informa.on&&
use&the&same&term&to&refer&to&the&same&‘thing’&allow&data&to&flow&from&
one&system&to&another&
12
Differentcommunities
report&the&same&core,&&essen.al&informa.on&&
use&the&same&term&to&refer&to&the&same&‘thing’&allow&data&to&flow&from&
one&system&to&another&
Challenges: lack of interaction & coordination, duplication of effort, fragmentation & uneven coverage... hampers interoperability
13
Data CollectionPlanning
13
Data CollectionPlanning
13
Data CollectionPlanning
13
Data CollectionPlanning
14
infrastructureThe
generic format for experimental description and data exchange
open source software toolscommunity engagement
15
16
semantics
structure
17
semantics
structure
Experimental workflow - graph representation
H. Sapiens
33 Years
H1
H2
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H. Sapiens
35 Years
Scanning
Scanning
Scanning
...
...
...
Spreadsheets for end-users
vocabulary for the description of the experimental workflow
H. Sapiens
33 Years
H1
H2
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H. Sapiens
35 Years
Scanning
Scanning
Scanning
...
...
...
H. Sapiens
H. Sapiens
H. Sapiens
H1
H1
H2
35
35
33
Years
Years
Years
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
Scanning
Scanning
Scanning
...
Experimental workflow - graph representation
Spreadsheets for end-users
vocabulary for the description of the experimental workflow
syntactic interoperabilityacross biological experiments of different types
H. Sapiens
33 Years
H1
H2
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H. Sapiens
35 Years
Scanning
Scanning
Scanning
...
...
...
H. Sapiens
H. Sapiens
H. Sapiens
H1
H1
H2
35
35
33
Years
Years
Years
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
Scanning
Scanning
Scanning
...
Experimental workflow - graph representation
Protocol'Process'
Characteristics[…]!Factor Value[…] (independent variables)!Material Type!Comment[…]!
!Date!(day effect)!
Performer!!(operator effect)!
Parameter!Value![…]!
Derived Data File!
Raw Data File!
Data'File'Node'
! DATA!
! Material!
Material'Node'
Sample'Name' Material'Type''
Hybridiza9on'Assay'Name' Assay'Design'REF' Array'Data'File' Protocol'REF' Derived'Array'Data'File'
'
sample1' genomic'DNA' assay1' A-AFFY-107! assay1.cel' data'normaliza9on' assay1.txt'
sample2' genomic'DNA' assay2' A-AFFY-107! assay2.cel' data'normaliza9on' assay2.txt'
sample3' genomic'DNA' assay3' A-AFFY-107! assay3.cel' data'normaliza9on' assay3.txt'
Material'transforma9ons...'
! Material!
! DATA!
23
A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework (ISA-Tab and/or format) to facilitate standards-compliant collection, curation, management and reuse of investigations in an increasingly diverse set of life science domains, including:
• stem cell discovery• system biology• transcriptomics• toxicogenomics• also by communities working to build a library of cellular
signatures
• environmental health• environmental genomics• metabolomics• metagenomics• nanotechnology• proteomics
24
Implementa)on+at+Harvard+
ISA
h1p://discovery.hsci.harvard.edu/+
25
Implementa)on+at+the++European+Bioinforma)cs+Ins)tute+
h5p://www.ebi.ac.uk/m
etabolights+
1
29
Create template(s) to fit the type of experiments to be described!!
Create!templates(detailing!the!steps!to!be!reported!for!different!inves4ga4ons,!complying(to(community(standards,!e.g.!configuring!the!value(s)!allowed!for!each!field!to!be!!• text!(with/without!regular!expression!tes4ng),!• ontology!terms,!• numbers!etc.&!
We#now#have#GSC#compliant#configura7ons#for#submission#to#ENA.#
&&&
30
Or describe, curate your experiment using a desktop-based tool!!
Report and edit the description using this tool, (also customized using the templates) with a spreadsheet like look and feel, packed with functionalities such as !• ontology search (access via ) !• term-tagging features!• import from spreadsheets etc…!
Developed to be a user friendly way to enter standards-compliant metadata: it has lots of features...
31
Data Collection
+Design Wizard Planning
33
OntoMaton:(a(Bioportal(powered(Ontology(widget(for(Google(Spreadsheets(Maguire(et(al,((2013(Bioinforma?cs(
Data Collection
• Ontology(search(and(automated(tagging(((relying(on((NCBO(Bioportal(services)(on(Google(Spreadsheets(• Collabora=ve(annota=on;(support(for(distributed(users(• Version(control(&(history(
Data Management Data Management
Data Management Data Management
Data Management Data Management
Shifting towards a new system
Data Management Data Management
Shifting towards a new system
Data Management Data Management
Shifting towards a new system
Analysis Analysis
The interesting bit...doing something with our data and metadata...
Analysis of ISA Tab data in the R language. Brings together the context and data to enable more meaningful analysis.
Also suggests packages to use for analysis based on the data types in the ISA Tab file.
Analysis of ISA-Tab data in the Galaxy Environment.
Creates Galaxy Library objects from ISA-Tab files.
Analysis of ISA-Tab data in the GenomeSpace Environment.
Load and edit files stored on distributed servers.
Created by Brad Chapman at the Harvard School for Public Health
39
Run Assays4
SAMPLE1
SAMPLE2
SAMPLE3
SAMPLE4
SAMPLE5
SAMPLE6
SAMPLE7
SAMPLE8
SAMPLE9
SAMPLE10
SAMPLE11
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
SAMPLE 5
SAMPLE 6
SAMPLE 7
SAMPLE 8
SAMPLE 9
SAMPLE 10
SAMPLE 11
FILE 1
FILE 2
FILE 3
FILE 4
FILE 5
FILE 6
FILE 7
FILE 8
FIL
FIL
FIL
Experiment Design Analysis
Arabidopsis thaliana
Treatment groups
70% 90% 100%
Collect Samples1 2 3 5
6
Parses ISA-Tab datasets into R objects, allowing to update them and save them after analysis.
Bridges the ISA-Tab metadata to analysis pipelines of specific assay types, by building objects for use in other R packages downstream: currently considering mass spectrometry (xmcs package, xcmsSet) and DNA microarray (Biobase package, ExpressionSet)
Suggests packages in BioConductor that might be relevant for an assay type, according to the BioCViews annotations.
Gonzalez-Beltran et al. The Risa R/Bioconductor package: integrative data analysis from experimental metadata and back again. In press
41
Publicationdata submission
41
Publicationdata submission
41
Publicationdata submission
Publication
Publish, along with your research
articles
& specialised community repositories
Share, link and reason over
experiments with linked data
Getting your work out there...
Publication
http://gigasciencejournal.com
http://www.gigasciencejournal.com/content/1/1/3#B19
http://gigasciencejournal.com
http://gigadb.org/dataset/100035
http://www.gigasciencejournal.com/content/1/1/3#B19
• New open-access, online-only publication for descriptions of scientifically valuable datasets• Only content type: Data Descriptor, narrative + structured parts• Initially focused on the life, environmental and biomedical sciences• Data Descriptor will be complementary to traditional research journals and data repositories• Designed to foster data sharing and reuse, and ultimately to accelerate scientific discovery
www.nature.com/scientificdata
Narrative SectionA brief article-like document like with:•Title•Abstract•Background & Summary•Methods•Technical Validation•Usage Notes •Figures & Tables •References
Structured SectionDetailed descriptions of the experimental procedures used to produce the data•Following community-defined minimum information requirements •for a level of detail sufficient to reproduce the experiments•Using ontologies & controlled-vocabularies•To maximise consistency of the descriptions
www.nature.com/scientificdata
Data Descriptors served by Scientific Data
Narrative SectionA brief article-like document like with:•Title•Abstract•Background & Summary•Methods•Technical Validation•Usage Notes •Figures & Tables •References
Structured SectionDetailed descriptions of the experimental procedures used to produce the data•Following community-defined minimum information requirements •for a level of detail sufficient to reproduce the experiments•Using ontologies & controlled-vocabularies•To maximise consistency of the descriptions
www.nature.com/scientificdata
Data Descriptors served by Scientific Data
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Science Reproducibility
50
funders
core isa team
Questions?
You can email [email protected]
View our bloghttp://isatools.wordpress.com
Follow us on Twitter@isatools
View our websitehttp://www.isa-tools.org
View our Git repo & contributehttp://github.com/ISA-tools
Thanks for your attention!
Top Related