Reproducible, Open Data Science in the Life Sciences
-
Upload
eamonn-maguire -
Category
Technology
-
view
103 -
download
0
description
Transcript of Reproducible, Open Data Science in the Life Sciences
Reproducible, Open Data Science in the
Life Sciences
Digital Research 2013, 9th-10th September 2013
Eamonn MaguireLead Software Engineer
University of Oxford e-Research Centre
“data science” - the storage, management and analysis of data sets
or...
What is “data science”...
these days, all hard sciences are “data sciences”
"data science" - formalizing a hypothesis given a set of observations and assumptions, grabbing data about that hypothesis, testing it and
analyzing it to either confirm or falsify the hypothesis.
we shift the focus of science from performing physical experiments where data is the by-product used to test a hypothesis, to working
directly with the data
both definitions have different levels of validity in terms of the etymology of the word “science”, but in this presentation, both go very much hand in hand.
Why reproducible and open?
openexperiments are expensive...and often funded publicly.
data from experiments may extend way beyond the realms originally intended...
one without the other is really no use at all.
reproduciblefindings need to be robust...and testable by the wider scientific community...
provided withdata, metadata, analysis methods, algorithms
enabled bydata, metadata, analysis methods, algorithms
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
The workflow of a “data scientist”...
PlanningPlanning
What is your hypothesis? What do you need to prove/disprove this?
You need an experimental design. Preferably balanced groups, but enough samples to make it statistically valid.
If you need to generate the data...there’s an app for that.
+Design Wizard
Plan your experiment by answering questions about what you want to measure and how you want to measure it...then let the tool create the design
Creates the ISA-Tab stub, leaving you to fill in which files match which biological samples.
PlanningPlanning
You also need a data management strategy.
Which ontologies, minimal information checklists and exchange formats can be used for my domain?
What are the requirements of my funder for data deposition?
Which databases support my data?
Data CollectionData Collection
Use existing data
Perform new experiment
New experiment data Use existing data
Collect the data and metadata from an experiment
Or use existing data and metadata to test a hypothesis...
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
SAMPLE 5
SAMPLE 6
SAMPLE 7
SAMPLE 8
SAMPLE 9
SAMPLE 10
SAMPLE 11
FILE 1
FILE 2
FILE 3
FILE 4
FILE 5
FILE 6
FILE 7
FILE 8
FIL
FIL
FIL
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
SAMPLE 5
SAMPLE 6
SAMPLE 7
SAMPLE 8
SAMPLE 9
SAMPLE 10
SAMPLE 11
FILE 1
FILE 2
FILE 3
FILE 4
FILE 5
FILE 6
FILE 7
FILE 8
FIL
FIL
FIL
Data Collection
Use existing data
Perform new experiment
New experiment data
Excel
Create templates to !t the type of experiments to be
described
Curate your experiment using a desktop-based,
platform independent tool.
Describe & curate your experiment with geographically
distributed collaborators
Check out http://isa-tools.org to download.
Enables the creation of meaningful experiment data in a simple extendable format
Data Collection
Use existing data
Perform new experiment
Use existing data
Data ManagementData
Management
Data ManagementData
ManagementShifting towards a new system
AnalysisAnalysis
The interesting bit...doing something with our data and metadata...
Analysis of ISA Tab data in the R language. Brings together the context and data to enable more meaningful analysis.
Also suggests packages to use for analysis based on the data types in the ISA Tab file.
Publication coming soon...
Analysis of ISA-Tab data in the Galaxy Environment.
Creates Galaxy Library objects from ISA-Tab files.
Analysis of ISA-Tab data in the GenomeSpace Environment.
Load and edit files stored on distributed servers.
Created by Brad Chapman at the Harvard School for Public Health
VisualizationVisualization
Check out your experiment, visualize experimental design
Visual Compression of Workflow Visualizations with Automated Detection of Macro Motifs
Maguire et al, 2013IEEE TVCG
Taxonomy!based Glyph Design – with a Case Study on Visualizing Workflows of Biological Experiments
Maguire et al, 2012IEEE TVCG
PublicationVisualization
Publish, along with your research
articles
& specialised community repositories
Share, link and reason over
experiments with linked data
Getting your work out there...
Publication - current workVisualization
See http://www.slideshare.net/GigaScience/scott-edmunds-ismb-talk-on-big-data-publishing for a use case showing how we achieve this.
Analysis results
Data !les
Publications
MetadataEncapsulates all the information about the experiment, providing links to the data files, publications and analysis protocols
Analysis workflows in the Galaxy Environment
Work"ows
Presentations
Logs
Box it all up
The role of a data scientist, (or in the life sciences, a bioinformatician) is multi fold
We’ve presented a suite of tools to help the data scientist in the management of their data and support the creation of open, meaningful life science data
Data Scientist
VisualizationData ManagementData Collection PublicationPlanning
a b c e f
Analysis
d
“data science” - the storage, management and analysis of data sets
"data science" - formalizing a hypothesis given a set of observations and assumptions, grabbing data about that hypothesis, testing it and analyzing it to either confirm or falsify the hypothesis.
With the systems we have in place for data discovery paired with data already created with the ISA suite of tools, we make possible data integration
The workflow of a “data scientist”...
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Making science with data possible...
The workflow of a “data scientist”...
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
And this...
Making science with data possible...
The workflow of a “data scientist”...
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Making science with data possible...
The workflow of a “data scientist”...
...
Making science with data possible...
Data Scientist
Analysis
Planning
Data Management
Data Collection
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Publication
Data Scientist
Analysis
Planning
Data Management
Data Collection
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Publication
Data Scientist
Planning
Data
Data Collection
Use existing data
Perform new experiment
Data Scientist
Planning
Data
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Planning
Data
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Planning
Data
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Planning
Data
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Planning
Publication
Data Scientist
Analysis
Planning
Data Management
Data Collection
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Publication
Recent PublicationsVisual Compression of Workflow Visualizations with Automated Detection of Macro Motifs
Maguire et al, 2013
IEEE TVCG
Taxonomy!based Glyph Design – with a Case Study on Visualizing Workflows of Biological Experiments
Maguire et al, 2012
IEEE TVCG
ISA software suite: Overview of ISA-Tab and first set of tools
Rocca-Serra et al, 2010
Bioinformatics
Towards interoperable bioscience data: Pre-senting the ISA Commons, authored by more than 50 collaborators at over 30 scientific organizations around the globe.
Sansone et al, 2012
Nature Genetics
OntoMaton: a Bioportal powered ontology widget for Google Spreadsheets.
Maguire et al, 2013
Bioinformatics
The Harvard Stem Cell Discovery Engine: an integrated repository and analysis system for FDQFHU�VWHP�FHOO��SRZHUHG�E\�,6$�WRROV�
Ho Sui et al, 2012
Nucleic Acids Research
Taxonomy-based Glyph Design
Maguire et al, 2012
IEEE TVCG
Visualizing (ISA based) workflows of biological experiments
Standardizing data: ISA-Tab-Nano: ISA-Tab extension for nanotechnology applications au-thored by over 20 organizations inlc. government agencies, academia and industry.
Baker et al, 2013
Nature Nanotechnology
MetaboLights: an open-access repository for metabolomics at EBI powered by ISA.
Haug et al, 2013
Nucleic Acids Research
The ToxBank Data Warehouse: a research cluster of 7 EU FP7 Health systems toxicology and toxicogenomics projects develops the ISAtoRDF moduleKohonen et al, 2013
Molecular Informatics
Thanks to
ISA team
Susanna-Assunta SansonePhilippe Rocca-SerraEamonn MaguireAlejandra Gonzalez-Beltran
Contributors
Marco BrandiziNatalija SklyarBrad ChapmanBob MacCallumKenneth HaugPablo ConesaAudrey Kauffman
Funders
& Our Many Collaborators!
S t e m C e ll C o m m o n sNanotechnology
Informatics Working Group
Questions?
http://isa-tools.orghttp://isacommons.org
http://biosharing.org