ISMB Workshop 2014

Post on 26-Jan-2015

111 views 1 download

description

This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.

Transcript of ISMB Workshop 2014

What was the plan? A role for data standards, models and computational

workflows in scholarly data publishing

Alejandra González-Beltrán, PhD Philippe Rocca-Serra, PhD Oxford e-Research Centre, University of Oxford

{alejandra.gonzalezbeltran,philippe.rocca-serra}@oerc.ox.ac.uk

ISMB Workshop: What Bioinformaticians need to know about

digital publishing beyond the PDF2

July15th, 2014 Boston, USA

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

The experimental workflow

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

The experimental workflow

metadata

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

The experimental workflow

metadata

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Interoperability

The experimental workflow

Reproducibility

Data Review

The experimental workflow

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Reusability

The experimental plan - life sciences case

experimental design!sample characteristic(s)!

experimental variable(s)!

2-week systemic rat study using male Wistar rats (N=15 per dose group)

14 proprietary drug candidates from participating companies and 2 reference toxic compounds

InnoMed PredTox Project

The experimental plan - life sciences case

experimental design!sample characteristic(s)!

experimental variable(s)!

technology(s)!measurement(s)!protocols(s)!data file(s)!…!

The experimental plan - computational case

•open peer-review •availability of

•data •analysis scripts •documentation

Evaluation of SOAPdenovo2 tool for the de novo assembly of genomes from small DNA segments reads by next generation sequencing, implementing improvements over SOAPdenovo1 assembler.

genome assembly algorithm

genome size

Predictor Variables!(Factor Name, Factor Type)

The experimental plan - computational case

genome assembly algorithm

genome size

SOAPdenovo2

SOAPdenovo1

ALL-PATHS-LG

Predictor Variables!(Factor Name, Factor Type)

The experimental plan - computational case

genome assembly algorithm

genome size

SOAPdenovo2

SOAPdenovo1

ALL-PATHS-LG

bacterial genome

insect genomehuman genome

Predictor Variables!(Factor Name, Factor Type)

The experimental plan - computational case

genome assembly algorithm

genome size

SOAPdenovo2

SOAPdenovo1

ALL-PATHS-LG

bacterial genome

insect genomehuman genome

bacterial genome

insect genomehuman genomebacterial genome

insect genomehuman genome

Predictor Variables!(Factor Name, Factor Type)

3x3 factorial design 9 study groups

The experimental plan - computational case

genome assembly algorithm

genome size

SOAPdenovo2

SOAPdenovo1

ALL-PATHS-LG

bacterial genome

insect genomehuman genome

bacterial genome

insect genomehuman genomebacterial genome

insect genomehuman genome

Predictor Variables!(Factor Name, Factor Type)

The experimental plan - computational case

S. aureusR. sphaeroides

B. impatiens

Chinese Han genome (or YH genome)

genome assembly algorithm

genome size

SOAPdenovo2

SOAPdenovo1

ALL-PATHS-LG

bacterial genome

insect genomehuman genome

bacterial genome

insect genomehuman genomebacterial genome

insect genomehuman genome

Predictor Variables!(Factor Name, Factor Type)

The experimental plan - computational case

Response Variables!

genome coverage

computation run time

memory consumption

http://www.am

a-roch

ester.o

rg/W

P/wp-co

nten

t/up

load

s/20

13/01/three-pillars.png

17

A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework (ISA-Tab and/or tools) to facilitate standards-compliant collection, curation, management and reuse of investigations in an increasingly diverse set of life science domains, including: !

• stem cell discovery • system biology • transcriptomics • toxicogenomics • also by communities working to build a library of cellular

signatures

!• environmental health • environmental genomics • metabolomics • metagenomics • nanotechnology • proteomics

General-purpose, configurable format designed to support: !• description of the experimental metadata, making the annotation explicit and discoverable !• provenance tracking !

• use of community standards, such as minimal reporting guidelines and terminologies !• designed to be converted to - a growing number of - other metadata formats, e.g. used by the European Bioinformatics Institute (EBI) repositories !

H. Sapiens

H. Sapiens

H. Sapiens

H1

H1

H2

35

35

33

Years

Years

Years

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

Scanning

Scanning

Scanning

...

H. Sapiens

33 Years

H1

H2

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

H. Sapiens

35 Years

Scanning

Scanning

Scanning

...

...

...

H. Sapiens

H. Sapiens

H. Sapiens

H1

H1

H2

35

35

33

Years

Years

Years

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

Scanning

Scanning

Scanning

...

H. Sapiens

33 Years

H1

H2

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

H. Sapiens

35 Years

Scanning

Scanning

Scanning

...

...

...

obi:material entity

obi:material sample

obi:material processing

obi:processed material

obi:planned process

isa:raw data file

bfo:derives from

http://gigasciencejournal.com

http://gigadb.org/dataset/100035

http://gigasciencejournal.com

http://gigadb.org/dataset/100035

Experimental metadata

or structured component

(in-house curated, machine-readable

formats)

Article or narrative

component (PDF and HTML)

A new online-only publication for descriptions of scientifically valuable datasets in the life, environmental and biomedical sciences, but not limited to these!

Credit for sharing your data

Focused on reuse and reproducibility

Peer reviewed, curated

Promoting Community Data Repositories

Open Access

SOAPdenovo2

http://isa-tools.github.io/soapdenovo2

SOAPdenovo2

http://isa-tools.github.io/soapdenovo2

SOAPdenovo2

http://isa-tools.github.io/soapdenovo2

Galaxy workflows to re-enact the data analysis

http://isa-tools.github.io/soapdenovo2

SOAPdenovo2

Nanopub: represents structured data along with its

provenance in a single publishable and citable entity

http://isa-tools.github.io/soapdenovo2

SOAPdenovo2

ResearchObject: enables the aggregation of the digital

resources contributing to findings of computational

research, including results, data and software, as citable

compound digital objects

Reproducing SOAPdenovo2 results Galaxy workflows

S. aureus pipeline

Reproducing SOAPdenovo2 results Galaxy workflows

Reproducing SOAPdenovo2 results Galaxy workflows

2241 400

30

119.0 11 106 24 68

0

Reproducing SOAPdenovo2 results Galaxy workflows

“genome coverage increased over the human data when comparing SOAPdenovo2 against SOAPdenovo1”!

Response Variables!

genome coverage

computation run time

memory consumption

OntoMaton:(a(Bioportal(powered(Ontology(widget(for(Google(

Spreadsheets(Maguire(et(al,((2013(

Bioinforma?cs(

widget for ontology

annotation and tagging on

Google spreadsheets

relying on BioPortal and Linked Open Vocabularies

services

OntoMaton:(a(Bioportal(powered(Ontology(widget(for(Google(

Spreadsheets(Maguire(et(al,((2013(

Bioinforma?cs(

widget for ontology

annotation and tagging on

Google spreadsheets

relying on BioPortal and Linked Open Vocabularies

services

NanoMaton https://github.com/ISA-tools/NanoMaton

Ontology for Biomedical Investigations

SemanticsScience Integrated Ontology

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Findable, Accessible, Interoperable, Reusable!FAIR data

Contributing to !Metabolights and ISA

• BBRSC UK-China Award & BGI funded Hackathon!• venue: BGI Hong-Kong!• Participants:!

• Metabolights/BGI/ISA/Birmingham/Hong-Kong University!

• Outcome: !• ISAtab web viewer code!• Functional Specifications & Code for DoE

Wizard API!• 4 datasets coded in ISA format!• Conversion Metabolights datasets to RDF

funders

acknowledgements

Scott Edmunds, GigaScience

Peter Li, GigaScience

Jun Zhao, Lancaster University

María Susana Avila García, Oxford University

Marco Roos, Leiden UniversityMark Thompson, Leiden University

Ruibang Luo, University of Hong Kong

Tin-Lap Lee, Chinese University of Hong Kong

Tak-wah Lam, University of Hong Kong

Questions?You can email us...

isatools@googlegroups.com

View our blog http://isatools.wordpress.com

Follow us on Twitter @isatools

View our websites

View our Git repo & contribute http://github.com/ISA-tools

Thanks for your attention!