4A2B2C-2013

74
Alejandra González-Beltrán, Ph.D University of Oxford e-Research Centre, UK The ISA infrastructure: supporting bio-scientists from experimental design to data publication [email protected] 4to. Congreso Argentino de Bioinformática y Biología Computacional (4CAB2C) & 4ta. Conferencia Internacional de la Sociedad Iberoamericana de Bioinformática (SolBio) 29-31 October 2013, Rosario, Argentina 1

description

 

Transcript of 4A2B2C-2013

Page 1: 4A2B2C-2013

Alejandra González-Beltrán, Ph.D

University of Oxford e-Research Centre, UK

The ISA infrastructure:supporting bio-scientists from experimental

design to data publication

[email protected]

4to. Congreso Argentino de Bioinformática y Biología Computacional (4CAB2C) &4ta. Conferencia Internacional de la Sociedad Iberoamericana de Bioinformática (SolBio)

29-31 October 2013, Rosario, Argentina

1

Page 2: 4A2B2C-2013

h"p://www.nature.com/news/2011/110111/full/469139a.html;

Page 3: 4A2B2C-2013

h"p://www.nature.com/news/2011/110111/full/469139a.html;

h"p://www.economist.com/node/215285937

Page 4: 4A2B2C-2013

h"p://www.nature.com/news/2011/110111/full/469139a.html;

h"p://www.economist.com/node/215285937h"p://www.ny*mes.com/2011/07/08/health/research/08genes.html:

Page 5: 4A2B2C-2013

3

Ioannidis( et( al.,( Repeatability( of( published( microarray(gene(expression(analyses.(Nature'Gene*cs(41(2),(149@55((2009)(doi:10.1038/ng.295((

Page 6: 4A2B2C-2013

3

Ioannidis( et( al.,( Repeatability( of( published( microarray(gene(expression(analyses.(Nature'Gene*cs(41(2),(149@55((2009)(doi:10.1038/ng.295((

Page 7: 4A2B2C-2013

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Experimental workflow

Page 8: 4A2B2C-2013

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

metadata

data+

Experimental workflow

Page 9: 4A2B2C-2013

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

metadata

data+

Science Reproducibility

Experimental workflow

Page 10: 4A2B2C-2013

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

metadata

data+

Science Reproducibility

Experimental workflow

Page 11: 4A2B2C-2013

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

metadata

data+

Science Reproducibility

Experimental workflow

Page 12: 4A2B2C-2013

Experimental workflow

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Page 13: 4A2B2C-2013

retrospective

metadata

Experimental workflow

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Page 14: 4A2B2C-2013

metadata

Experimental workflow

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Page 15: 4A2B2C-2013

metadata

Experimental workflow

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

prospective

metadata

metadata

metadata

metadata

metadata

metadata

Page 16: 4A2B2C-2013

metadata

Experimental workflow

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

metadata tracking infrastructure

prospective

metadata

metadata

metadata

metadata

metadata

metadata

Page 17: 4A2B2C-2013

metadata

Experimental workflow

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

metadata tracking infrastructure

prospective

metadata

metadata

metadata

metadata

metadata

metadata

Page 18: 4A2B2C-2013

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Reusability

Experimental workflow

Page 19: 4A2B2C-2013

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Reusability

Experimental workflow

Traceability

Evidence

Provenance

Assessm

ent

Accountability

Retrieval

Mining Science

Reproducibility

Page 20: 4A2B2C-2013

Motivation Data Collection

heterogeneous experimental data

Page 21: 4A2B2C-2013

Motivation Publicationformats and database fragmentation

Page 22: 4A2B2C-2013

Roadmap

• Importance of data+metadata availability

• Experimental workflow

• Multi-omic experiments, heterogeneous data & formats

• The Investigation/Study/Assay (ISA) infrastructure

• Experimental workflow revisited

Page 23: 4A2B2C-2013

Roadmap

• Importance of data+metadata availability

• Experimental workflow

• Multi-omic experiments, heterogeneous data & formats

• The Investigation/Study/Assay (ISA) infrastructure

• Experimental workflow revisited

Page 24: 4A2B2C-2013

12

Differentcommunities

Page 25: 4A2B2C-2013

12

Differentcommunities

report&the&same&core,&&essen.al&informa.on&&

use&the&same&term&to&refer&to&the&same&‘thing’&allow&data&to&flow&from&

one&system&to&another&

Page 26: 4A2B2C-2013

12

Differentcommunities

report&the&same&core,&&essen.al&informa.on&&

use&the&same&term&to&refer&to&the&same&‘thing’&allow&data&to&flow&from&

one&system&to&another&

Challenges: lack of interaction & coordination, duplication of effort, fragmentation & uneven coverage... hampers interoperability

Page 27: 4A2B2C-2013

13

Data CollectionPlanning

Page 28: 4A2B2C-2013

13

Data CollectionPlanning

Page 29: 4A2B2C-2013

13

Data CollectionPlanning

Page 30: 4A2B2C-2013

13

Data CollectionPlanning

Page 31: 4A2B2C-2013

14

infrastructureThe

generic format for experimental description and data exchange

open source software toolscommunity engagement

Page 32: 4A2B2C-2013

15

Page 33: 4A2B2C-2013

16

semantics

structure

Page 34: 4A2B2C-2013

17

semantics

structure

Page 35: 4A2B2C-2013
Page 36: 4A2B2C-2013

Experimental workflow - graph representation

H. Sapiens

33 Years

H1

H2

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

H. Sapiens

35 Years

Scanning

Scanning

Scanning

...

...

...

Page 37: 4A2B2C-2013

Spreadsheets for end-users

vocabulary for the description of the experimental workflow

H. Sapiens

33 Years

H1

H2

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

H. Sapiens

35 Years

Scanning

Scanning

Scanning

...

...

...

H. Sapiens

H. Sapiens

H. Sapiens

H1

H1

H2

35

35

33

Years

Years

Years

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

Scanning

Scanning

Scanning

...

Experimental workflow - graph representation

Page 38: 4A2B2C-2013

Spreadsheets for end-users

vocabulary for the description of the experimental workflow

syntactic interoperabilityacross biological experiments of different types

H. Sapiens

33 Years

H1

H2

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

H. Sapiens

35 Years

Scanning

Scanning

Scanning

...

...

...

H. Sapiens

H. Sapiens

H. Sapiens

H1

H1

H2

35

35

33

Years

Years

Years

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

Scanning

Scanning

Scanning

...

Experimental workflow - graph representation

Page 39: 4A2B2C-2013

Protocol'Process'

Characteristics[…]!Factor Value[…] (independent variables)!Material Type!Comment[…]!

!Date!(day effect)!

Performer!!(operator effect)!

Parameter!Value![…]!

Derived Data File!

Raw Data File!

Data'File'Node'

! DATA!

! Material!

Material'Node'

Sample'Name' Material'Type''

Hybridiza9on'Assay'Name' Assay'Design'REF' Array'Data'File' Protocol'REF' Derived'Array'Data'File'

'

sample1' genomic'DNA' assay1' A-AFFY-107! assay1.cel' data'normaliza9on' assay1.txt'

sample2' genomic'DNA' assay2' A-AFFY-107! assay2.cel' data'normaliza9on' assay2.txt'

sample3' genomic'DNA' assay3' A-AFFY-107! assay3.cel' data'normaliza9on' assay3.txt'

Material'transforma9ons...'

! Material!

! DATA!

Page 40: 4A2B2C-2013

23

A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework (ISA-Tab and/or format) to facilitate standards-compliant collection, curation, management and reuse of investigations in an increasingly diverse set of life science domains, including:

• stem cell discovery• system biology• transcriptomics• toxicogenomics• also by communities working to build a library of cellular

signatures

• environmental health• environmental genomics• metabolomics• metagenomics• nanotechnology• proteomics

Page 41: 4A2B2C-2013

24

Implementa)on+at+Harvard+

ISA

h1p://discovery.hsci.harvard.edu/+

Page 42: 4A2B2C-2013

25

Implementa)on+at+the++European+Bioinforma)cs+Ins)tute+

h5p://www.ebi.ac.uk/m

etabolights+

Page 43: 4A2B2C-2013
Page 44: 4A2B2C-2013
Page 45: 4A2B2C-2013

1

Page 46: 4A2B2C-2013

29

Create template(s) to fit the type of experiments to be described!!

Create!templates(detailing!the!steps!to!be!reported!for!different!inves4ga4ons,!complying(to(community(standards,!e.g.!configuring!the!value(s)!allowed!for!each!field!to!be!!•  text!(with/without!regular!expression!tes4ng),!•  ontology!terms,!•  numbers!etc.&!

We#now#have#GSC#compliant#configura7ons#for#submission#to#ENA.#

&&&

Page 47: 4A2B2C-2013

30

Or describe, curate your experiment using a desktop-based tool!!

Report and edit the description using this tool, (also customized using the templates) with a spreadsheet like look and feel, packed with functionalities such as !•  ontology search (access via ) !•  term-tagging features!•  import from spreadsheets etc…!

Page 48: 4A2B2C-2013

Developed to be a user friendly way to enter standards-compliant metadata: it has lots of features...

31

Data Collection

Page 49: 4A2B2C-2013

+Design Wizard Planning

Page 50: 4A2B2C-2013

33

OntoMaton:(a(Bioportal(powered(Ontology(widget(for(Google(Spreadsheets(Maguire(et(al,((2013(Bioinforma?cs(

Data Collection

•  Ontology(search(and(automated(tagging(((relying(on((NCBO(Bioportal(services)(on(Google(Spreadsheets(•  Collabora=ve(annota=on;(support(for(distributed(users(•  Version(control(&(history(

Page 51: 4A2B2C-2013
Page 52: 4A2B2C-2013

Data Management Data Management

Page 53: 4A2B2C-2013

Data Management Data Management

Page 54: 4A2B2C-2013

Data Management Data Management

Shifting towards a new system

Page 55: 4A2B2C-2013

Data Management Data Management

Shifting towards a new system

Page 56: 4A2B2C-2013

Data Management Data Management

Shifting towards a new system

Page 57: 4A2B2C-2013
Page 58: 4A2B2C-2013

Analysis Analysis

The interesting bit...doing something with our data and metadata...

Analysis of ISA Tab data in the R language. Brings together the context and data to enable more meaningful analysis.

Also suggests packages to use for analysis based on the data types in the ISA Tab file.

Analysis of ISA-Tab data in the Galaxy Environment.

Creates Galaxy Library objects from ISA-Tab files.

Analysis of ISA-Tab data in the GenomeSpace Environment.

Load and edit files stored on distributed servers.

Created by Brad Chapman at the Harvard School for Public Health

Page 59: 4A2B2C-2013

39

Run Assays4

SAMPLE1

SAMPLE2

SAMPLE3

SAMPLE4

SAMPLE5

SAMPLE6

SAMPLE7

SAMPLE8

SAMPLE9

SAMPLE10

SAMPLE11

SAMPLE 1

SAMPLE 2

SAMPLE 3

SAMPLE 4

SAMPLE 5

SAMPLE 6

SAMPLE 7

SAMPLE 8

SAMPLE 9

SAMPLE 10

SAMPLE 11

FILE 1

FILE 2

FILE 3

FILE 4

FILE 5

FILE 6

FILE 7

FILE 8

FIL

FIL

FIL

Experiment Design Analysis

Arabidopsis thaliana

Treatment groups

70% 90% 100%

Collect Samples1 2 3 5

6

Parses ISA-Tab datasets into R objects, allowing to update them and save them after analysis.

Bridges the ISA-Tab metadata to analysis pipelines of specific assay types, by building objects for use in other R packages downstream: currently considering mass spectrometry (xmcs package, xcmsSet) and DNA microarray (Biobase package, ExpressionSet)

Suggests packages in BioConductor that might be relevant for an assay type, according to the BioCViews annotations.

Gonzalez-Beltran et al. The Risa R/Bioconductor package: integrative data analysis from experimental metadata and back again. In press

Page 60: 4A2B2C-2013
Page 61: 4A2B2C-2013

41

Publicationdata submission

Page 62: 4A2B2C-2013

41

Publicationdata submission

Page 63: 4A2B2C-2013

41

Publicationdata submission

Page 64: 4A2B2C-2013
Page 65: 4A2B2C-2013

Publication

Publish, along with your research

articles

& specialised community repositories

Share, link and reason over

experiments with linked data

Getting your work out there...

Publication

Page 68: 4A2B2C-2013

• New open-access, online-only publication for descriptions of scientifically valuable datasets• Only content type: Data Descriptor, narrative + structured parts• Initially focused on the life, environmental and biomedical sciences• Data Descriptor will be complementary to traditional research journals and data repositories• Designed to foster data sharing and reuse, and ultimately to accelerate scientific discovery

www.nature.com/scientificdata

Page 69: 4A2B2C-2013

Narrative SectionA brief article-like document like with:•Title•Abstract•Background & Summary•Methods•Technical Validation•Usage Notes •Figures & Tables •References

Structured SectionDetailed descriptions of the experimental procedures used to produce the data•Following community-defined minimum information requirements •for a level of detail sufficient to reproduce the experiments•Using ontologies & controlled-vocabularies•To maximise consistency of the descriptions

www.nature.com/scientificdata

Data Descriptors served by Scientific Data

Page 70: 4A2B2C-2013

Narrative SectionA brief article-like document like with:•Title•Abstract•Background & Summary•Methods•Technical Validation•Usage Notes •Figures & Tables •References

Structured SectionDetailed descriptions of the experimental procedures used to produce the data•Following community-defined minimum information requirements •for a level of detail sufficient to reproduce the experiments•Using ontologies & controlled-vocabularies•To maximise consistency of the descriptions

www.nature.com/scientificdata

Data Descriptors served by Scientific Data

Page 71: 4A2B2C-2013
Page 72: 4A2B2C-2013

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Science Reproducibility

Page 73: 4A2B2C-2013

50

funders

core isa team

Page 74: 4A2B2C-2013

Questions?

You can email [email protected]

View our bloghttp://isatools.wordpress.com

Follow us on Twitter@isatools

View our websitehttp://www.isa-tools.org

View our Git repo & contributehttp://github.com/ISA-tools

Thanks for your attention!