Data Driven Innovation - Interoperable Genebanks (Tech Track Session)

35
Data Driven Innovation Interoperability Tech Track (#agridata) 18 & 19 March 2015, Wageningen (@rfinkers)

Transcript of Data Driven Innovation - Interoperable Genebanks (Tech Track Session)

Data Driven Innovation

Interoperability Tech Track (#agridata)

18 & 19 March 2015, Wageningen (@rfinkers)

Outline

Introduction “Interoperable Genetic Diversity”

Concept ”Bring Your Own Data” party

Aim BYOD Green Genetics?

Outcome BYOD Green Genetics

Hands on

2

Climate change & Social disruption

4Photograph: AFP/Getty Imageshttp://www.theguardian.com/commentisfree/2015/mar/08/guardian-view-climate-change-social-disruption#img-1

Select a genetically diverse collection

6

Legacy databases (e.g. Uniprot)

Genome Sequence & Genome Annotation

Genome Variation Data (re-sequencing collections) & SNP annotation

Accession Passport Information

Accession Phenotype Information

Web based aggregation of Information

7

Interoperable Genetic Diversity

Genebanks should utilize genomics data

●But should not store them!

Genomics studies should make variant data available

●But need access to passport and characterization & evaluation data.

Breeders needs tools to access diversity

Finkers, van Hintum et al. 2014 DOI: 10.1017/S1479262114000689

Genebank (s)

Genomics provider(s)

Intermezzo: Linked Open Data

Standardization makes the information interoperable• Controlled vocabularies• Machine readable• Can all be queried by a single question vs. visiting

many websites

Interoperable Genetic Diversity (2)

Implications:

●Data can be stored at many different locations, but can be found by computers

●Newly published information (in the correct format) will be included automatically.

●Tools can be written to dedicated questions, such as assessing allelic variation or utilize for collection management

Finkers, van Hintum et al. 2014 DOI: 10.1017/S1479262114000689

Genebank (s)

Genomics provider(s)

Interdisciplinary Approach Needed

11

Genebanks Genomics provider(s)

Interdisciplinary Approach Needed

Need for Data Scientists & Domain Experts

12

Genebanks Genomics provider(s)

Format: Bring your own Data Workshop

1. Users define the question(s)2. Users and Linked data experts define concepts and ontologies3. Experts help to create linked data and formulate query

Bring Your Own Data Workshop

More Info: http://www.dtls.nl/fair-data/byod/

14

Data owners

Domain Experts

Trainers Linked Data

Experts

Example: Solanaceae Trait Ontology

BYOD in action

Select a genetically diverse collection

17

Legacy databases (e.g. Uniprot)

Genome Sequence & Genome Annotation

Genome Variation Data (re-sequencing collections) & SNP annotation

Accession Passport Information

Accession Phenotype Information

Example Query

18

Outcome: Query Graph

19

FAIRport* in VLPB?

*More on FAIRport in the presentation of Luiz Bonino, Thursday 10:30

Summary

Blueprint “Interoperable Genetic Diversity Shown”

BYOD resulted in interoperable data which could be queried

●Request your own BYOD?

Public <-> Private integration possible

Select a genetically diverse collection

22

Legacy databases (e.g. Uniprot)

Genome Sequence & Genome Annotation

Genome Variation Data (re-sequencing collections) & SNP annotation

Accession Passport Information

Accession Phenotype Information

Select a genetically diverse collection

23

Legacy databases (e.g. Uniprot)

Genome Sequence & Genome Annotation

Genome Variation Data (re-sequencing collections) & SNP annotation

Accession Passport Information

Accession Phenotype Information

Working Prototype

screendump

24

Questions?

Acknowledgements:

BYOD team

Theo van Hinthum & Frank Menting (CGN)

Denis Guryunov & Martijn van Kaauwen (prototype)

et. all.

HaploSmasher Hands On Session

HaploSmasher Prototype:

●genomic regions as input: SL2.40ch03:10000..10200

●Solyc gene identifiers: Solyc10g085020

●Filter SNPs on impact type ● HIGH, MODERATE, LOW, MODIFIER

(SNPEff )

●No input validation yet● Use correct notation, existing Solyc

gene ID’s

HaploSmasher

HaploSmasher

Query CGN FAIRdata graph

● Prototype is only generating links to CGN passport data now

● Graph data of three CGN accessions is available in our testset

HaploSmasher examples:

Haplotype Output

Example queries

http://www.plantbreeding.wur.nl/hs/

Also, explore variation data & Linked resources

●http://www.tomatogenome.net

Examples:

●Beta-tubulin: Solyc10g085020●HIGH & MODERATE vs. ALL effects

●Glutamate dehydrogenase Solyc05g052100●Uridine kinase Solyc02g067880●magnesium chelatase Solyc04g015750

30

HaploSmasher examples:

Conserved housekeeping genes:

● Beta-tubulin Solyc10g085020 439 AA

● 1 SNP (HIGH & MODERATE effect) , two haplotypes

HaploSmasher examples:

● Beta-tubulin Solyc10g085020 439 AA

● 136 SNPs (all SNPEff impact types)

● Part of haplotype groups:

HaploSmasher examples:

● Glutamate dehydrogenase Solyc05g052100

● 13 SNPs (HIGH, MODERATE)

HaploSmasher examples:

● Uridine kinase Solyc02g067880

● 23 SNPs (HIGH, MODERATE)

● Example haplotype groups:

HaploSmasher examples:

● magnesium chelatase Solyc04g015750

● 21 SNPs (HIGH, MODERATE)

● Example haplotype groups: