Despoina Antonakaki, Dasha Zhernakova, Erik Roos,

42
Towards an intelligent framework to quickly find data from distributed heterogeneous biomedical resources. Despoina Antonakaki, Dasha Zhernakova, Erik Roos, K Joeri van der Velde, Mark Kiestra,Tomasz Adamusiak, Niran Abeygunawardena, Helen Parkinson, Rolf Sijmons, Morris A. Swertz

description

Towards an intelligent framework to quickly find data from distributed heterogeneous biomedical resources. Despoina Antonakaki, Dasha Zhernakova, Erik Roos, K Joeri van der Velde, Mark Kiestra ,Tomasz Adamusiak, Niran Abeygunawardena, Helen Parkinson, Rolf Sijmons, Morris A. Swertz. - PowerPoint PPT Presentation

Transcript of Despoina Antonakaki, Dasha Zhernakova, Erik Roos,

Page 1: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

Towards an intelligent framework to quickly find data from distributed heterogeneous biomedical resources.

Despoina Antonakaki, Dasha Zhernakova, Erik Roos,K Joeri van der Velde, Mark Kiestra,Tomasz Adamusiak,

Niran Abeygunawardena, Helen Parkinson, Rolf Sijmons, Morris A. Swertz

Page 2: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

Biologists challenges: A web of data① Find data

– Many different resources• local, structured – array express, free text – pubmed

– Type in many search boxes• Google, NCBI/Entrez, EBI/EB-eye, KEGG/DBGET

② Merge and pool data– Big excel file (trying to make headers fit)

③ Size of data– Working for weeks (map and match)

Major problem : “Using Microsoft Word as sequence annotation tool”

Page 3: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

Informatics challenges: Too many silos…① Differences in terminology

– Need to reach “hidden”, structured data : DB encapsulated, legacy

– Different conceptualization of information

② Differences in formats and structure– Too many formats, specifying & describing biomedical entities:

• no standard representation model

③ Automatic matching and merging– Difficult to merge into single query

• Working for weeks (map & match)

④ Query across silos

DB1DB1 DB2DB2 DB3DB3

Format 1Format 1 Format 2Format 2 Format3Format3

Page 4: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

Local? National? EU? Global?LifeLines

GenerationR TweelingReg

PSI

CeliacDisease

query

Wanted:‘meta’ search infrastructure to

Find me casesFind me cohorts/partners

Connecting different ‘ biobanks’?

Page 5: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

Outline

• Three challenges for biologists’ and the corresponding for the Informatics’:

1. Merge and pool data - Differences in formats and structure2. Find data - Differences in terminology3. Size of data - Automatic matching and merging4. Across data sets – All above + distribution

• Approaches 1. Integrate data into one ‘pheno’ model (MOLGENIS)2. Use ontologies (OntoCAT)3. Indexing (Lucene)4. Query expansion (Lucene + OntoCAT)

• Discussion1. Federated data queries (molgenis & rdf)

Page 6: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

① Data warehouse, put it all in one place? Loading …

Pheno-OM

Page 7: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

①Pheno-OM data model

Flexible: any feature, value, and target combination

Observedvalue

Observedvalue*

Observationtarget

Observationtarget

time

Observablefeature

Observablefeature

*

Panel/cohort/Biobanks

Panel/cohort/Biobanks IndividualIndividual*

* ProtocolProtocol

Protocolapplication

Protocolapplication

*

time

Observed RelationObserved Relation Inferred ValueInferred Value*

*

time

*

Height

179cmInd1

http://wwwdev.ebi.ac.uk/microarray-srv/pheno/doc/objectmodel.html

Page 8: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

An example of excel data

• Or bbmri-nl

Page 9: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

②Use ontologiesTo overcome different terminologies, two approaches:1. Use ontologies to annotate the source

• Of course depends on other parties

2. Use ontologies for query expansion (synonyms, part of, subclasses)

Deformed ears?

Abnormale shaped ears

Pheno-DB

Ontologies with

mappingsOntologies

with mappings

Ontologies with

mappingsIndex

HPO:Abnormally shaped ears Auricular malformation

Deformed auricles Deformed ears

Malformed auricles Malformed ears

Malformed external ears

MP:Abnormally shaped ears Auricular malformation

Deformed auricles Deformed ears

Malformed auricles Malformed ears

Malformed external ears

Page 10: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

Outline• Three challenges for biologists’ and the corresponding for the

Informatics’: 1. Merge and pool data - Differences in formats and structure2. Find data - Differences in terminology3. Size of data - Automatic matching and merging4. Across data sets – All above + distribution

• Approaches 1. Integrate data into one ‘pheno’ model (MOLGENIS)2. Use ontologies (OntoCAT)3. Indexing (Lucene)4. Query expansion (Lucene + OntoCAT)

• Discussion1. Federated data queries (molgenis & rdf)

Page 11: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

Complexity in Ontologies

..sometimes they change unpredictably ..

..or sometimes they become suddenly unavailable ..

To search across different ontologies requires expert knowledge

Page 12: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

Some facts…• NCBO Bioportal :

– 204 ontologies , 29 REST signatures …– BUT : Rest signature change/break without

notice ,

• OLS: 79 OBO ontologies, 16 web service signatures - stable, open, local

– BUT: not as rich , rudimentary documentation

• Individual user’s ontologies created• Integration is hard …

Ontology Browser

EFO Bioportal Import

OntoAPI

OWL API

Page 13: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

OntoCAT hides the complexityontocat.org

searchOntology()

getChildren()

getParents()

getSynonyms()

getDefinitions()

...

Page 14: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

② Generic Ontology Service interface

Implemented in Java 6, Open Source (LGPL v3), Simple and easy-to-use API for BioPortal , OLS web services, OWL API

(BioportalOntologyService, OlsOntologyService and FileOntologyService ).

BBMRI ontologyOWL API

HPONCBO Bioportal

OLS (EMBL-EBI)

OBO files

Page 15: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

②Use case diagram of OntoCAT Use case of a simplified user interaction with existing ontology resources through

OntoCAT . Web applications can connect using REST or SOAP services R connect with Ontocat bioconductor

Page 16: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

② Common workflow to integrate ontology resources

Page 17: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

② Ontocat example :Find “membrane” term in multiple ontologies

Page 18: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

②More examples available

Page 19: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

1. Updating Ontology properties:

– EFO involves construction of mappings to multiple domain specific ontologies (Disease, Cell Type)

– Multithreading the Ontocat requests allows to process & import extra information

• from over 20,000 external ontology terms in less that 10 minutes

2. Annotate user experimental values with ontology terms

– Array Express Archive & Gene Expression Atlas >1 million unique experiment annotated from EBI’s version EFO

• Not existing ones have to be checked against publicly available ontologies

– Previously manual process now with Zooma (local EFO, OWL, local DBs)

② OntoCAT & Zooma use cases

Array express archive Gene Expression Atlas

> 1 million unique experiment annotations

Array express archive Gene Expression Atlas

> 1 million unique experiment annotations

Annotate (ontology

terms)

Annotate (ontology

terms)EBI (pre release version of the

application ontology EFO)EBI (pre release version of the

application ontology EFO)

Not available in EFO ?Not available in EFO ???????

????????????

Page 20: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

② OntoCAT & Zooma use cases

3. Local ontology management– eXtensive Genotype And Phenotype data platform (XGAP - Molgenis) : search

widgetInteractive annotation of data with ontology terms

• Allows search publically available ontologies & download terms for unambiguous annotation of QTL or GWAS data.

4. Data analysis & annotation – New Bioconductor ready to read & query OWL/OBO into R .

• Build in offline support for EFO & Bioportal ontology queries

Page 21: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

OntoCAT provides synonym & definition lookup across two major implemented ontology services

Supports interoperability using RDF Class combining multiple ontology resources including different repositories behind single

entry point (CompositeOntologyService) Cache Ranking Prioritization Fallback mechanism if ontology resource unavailable

② OntoCAT characteristics & tools

Page 22: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

②Demo on Google App Engine framework

• http://ontocat-web.appspot.com

Page 23: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

② Ontocat browser retrieving OLS http://gbic.target.rug.nl:8080/ontocatbrowser/molgenis.do?__target=main&select=OntocatBrowser

Page 25: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

Outline• Three challenges for biologists’ and the corresponding for the

Informatics’: 1. Merge and pool data - Differences in formats and structure2. Find data - Differences in terminology3. Size of data - Automatic matching and merging4. Across data sets – All above + distribution

• Approaches 1. Integrate data into one ‘pheno’ model (MOLGENIS)2. Use ontologies (OntoCAT)3. Indexing (Lucene)4. Query expansion (Lucene + OntoCAT)

• Discussion1. Federated data queries (molgenis & rdf)

Page 26: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

③Indexing: general features• Data structure overcomes barriers in large DB

– created by using DB tables as basis for search– Efficient access of ordered records & rapid random lookup– Less disk space for storage (key fields)

• Open source java library (known in internet search engines)– Full text indexing & searching capability– Format independent (documents & fields)

• Query Expansion: – Add additional terms related (synonyms & children) appended by OR operator,

assigned lower weight– Changes document ranking order of retrieved docs– Even if query expansion doesn’t improve search, query more precise

Page 27: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

③Indexing: the approach• Overcome the barriers of searching in large data size

– Optimize the in memory representation, e.g. as a tree

– Steps:1. Create a new index and add documents (fields from DB, ontology terms from Ontocat) 2. Analyzer: extract tokens out of text to be indexed and eliminates the rest3. Parser: Select Fields (term/value)

» Tokenized? Indexed? Case sensitive?

4. Collect results

def: "Paired, cup-shaped cartilage that are dorsal to the septomaxillae and anterior to the oblique cartilage. The anterior, convex face of each alary cartilage is synchondrotically fused to the superior prenasal cartilage and the ventral edge is fused to the superior margin of the crista intermedia." [AAO:LAP] related_synonym: "alinasal cartilage" [] related_synonym: "cartilago alaris" []related_synonym: "cartilago alaris nasi" []related_synonym: "cartilago cupullaris" [] [Term] id: AAO:0000289name: Meckel's_cartilage def: "Paired, rod-shaped elements that extend the length of the mandible and lie between the dentaries and the angulosplenials." [AAO:LAP] relationship: part_of AAO:0000274 ! lower_jaw_skeleton [Term]id: CHEBI:24431name: molecular structuredef: "A description of the molecular entity or part thereof based on its composition and/or the connectivity between its constituent atoms." []

def: "Paired, cup-shaped cartilage that are dorsal to the septomaxillae and anterior to the oblique cartilage. The anterior, convex face of each alary cartilage is synchondrotically fused to the superior prenasal cartilage and the ventral edge is fused to the superior margin of the crista intermedia." [AAO:LAP] related_synonym: "alinasal cartilage" [] related_synonym: "cartilago alaris" []related_synonym: "cartilago alaris nasi" []related_synonym: "cartilago cupullaris" [] [Term] id: AAO:0000289name: Meckel's_cartilage def: "Paired, rod-shaped elements that extend the length of the mandible and lie between the dentaries and the angulosplenials." [AAO:LAP] relationship: part_of AAO:0000274 ! lower_jaw_skeleton [Term]id: CHEBI:24431name: molecular structuredef: "A description of the molecular entity or part thereof based on its composition and/or the connectivity between its constituent atoms." []

Oblique cartilage.Oblique

cartilage.Tokenized??Tokenized??

cartilago cupullaris cartilago

cupullaris Tokenized??Tokenized??

SeptomaxillaeSeptomaxillae

angulosplenias angulosplenias

index

1. Analyze Query2. Parse Index

3. Collect Results

1. Analyze Query2. Parse Index

3. Collect Results

Enters search term

Enters search term

Output resultsOutput results

Page 28: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

③Indexing DB: implementation

Page 29: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

Outline• Three challenges for biologists’ and the corresponding for the

Informatics’: 1. Merge and pool data - Differences in formats and structure2. Find data - Differences in terminology3. Size of data - Automatic matching and merging4. Across data sets – All above + distribution

• Approaches 1. Integrate data into one ‘pheno’ model (MOLGENIS)2. Use ontologies (OntoCAT)3. Indexing (Lucene)4. Query expansion (Lucene + OntoCAT)

• Discussion1. Federated data queries (molgenis & rdf)

Page 30: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

32

Pheno Warehouse

Deformed ears?

HPO:Abnormally shaped ears Auricular malformation

Deformed auricles

MP:Malformed auricles

Malformed ears Malformed external ears

etc

query expansion

④ Query expansionLocal

ontologies(OLW or

OBO)

CWA

BioPortal

OLS

OntoCAT – Ontology common API taskshttp://www.ontocat.org and http://precedings.nature.com/documents/4666

Abnormally shaped ears

Page 31: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

④ Query expansion details & ontology selection

Ontologies used Ontologies used

Page 32: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

④ The expanded query & the results

Page 33: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

query: lung diseasesearching WITHOUT query expansion:

Page 34: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

④ Indexing: implementation (ontocat)

Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query.

Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query.

Page 35: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

query: lung diseasesearching WITH query expansion:

Page 36: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

Outline• Three challenges for biologists’ and the corresponding for the

Informatics’: 1. Merge and pool data - Differences in formats and structure2. Find data - Differences in terminology3. Size of data - Automatic matching and merging4. Across data sets – All above + distribution

• Approaches 1. Integrate data into one ‘pheno’ model (MOLGENIS)2. Use ontologies (OntoCAT)3. Indexing (Lucene)4. Query expansion (Lucene + OntoCAT)

• Discussion1. Federated data queries (molgenis & rdf)

Page 37: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

Twin Registry

Generation R

LifeLines

BBMRI-SE

Deformed ears?

query

Distributed querying in BBMRI

OntoCAT – Ontology common API taskshttp://www.ontocat.org and http://precedings.nature.com/documents/4666

RDF + OWL?

Page 38: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

Federated

data

queries (

molgenis

& rdf)

• How to make Molgenis data distributed via RDF/SPARQL ?

Deformed ears?

Abnormale shaped ears

HPO:Abnormally shaped ears Auricular malformation

Deformed auricles Deformed ears

Malformed auricles Malformed ears

Malformed external ears

MP:Abnormally shaped ears Auricular malformation

Deformed auricles Deformed ears

Malformed auricles Malformed ears

Malformed external ears

DB

Ontologies with

mappingsOntologies

with mappings

Ontologies with

mappings

DB

DB

?RDF

RDF

SPARQL SPARQL

Page 39: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

Discussion & next steps : distributed querying?

• How to map a database to RDF such that it helps querying?– Diversity : all data molgenis’ pheno model .(+ quick - working

offline , - have to update all the time)

– Map to all distributed sources “on the fly”. (RDF & SPARQL ) – Agree on distributed query mechanisms (+ always up to date– - slow, breaks if sources go offline)

• Investigate other project like Open Data– Can molgenis be part of open data?

Page 40: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

NLNL

Page 41: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

Thank you for your attention. Questions?

Page 42: Despoina Antonakaki,  Dasha Zhernakova, Erik Roos,

• Ontocat http://www.ontocat.org/ , – http://precedings.nature.com/documents/4666/version/1– http://www.biomedcentral.com/imedia/1627447285460829_article.pdf

– Guide/ examples http://www.ontocat.org/wiki/OntocatGuide– Available from : – http://gbic.target.rug.nl:8080/ontocatbrowser/molgenis.do?

__target=main&select=OntocatBrowser

– Ontocat Demo on Google App Engine framework : http://ontocat-web.appspot.com

• Molgenis Lucene Index & query expansion app : – http://www.molgenis.org/svn/molgenis_projects/

molgenis4phenotype/handwritten/java/plugins/LuceneIndex/• Pheno-OM datamodel :

http://wwwdev.ebi.ac.uk/microarray-srv/pheno/doc/objectmodel.html• XGAP: http://www.xgap.org