Post on 07-May-2015
description
TowardsA MashupTo Build BioinformaticsKnowledgeSystem
, - ,FrançoisBelleau Marc AlexandreNolin , , NicoleTourigny PhilippeRigault Jean Morissette
Département d'informatique et de génie logicielUniversité Laval
Banff, May 8, 2007 CHUL research center - Laval University 2
Presentation Plan Knowledge integration vision 2 Bio RDF architecture RDFization of knowledge Normalization of URI Parkinson ExampleDemo Conclusion
Banff, May 8, 2007 CHUL research center - Laval University 3
From the RDF inventor :"Wouldn't it be great if you were able to organize all this information based on your own terms, instead of based on the application you use to access the information ?” (1999)
Ramanathan V. Guha
From WikiPedia :Mashup (web application hybrid)
A mashup is a website or application that combines content from more than one source into an integrated experience.(2007)
Banff, May 8, 2007 CHUL research center - Laval University 4
- ’ Sir Berners Leesvision of semantic web
Tim Berners- Lee
« The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. »Scientific Americain, 2001
http://www.w3.org/2006/Talks/0404-mit-tbl/
Banff, May 8, 2007 CHUL research center - Laval University 5
2 2005Bio RDF startingvision at ISMB
T hanks to Chr istopher Baker, Eric Neumann, Kei Cheung and Johanne Luciaono for their ideas.
Too many knowledgesources available for life sciencescientists
( , , Too many formats text XML)HTML
New sourceeach day with specialized tool or web interface
Integration problem recognized by global community
Banff, May 8, 2007 CHUL research center - Laval University 6
Theknowledge integration problem inbioinformatics
2005From Carol Gobleat ISWC (2004)From the BioPAX group
Banff, May 8, 2007 CHUL research center - Laval University 7
Integration methods in bioinformatics
1) Davidson 1995“Transform data to the federated database on
demand”
2) Köhler 2003 “In different databases the same things can be
given different names”
3) Stein 2003“link integration, view integration and data
warehousing”
Banff, May 8, 2007 CHUL research center - Laval University 8
Data warehouseapproaches
url
http://www.ncbi.nlm.nih.gov/Database/ http://www.genome.jp/dbget/dbget.links.html
Banff, May 8, 2007 CHUL research center - Laval University 9
2 ’ Bio RDF sapproach :to knowledge integration
“Solve the problem of knowledge integration in biology by applying
a semantic web approach.”
Banff, May 8, 2007 CHUL research center - Laval University 10
Other semantic webprojects
Banff, May 8, 2007 CHUL research center - Laval University 11
2 ’ Bio RDF sdesign rules
2. ;Convert document to RDF format3. ( , Useof a triplestore technology sesame
, );virtuoso oracle4. ;NormalizeURIs5. Build a mashup asneeded to answer specific
( );question elmo6. .Query themashup with SeRQL or SPARQL
Banff, May 8, 2007 CHUL research center - Laval University 12
2 ’ Bio RDF sarchitecture
#1
#2
#3
#4
#5
#6
Banff, May 8, 2007 CHUL research center - Laval University 13
2 ’ Bio RDF sknowledge sources
Banff, May 8, 2007 CHUL research center - Laval University 14
RDF conversion statistics
LSID example Size of data converted
go go:0000001kegg path:aae00010kegg cpd:c00001mgi mgi:96103ncbi omim:100050ncbi geneid:1obo obo's 59 namespacespdb pdb:100d
uniprot uniprot:A0A000uniprot enzyme:1.-.-.-uniprot pubmed:100133uniprot taxonomy:10uniprotuniref:UniRef100_A0A000… … … …
Data sourc
e
Number of RDF documents
22 961 507 963 32135 257 1 038 593 13714 292 8 902 205
438 724 210 458 89717 359 573 639 380
2 744 786 67 225 535 082279 720 216 007 267
34 421 16 309 651 9354 177 176 29 453 203 064
5 020 2 844 058191 664 364 728 083337 564 125 630 659
7 990 452 14 865 490 144
Banff, May 8, 2007 CHUL research center - Laval University 15
’ OpenRDF ssoftwarehttp://www.openrdf.org/
Banff, May 8, 2007 CHUL research center - Laval University 16
:15275RDF of geneid
•rdf:about•rdfs:label•dc:identifier, title, created•bio2rdf:lsid•bio2rdf:url•bio2rdf:synonym•bio2rdf:xRef
Banff, May 8, 2007 CHUL research center - Laval University 17
RDFizer
efetch rdfizer
: To rdfize T o convert existing docum ent into RDF form at.
Banff, May 8, 2007 CHUL research center - Laval University 18
How to rdfize
• ( : 00101)From HTML pages prosite ps• From XML documentsusingXSLT
( : 00010)path mmu• From XML documentsusingXPath and
( :15275)JSTL geneid• From direct SQL access
( : 00000025875 ensembl ensmusg )• (From RDF document uniprot:p26838 )• ( : 00001)From Text files cpd c
Banff, May 8, 2007 CHUL research center - Laval University 19
1) : 00101 prosite ps from html usinga regex
Banff, May 8, 2007 CHUL research center - Laval University 20
2) ’ : 00010 Keggspath mmu from XML usingXSL
Banff, May 8, 2007 CHUL research center - Laval University 21
3) : 00000025875 ensembl ensmusg from SQL
Banff, May 8, 2007 CHUL research center - Laval University 22
4) uniprot:p26838 from RDF using SeRQL
Banff, May 8, 2007 CHUL research center - Laval University 23
, One reality many names● Different namespace identifier
pubmed:11992264 vs pmid:11992264● Uppercase and lowercase
uniprot:p26838 vs uniprot:P26838● Version number
genbank:ac008393 vs genbank:ac008393.7● Total id length
go:0032283 vs go:32283
Banff, May 8, 2007 CHUL research center - Laval University 24
RDFizing docum ent is not enough we also need norm alized URIs.
:/ / 2 . / :http bio rdf org namespace id
:/ / 2 . /http bio rdf org pubmed:11992264:/ / 2 . /http bio rdf org uniprot:p26838
:/ / 2 . /http bio rdf org genbank:ac008393 :/ / 2 . /http bio rdf org go:0032283
Banff, May 8, 2007 CHUL research center - Laval University 25
URI Normalization rules● Different namespace identifier
, We resolvenamespacesynonymy with a urlrewrite rule for . examplepubmed and pmid
● Uppercase and lowercase We writeevery URI in lowercase
● Version numberA owl:sameAs predicate is use to link the different versions of a document.
● Total id length .A fixed length isdetermine for id
Banff, May 8, 2007 CHUL research center - Laval University 26
Url Rewrite Filterhttp://tuckey.org/urlrewrite/
< >rule< > /̂ :(.*?) < / > from search @pubmed from< > / /to rdfizer - 2 .ncbi entrez rdf jsp? = ; = 1< / >db pubmed& query $ to
< / >rule< >rule
< > /̂ :(.*)< / >from pubmed from< > / /to rdfizer - 2 .ncbi pubmed rdf jsp? = 1< / >id $ to
< / >rule< >rule
< > /̂ :(.*)< / >from pmid from< > / /to rdfizer - 2 .lsid sameas rdf jsp? = : 1 ; = : 1< / >from pmid $ & to pubmed $ to
< / >rule
< >rule< > /̂ (.*):(.*)< / >from from< = " ">to type redirect :/ / 2 .http bio rdf org/ 1: 2< / >$ $ to
< / >rule
Banff, May 8, 2007 CHUL research center - Laval University 27
URL vsLSID:/ / 2 . / : 26838http bio rdf org uniprot p
:owl sameAs: : . : : 26838urn lsid uniprot orguniprot p
http:/ / bio2rdf .org/ un iprot:p26838
http:/ / bio2rdf .org/ urn:lsid:uniprot.org:uniprot:p2 6838
Banff, May 8, 2007 CHUL research center - Laval University 28
Our method to answer question
T o answer a very specialized question, we build a specifi c knowledge base (the mashup stored in a RDF triplestore)
and then query it wi th SeRQL.
Banff, May 8, 2007 CHUL research center - Laval University 29
Parkinson examples1. What is the semantic network of
OMIM records describing Parkinson’s disease?
2. Which MeSH terms are mostly cited in Parkinson’s disease publications?
3. What genes related to Parkinson’s disease are involved in pathways according to Kegg ?
Banff, May 8, 2007 CHUL research center - Laval University 30
!Time for demo
Banff, May 8, 2007 CHUL research center - Laval University 31
The bigeverythingabout parkinson:/ / :8080/ 2 / :http localhost bio rdf search parkinson@omim:/ / :8080/ 2 / :http localhost bio rdf search parkinson@geneid:/ / :8080/ 2 / :http localhost bio rdf search parkinson@uniprot:/ / :8080/ 2 / :http localhost bio rdf search parkinson@kegg:/ / :8080/ 2 / :http localhost bio rdf load pubmed:/ / :8080/ 2 / : - http localhost bio rdf sameashsa geneid:/ / :8080/ 2 / : http localhost bio rdf learn geneid:/ / :8080/ 2 / : http localhost bio rdf load cpd:/ / :8080/ 2 / : http localhost bio rdf load reactome:/ / :8080/ 2 / : - http localhost bio rdf load biopax xref:/ / :8080/ 2 / : http localhost bio rdf load chebi:/ / :8080/ 2 / :http localhost bio rdf load obo- xref:/ / :8080/ 2 / : -http localhost bio rdf sameaskeggcompound cpd
1.700 Ktriples97 Mbytes in turtle format
90 in minutes
Banff, May 8, 2007 CHUL research center - Laval University 32
Third exempleSeRQL queryWhat genes related to Parkinson’s disease are involved in
pathways according to Kegg ?SELECT
GeneticDisorder-label, Gene-label, pathway-labelFROM{GeneticDisorder} rdf:type {<http://bio2rdf.org/omim#GeneticDisorder>},{GeneticDisorder} rdfs:label {GeneticDisorder-label},{GeneticDisorder} <http://www.w3.org/2002/07/owl#sameAs> {sameAs},{Gene} <http://bio2rdf.org/bio2rdf#xRef> {sameAs},{Gene} rdfs:label {Gene-label},{Gene2} <http://www.w3.org/2000/01/rdf-schema#seeAlso> {Gene},{xobject} <http://bio2rdf.org/kegg#xobject> {Gene2},{xentry1} <http://bio2rdf.org/kegg#xentry1> {xobject},{pathway} <http://bio2rdf.org/kegg#xrelation> {xentry1},{pathway} rdfs:label {pathway-label}
WHEREGeneticDisorder-label like "*PARKINSON*"
Banff, May 8, 2007 CHUL research center - Laval University 33
Query result
Banff, May 8, 2007 CHUL research center - Laval University 34
Conclusion
Banff, May 8, 2007 CHUL research center - Laval University 35
2 BeforeBio RDF integration
Banff, May 8, 2007 CHUL research center - Laval University 36
Our main results
● RDF is a framework that enables a very simple thing: scalability of the knowledge base complexity.
● The Bio2RDF project proposes to keep complexity in the bioinformatics knowledge space under control by applying this proven web semantic approach.
Banff, May 8, 2007 CHUL research center - Laval University 37
2 Now with Bio RDF semantic integration
Banff, May 8, 2007 CHUL research center - Laval University 38
2 ’ Bio RDF svision of knowledge map
Banff, May 8, 2007 CHUL research center - Laval University 39
2 ’ Bio RDF smapof distributed bioinformaticsknowledge
http://bio2rdf.org/bio2rdf-2007-02.owl
Banff, May 8, 2007 CHUL research center - Laval University 40
Map of semantic resource
Banff, May 8, 2007 CHUL research center - Laval University 41
’ Montreal ssubway map
Banff, May 8, 2007 CHUL research center - Laval University 42
2 ’ Bio RDF sactual knowledgemap
Banff, May 8, 2007 CHUL research center - Laval University 43
+ + Public data open sourcesoftware rdf + + = technology rdfizer normalized URIs
2 ;Bio RDF knowledge integration - A bioinformatic integration ontology wont exist if
, 2 . it isnot adopted by thecommunity bio rdf owl is ;just a proposed startingpoint
46 millionsRDF documentsarenow availableat:/ / 2 . .http bio rdf org
Achievement
Banff, May 8, 2007 CHUL research center - Laval University 44
2 Bio RDF project providesopen . source RDFizer to the community
, So much styleneed to be rdfized if , you are interested to contribute
! join us
Now letsbuild the bigknowledge …mapof bioinformatics
Banff, May 8, 2007 CHUL research center - Laval University 45
Final words
, - Please tell Sir Tim Berners Lee that hewasright‘ ’ semantic web in bioinformatics isa k iller a p p
.to illustrateall thepotential of thesemantic web , And also tell Mark W ilkinson that semantic web
’ in bioinformaticswont be full of creep s if we …organize it likewedid
Thanks Jean Morissette
Nicole Tourigny PhilippeRigault
’ Bioinformatics labsteam at CHUL Research Center
Many open sourcecommunities( , ’ , , )OpenRDF Similesproject Tomcat JSTL and many more
3 - W C Bio RDF Group
GénomeQuébec Génome Canada
Banff, May 8, 2007 CHUL research center - Laval University 47
Visit http://bio2rdf.org
Download http://sourceforge.net/projects/bio2rdf/
Discover http://bio2rdf.org/bio2rdf-2007-02.owl
Contact us at bio2rdf@gmail.com