Semantic Web Anno 1999 - Keio...
Transcript of Semantic Web Anno 1999 - Keio...
3/3/2011
1
How data sharing leads to knowledge
M. Scott Marshall, Ph.D.W3C HCLS IG co-chair
Leiden University Medical CenterUniversity of Amsterdam
http://staff.science.uva.nl/~marshall
Semantic Web Anno 1999
DIY (Do It Yourself)
- 別3 -
3/3/2011
2
Reasoning across a Federation(Semantic Web 2012?)
The Semantic Web is the New Global
Web of Knowledge
It is about standards for publishing, sharing and querying
knowledge drawn from diverse sources
It makes possible the answering
sophisticated questions using
background knowledge
Source: Susie Stephens
- 別4 -
3/3/2011
3
Motivation
Science is based on knowledge:
knowledge capture, knowledge sharing, i.e. communication of findings.
Semantic Web provides a basis for knowledge sharing through machine-readable and reason-able annotation of resources.
Biology in a nutshell: Bigger isn’t better
• DNA Dogma
– Transcription = DNA -> mRNA -> Protein
• Molecular pathways allow biologists to ‘connect’ one molecule to another.
• Huntington’s mutation mapped in 1993 yet there is still no understanding of the mechanism that causes the neurodegeneration.
• Semantic models are necessary to create a ‘systems view’ of biology.
- 別5 -
3/3/2011
4
Where is biomedical knowledge?
Can be extracted from:
• People
• Literature
• Diagrams
• Clinical reports
• Databases
• Excel sheets
• …
Most of these sources of biomedical knowledge are
not machine-readable
Examples
1. “find all images with a big lesion in the right frontal lobe » ?
2. Provide the neurosurgeon with landmarksHmm, but what’s
nearby?
Source: Christine Golbreich
- 別6 -
3/3/2011
5
Many tasks are still a challenge!
With existing Web and Health IT:
• Find, share and integrate information
– “Although a plethora of resources (tools, databases, materials) for neuroscientists is now available on the web, finding these resources among the billions of possible web pages continues to be a challenge.” [M. Martone, NCBO Seminar
Series, 4 Nov 2009]
• Make multiple inferences based on background knowledge
– to obtain more complete answers
– to discover knowledge
Source: Christine Golbreich
Examples– in a medical record system
“find all patients which radiology exhibits a fracture of femur”
– in genomic data
“find all genes annotated with a molecular function or any of its descendants and which is associated with anyform of a given disease (see genes associated to muscular dystrophy [Sahoo et al. 2007])
– find, share, annotate images
Source: Christine Golbreich
- 別7 -
3/3/2011
6
How ?
• Adding semantic annotations from ontologies
“ … The vocabulary for such markup will be defined by ontologies, sets of statements which describe the structure of the domain of interest and give meaning to the terms therein. ”
[I. Horrocks, An Ontology Language for the Semantic Web, IEEE Intelligent Systems, 2002]
Source: Christine Golbreich
Biological and medical ontologies• Medical domain is *very* lucky
a large number of terminologies and reference ontologies, E.g., FMA, NCI, GO, SNOMED-CT, etc.
• Web Portals
– Bioportal library contains ~200 ontologies in different languages: OBO, Protégé Frames, RDF, OWL http://bioportal.bioontology.org/
– Bioportal now provides SPARQL access to ontologies: http://sparql.bioontology.org
– Open Biomedical Ontologies (OBO) Foundry, http://obofoundry.org/
• Nearly every day a new ontology is created.
• BUT …
– Does a term mean the same thing for everyone ?
– Different ontologies use different terms for the same
– How to connect ontologies? How to import terms?
– Is the vocabulary consistent ?
Source: Christine Golbreich
- 別8 -
3/3/2011
7
Advantages of OWL 2?
• Exploits results of 15+ years of research in Knowledge Representation (KR)
• OWL 2 based on description logics
• A logical language provides
– Semantics (meaning)
– Reasoning support and tools
Source: Christine Golbreich
Source: Christine Golbreich
Large ontologies in OWL and applications
• Many biomedical ontologies being moved to OWL 2
– e.g., FMA
– e.g., OBO to OWL [Golbreich et al. ISWC 2007]
– e.g., SNOMED-CT in OWL 2
• Peter Hendler's slides at HCLS W3C F2F
• Application example: Semantic Annotations of Brain MRI
– An hybrid system for semantic annotation of brainanatomy in MRI images based on an OWL ontologyof sulci-gyri anatomy extended by SWRL rules[Mechouche & al. 2009]
- 別9 -
3/3/2011
8
What is metadata (in this talk)?
• Metadata: data about data
• Metadata can be syntactic such as a data type, e.g. Integer.
• Metadata can be semantic such as chromosome number.
• Note: not always ontology, but metadata can be stored in OWL
What is knowledge ?
“data”, “information”, “facts”, “knowledge”
Knowledge is a statement
that can be tested for truth.
(by a machine)
Otherwise, computing can’t add much
- 別10 -
3/3/2011
9
RDF : a web format for knowledge
RDF is a W3C language to express
statements.
RDF Triple:
Subject Predicate Object
Graph of Knowledge:
Node Edge Node
Round trip knowledge engineering (Vision)
In a linked open data environment:
– How will I find the resources?
– How will I integrate them into an application?
– How will I leverage existing knowledge in my analysis?
– How will I publish the new knowledge resulting from my analysis?
– And link to the evidence (data) from the new knowledge?
- 別11 -
3/3/2011
10
Historical perspective: ARPANET
From ARPANET to Semantic Web
• ARPANET = TCP/IP, C language, Unix
• Internet = e-mail, gopher, ftp
• Web = http, html + Mosaic
• ProgrammableWeb = XML, SOAP, REST, JSON
• Semantic Web = OWL/RDF, URI’s, SPARQL,
- 別12 -
3/3/2011
11
Programmable Web:A few thousand APIs
http://www.programmableweb.com/
SPARQL Interoperability:A single API
• a query language for RDF that matches graph patterns
• SPARQL endpoint is comparable to a database connector for triples
• SPARQL endpoint enables data layer interoperability
- 別13 -
3/3/2011
12
Background of the HCLS IG
• Originally chartered in 2005– Chairs: Eric Neumann and Tonya Hongsermeier
• Re-chartered in 2008– Chairs: Scott Marshall and Susie Stephens
– Team contact: Eric Prud’hommeaux
• Broad industry participation– Over 100 members
– Mailing list of over 600
• Background Information– http://www.w3.org/blog/hcls
– http://esw.w3.org/topic/HCLSIG
Mission of HCLS IG
•The mission of HCLS is to develop, advocate for, and support the use of Semantic Web technologies for
– Biological science
– Translational medicine
– Health care
•These domains stand to gain tremendous benefit by adoption of Semantic Web technologies, as they depend on the interoperability of information from many domains and processes for efficient decision support
- 別14 -
3/3/2011
13
How does the HCLS IG work?
• Task forces
• Regular teleconferences using teleconference bridge (Zakim), IRC, minutes
• Face2Face (F2F) twice a year
• Procedures for publishing W3C notes
Current Task Forces
• BioRDF – federating (neuroscience) knowledge bases– M. Scott Marshall (Leiden University Medical Center / University of Amsterdam)
• Clinical Observations Interoperability – patient recruitment in trials– Vipul Kashyap (Cigna Healthcare)
• Linking Open Drug Data – aggregation of Web-based drug data – Susie Stephens (Johnson & Johnson)
• Translational Medicine Ontology – high level patient-centric ontology– Michel Dumontier (Carleton University)
• Scientific Discourse – building communities through networking– Tim Clark (Harvard University)
• Terminology – Semantic Web representation of existing resources– John Madden (Duke University)
- 別15 -
3/3/2011
14
COI: Translating across domains
Microarray
MRI PubMed
AlzForum
EHR
COI: Use Case
Pharmaceutical companies pay a lot to test drugs
Pharmaceutical companies express protocol in CDISC
-- precipitous gap –
Hospitals exchange information in HL7/RIM
Hospitals have relational databases
Source: Eric Prud’hommeaux
- 別16 -
3/3/2011
15
• Type 2 diabetes on diet and exercise therapy or
• monotherapy with metformin, insulin
• secretagogue, or alpha-glucosidase inhibitors, or
• a low-dose combination of these at 50%
• maximal dose. Dosing is stable for 8 weeks prior
• to randomization.
• …
• ?patient takes metformin .
Inclusion Criteria
Source: Holger Stenzhorn
Exclusion Criteria
Use of warfarin (Coumadin), clopidogrel
(Plavix) or other anticoagulants.
…
?patient doesNotTake anticoagulant .
Source: Holger Stenzhorn
- 別17 -
3/3/2011
16
?medication1 sdtm:subject ?patient ;spl:activeIngredient ?ingredient1 .
?ingredient1 spl:classCode 6809 . #metformin
OPTIONAL {
?medication2 sdtm:subject ?patient ;spl:activeIngredient ?ingredient2 .
?ingredient2 spl:classCode 11289 .#anticoagulant
} FILTER (!BOUND(?medication2))
Criteria in SPARQL
Source: Holger Stenzhorn
TMO: Translating across domains
Microarray
MRI PubMed
AlzForum
EHR
- 別18 -
3/3/2011
17
Questions & ProblemsThe Drug Development Pipeline
• The road is long, and costly.
• How do we contain costs and develop better drugs?
“A virtual space odyssey”, Cath O'Driscoll (2004)
http://www.nature.com/horizon/chemicalspace/background/odyssey.html
Source: Elgar Pichler
Translational Medicine OntologyMission
• Focuses on the development of a high level patient-centric ontology for the pharmaceutical industry. The ontology should enable data integration across discovery research, hypothesis management, experimental studies, compounds, formulation, drug development, market size, competitive data, population data, etc. This would enable scientists to answer new questions, and to answer existing scientific questions more quickly.
• This will help pharmaceutical companies to model patient-centric information, which is essential for the tailoring of drugs, and for early detection of compounds that may have sub-optimal safety profiles. The ontology should link to existing publicly available domain ontologies.
- 別19 -
3/3/2011
18
TMO Structure
Source: Susie Stephens
Translational Medicine KB
Source: Susie Stephens
- 別20 -
3/3/2011
19
TMO Query
How many patients experienced side effects while taking Donepezil?
Source: Susie Stephens
TMO DevelopmentConcept Identification via Use Cases
• Example
• (see http://esw.w3.org/topic/HCLSIG/PharmaOntology/UseCases):
• 1. Patient [OBI:0000093, patient role] (and family members [NCI:Patient_Family_Member_or_Friend]) report symptoms [IDO:0000048, Symptom] to physician/clinician [NCIt:Physician]. Physician/clinician enters reported symptoms into eHR.
• 2. Physician [NCIt: Physician] makes a list of differential diagnoses, with a working diagnosis [OBI:0000075] of Alzheimer Disease [DOID:10652]. (Data Source: Physician's head).
• 3. Physician [NCIt:Physician] arranges for patient [OBI:0000093, patient role] to have a basic biochemical/haematological, and SNP [SO:0000694, SNP] profile undertaken. Biochemistry, Haematology, and SNP requests are input by respective departments directly into patient’s eHR [HL7:EHR, UMLS:C1555708, HID:20081] from laboratory (Data Source: eRecord). Preliminary SNP and genetic data will be submitted directly to the NIH Pharmacogenetics Research Network (PGRN).
• [...]
- 別21 -
3/3/2011
20
Terminology: Translating across domains
Microarray
MRI PubMed
AlzForum
EHR
Terminology
• Representation of medical documents (OWL and SKOS)
• SKOS enables ‘fuzzier’ statements, with ‘broader’ and ‘narrower’
• Mammogram: Represent both radiology and pathology report to discover discrepancies
• Use Translational Medicine Ontology, RadLex, SNOMED in the RDF
- 別22 -
3/3/2011
21
SciDisc: Translating across domains
Microarray
MRI PubMed
AlzForum
EHR
SWAN / CiTO alignment
Source: Paolo Ciccarese
- 別23 -
3/3/2011
22
BioRDF: Translating across domains
Microarray
MRI PubMed
AlzForum
EHR
BioRDF: Looking for Targets for Alzheimer’s
• Signal transduction pathways are considered to be rich in “druggable” targets
• CA1 Pyramidal Neurons are known to be particularly damaged in Alzheimer’s disease
• Casting a wide net, can we find candidate genes known to be involved in signal transduction and active in Pyramidal Neurons?
Source: Alan Ruttenberg
- 別24 -
3/3/2011
23
NeuronDB
BAMS
Literature
Homologene
SWAN
Entrez
Gene
Gene
Ontology
Mammalian
Phenotype
PDSPki
BrainPharm
AlzGene
Antibodies
PubChem
MESH
Reactome
Allen Brain
Atlas
BioRDF: Integrating Heterogeneous Data
Source: Susie Stephens
“find me genes involved in signal transduction that are related to pyramidal neurons”
- 別25 -
3/3/2011
24
BioRDF: Results: Genes, Processes
•DRD1, 1812 adenylate cyclase activation•ADRB2, 154 adenylate cyclase activation•ADRB2, 154 arrestin mediated desensitization of G-protein coupled receptor protein signaling pathway•DRD1IP, 50632 dopamine receptor signaling pathway•DRD1, 1812 dopamine receptor, adenylate cyclase activating pathway•DRD2, 1813 dopamine receptor, adenylate cyclase inhibiting pathway•GRM7, 2917 G-protein coupled receptor protein signaling pathway•GNG3, 2785 G-protein coupled receptor protein signaling pathway•GNG12, 55970 G-protein coupled receptor protein signaling pathway•DRD2, 1813 G-protein coupled receptor protein signaling pathway•ADRB2, 154 G-protein coupled receptor protein signaling pathway•CALM3, 808 G-protein coupled receptor protein signaling pathway•HTR2A, 3356 G-protein coupled receptor protein signaling pathway•DRD1, 1812 G-protein signaling, coupled to cyclic nucleotide second messenger•SSTR5, 6755 G-protein signaling, coupled to cyclic nucleotide second messenger•MTNR1A, 4543 G-protein signaling, coupled to cyclic nucleotide second messenger•CNR2, 1269 G-protein signaling, coupled to cyclic nucleotide second messenger•HTR6, 3362 G-protein signaling, coupled to cyclic nucleotide second messenger•GRIK2, 2898 glutamate signaling pathway•GRIN1, 2902 glutamate signaling pathway•GRIN2A, 2903 glutamate signaling pathway•GRIN2B, 2904 glutamate signaling pathway•ADAM10, 102 integrin-mediated signaling pathway•GRM7, 2917 negative regulation of adenylate cyclase activity•LRP1, 4035 negative regulation of Wnt receptor signaling pathway•ADAM10, 102 Notch receptor processing•ASCL1, 429 Notch signaling pathway•HTR2A, 3356 serotonin receptor signaling pathway•ADRB2, 154 transmembrane receptor protein tyrosine kinase activation (dimerization)•PTPRG, 5793 ransmembrane receptor protein tyrosine kinase signaling pathway•EPHA4, 2043 transmembrane receptor protein tyrosine kinase signaling pathway•NRTN, 4902 transmembrane receptor protein tyrosine kinase signaling pathway•CTNND1, 1500 Wnt receptor signaling pathway
Many of the genes
are related to AD
through gamma
secretase
(presenilin) activity
Source: Alan Ruttenberg
Improving HCLS KB
• Shared Identifiers – Use the same names
• Federate – Where will we find answers to our scientific questions?
• Provenance – What will we do with what we find?
- 別26 -
3/3/2011
25
Automatic query decomposition
• Decompose query into components
• Locate sources of information
• Distribute queries across information sources
• Aggregate results
A SPARQL query for processes involved in pyramidal neurons
prefix go: <http://purl.org/obo/owl/GO#>prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>prefix owl: <http://www.w3.org/2002/07/owl#>prefix mesh: <http://purl.org/commons/record/mesh/>prefix sc: <http://purl.org/science/owl/sciencecommons/>prefix ro: <http://www.obofoundry.org/ro/ro.owl#>
select ?genename ?processnamewhere{ graph <http://purl.org/commons/hcls/pubmesh>
{ ?paper ?p mesh:D017966 .?article sc:identified_by_pmid ?paper.?gene sc:describes_gene_or_gene_product_mentioned_by ?article.
}graph <http://purl.org/commons/hcls/goa>{ ?protein rdfs:subClassOf ?res.?res owl:onProperty ro:has_function.?res owl:someValuesFrom ?res2.?res2 owl:onProperty ro:realized_as.?res2 owl:someValuesFrom ?process.
graph <http://purl.org/commons/hcls/20070416/classrelations>{{?process <http://purl.org/obo/owl/obo#part_of> go:GO_0007166} union
{?process rdfs:subClassOf go:GO_0007166 }} ?protein rdfs:subClassOf ?parent.?parent owl:equivalentClass ?res3.?res3 owl:hasValue ?gene.
}graph <http://purl.org/commons/hcls/gene>{ ?gene rdfs:label ?genename }
graph <http://purl.org/commons/hcls/20070416>{ ?process rdfs:label ?processname}
}
MeSH: Pyramidal Neurons
Pubmed: Journal Articles
Entrez Gene: Genes
GO: Signal Transduction
Inference required
- 別27 -
3/3/2011
26
Recipe for a Semantic Web
• Follow Linked Open Data principles
• Attempt to use Shared Names (same URI’s)
• Query rewriting to map from:
– SPARQL -> (query language)
– SPARQL (term1) -> SPARQL (term2)
• Add federated query support to SPARQL engine implementations
The Classic Web
B C
HTML HTMLHTML
Web
Browsers
Search
Engines
hyper-
links
• Single information space
• HTML describes presentation
• Built on URIs– globally unique IDs
– retrieval mechanism
• Built on Hyperlinks– are the glue that holds
everything together
A
hyper-
links
Source: Chris Bizer
- 別28 -
3/3/2011
27
Linked Data
B C
Thing
typed
links
A D E
typed
links
typed
links
typed
links
Thing
Thing
Thing
Thing
Thing Thing
Thing
Thing
Thing
Search
Engines
Linked Data
Mashups
Linked Data
Browsers
Use Semantic Web technologies to publish structured data on the Web and set
links between data from one data source and data from another data sources
Source: Chris Bizer
Linked Data Principles
1. Use URIs as names for things.
2. Use HTTP URIs so that people can look up those names.
3. When someone looks up a URI, provide useful RDF information.
4. Include RDF statements that link to other URIs so that they can discover related things.
• Tim Berners-Lee 2007
• http://www.w3.org/DesignIssues/LinkedData.html
- 別29 -
3/3/2011
28
The Linked Data Cloud
Source: Chris Bizer
Interlinking in LODD
http://esw.w3.org/HCLSIG/LODD/Interlinking
- 別30 -
3/3/2011
29
LOD cloud – a few challenges
• It’s a (nice) data warehouse in RDF
– http://lod.openlinksw.com/
• Ad hoc identifiers
– Prefer authoritative, persistent, and shared
• Redundancy
• Lacks provenance – Who created a given RDF rendering? Using what method? Version?
So, what’s missing?
• Data stewardship – serve (annotated) data back to community
• Plug-n-play SPARQL access to relational stores
• Best practice for data federation of SPARQL endpoints
• Best practice for important data sources: microarray data, variety of image data
• Shared Identifiers
- 別31 -
3/3/2011
30
Provenance
• Represent knowledge so that others can discover where a fact (or triple) came from and evaluate how to use it – link facts to data as evidence
• Named graphs – talk about what’s in them
• Example use: “Find the latest RDF version of DrugBank” from SPARQL endpoints accessible to me.
Semantic Web Foundation
Source: Tim Berners-Lee, Director W3C
- 別32 -
3/3/2011
31
Provenance for Federationor “What’s in a graph?”
• Minimum information about a graph (MIAG)
• Is it an ontology or RDF data?
• Version, License, ..
• Who converted it to RDF?
• Rendering: “I’m looking for SNOMED in SKOS (not OWL)”
• What is the most populated class?
• Graph statistics
Translating across domains
Microarray
MRI PubMed
AlzForum
EHR
- 別33 -
3/3/2011
32
Provenance types are perspectives on the data
Source: Helena Deus
A Bottom-up Approach
Provenance
modelsWorkflow,
experimental designDomain ontologies
(DO, GO…)Community
models
Raw Data
Results
Questions
Which genes are markers for neurodegenerative
diseases?
Was gene ALG2 differentially expressed in
multiple experiments?
Provenance of
Microarray
experiment
What software was used to analyse the data?
How can the experiment be replicated?
Source: Helena Deus
- 別34 -
3/3/2011
33
BioRDF: finding latest versionFILTER (?date_issued2 > ?date_issued)
}
FILTER (!BOUND(?srvc2))
}
#Get associated diseases from most recently updated Diseasomeserver.
SERVICE ?srvc2 {
?diseasomeGene rdfs:label ?geneLabel .
?disease diseasome:associatedGene ?diseasomeGene.
?disease rdfs:label ?diseaseName .
}
Myth busting
• Creating a Semantic Web application does not necessarily require converting and storing all data in RDF
• “Leave the data where it lives”
• Create views of the data accessible via RDF export or (BETTER) a SPARQL endpoint
- 別35 -
3/3/2011
34
SWObjects
• Creator: Eric Prud’hommeaux
• SWObjects creates a SPARQL endpoint
• Query rewriting to SQL and SPARQL
• Uses SPARQL Constructs as a mapping language
• JNI interface
• http://en.wikipedia.org/wiki/SWObjects
• http://precedings.nature.com/documents/5538/
Semantic Web is a framework
that provides common languages for defining vocabularies and querying across data that uses them.
However, common identifiers must be established if we are to link data without a million mapping files.
- 別36 -
3/3/2011
35
Shared Identifiers
• Must use common URI’s in order to link data
• Provenance related identifiers still needed:
– Identifiers for people (researchers)
– Identifiers for diseases
– Identifiers for terms (Terminology servers)
– Identifiers for programs, processes, workflows
– Identifiers for chemical compounds
• Shared Names http://sharednames.org
- 別37 -
3/3/2011
36
Emerging Best Practices for RDF
1. Convert unstructured source data into a structured data format, such as relational tables or XML
2. Identify persistent URLs (PURLS) for concepts in existing ontologies and create a map from the structured data into an RDF view
3. Customize the map manually if necessary
4. Map concepts in the new RDF mapping to concepts in other RDF data sources relying as much as possible on ontologies
5. Publish the RDF data through a SPARQL endpoint or as Linked Data
6. Alternatively, if data is in a relational format, apply a Semantic Web toolkit such as SWObjects [Prud’hommeaux et al. 2010] that enables SPARQL queries over the relational schema
7. Create Semantic Web applications using the published data
Create RDF views that a stranger can use
• Use a mapping language to create an RDF view of the data when possible, rather than customized code.
• When possible, use vocabularies that are openly available from an authoritative server. For the life sciences, we have OBO and NCBO.
• Use rdfs:label generously to provide information to user interfaces
- 別38 -
3/3/2011
37
Publish RDF so that it can be discovered
• Publish open access data whenever possible, as well as any associated software.
• Publish a URL to the software and mappings that you used to create the RDF.
• Register your data at CKAN. If it’s biomedical, register it in BioCatalogue.
• Assign a graph URI to the RDF graph and add provenance and metadata about the graph URI to the graph itself. This practice makes it possible for visitors and crawlers to find out what is in the graph.
Acknowledgements• W3C Health Care and Life Science Interest Group,
http://www.w3.org/blog/hcls
• National Center for Biomedical Ontologies
• Concept Web Alliance
• Authors of all contributed slides
- 別39 -
3/3/2011
38
The End
“Science is built up of facts, as a house is built of stones; but an accumulation of facts is no more a science than a heap of stones is a house.”
– Henri Poincaré,
Science and Hypothesis, 1905
- 別40 -