Scientific Data Integration withScientific Data Integration with Model-Based Mediation Model-Based Mediation: :
Databases MeetsDatabases Meets** Knowledge Representation Knowledge Representation
Bertram LudBertram Ludää[email protected]@SDSC.EDU
Knowledge-Based Integration LabKnowledge-Based Integration Lab
Data and Knowledge SystemsData and Knowledge Systems
San Diego Supercomputer Center San Diego Supercomputer Center
U.C. San DiegoU.C. San Diego
* * or rather or rather rediscoversrediscovers
Integration Example from the Database CommunityIntegration Example from the Database Community
User: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”
?Information Integration
?Information Integration
addall.com Mediator
addall.com Mediator
“One-World”Mediation
“One-World”Mediation
amazon.comamazon.com A1books.comA1books.comhalf.comhalf.combarnes&noble.combarnes&noble.com
Another Well-Known Data Integration ExampleAnother Well-Known Data Integration Example
What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood
with below-average crime rate and diverse population?
?Information Integration
?Information Integration
RealtorRealtor DemographicsDemographicsSchool RankingsSchool RankingsCrime StatsCrime Stats
“Multiple-Worlds”Mediation
“Multiple-Worlds”Mediation
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Information Integration from a DB Perspective Information Integration from a DB Perspective
• Information Integration ChallengeInformation Integration Challenge– Given: data sources S_1, ..., S_k (DBMS, web sites, ...) and
user questions Q_1,...,Q_n that can be answered using the S_i
– Find: the answers to Q_1, ..., Q_n
• The Database Perspective: source = “database” The Database Perspective: source = “database” S_i has a schema (relational, XML, OO, ...) S_i can be queried define virtual (or materialized) integrated views V over
S_1,...,S_k using database query languages questions become queries Q_i against V(S_1,...,S_k)
• Why a Database Perspective?Why a Database Perspective?– scalability, efficiency, reusability (declarative queries), ...
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Abstract (XML-Based) Mediator ArchitectureAbstract (XML-Based) Mediator Architecture
S_1
MEDIATORMEDIATOR
XML Queries & Results
USER/ClientUSER/Client
Wrapper
XML View
S_2
Wrapper
XML View
S_k
Wrapper
XML View
IntegratedXML View V
Integrated ViewDefinition
IVD(S_1,...,S_k)
Query Q o V (S_1,...,S_k)Query Q o V (S_1,...,S_k)
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
XMAS: XML Matching And Structuring language
Integrated View Definition:“Find publications from amazon.com and DBLP,
join on author,group by authors and title”
CONSTRUCT <books> <book>
$a1$t<pubs>
$p { $p } </pubs>
</book> { $a1, $t } </books>WHERE <books.book>
$a1 : <author />$t : <title />
</> IN WRAP(“amazon.com”)AND <authors.author>
$a2 : <author /><pubs> $p : <pub/> </>
</> IN WRAP(“www...DBLP…”)AND value( $a1 ) = value( $a2 )
CONSTRUCT <books> <book>
$a1$t<pubs>
$p { $p } </pubs>
</book> { $a1, $t } </books>WHERE <books.book>
$a1 : <author />$t : <title />
</> IN WRAP(“amazon.com”)AND <authors.author>
$a2 : <author /><pubs> $p : <pub/> </>
</> IN WRAP(“www...DBLP…”)AND value( $a1 ) = value( $a2 )
XMAS
XMAS Algebra
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Information Integration & Mediation Information Integration & Mediation for Scientific Data for Scientific Data
... a different set of problems (reality) came our way ...... a different set of problems (reality) came our way ...
A Neuroscientist’s Information Integration ProblemA Neuroscientist’s Information Integration Problem
What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity?
How about other rodents?
?Information Integration
?Information Integration
protein localization(NCMIR)
protein localization(NCMIR)
neurotransmission(SENSELAB)
neurotransmission(SENSELAB)
sequence info(CaPROT)
sequence info(CaPROT) morphometry
(SYNAPSE)
morphometry(SYNAPSE)
“Complex Multiple-Worlds”
Mediation
“Complex Multiple-Worlds”
Mediation
A Geoscientist’s Information Integration ProblemA Geoscientist’s Information Integration Problem
What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ?
How does it relate to host rock structures?
?Information Integration
?Information Integration
Geologic Map(Virginia)
Geologic Map(Virginia) GeoChemicalGeoChemical GeoPhysical
(gravity contours)
GeoPhysical(gravity contours)
GeoChronologic(Concordia)
GeoChronologic(Concordia)
Foliation Map(structure DB)
Foliation Map(structure DB)
“Complex Multiple-Worlds”
Mediation
“Complex Multiple-Worlds”
Mediation
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
DB mediation techniques
OntologiesKR formalisms
Model-Based Mediation
Information Integration LandscapeInformation Integration Landscape
conceptual distanceone-world multiple-worlds
conceptual complexity/depth
low
high
addallbook-buyer
BLAST
EcoCyc
Cyc
WordNet
GO
home-buyer24x7 consumer
NCBI UMLS
MIA Entrez
RiboWeb
Tambis
BioinformaticsGeoinformatics
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
What’s the Problem with XML & Complex Multiple-Worlds?What’s the Problem with XML & Complex Multiple-Worlds?
• XML is SyntaxXML is Syntax– DTDs talk about element nesting– XML Schema schemas give you data types – need anything else? => write comments!
• Domain Semantics is complex:Domain Semantics is complex:– implicit assumptions, hidden semantics sources seem unrelated to the non-expert
• Need Structure and Semantics beyond XML trees!Need Structure and Semantics beyond XML trees! employ richer OO models (UML, EER, ...) make domain semantics and “glue knowledge” explicit use ontologies to fix terminology and conceptualization avoid ambiguities by using formal semantics
XML-Based vs. Model-Based MediationXML-Based vs. Model-Based Mediation
Raw DataRaw DataRaw Data
IF THEN IF THEN IF THEN
LogicalDomainConstraints
Integrated-CM :=
CM-QL(Src1-CM,...)
Integrated-CM :=
CM-QL(Src1-CM,...)
. . ....
....
........ (XML)Objects
Conceptual Models
XMLElements
XML Models
C2 C3
C1
R
Classes,Relations,is-a, has-a, ...
Glue Maps
DMs, PMs
Glue Maps
DMs, PMs
Integrated-DTD :=
XML-QL(Src1-DTD,...)
Integrated-DTD :=
XML-QL(Src1-DTD,...)
No DomainConstraints
A = (B*|C),DB = ...
Structural Constraints (DTDs),Parent, Child, Sibling, ...
CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} CM-QL ~ {F-Logic, DAML+OIL, …}
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
What’s the Glue? What’s in a Link? What’s the Glue? What’s in a Link?
• Syntactic Joins Syntactic Joins (X,Y) := X.SSN = Y.SSN equality (X,Y) := X.UMLS-ID = Y.UID
• ““Speciality” JoinsSpeciality” Joins (X,Y,Score) := BLAST(X,Y,Score) similarity
• Semantic/Rule-Based JoinsSemantic/Rule-Based Joins (X,Y,C) :=
X isa C, Y isa C, BLAST(X,Y,S), S>0.8 homology, lub
(X,Y,[produces,B,increased_in]) :=
X produces B, B increased_in Y. rule-based
e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease
XY
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Model-Based Mediation Methodology ...Model-Based Mediation Methodology ...
• LiftLift Sources to export Sources to export Conceptual ModelsConceptual Models (CMs): (CMs): CM(S) = OM(S) + KB(S) + CON(S)
• Object Model OM(Object Model OM(SS):):– complex objects (frames), class hierarchy, OO constraints
• Knowledge Base KB(Knowledge Base KB(SS):):– explicit representation of (“hidden”) source semantics – logic rules over OM(S)
• Contextualization CON(Contextualization CON(SS):):– situate OM(S) data using “glue maps” (GMs): domain maps DMs (ontology)
= terminological knowledge: concepts + roles process maps PMs
= “procedural knowledge”: states + transitions
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
... Model-Based Mediation Methodology... Model-Based Mediation Methodology
• Integrated View Definition (IVD)Integrated View Definition (IVD)– declarative (logic) rules with object-oriented features
– defined over CM(S), domain maps, process maps
– needs “mediation engineers” = domain + KRDB experts
• Knowledge-Based Querying and Browsing (runtime):Knowledge-Based Querying and Browsing (runtime):– mediator composes the user query Q with the IVD
... rewrites (Q o IVD), sends subqueries to sources
... post-processes returned results (e.g., situate in context)
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
S1 S2
S3
(XML-Wrapper) (XML-Wrapper) (XML-Wrapper)
CM-Wrapper CM-Wrapper CM-Wrapper
USER/ClientUSER/Client
CM (Integrated View)
MediatorEngine
FL rule proc.
LP rule proc.
Graph proc.DDB engine
CM(S) =OM(S)+KB(S)+CON(S)
GCM
CM S1
GCM
CM S2
GCM
CM S3
CM Queries & Results (exchanged in XML)
Domain MapsDMs
Domain MapsDMs
Domain MapsDMs
Domain MapsDMs
Domain MapsDMs
Process MapsPMs
“Glue” MapsGMs
semanticcontextCON(S)
Integrated View Definition IVD
Model-Based Mediator Architecture
First results:KIND prototype, formal
DM semantics, PMs[SSDBM00] [VLDB00][ICDE01] [NIH-HB01]
[EDBT02], ...
BIRN-CC, ...
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Domain Maps & Ontologies as Domain Maps & Ontologies as “Glue Knowledge Sources” “Glue Knowledge Sources”
• Domain Map Domain Map Ontology Ontology– conceptualization of relevant entities and relationships– formal representation of terminological knowledge
• Use in Model-Based MediationUse in Model-Based Mediation– (derived) concepts as “drop points”, “anchor points”, “context”
for source classes– compile-time use:
• view definition, subsumption, classification,...
– runtime use: • querying/deduction, path queries, ....
• KR Formalisms:KR Formalisms:– Semantic nets, Thesauri, Frame-Logic, Description Logics, ...
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Domain Experts’ “Glue Knowledge”Domain Experts’ “Glue Knowledge”
Cerebellum
Source 1 Source 2
Source 3
Cerebellar Cortex
Granule Cell Layer
Purkinje Cell layer
Molecular Layer
has a
Purkinje Cell Dendrite
Dendritic spines
Dendritic shaft
Endoplasmic reticulum
Purkinje Neuron
has a
NCMIR ANATOM NCMIR ANATOM Domain Map:Domain Map:
• conceptsconcepts• relationsrelations• logic ruleslogic rules
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Formalizing Glue Knowledge:Formalizing Glue Knowledge:Domain Map for Domain Map for SYNAPSESYNAPSE and and NCMIRNCMIR
Domain Map = labeled graph with concepts ("classes") and roles ("associations")• additional semantics: expressed as logic rules (F-logic)
Domain Map = labeled graph with concepts ("classes") and roles ("associations")• additional semantics: expressed as logic rules (F-logic)
Domain Map (DM)
Purkinje cells and Pyramidal cells have dendritesthat have higher-order branches that contain spines.Dendritic spines are ion (calcium) regulating components.Spines have ion binding proteins. Neurotransmissioninvolves ionic activity (release). Ion-binding proteinscontrol ion activity (propagation) in a cell. Ion-regulatingcomponents of cells affect ionic activity (release).
Domain Expert Knowledge
DM in Description Logic
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Source Contextualization & DM RefinementSource Contextualization & DM Refinement
In addition to registering (“hanging off”) data relative toexisting concepts, a source may also refine the mediator’s domain map...
sources can register new concepts at the mediator ...
Query Processing Query Processing “Demo”“Demo”
Query resultsin context
ContextualizationCON(Result) wrt. ANATOM.
Integrated View DefinitionIntegrated View DefinitionIntegrated View DefinitionIntegrated View DefinitionDERIVEDERIVEprotein_distributionprotein_distribution((ProteinProtein, , OrganismOrganism, , Brain_regionBrain_region, , Feature_nameFeature_name, , AnatomAnatom, ,
ValueValue) ) IFIFI:I:protein_label_image[protein_label_image[ proteins ->> {Protein}; organism -> Organism; proteins ->> {Protein}; organism -> Organism;
anatomical_structures ->>anatomical_structures ->>{AS:{AS:anatomical_structure[anatomical_structure[name->Anatomname->Anatom]]}}] ] , , % from PROLAB% from PROLAB
NAE:NAE:neuro_anatomic_entity[neuro_anatomic_entity[name->Anatom; name->Anatom; % from ANATOM% from ANATOM located_in->>{Brain_region}located_in->>{Brain_region}]], , AS..segments..featuresAS..segments..features[[name->Feature_name; value->Valuename->Feature_name; value->Value]]. .
DERIVEDERIVEprotein_distributionprotein_distribution((ProteinProtein, , OrganismOrganism, , Brain_regionBrain_region, , Feature_nameFeature_name, , AnatomAnatom, ,
ValueValue) ) IFIFI:I:protein_label_image[protein_label_image[ proteins ->> {Protein}; organism -> Organism; proteins ->> {Protein}; organism -> Organism;
anatomical_structures ->>anatomical_structures ->>{AS:{AS:anatomical_structure[anatomical_structure[name->Anatomname->Anatom]]}}] ] , , % from PROLAB% from PROLAB
NAE:NAE:neuro_anatomic_entity[neuro_anatomic_entity[name->Anatom; name->Anatom; % from ANATOM% from ANATOM located_in->>{Brain_region}located_in->>{Brain_region}]], , AS..segments..featuresAS..segments..features[[name->Feature_name; value->Valuename->Feature_name; value->Value]]. .
• provided by the domain expert and mediation engineer• deductive OO language (here: F-logic)
• provided by the domain expert and mediation engineer• deductive OO language (here: F-logic)
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Process Maps with Process Maps with AbstractionsAbstractions and and ElaborationsElaborations::=> => From Terminological to “Procedural Glue”From Terminological to “Procedural Glue”
• nodes ~ states• edges ~ processes, transitions• blue/red edges:
• processes in Src1/Src2• general form of edges:
how about these?
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
What’s in an Answer?What’s in an Answer?((What’s in a Link?What’s in a Link? revisited) revisited)
• Semantic/Rule-Based JoinsSemantic/Rule-Based Joins
(X,Y,[produces,B,increased_in]) :=
X produces B, B increased_in Y. rule-based
e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease
• What is the Erdoes number of person P?What is the Erdoes number of person P?
– 3
• Really? Why?Really? Why?– authority based: <VIP> said so
– faith based: don’t know but believe firmly
– query statement Q = ... derived it from DB
– query Q = ... derived it from DB and KB using derivation D logic-based systems often “come with explanations” ultimate goal: “computations as proofs”, “explanation-based computing”
XY
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Summary: Mediation Scenarios & TechniquesSummary: Mediation Scenarios & Techniques
Federated Databases XML-Based Mediation Model-Based Mediation
One-World One-/Multiple-Worlds Complex Multiple-Worlds
Common Schema Mediated Schema Common Glue Maps
SQL, rules XML query languages DOOD query languages
Schema Transformations Syntax-Aware Mappings Semantics-Aware Mappings
Syntactic Joins Syntactic Joins “Semantic” Joins via Glue Maps
DB expert DB expert KRDB + domain expert
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Technical Issues and ChallengesTechnical Issues and Challenges
• Integration Method and ArchitectureIntegration Method and Architecture– federated DBs, warehouse/wrapper-mediator approach,
GAV/LAV, Grid infrastructure, ...
• Suitable KRDB Formalisms and FrameworksSuitable KRDB Formalisms and Frameworks– XML, DTDs, XML Schema, XPath, XQuery, ...
– RDF(S), Ontologies, Description Logics, DAML+OIL, ...
– querying, deduction, subsumption, classification, ...
• Algorithms and ImplementationAlgorithms and Implementation– query composition, rewriting, reasoning, source capabilities, ...
• Information Integration Scenario and ScopeInformation Integration Scenario and Scope– simple/complex, single/multiple worlds, ...
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
The Larger Infrastructure / Interoperability PictureThe Larger Infrastructure / Interoperability Picture
• The GridsThe Grids– Data-Grid (SRB, ...), Computational-Grid
(Globus, ...), “Knowledge-Grid”, ...
• The WebsThe Webs– W3C: HTML, XML, Semantic Web (RDF(S),
DAML+OIL, ...)
• Service & Protocol-Oriented Architectures Service & Protocol-Oriented Architectures – WSDL, SOAP, CORBA, EJB, ...
• The Application LevelThe Application Level– applications (computations + KRDB mediation) are
chained together to form ...=> analytical “Knowledge” Pipelines:
• NIH BIRN: LONI, NSF GriPhyN, DOE SciDAC, PDB, ASC, AVIRIS, ...
=> Data =>=> Computations =>
=> Analysis => => Knowledge =>
=> Data =>=> Computations =>
=> Analysis => => Knowledge =>
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Thank You!Thank You!
Questions?Questions?Queries?Queries?
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Models and Formal Approaches:Models and Formal Approaches:Relating Theory to the WorldRelating Theory to the World
©2000 by John F. Sowa, http://www.jfsowa.com/krbook/, Knowledge Representation: Logical, Philosophical, and Computational Foundations, Brooks/Cole, Pacific Grove, CA.
All models are wrong, but some are useful!
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
OntologiesOntologies
• So what is an Ontology?So what is an Ontology?– definition of things that are relevant to your application– representation of terminological knowledge (“TBox”)– explicit specification of a conceptualization– concept hierarchy (“is-a”)– further semantic relationships between concepts– abstractions of relational schemas, (E)ER, UML classes, XML
Schemas
• Examples:Examples:– NCMIR ANATOM– GO (Gene Ontology)– UMLS (Unified Medical Language System– CYC
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Description LogicsDescription Logics
• Terminological Knowledge (TBox)Terminological Knowledge (TBox)– Concept Definition (naming of concepts):
– Axiom (constraining of concepts):
=> a mediators “glue knowledge source”
• Assertional Knowledge (ABox)Assertional Knowledge (ABox)– the marked neuron in image 27
=> the concrete instances/individuals of the concepts/classes that your sources export
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Description LogicDescription Logic
• DL definition of “Happy Father” DL definition of “Happy Father” (Example from Ian Horrocks, U Manchester, UK)(Example from Ian Horrocks, U Manchester, UK)
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center
Some Open Database & Knowledge Some Open Database & Knowledge Representation IssuesRepresentation Issues
• Mix of Query Processing and ReasoningMix of Query Processing and Reasoning– FaCT description logic reasoner for DMs?– or reconcilation of DMs via argumentation-frameworks
(“games”) using well-founded and stable models of logic programs [ICDT97,PODS97,TCS00]
• Modeling “Process Knowledge” => Process MapsModeling “Process Knowledge” => Process Maps– formal semantics? (dynamic/temporal/Kripke models?)– executable semantics? (Statelog?)
• Graph Queries over DMs and PMsGraph Queries over DMs and PMs– expressible in F-logic [InfSystem98]– scalability? (UMLS Domain Map has millions of entries)
• ... ...
Top Related