Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge...
-
date post
20-Dec-2015 -
Category
Documents
-
view
219 -
download
0
Transcript of Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge...
Typing Semistructured Data
By,
Keshava Reddy Kottapally
Goutham Chinnapolamada
Source:
Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From relations to semistructured data and XML, Morgan Kaufmann Series, ISBN 1-55860-622-X, 1999
Typing Semistructured Data
• Introduction: Schema for Semistructured data• Motivation for typing Semistructured data• Schema formalisms:
– First-order logic
– Datalog
– Graph simulations
• Extracting schemas from data• Inferring schemas from queries• Path constraints
What is semistructured data..?
• Semistructured data has some structure, but is difficult to describe with a predefined, rigid schema– Irregularity
– Continual evolution
– Structure that is implicit or unknown to the user
What is typing..?
• Typing is about finding the structure of semistructured data
• The idea of structuring semistructured data is still an area of much research activity
• Typing involves finding methods to provide schemas for semistructured data
• Typing for SSD differ from those for relational or object-oriented data and hence needs separate methods
Uses of typing SSD
• To optimize query evaluationExample:
Original query:
select X.title
from biblio._X
where X.*.zip = “12345”
Optimized form:
select X.title
from biblio.book X
where X.address.zip = “12345”
C1 C2 C3 C4
C5
C5
C5
C5 C5
C5
C5
C5 C5
C5
C5
biblio book title string
author first name
last name
string
string
string
string
string
string
street
city
zip
title
journal
year
paper
address
Uses of typing continued...
• To facilitate the task of integrating several data sources
• To improve storage– Better clustering may reduce number of page fetches,
thus improving query performance
• To construct indexes• To describe the database content to users and
facilitate query formulation• To proscribe certain updates
Two ways of typing..
• Schema extraction– Given one particular data instance, finding the most
specific schema for it
– With semistructured data we may specify the type after the database is populated
– A data instance may have more than one type
• Schema inference– Finding the most specific schema by analyzing the
query
– This process is similar to type inference in programming languages
The problem
• Given a database and a type, – does the database conform to this type…?
• Classification of objects– Which objects belong to each class..?
• Typing involves description of the structure of each class and its relationships with other classes
Difference between typing SSD and Object Databases
• Classes are defined less precisely. As a consequence, objects may belong to several classes
• Some objects may not belong to any class or may have properties that do not pertain to any class
• The typing may be approximate. For example, we may accept in a class an object that does not quite conform to the specification of that class.
Schema formalisms
First-order logic
Datalog
Simulation
First-order logic
• Example: Consider three kinds of objects in the database
– Root object(s) have• Outgoing edges labeled company to company objects and person to
person objects
– Person objects have• Outgoing edges labeled name and position to string objects
• Outgoing edges labeled worksfor to company objects
• Incoming edges labeled manager and employee from company objects
– Company objects have• Outgoing edges labeled name and address to string objects
• Outgoing edges labeled manager and employee to person objects
• Incoming edges labeled worksfor from person objects
• If : – if an object has a-edges to strings and b-edges from c’ objects, then
it is a c-object. Y, Z(ref(X,a,Y) ^ string(Y) ^ c’(Z) ^ ref(Z,b,X))) c(X)
• Only-if:– Any c-object has some a-edges to strings and some b-edges from
c’ objects: Y, Z(ref(X,a,Y) ^ string(Y) ^ c’(Z) ^ ref(Z,b,X))) c(X)
• If and only if: Y, Z(ref(X,a,Y) ^ string(Y) ^ c’(Z) ^ ref(Z,b,X))) c(X)
• Consequence: – c(X) ^ ref(Z,b,X) c’ (Z)
– c(X) ^ ref(X,a,Y) string(Y)
– c(X) ^ ref(X,L,Y) ^ L a ^ L b false
Problem definition with first-order logic
• The previous questions on typing can be restated in terms of first-order logic– Does D satisfy T, noted D |= T, that is, is there a model
of T that coincides with D over the extensional predicates..?
– If D |= T, what is the classification that is induced..?
• First-order logic leads to very general typings, probably too general for what is needed in semistructured data
• It could also lead to undecidability or intractability
Datalog: A rule-based language
• Datalog allows us to state that if a conjunction of facts holds, then some new fact can be derived
• Datalog rules allow us to define classes by specifying what incoming and outgoing edges are required
• Example:– r(X) :- ref(X, person, Y), p(Y), ref(X, company, Z), c(Z)
– p(X) :- c(Y), ref(Y, manager, X), c(Z), ref(Z, employee, X), ref(X, worksfor, U), c(U), ref(X, name, N), string(N), ref(X, position, P), string(P)
– c(X) :- p(Z), ref(Z, worksfor, X), p(Z), ref(Z, worksfor, X), ref(X, manager, M), p(M), ref(X, employee, E), p(E), ref(X, name, N), string(N), ref(X, address, A), string(A)
Fixpoint semantics
• Least fixpoint semantics– We start from an empty set of facts and derive
nothing. Hence, the empty set of facts is the least fixpoint for this program
• Greatest fixpoint semantics– Typing the largest set of objects
• The goal is to find the greatest fixpoint for a given data graph. The desired model is the greatest fixpoint containing D.
Consider the following data graph D:&o1 {company: &o2{name: &o5 “o2”,
address: &o6 “Versailles”,
manager: &o3,
employee: &o3, employee: &o4 },
person: &o3 { name: &o7 “Francois”,
position: &o8 “CEO”,
worksfor: &o2 },
person: &o4 { name: &o9 “Lucien”,
position: &o10 “programmer”,
worksfor: &o2 }
}
• ref(&o1, company, &o2), ref(&o2, name, &o5), etc.
• string(&o5, string(&o6), etc.
Deriving the greatest fixpoint
• The desired model M can be derived by starting from a model containing D and all possible typing facts. LetJo = D U { r(&o1), r(&o2), r(&o3), r(&o4), p(&o1),
p(&o2), p(&o3), p(&o4), c(&o1), c(&o2), c(&o3), c(&o4), }
• Deriving from J0 until a fixpoint is reached will get to the desired modelM = J2 = J1 = D U {r(&o1), c(&o2), p(&o3), p(&o4)}
Simulation
• The aim is to produce a schema graph for a data graph whose semantics lead to a listing of all permitted labels.
• A schema graph is similar to a data graph with the following changes– Labels can be alterations (like address | name | url ) or
underscore
– Atomic values are type names, like string, int, float, etc.
– Oids of complex objects are called as classes, like Person, Company, etc.
&r1
&p1 &c1 &p2 &c2 &p3
&s0 &s1 &s2 &s3 &s4 &s5 &s6 &s7 &s8 &s9
&a1
&a2&a3
&a4
&a5
&a6 &a7
person
companypersoncompany
person
managermgr emp
name name name name name
position addr phone addr position
&s10
url
worksfor worksfor worksfor
emp
description
procurementsalesrep
contact
task
description
performance
19971998
“Smith” “Mgr” “Widget” “Trent” “Joe”
Schema graph
Root
Person Company
StringAny
companyperson
employee
manager
worksforname|address|urlname|phone|positiondescription
manager
-
• Simulation is defined as follows:Given graphs G1 = (V1, E1), G2 = (V2, E2), a relation R on V1,V2
is a simulation if it satisfies l L x1,y1 V1 x2 V2(x1[l]y1 ^ x1Rx2 y2V2(y1Ry2 ^ x2[l]y2))
• The rule says that every edge in G1 must have a “corresponding” edge in G2 under the simulation
x1
y1 y2
x2R
R
G1 G2
[l] [l]
• To define a simulation between a semistructured data instance and a schema graph, we add the following additional requirements:
– The roots must be in the simulation: r R r’
– Whenever x R y, if y is an atomic type (like string, int), then x must be an atomic node too and have a value of that type. We say the simulation is typed
Data node Schema node&r1 Root
&c1, &c2 Company
&p1, &p2, &p3 Person
&s0,&s1,&s2,&s3… string
&a1,&a2,&a3,&a4…. Any
• The relation R defined by the example data graph and the given schema graph is a simulation
Back to the typing problem….
• When does a data graph D conform to a schema graph S..?– When there exists a rooted, typed simulation between
the data and the schema
• Which objects belong to each class..?– The principle is that oid ‘o’ should belong to class ‘c’ if
o R c. In this way, a rooted simulation R will always classify all objects.
– However, the classification need not be unique!, which leads to finding maximal simulation
string string string string string string
book
title author author
book
title author publisher
book
title author year
&o
&b1&b2
D =
S =
Maximal simulation
• G1 <=R G2 : R is a simulation from G1 to G2
• Fact:– if G1 <=R1 G2 and G1 <=R2 G2 then G1 <=R1UR2 G2
– For any data graph D conforming to some schema graph S, there is always a maximal simulation from D to S.
• Back to the problem: Which objects belong to each class…?– An object ‘o’ belongs to some class ‘c’ if oRc, where R
is the maximal solution between the OEM data and schema graph