Capturing both types and integrity constraints in data ...
Transcript of Capturing both types and integrity constraints in data ...
1
Capturing both Types and Constraints
in Data Integration
Michael Benedikt Bell Laboratories
Chee-Yong Chan National Univ. of Singapore
Wenfei Fan Bell Laboratories
Juliana Freire Oregon Health & Science Univ. (OGI)
Rajeev Rastogi Bell Laboratories
2
Schema-directed integration
Integration: – extract relevant data from distributed, multiple databases– construct an XML view
Schema-directed: conformance to a predefined schema (D, )
– D: a DTD, type constraints– : a set of XML integrity constraints (keys, foreign keys)
DB3DB1
XML view
DB2
middleware
(D, )
data exchange
3
Example: hospital and insurance company
DB1
DB2
DB3
DB4
billing (trId, price)
cover (policy, trId)
treatment (trId, name)procedure (trId1, trId2)
patient (SSN, name, policy)visitInfo (SSN, trId, date)
daily XML report
Given a date, for each patient of that day, report: SSN, name (DB1) treatments (hierarchy) covered by insurance (DB1, DB3, DB4) cost of all and only those treatments received (DB2)
4
Predefined schema (D, ): DTD
report patient* patient SSN, name, treatments, bill treatments treatment* treatment trId, tname, treatments bill item* item trId, price
date
name
report
treatments
patientpatient patient
SSN bill
item item
pricetrId
treatment treatment
trId tname
treatments. . .
5
Predefined schema (D, ): XML constraints
constraints: relative to each patient, key: each treatment is charged only once
patient ( item.trId item ) foreign key: every treatment has a billing record
patient ( treatment.trId item.trId )
date
name
report
treatments
patientpatient patient
SSN bill
item item
pricetrId
treatment treatment
treatmentstrId tname . . .
6
Challenge: nondeterministic structure
certain structure cannot be decided at compile time recursion expansion –- the depth of XML tree
treatments treatment* treatment trId, tname, treatments --- recursive
choice of children in a disjunction, e.g. A B + C
unbounded -->
date
name
report
treatments
patientpatient patient
SSN bill
item item
pricetrId
treatment treatment
trId tname
treatments
. . .
7
Challenge: context dependency
bill subtree: all and only the trIds in the treatments subtree controlled derivation: the bill subtree cannot be started
before the treatments subtree is completed. information passing: downward, upward, sideways
trIdS
date
patient
SSN treatments bill
trIdtrId
unbounded
report
name
date
SSN
8
Challenges
DTD-conformance: recursive, nondeterministic integrity constraints: validation during document generation multi-source queries: a single one involves several databases context-dependency: not strictly top-down or bottom-up
Previous work: XML publishing: single data source, no constraints, top-down. XML integration: little support for XML schema-conformance. XML query languages + type checking: provide no guidance
for how to ensure schema-conformance; optimization hard.
Schema-directed XML integration is nontrivial!
9
Our middleware for schema-directed integration
A lightweight language: Attribute Integration Grammar (AIG)
Cost-based optimization in light of context-dependency
(D, )
DB3DB1 DB2
middleware
view definition language
optimization techniques
DTD
+constraints
=AIG
semantic attributes
semantic rules
10
Attribute Integration Grammar (AIG)
DTD: element type definitions e ::= PCDATA | | e1, …, en | e1 + … + en | e*
Attributes: associated with each element type e
– Inh(e): inherited from parent/siblings (top-down/sideways)
– Syn(e): synthesized from children (bottom-up)
Syn(e), Inh(e): tuple or set/bag-valued
Rules: associated with each production e , for e’ in – Inh(child): Inh(e’) = Q(Inh(parent), Syn(sibling) ) – Syn(parent): Syn(e) = U (Syn(children)) -- union
Q: multi-source SQL query with parameters
Dependency: e2 must be evaluated before e1 if
Inh(e1) = Q( … Syn(e2) … ) -- acyclic graph (DAG)
11
AIG semantics: conceptual evaluation
following the dependency ordering; starting from the root
report patient*
Inh(patient) Q1 (Inh(report))
select Inh(report) as date, p.SSN, p.name, p.policy
from DB1: patient p, DB1: visitInfo v
where p.SSN = v.SSN and v.date = Inh(report)
Recall DB1: patient (SSN, name, policy), visitInfo (SSN, trId, date)
Parameter in a query: Inh(report) as a constant Data driven: the number of patients depends on Q1
Inh
report
patientpatient patient
date
12
Multi-source query
patient SSN, name, treatments, bill
Inh(SSN) = Inh(patient).SSN, . . . ,
Inh(treatments) = Q2(v) -- v = Inh(patient)
select t.trId, t.tname
from DB1: visitInfo i, DB3: cover c, DB4: treatment t
where i.SSN = v.SSN and i.date = v.date and t.trId = i.trId
and c.trId = i.trId and c.policy = v.policy
a single query uses DB1, DB3 and DB4 tuple- and set-valued attributes (Inh(SSN), Inh(treatments))
Recall DB1: patient(SSN, name, policy), visitInfo(SSN, trId, date) DB3: cover (policy, trId) DB4: treatment (trId, name), procedure (trId1, trId2)
13
Initial top-down pass: context-dependent
patient SSN, name, treatments, bill
Inh(SSN) = Inh(patient).SSN, Inh(name) = Inh(patient).name,
Inh(treatments) = Q2(Inh(patient))
Inh(bill) = Syn(treatments) <-- halt
DTD-directed: generate children following the production Inh(bill): defined with sibling -- Syn(treatments),
dependency ordering: evaluate bill after treatments
SSN
Inh
Inh report
patientpatient
Inh treatments bill
patient
Syn InhInh name<- - halt
14
Initial top-down pass: recursion
treatments treatment*
Inh(treatment) Inh(treatments) -- set of (trId,
tname)
Data driven: treatments expansion depends on Inh(treatments)– empty: expansion terminates, Syn(treatments) is
empty; – nonempty: expands. treatments
treatment treatment. . .
Inh(treatments)
patient
treatments
15
Leaf step
treatment trId, tname, treatments
Inh(trId) = Inh(treatment).trId, . . .,
trId PCDATA
Syn(trId) = Inh(trId)
treatments
Inh: (trId, tname)
treatmentstrId tnameInh
Syn
treatment treatment
16
Bottom-up step: synthesize attributes
treatment trId, tname, treatments
treatments treatment*
Syn(treatments) = U Syn(treatment)
Processing of an element e: Inh(e) subtree(e) Syn(e)
treatments
treatment treatment
Syn
Syn Syn
treatmentstrId tname
Syn
Syn(treatment) = Syn(trId) Syn(treatments)
17
Sideways step: controlled derivation
patient SSN, name, treatments, bill
Inh(bill) = Syn(treatments)
bill item*
Inh(item) Q’(Inh(bill) )
select trId, price
from DB2: billing
where trId in Inh(bill) -- set membership test
DTD-directed: each step of construction follows a production
treatments bill
trId
patient
Syn Inh
item item
trId
Recall DB2: billing (trId, price)
18
Constraint compilation
Captured with rules on synthesized attributes of patient:
– trIdB: bag-valued, collecting trId’s under item
– trIdS1: set-valued, collecting trId’s under treatment
– trIdS2: set-valued, collecting trId’s under item
key: patient ( item.trId item )
unique (Syn(patient).trIdB) -- no duplicates in the bag
foreign key: patient ( treatment.trId item.trId )
subset ( Syn(patient).trIdS1, Syn(patient).trIdS2)
compilation: semantic rules and attributes for constraints are automatically generated and evaluated
19
Advantages of AIG
DTD-directed view definition: automatically ensures conformance to DTD -- recursive, nondeterministic
Constraint compilation: automatically captures integrity constraints in a uniform framework– performance: avoid post-materialization checking– optimization: jointly with query evaluation– exception handling: actions when constraints are violated
Controlled derivation: supports context-dependent generation.Information passing: top-down, bottom-up, sideways
Multi-source queries: optimizer-based decomposition
One sweep: each node is visited at most twice – evaluates its inherited attribute, subtree, then its synthesized attribute
20
Middleware: evaluation of AIGs
pre-processingmerging
scheduling
query planexecution tagging
DB1
AIG XML
DB3DB2
optimizer
coststatistics
query
data
pre-processing: – constraint compilation– multi-source query decomposition single-source queries
optimizer: query plan generation using cost statistics execution: SQL queries data sources, results mediator tagging: relational tables (paths from root) XML view
via merge-sorting
21
Optimization
Goal: reduce response time costs: query execution, data transfer, storage (caching) constraints: query dependency graph (DAG)
– nodes: queries computing inherited/synthesized attributes
– edges: dependency relation (producer-consumer relation)
recursion: iterative unfolding by a certain depth Optimization techniques:
– query merging – query scheduling
(Q1, DB1)
(Q3, DB2)(Q2, DB1)
(Q8, DB3)(Q7, DB3)
(Q4, DB2) (Q5, DB2)
(Q6, DB1)
(Q9, DB3)
22
Query scheduling
Goal: reduce the total response time by increasing parallelism
Ordering execution of queries on the same sitee.g., <Q4, Q5, Q3> vs. <Q3, Q5, Q4> on DB2
Constraints: – costs: execution, communication, caching overheads– dependency relation
Finding an optimal schedule: NP-hard(Q1, DB1)
(Q3, DB2)(Q2, DB1)
(Q8, DB3)(Q7, DB3)
(Q4, DB2) (Q5, DB2)
(Q6, DB1)
(Q9, DB3)Q3
DB3DB2DB1
Q1
Q2
Q6Q5
Q4
Q8
Q7
Q9
23
Query merging
Goal: reduce DB visits, leverage DB optimizer Composition of queries on the same site:
outer-join/union Tradeoffs:
– large result tables with null – communication/caching cost
– impact on scheduling -- changing query dependency graph
(Q1, DB1)
(Q3, DB2)(Q2, DB1)
(Q8, DB3)(Q7, DB3)
(Q4, DB2) (Q5, DB2)
(Q6, DB1)
(Q9, DB3)
((Q3, Q4, Q5), DB2)
(Q7, DB3) (Q8, DB3)
(Q6, DB1) (Q1, DB1)
bad merging
Finding an optimal strategy for merging and scheduling: NP-hard
24
Cost-based heuristic: scheduling
cost estimate: given a fixed schedule, estimate the completion time of a query Q, comp_time(Q)
– statistics: eval_cost(Q), size(Q), trans_cost(S, S’, size)
– dependency: Q can’t start before comp_time(Q’) if Q’ -> Q
scheduling: given a fixed query dependency graph, a heuristic based on dynamic programming to
– find “the most costly” trailing path of each query
– sort queries to favor “critical” paths
25
Cost-based heuristic: merging
greedy algorithm: repeat until no further improvement– “merge” each pair of queries on the same source– modify the query dependency graph accordingly– invoke scheduling w.r.t. the modified dependency – estimate the cost– pick the pair with “the biggest improvement” to
merge
(Q1, DB1)
(Q3, DB2)(Q2, DB1)
(Q8, DB3)(Q7, DB3)
(Q4, DB2) (Q5, DB2)
(Q6, DB1)
(Q9, DB3) Q3
DB3DB2DB1
Q1
(Q2, Q6)Q5
Q4
Q8
(Q7, Q9)
Interaction between query scheduling and merging
26
Preliminary experimental study
Data set: running example – insurance/hospital– recursion unfolding (depth): 3, 4, 5 -- treatments– data size: small, medium, large – generated via
ToXgene
System: – DB2 v8.1 for Linux– simulation of data transfer: bandwidth=1Mbps
DB relation small medium large
DB1 patient 2500 3300 5000
visitInfo 11371 14887 22496
DB2 cover 2224 3762 8996
DB3 billing 175 250 350
DB4 treatment 175 250 350
procedure 441 718 923
Card(3-way self join)= 4055Card(4-way self join)= 6837
27
Experimental results – ratio of improvement
0
0.5
1
1.5
2
2.5
Small Medium Large
Dataset
Improvement
No Opt
3-unfold
4-unfold
5-unfold
438/19361/34
28
Summary
AIG: a novel specification language– ensures DTD-conformance (recursion/nondeterminism)– captures integrity constraints in a uniform framework– supports complex transformations: controlled
derivation (context-dependency), multi-source queries, . . .
Optimization techniques: nontrivial optimization problems
– constraint compilation, multi-source query decomposition
– query scheduling w.r.t. query dependency graph
– query merging and its interaction with schedulingA systematic framework for schema-directed XML integration