Capturing both types and integrity constraints in data ...

1

Capturing both Types and Constraints

in Data Integration

Michael Benedikt Bell Laboratories

Chee-Yong Chan National Univ. of Singapore

Wenfei Fan Bell Laboratories

Juliana Freire Oregon Health & Science Univ. (OGI)

Rajeev Rastogi Bell Laboratories

2

Schema-directed integration

Integration: – extract relevant data from distributed, multiple databases– construct an XML view

Schema-directed: conformance to a predefined schema (D, )

– D: a DTD, type constraints– : a set of XML integrity constraints (keys, foreign keys)

DB3DB1

XML view

DB2

middleware

(D, )

data exchange

3

Example: hospital and insurance company

DB1

DB2

DB3

DB4

billing (trId, price)

cover (policy, trId)

treatment (trId, name)procedure (trId1, trId2)

patient (SSN, name, policy)visitInfo (SSN, trId, date)

daily XML report

Given a date, for each patient of that day, report: SSN, name (DB1) treatments (hierarchy) covered by insurance (DB1, DB3, DB4) cost of all and only those treatments received (DB2)

4

Predefined schema (D, ): DTD

report patient* patient SSN, name, treatments, bill treatments treatment* treatment trId, tname, treatments bill item* item trId, price

date

name

report

treatments

patientpatient patient

SSN bill

item item

pricetrId

treatment treatment

trId tname

treatments. . .

5

Predefined schema (D, ): XML constraints

constraints: relative to each patient, key: each treatment is charged only once

patient ( item.trId item ) foreign key: every treatment has a billing record

patient ( treatment.trId item.trId )

date

name

report

treatments


SSN bill

item item

pricetrId

treatment treatment

treatmentstrId tname . . .

6

Challenge: nondeterministic structure

certain structure cannot be decided at compile time recursion expansion –- the depth of XML tree

treatments treatment* treatment trId, tname, treatments --- recursive

choice of children in a disjunction, e.g. A B + C

unbounded -->

date

name

report

treatments


SSN bill

item item

pricetrId

treatment treatment

trId tname

treatments

. . .

7

Challenge: context dependency

bill subtree: all and only the trIds in the treatments subtree controlled derivation: the bill subtree cannot be started

before the treatments subtree is completed. information passing: downward, upward, sideways

trIdS

date

patient

SSN treatments bill

trIdtrId

unbounded

report

name

date

SSN

8

Challenges

DTD-conformance: recursive, nondeterministic integrity constraints: validation during document generation multi-source queries: a single one involves several databases context-dependency: not strictly top-down or bottom-up

Previous work: XML publishing: single data source, no constraints, top-down. XML integration: little support for XML schema-conformance. XML query languages + type checking: provide no guidance

for how to ensure schema-conformance; optimization hard.

Schema-directed XML integration is nontrivial!

9

Our middleware for schema-directed integration

A lightweight language: Attribute Integration Grammar (AIG)

Cost-based optimization in light of context-dependency

(D, )

DB3DB1 DB2

middleware

view definition language

optimization techniques

DTD

+constraints

=AIG

semantic attributes

semantic rules

10

Attribute Integration Grammar (AIG)

DTD: element type definitions e ::= PCDATA | | e1, …, en | e1 + … + en | e*

Attributes: associated with each element type e

– Inh(e): inherited from parent/siblings (top-down/sideways)

– Syn(e): synthesized from children (bottom-up)

Syn(e), Inh(e): tuple or set/bag-valued

Rules: associated with each production e , for e’ in – Inh(child): Inh(e’) = Q(Inh(parent), Syn(sibling) ) – Syn(parent): Syn(e) = U (Syn(children)) -- union

Q: multi-source SQL query with parameters

Dependency: e2 must be evaluated before e1 if

Inh(e1) = Q( … Syn(e2) … ) -- acyclic graph (DAG)

11

AIG semantics: conceptual evaluation

following the dependency ordering; starting from the root

report patient*

Inh(patient) Q1 (Inh(report))

select Inh(report) as date, p.SSN, p.name, p.policy

from DB1: patient p, DB1: visitInfo v

where p.SSN = v.SSN and v.date = Inh(report)

Recall DB1: patient (SSN, name, policy), visitInfo (SSN, trId, date)

Parameter in a query: Inh(report) as a constant Data driven: the number of patients depends on Q1

Inh

report


date

12

Multi-source query

patient SSN, name, treatments, bill

Inh(SSN) = Inh(patient).SSN, . . . ,

Inh(treatments) = Q2(v) -- v = Inh(patient)

select t.trId, t.tname

from DB1: visitInfo i, DB3: cover c, DB4: treatment t

where i.SSN = v.SSN and i.date = v.date and t.trId = i.trId

and c.trId = i.trId and c.policy = v.policy

a single query uses DB1, DB3 and DB4 tuple- and set-valued attributes (Inh(SSN), Inh(treatments))

Recall DB1: patient(SSN, name, policy), visitInfo(SSN, trId, date) DB3: cover (policy, trId) DB4: treatment (trId, name), procedure (trId1, trId2)

13

Initial top-down pass: context-dependent


Inh(SSN) = Inh(patient).SSN, Inh(name) = Inh(patient).name,

Inh(treatments) = Q2(Inh(patient))

Inh(bill) = Syn(treatments) <-- halt

DTD-directed: generate children following the production Inh(bill): defined with sibling -- Syn(treatments),

dependency ordering: evaluate bill after treatments

SSN

Inh

Inh report

patientpatient

Inh treatments bill

patient

Syn InhInh name<- - halt

14

Initial top-down pass: recursion

treatments treatment*

Inh(treatment) Inh(treatments) -- set of (trId,

tname)

Data driven: treatments expansion depends on Inh(treatments)– empty: expansion terminates, Syn(treatments) is

empty; – nonempty: expands. treatments

treatment treatment. . .

Inh(treatments)

patient

treatments

15

Leaf step

treatment trId, tname, treatments

Inh(trId) = Inh(treatment).trId, . . .,

trId PCDATA

Syn(trId) = Inh(trId)

treatments

Inh: (trId, tname)

treatmentstrId tnameInh

Syn

treatment treatment

16

Bottom-up step: synthesize attributes

treatment trId, tname, treatments

treatments treatment*

Syn(treatments) = U Syn(treatment)

Processing of an element e: Inh(e) subtree(e) Syn(e)

treatments

treatment treatment

Syn

Syn Syn

treatmentstrId tname

Syn

Syn(treatment) = Syn(trId) Syn(treatments)

17

Sideways step: controlled derivation


Inh(bill) = Syn(treatments)

bill item*

Inh(item) Q’(Inh(bill) )

select trId, price

from DB2: billing

where trId in Inh(bill) -- set membership test

DTD-directed: each step of construction follows a production

treatments bill

trId

patient

Syn Inh

item item

trId

Recall DB2: billing (trId, price)

18

Constraint compilation

Captured with rules on synthesized attributes of patient:

– trIdB: bag-valued, collecting trId’s under item

– trIdS1: set-valued, collecting trId’s under treatment

– trIdS2: set-valued, collecting trId’s under item

key: patient ( item.trId item )

unique (Syn(patient).trIdB) -- no duplicates in the bag

foreign key: patient ( treatment.trId item.trId )

subset ( Syn(patient).trIdS1, Syn(patient).trIdS2)

compilation: semantic rules and attributes for constraints are automatically generated and evaluated

19

Advantages of AIG

DTD-directed view definition: automatically ensures conformance to DTD -- recursive, nondeterministic

Constraint compilation: automatically captures integrity constraints in a uniform framework– performance: avoid post-materialization checking– optimization: jointly with query evaluation– exception handling: actions when constraints are violated

Controlled derivation: supports context-dependent generation.Information passing: top-down, bottom-up, sideways

Multi-source queries: optimizer-based decomposition

One sweep: each node is visited at most twice – evaluates its inherited attribute, subtree, then its synthesized attribute

20

Middleware: evaluation of AIGs

pre-processingmerging

scheduling

query planexecution tagging

DB1

AIG XML

DB3DB2

optimizer

coststatistics

query

data

pre-processing: – constraint compilation– multi-source query decomposition single-source queries

optimizer: query plan generation using cost statistics execution: SQL queries data sources, results mediator tagging: relational tables (paths from root) XML view

via merge-sorting

21

Optimization

Goal: reduce response time costs: query execution, data transfer, storage (caching) constraints: query dependency graph (DAG)

– nodes: queries computing inherited/synthesized attributes

– edges: dependency relation (producer-consumer relation)

recursion: iterative unfolding by a certain depth Optimization techniques:

– query merging – query scheduling

(Q1, DB1)

(Q3, DB2)(Q2, DB1)

(Q8, DB3)(Q7, DB3)

(Q4, DB2) (Q5, DB2)

(Q6, DB1)

(Q9, DB3)

22

Query scheduling

Goal: reduce the total response time by increasing parallelism

Ordering execution of queries on the same sitee.g., <Q4, Q5, Q3> vs. <Q3, Q5, Q4> on DB2

Constraints: – costs: execution, communication, caching overheads– dependency relation

Finding an optimal schedule: NP-hard(Q1, DB1)

(Q3, DB2)(Q2, DB1)

(Q8, DB3)(Q7, DB3)

(Q4, DB2) (Q5, DB2)

(Q6, DB1)

(Q9, DB3)Q3

DB3DB2DB1

Q1

Q2

Q6Q5

Q4

Q8

Q7

Q9

23

Query merging

Goal: reduce DB visits, leverage DB optimizer Composition of queries on the same site:

outer-join/union Tradeoffs:

– large result tables with null – communication/caching cost

– impact on scheduling -- changing query dependency graph

(Q1, DB1)

(Q3, DB2)(Q2, DB1)

(Q8, DB3)(Q7, DB3)

(Q4, DB2) (Q5, DB2)

(Q6, DB1)

(Q9, DB3)

((Q3, Q4, Q5), DB2)

(Q7, DB3) (Q8, DB3)

(Q6, DB1) (Q1, DB1)

bad merging

Finding an optimal strategy for merging and scheduling: NP-hard

24

Cost-based heuristic: scheduling

cost estimate: given a fixed schedule, estimate the completion time of a query Q, comp_time(Q)

– statistics: eval_cost(Q), size(Q), trans_cost(S, S’, size)

– dependency: Q can’t start before comp_time(Q’) if Q’ -> Q

scheduling: given a fixed query dependency graph, a heuristic based on dynamic programming to

– find “the most costly” trailing path of each query

– sort queries to favor “critical” paths

25

Cost-based heuristic: merging

greedy algorithm: repeat until no further improvement– “merge” each pair of queries on the same source– modify the query dependency graph accordingly– invoke scheduling w.r.t. the modified dependency – estimate the cost– pick the pair with “the biggest improvement” to

merge

(Q1, DB1)

(Q3, DB2)(Q2, DB1)

(Q8, DB3)(Q7, DB3)

(Q4, DB2) (Q5, DB2)

(Q6, DB1)

(Q9, DB3) Q3

DB3DB2DB1

Q1

(Q2, Q6)Q5

Q4

Q8

(Q7, Q9)

Interaction between query scheduling and merging

26

Preliminary experimental study

Data set: running example – insurance/hospital– recursion unfolding (depth): 3, 4, 5 -- treatments– data size: small, medium, large – generated via

ToXgene

System: – DB2 v8.1 for Linux– simulation of data transfer: bandwidth=1Mbps

DB relation small medium large

DB1 patient 2500 3300 5000

visitInfo 11371 14887 22496

DB2 cover 2224 3762 8996

DB3 billing 175 250 350

DB4 treatment 175 250 350

procedure 441 718 923

Card(3-way self join)= 4055Card(4-way self join)= 6837

27

Experimental results – ratio of improvement

0

0.5

1

1.5

2

2.5

Small Medium Large

Dataset

Improvement

No Opt

3-unfold

4-unfold

5-unfold

438/19361/34

28

Summary

AIG: a novel specification language– ensures DTD-conformance (recursion/nondeterminism)– captures integrity constraints in a uniform framework– supports complex transformations: controlled

derivation (context-dependency), multi-source queries, . . .

Optimization techniques: nontrivial optimization problems

– constraint compilation, multi-source query decomposition

– query scheduling w.r.t. query dependency graph

– query merging and its interaction with schedulingA systematic framework for schema-directed XML integration

Capturing both types and integrity constraints in data ...

Documents

Transcript of Capturing both types and integrity constraints in data ...