Capturing both types and integrity constraints in data ...

28
1 Capturing both Types and Constraints in Data Integration Michael Benedikt Bell Laboratories Chee-Yong Chan National Univ. of Singapore Wenfei Fan Bell Laboratories Juliana Freire Oregon Health & Science Univ. (OGI) Rajeev Rastogi Bell Laboratories

Transcript of Capturing both types and integrity constraints in data ...

Page 1: Capturing both types and integrity constraints in data ...

1

Capturing both Types and Constraints

in Data Integration

Michael Benedikt Bell Laboratories

Chee-Yong Chan National Univ. of Singapore

Wenfei Fan Bell Laboratories

Juliana Freire Oregon Health & Science Univ. (OGI)

Rajeev Rastogi Bell Laboratories

Page 2: Capturing both types and integrity constraints in data ...

2

Schema-directed integration

Integration: – extract relevant data from distributed, multiple databases– construct an XML view

Schema-directed: conformance to a predefined schema (D, )

– D: a DTD, type constraints– : a set of XML integrity constraints (keys, foreign keys)

DB3DB1

XML view

DB2

middleware

(D, )

data exchange

Page 3: Capturing both types and integrity constraints in data ...

3

Example: hospital and insurance company

DB1

DB2

DB3

DB4

billing (trId, price)

cover (policy, trId)

treatment (trId, name)procedure (trId1, trId2)

patient (SSN, name, policy)visitInfo (SSN, trId, date)

daily XML report

Given a date, for each patient of that day, report: SSN, name (DB1) treatments (hierarchy) covered by insurance (DB1, DB3, DB4) cost of all and only those treatments received (DB2)

Page 4: Capturing both types and integrity constraints in data ...

4

Predefined schema (D, ): DTD

report patient* patient SSN, name, treatments, bill treatments treatment* treatment trId, tname, treatments bill item* item trId, price

date

name

report

treatments

patientpatient patient

SSN bill

item item

pricetrId

treatment treatment

trId tname

treatments. . .

Page 5: Capturing both types and integrity constraints in data ...

5

Predefined schema (D, ): XML constraints

constraints: relative to each patient, key: each treatment is charged only once

patient ( item.trId item ) foreign key: every treatment has a billing record

patient ( treatment.trId item.trId )

date

name

report

treatments

patientpatient patient

SSN bill

item item

pricetrId

treatment treatment

treatmentstrId tname . . .

Page 6: Capturing both types and integrity constraints in data ...

6

Challenge: nondeterministic structure

certain structure cannot be decided at compile time recursion expansion –- the depth of XML tree

treatments treatment* treatment trId, tname, treatments --- recursive

choice of children in a disjunction, e.g. A B + C

unbounded -->

date

name

report

treatments

patientpatient patient

SSN bill

item item

pricetrId

treatment treatment

trId tname

treatments

. . .

Page 7: Capturing both types and integrity constraints in data ...

7

Challenge: context dependency

bill subtree: all and only the trIds in the treatments subtree controlled derivation: the bill subtree cannot be started

before the treatments subtree is completed. information passing: downward, upward, sideways

trIdS

date

patient

SSN treatments bill

trIdtrId

unbounded

report

name

date

SSN

Page 8: Capturing both types and integrity constraints in data ...

8

Challenges

DTD-conformance: recursive, nondeterministic integrity constraints: validation during document generation multi-source queries: a single one involves several databases context-dependency: not strictly top-down or bottom-up

Previous work: XML publishing: single data source, no constraints, top-down. XML integration: little support for XML schema-conformance. XML query languages + type checking: provide no guidance

for how to ensure schema-conformance; optimization hard.

Schema-directed XML integration is nontrivial!

Page 9: Capturing both types and integrity constraints in data ...

9

Our middleware for schema-directed integration

A lightweight language: Attribute Integration Grammar (AIG)

Cost-based optimization in light of context-dependency

(D, )

DB3DB1 DB2

middleware

view definition language

optimization techniques

DTD

+constraints

=AIG

semantic attributes

semantic rules

Page 10: Capturing both types and integrity constraints in data ...

10

Attribute Integration Grammar (AIG)

DTD: element type definitions e ::= PCDATA | | e1, …, en | e1 + … + en | e*

Attributes: associated with each element type e

– Inh(e): inherited from parent/siblings (top-down/sideways)

– Syn(e): synthesized from children (bottom-up)

Syn(e), Inh(e): tuple or set/bag-valued

Rules: associated with each production e , for e’ in – Inh(child): Inh(e’) = Q(Inh(parent), Syn(sibling) ) – Syn(parent): Syn(e) = U (Syn(children)) -- union

Q: multi-source SQL query with parameters

Dependency: e2 must be evaluated before e1 if

Inh(e1) = Q( … Syn(e2) … ) -- acyclic graph (DAG)

Page 11: Capturing both types and integrity constraints in data ...

11

AIG semantics: conceptual evaluation

following the dependency ordering; starting from the root

report patient*

Inh(patient) Q1 (Inh(report))

select Inh(report) as date, p.SSN, p.name, p.policy

from DB1: patient p, DB1: visitInfo v

where p.SSN = v.SSN and v.date = Inh(report)

Recall DB1: patient (SSN, name, policy), visitInfo (SSN, trId, date)

Parameter in a query: Inh(report) as a constant Data driven: the number of patients depends on Q1

Inh

report

patientpatient patient

date

Page 12: Capturing both types and integrity constraints in data ...

12

Multi-source query

patient SSN, name, treatments, bill

Inh(SSN) = Inh(patient).SSN, . . . ,

Inh(treatments) = Q2(v) -- v = Inh(patient)

select t.trId, t.tname

from DB1: visitInfo i, DB3: cover c, DB4: treatment t

where i.SSN = v.SSN and i.date = v.date and t.trId = i.trId

and c.trId = i.trId and c.policy = v.policy

a single query uses DB1, DB3 and DB4 tuple- and set-valued attributes (Inh(SSN), Inh(treatments))

Recall DB1: patient(SSN, name, policy), visitInfo(SSN, trId, date) DB3: cover (policy, trId) DB4: treatment (trId, name), procedure (trId1, trId2)

Page 13: Capturing both types and integrity constraints in data ...

13

Initial top-down pass: context-dependent

patient SSN, name, treatments, bill

Inh(SSN) = Inh(patient).SSN, Inh(name) = Inh(patient).name,

Inh(treatments) = Q2(Inh(patient))

Inh(bill) = Syn(treatments) <-- halt

DTD-directed: generate children following the production Inh(bill): defined with sibling -- Syn(treatments),

dependency ordering: evaluate bill after treatments

SSN

Inh

Inh report

patientpatient

Inh treatments bill

patient

Syn InhInh name<- - halt

Page 14: Capturing both types and integrity constraints in data ...

14

Initial top-down pass: recursion

treatments treatment*

Inh(treatment) Inh(treatments) -- set of (trId,

tname)

Data driven: treatments expansion depends on Inh(treatments)– empty: expansion terminates, Syn(treatments) is

empty; – nonempty: expands. treatments

treatment treatment. . .

Inh(treatments)

patient

treatments

Page 15: Capturing both types and integrity constraints in data ...

15

Leaf step

treatment trId, tname, treatments

Inh(trId) = Inh(treatment).trId, . . .,

trId PCDATA

Syn(trId) = Inh(trId)

treatments

Inh: (trId, tname)

treatmentstrId tnameInh

Syn

treatment treatment

Page 16: Capturing both types and integrity constraints in data ...

16

Bottom-up step: synthesize attributes

treatment trId, tname, treatments

treatments treatment*

Syn(treatments) = U Syn(treatment)

Processing of an element e: Inh(e) subtree(e) Syn(e)

treatments

treatment treatment

Syn

Syn Syn

treatmentstrId tname

Syn

Syn(treatment) = Syn(trId) Syn(treatments)

Page 17: Capturing both types and integrity constraints in data ...

17

Sideways step: controlled derivation

patient SSN, name, treatments, bill

Inh(bill) = Syn(treatments)

bill item*

Inh(item) Q’(Inh(bill) )

select trId, price

from DB2: billing

where trId in Inh(bill) -- set membership test

DTD-directed: each step of construction follows a production

treatments bill

trId

patient

Syn Inh

item item

trId

Recall DB2: billing (trId, price)

Page 18: Capturing both types and integrity constraints in data ...

18

Constraint compilation

Captured with rules on synthesized attributes of patient:

– trIdB: bag-valued, collecting trId’s under item

– trIdS1: set-valued, collecting trId’s under treatment

– trIdS2: set-valued, collecting trId’s under item

key: patient ( item.trId item )

unique (Syn(patient).trIdB) -- no duplicates in the bag

foreign key: patient ( treatment.trId item.trId )

subset ( Syn(patient).trIdS1, Syn(patient).trIdS2)

compilation: semantic rules and attributes for constraints are automatically generated and evaluated

Page 19: Capturing both types and integrity constraints in data ...

19

Advantages of AIG

DTD-directed view definition: automatically ensures conformance to DTD -- recursive, nondeterministic

Constraint compilation: automatically captures integrity constraints in a uniform framework– performance: avoid post-materialization checking– optimization: jointly with query evaluation– exception handling: actions when constraints are violated

Controlled derivation: supports context-dependent generation.Information passing: top-down, bottom-up, sideways

Multi-source queries: optimizer-based decomposition

One sweep: each node is visited at most twice – evaluates its inherited attribute, subtree, then its synthesized attribute

Page 20: Capturing both types and integrity constraints in data ...

20

Middleware: evaluation of AIGs

pre-processingmerging

scheduling

query planexecution tagging

DB1

AIG XML

DB3DB2

optimizer

coststatistics

query

data

pre-processing: – constraint compilation– multi-source query decomposition single-source queries

optimizer: query plan generation using cost statistics execution: SQL queries data sources, results mediator tagging: relational tables (paths from root) XML view

via merge-sorting

Page 21: Capturing both types and integrity constraints in data ...

21

Optimization

Goal: reduce response time costs: query execution, data transfer, storage (caching) constraints: query dependency graph (DAG)

– nodes: queries computing inherited/synthesized attributes

– edges: dependency relation (producer-consumer relation)

recursion: iterative unfolding by a certain depth Optimization techniques:

– query merging – query scheduling

(Q1, DB1)

(Q3, DB2)(Q2, DB1)

(Q8, DB3)(Q7, DB3)

(Q4, DB2) (Q5, DB2)

(Q6, DB1)

(Q9, DB3)

Page 22: Capturing both types and integrity constraints in data ...

22

Query scheduling

Goal: reduce the total response time by increasing parallelism

Ordering execution of queries on the same sitee.g., <Q4, Q5, Q3> vs. <Q3, Q5, Q4> on DB2

Constraints: – costs: execution, communication, caching overheads– dependency relation

Finding an optimal schedule: NP-hard(Q1, DB1)

(Q3, DB2)(Q2, DB1)

(Q8, DB3)(Q7, DB3)

(Q4, DB2) (Q5, DB2)

(Q6, DB1)

(Q9, DB3)Q3

DB3DB2DB1

Q1

Q2

Q6Q5

Q4

Q8

Q7

Q9

Page 23: Capturing both types and integrity constraints in data ...

23

Query merging

Goal: reduce DB visits, leverage DB optimizer Composition of queries on the same site:

outer-join/union Tradeoffs:

– large result tables with null – communication/caching cost

– impact on scheduling -- changing query dependency graph

(Q1, DB1)

(Q3, DB2)(Q2, DB1)

(Q8, DB3)(Q7, DB3)

(Q4, DB2) (Q5, DB2)

(Q6, DB1)

(Q9, DB3)

((Q3, Q4, Q5), DB2)

(Q7, DB3) (Q8, DB3)

(Q6, DB1) (Q1, DB1)

bad merging

Finding an optimal strategy for merging and scheduling: NP-hard

Page 24: Capturing both types and integrity constraints in data ...

24

Cost-based heuristic: scheduling

cost estimate: given a fixed schedule, estimate the completion time of a query Q, comp_time(Q)

– statistics: eval_cost(Q), size(Q), trans_cost(S, S’, size)

– dependency: Q can’t start before comp_time(Q’) if Q’ -> Q

scheduling: given a fixed query dependency graph, a heuristic based on dynamic programming to

– find “the most costly” trailing path of each query

– sort queries to favor “critical” paths

Page 25: Capturing both types and integrity constraints in data ...

25

Cost-based heuristic: merging

greedy algorithm: repeat until no further improvement– “merge” each pair of queries on the same source– modify the query dependency graph accordingly– invoke scheduling w.r.t. the modified dependency – estimate the cost– pick the pair with “the biggest improvement” to

merge

(Q1, DB1)

(Q3, DB2)(Q2, DB1)

(Q8, DB3)(Q7, DB3)

(Q4, DB2) (Q5, DB2)

(Q6, DB1)

(Q9, DB3) Q3

DB3DB2DB1

Q1

(Q2, Q6)Q5

Q4

Q8

(Q7, Q9)

Interaction between query scheduling and merging

Page 26: Capturing both types and integrity constraints in data ...

26

Preliminary experimental study

Data set: running example – insurance/hospital– recursion unfolding (depth): 3, 4, 5 -- treatments– data size: small, medium, large – generated via

ToXgene

System: – DB2 v8.1 for Linux– simulation of data transfer: bandwidth=1Mbps

DB relation small medium large

DB1 patient 2500 3300 5000

visitInfo 11371 14887 22496

DB2 cover 2224 3762 8996

DB3 billing 175 250 350

DB4 treatment 175 250 350

procedure 441 718 923

Card(3-way self join)= 4055Card(4-way self join)= 6837

Page 27: Capturing both types and integrity constraints in data ...

27

Experimental results – ratio of improvement

0

0.5

1

1.5

2

2.5

Small Medium Large

Dataset

Improvement

No Opt

3-unfold

4-unfold

5-unfold

438/19361/34

Page 28: Capturing both types and integrity constraints in data ...

28

Summary

AIG: a novel specification language– ensures DTD-conformance (recursion/nondeterminism)– captures integrity constraints in a uniform framework– supports complex transformations: controlled

derivation (context-dependency), multi-source queries, . . .

Optimization techniques: nontrivial optimization problems

– constraint compilation, multi-source query decomposition

– query scheduling w.r.t. query dependency graph

– query merging and its interaction with schedulingA systematic framework for schema-directed XML integration