Implementing Mapping Composition
Todd J. Green* University of Pennsylania
with Philip A. Bernstein (Microsoft Research),
Sergey Melnik (Microsoft Research),
Alan Nash (UC San Diego)
VLDB 2006 Seoul, Korea*Work partially supported by NSF grants IIS0513778 and IIS0415810
2
Mapping: a correspondence between instances of different schemas
Schema mappings
StudentsName,Address
NamesSID,Name
AddressesSID,Address
m
S1 S2
Students Name,Address (Names Addresses)⋈
3
Schema evolution
Applications of mappings
StudentsName,Address,Country
NamesSID,Name
AddressesSID,Address,Country
...m12 m23
S3S2
NamesSID,Name
LocalSID,Address
ForeignSID,Address,Country
Names Names
σCountry = KR(Addresses) SID,Address(Local)£{KR}σCountry KR(Addresses) Foreign
S1
Students Name,Address,Country(Names Addresses)⋈
4
Data integration, data exchange
Applications of mappings
StudentsName,Address,Country
NamesSID,Name
AddressesSID,Address,Country
...
...m1 mn
S1
NamesSID,Name
ForeignSID,Address, Country
LocalSID,Address
Students Name,Address
(Names ⋈ Addresses)
Names NamesLocal SID,Address(Country = KR(Addresses))
Foreign Country KR(Addresses)
Sn−1
Sn
5
Requirements for constraints
“First attribute in R is a key for R”
2,4(R ⋈1=3 R) µ 2,2(R)
“View V equals R joined with S”
V µ R ⋈ S, V ¶ R ⋈ S
“Second attribute of R is a foreign key in S”
2(R) µ 1(S)
2,4(S ⋈1=3 S) µ 2,2(S)
Data integration, data exchange – GLAV
R ⋈ S µ T ⋈ U
6
NamesSID,Name
AddressesSID,Address,Country
S2
StudentsName,Address,Country
NamesSID,Name
LocalSID,Address
ForeignSID,Address,Country
m12 m23
Students Name,Address, Country (Names ⋈
(SID,Address(Local)£{KR} [ Foreign))
Mapping composition
S1 S3
m12
Students Name,Address,Country(Names ⋈ Addresses)
Names Names
σCountry = KR(Addresses) SID,Address(Local)£{KR}σCountry KR(Addresses) Foreign
m23
7
Composition is hard Hard part: write composition in the same language
as the input mappings. Depending on language: Not always possible Not even decidable whether possible
Strategy 1: use powerful (second-order) mapping language closed under composition [FKPT04] Not supported by DBMS today Expensive to check Source-target restriction
Strategy 2: settle for partial solutions [NBM05] Containment mappings easier integration with DBMS The strategy we adopt in this work
8
Our contributions
New algorithm for composition problem Incorporates view unfolding and left-
composition (new technique)Makes best effort in failure casesAlgebraic rather than logic-based mappingsUse of monotonicity to handle more operatorsModular and extensible factoring of algorithm
First implementation of compositionExperimental evaluation
9
) R ⊆ (U)⋈(V - W)
Formal definition of composition
Mapping: set of pairs of instances of db schemas
The composition m12 ± m23 is the mapping
{hA,Ci : (9B)(hA,Bi 2 m12 and hB,Ci 2 m23)}
where A,B,C are instances of S1,S2,S3
Composition problem: find constraints in same language as input mappings giving the composition of the input mappings
Example:
S1 = {R}, S2 = {S,T}, S3 = {U,V,W}
R ⊆ S⋈T, S ⊆ (U), T = V – WR(∙,∙,∙)
S(∙,∙)
T(∙,∙)
U(∙,∙,∙)
V(∙,∙)
S1 S2 S3
m12 m23
R ⊆ S⋈T
S ⊆ (U),T = V – W
W(∙,∙)
10
Best-effort composition problem
Composition not always possible“Best-effort” composition problem: compute
set of constraints equivalent to input constraints, but with as many symbols from S2 eliminated as possible
R ⊆ U, R ⊆ V,
1,4(2=3(UU)) ⊆ U, 1,4(2=3(VV)) ⊆ V,U ⊆ T, V ⊆ TCan eliminate U (cross out left column) or V
(right column), but not both [NBM05]
11
Composition algorithm overview
For each relation R in S2
Try to eliminate R via (1) view unfolding
Replace = by pairs of ⊆, ⊇For each relation R in S2 not yet eliminated
Try to eliminate R via (2) left composeElse, try to eliminate R via (3) right compose
Output:
New constraints and list of relations successfully eliminated
12
(1) View unfolding
Idea: exploit equality constraints (if we have any) Standard technique: substitute view definition
for occurrences of view relation in mappings
T = V – W, R ⊆ S ⋈T, T X ⊆ (U)
R ⊆ S ⋈(V – W), (V – W) X ⊆ (U)
Body must not mention view relation itself Doesn’t matter what else is in body Can substitute everywhere
13
(2) Left compose
“View unfolding” for containment constraints(V) ⊆ R – U, R ⊆ S ⋈ T
(V) ⊆ (S ⋈ T) – U Needs monotonicity of expressions in R.
E1 ⊆ E2(R), R ⊆ E3 ´ E1 ⊆ E2(E3)
if E2(R) is monotone in R (and R not in E3)
Partial check for monotonicity
“Is S – (T – R) monotone in R?”
14
Normalization for left compose
Need one constraint of form R ⊆ E1
Use identities to normalize, e.g.: R ⊆ E1 and R ⊆ E2 iff R ⊆ E1 E2
E1 E2 ⊆ E3 iff E1 ⊆ E3 and E2 ⊆ E3
(E1) ⊆ E2 iff E1 ⊆ E2 Dr
More identities in paperAfter left compose, try to eliminate D
15
(3) Right compose
Dual to left compose, from [NBM05] Example:
S ⋈T R, R – U (V)
(S ⋈T) – U (V) Monotonicity check needed here too Normalization may introduce Skolem functions
E1 (E2) iff f(E1) E2
Must eliminate Skolem functions after composition Lots of effort coding this step!
16
User-defined operators
User specifies: Monotonicity of operator in its arguments
“If E1 monotone in R and E2 antimonotone in R or independent of R, then E1 * E2 monotone in R”
“if E1 monotone in R or independent of R and E2 antimonotone in R, then E1 * E2 monotone in R”
Identities for normalization
“E1 * E2 E3 iff E1 E2 E3 ”
User-defined operators and standard relational operators treated uniformly
17
Implementation 12K lines of C# code, command-line tool
# Test case 13: PODS05 example 2SCHEMA R(2), S(2), T(2)CONSTRAINTS R <= S, P_{0,2} J_{0,1:1,2} (S S) <= R, S <= TELIMINATE S;
Output:
P_{0,2} J_{0,1:1,2}(R R) <= R,R <= T
18
Experimental evaluation
First attempt at a composition benchmark Schema editing and schema reconciliation
scenarios “Add a column to R to produce S”: (R) = S
Measure % of symbols eliminated Running time
As a function of Editing primitives allowed, length of edit sequence,
presence/absence of keys, starting schema size, …
Synthetic data
19
Summary of results
Algorithm often effective in eliminating most or even all relation symbols from S2
Running time in subsecond range even for large problems containing hundreds of constraints
Certain schema editing primitives problematic Key constraints did not reduce effectiveness,
although did increase running time (and output size)
20
Schema editing
0
0.5
1
1.5
2
2.5
3
3.5
0 10 20 30 40 50 60 70 80 90 100Run number
Exe
cu
tio
n t
ime
(sec
)
Random starting schema (30 relations of 2-10 attributes) 100 random edits 100 different runs, sorted by execution time
21
Schema reconciliation (1)
0
0.2
0.4
0.6
0.8
1
10 30 50 70 90 110 130 150 170 190 210 Number of edits
fraction ofsymbolseliminated
executiontime (sec)
Random schema (30 relations of 2-10 attributes), random edits Point represents median time of reconciliation step of 500 runs
22
Schema reconciliation (2)
0
0.2
0.4
0.6
0.8
1
10 20 30 40 50 60 70 80 90 100 Schema size
Fra
ctio
n o
f sy
mb
ols
el
imin
ated
complete
no viewunfolding
no rightcompose
Random schema (variable # relations of 2-10 attributes) 100 random edits 100 different runs, sorted by execution time
23
Related work
[MH03] J. Madhavan, A. Y. Halevy. Composing mappings among data sources. VLDB, 2003.
[FKPT04] R. Fagin, Ph. G. Kolaitis, L. Popa, W.C. Tan. Composing schema mappings: second-order dependencies to the rescue. PODS, 2004.
[NBM05] A. Nash, P. A. Bernstein, S. Melnik. Composition of mappings given by embedded dependencies. PODS, 2005.
24
Conclusion and future work
We motivated and described the mapping composition problem
We presented an implementation of a practical new algorithm for the composition problem
We also presented an experimental evaluationTo do: theoretical analysis of impact of user-
defined operatorsTo do: output constraints from algorithm can
be a mess! How to clean up?
Top Related