1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.
-
Upload
reynard-valentine-park -
Category
Documents
-
view
216 -
download
3
Transcript of 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.
![Page 1: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/1.jpg)
1
Schema & Schema Integration
Carsten Karl
Dennis Schade
Thorsten Dollmann
![Page 2: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/2.jpg)
2
Outline
XTRACT System for inferring DTDs from a set of XML documents
Incremental validation of XML Documents
![Page 3: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/3.jpg)
3
Schema & XML Databases
Databases need a Schema DTDs serve the role of the schema of the
document Efficient storage of XML data Optimization of XML queries
DTDs are not mandatory !!!!
![Page 4: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/4.jpg)
4
XTRACT
Goal:Infer DTDs from a set of XML documents
![Page 5: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/5.jpg)
5
Problem Simplification and Abstraction Infer a DTD for each tag separately Separate example sequences for each
<e> Infer a “good” DTD for each <e> Resulting document DTD is a composition
of all inferred “tag”-DTDs
![Page 6: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/6.jpg)
6
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book
![Page 7: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/7.jpg)
7
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book
![Page 8: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/8.jpg)
8
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
![Page 9: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/9.jpg)
9
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author
![Page 10: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/10.jpg)
10
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author
![Page 11: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/11.jpg)
11
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author { <name> <age>}
![Page 12: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/12.jpg)
12
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author { <name> <age>}
![Page 13: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/13.jpg)
13
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author { <name> <age>}
![Page 14: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/14.jpg)
14
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author { <name> <age>, <name> }
![Page 15: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/15.jpg)
15
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author { <name> <age>, <name> }
editor
![Page 16: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/16.jpg)
16
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author { <name> <age>, <name> }
editor
![Page 17: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/17.jpg)
17
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author { <name> <age>, <name> }
editor { <name> }
![Page 18: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/18.jpg)
18
What is a “good” DTD ?
Given the example sequence set I={ ab, abab, ababab }
Possible DTDs:
(ab)*
PreciseConciseCandidate DTD
(a|b)*
(ab|abab|ababab)
ab|ab(ab|abab)
Yes No
No
No
Yes
Yes
Yes Somewhat
![Page 19: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/19.jpg)
19
What is a “good” DTD ? (ctd.)
A good DTD D must satisfy two restrictions R1: D should be concise R2: D should be precise
Minimum Description Length quantifies and resolves the tradeoff between R1 and R2
![Page 20: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/20.jpg)
20
The MDL Principle
MDL principle states: The best theory to infer from a given set of data is the one which minimizes the sum of1. The length of the theory in bits
2. The length of the data, in bits, when encoded with the help of the theory
![Page 21: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/21.jpg)
21
Overview of XTRACT System
MDL Modul
Factoring
Generalization
Input Sequences I = { ab,abab,ac, ad, bc, bd, bbd, bbbe }
Sg = I { (ab)*, (a|b)*, b*d, b*e }
Sf = Sg { (a|b)(c|d), b*(d|e) }
Inferred DTD: (ab)* | (a|b)(c|d) | b*(d|e)
![Page 22: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/22.jpg)
22
MDL Subsystem
In order to use the MDL principle, we need to
Define theory description length Define data description length Solve the resulting minimization problem
![Page 23: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/23.jpg)
23
MDL Coding scheme
Description Length of a DTDNumber of characters of the DTD
Cost of encoding the example sequencesencoding of b in terms of DTD a | b | c is 1,
cost 1 (position of b in the DTD)encoding of bbb in terms of DTD b* is 3
(number of repetitions of b), cost 1encoding of b in terms of DTD b is , cost 0
![Page 24: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/24.jpg)
24
MDL Subsystem Minimization
Input Sequences Candidate DTDs
ab
abb
abbb
abbbb
ab
(a|b)*
ab*
abb
![Page 25: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/25.jpg)
25
MDL Subsystem Minimization
Input Sequences Candidate DTDs
ab
abb
abbb
abbbb
(a|b)*
ab*
abb
63
4
5
6
7
abbbbb
30
+ 1b)= 1*+ (1a
![Page 26: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/26.jpg)
26
MDL Subsystem Minimization
Input Sequences Candidate DTDs
ab
abb
abbb
abbbb
abbbbb
(a|b)*
ab*
abb
30
3
1
1
1
1
1
8
![Page 27: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/27.jpg)
27
MDL Subsystem Minimization
Input Sequences Candidate DTDs
ab
abb
abbb
abbbb
(a|b)*
ab*
abbabbbbb
30
8
3
0
3
![Page 28: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/28.jpg)
28
MDL Subsystem Minimization
Input Sequences Candidate DTDs
ab
abb
abbb
abbbb
ab
(a|b)*
ab*
abb
30
8
3
![Page 29: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/29.jpg)
29
Overview of XTRACT System
MDL Modul
Factoring
Generalization
Input Sequences I = { ab,abab,ac, ad, bc, bd, bbd, bbbe }
Sg = I { (ab)*, (a|b)*, b*d, b*e }
Sf = Sg { (a|b)(c|d), b*(d|e) }
Inferred DTD: (ab)* | (a|b)(c|d) | b*(d|e)
![Page 30: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/30.jpg)
30
Generalization Subsystem
Goal: Infer regular expressions from example sequences Produce candidate DTDs such as a*bc,(abc)*, (a|b|c)*,((ab)*c)*
Generate more general DTDs Two heuristics:
DiscoverSeqPattern(s,r): s=abbbbc => ab*c DiscoverOrPattern(s,d): s=abacbc => (a|b|c)*
Candidate DTDs are generated by calling the above functions for appropriate values of r and d
![Page 31: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/31.jpg)
31
DiscoverSeqPattern Example
( a b ) * c a b c ( a b ) * c
( a b ) * c a b c ( a b ) * c
( a b ) * c ) *(
The pattern must occur at least two times: r=2
a b a b a b c a b c a b a b ca b
a b a b a b c a b c a b a b ca b
( a b ) * c a b c a b a b ca b
( a b ) * c a b c a b a b ca b
![Page 32: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/32.jpg)
32
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax
![Page 33: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/33.jpg)
33
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax Step 1: Partition
![Page 34: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/34.jpg)
34
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax Step 1: Partition
![Page 35: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/35.jpg)
35
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax Step 1: Partition
![Page 36: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/36.jpg)
36
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax Step 1: Partition
![Page 37: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/37.jpg)
37
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax Step 1: Partition
![Page 38: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/38.jpg)
38
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax Step 1: Partition
![Page 39: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/39.jpg)
39
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax Step 2:
replace pattern a1…an by (a1|..|an)*
![Page 40: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/40.jpg)
40
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a ( x ca| c ) * Step 2:
replace pattern a1…an by (a1|..|an)*
![Page 41: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/41.jpg)
41
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a ( x ca| c ) *
x is an auxiliary symbol introduced by DiscoverSeqPattern
a ( ca| c ) *((de)*e)*
x = ((de)*e)*
![Page 42: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/42.jpg)
42
Factoring Subsystem
Goal: Combine different candidates to derive more compact, factored DTDs
Example candidate set Sg = { ac, ad, bc, bd }
ac ad bc bd| | |
![Page 43: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/43.jpg)
43
Factoring Subsystem
Goal: Combine different candidates to derive more compact, factored DTDs
Example candidate set Sg = { ac, ad, bc, bd }
ac ad bc bd| | | =>
![Page 44: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/44.jpg)
44
Factoring Subsystem
Goal: Combine different candidates to derive more compact, factored DTDs
Example candidate set Sg = { ac, ad, bc, bd }
ac ad bc bd| | | => a(c|d)
![Page 45: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/45.jpg)
45
Factoring Subsystem
Goal: Combine different candidates to derive more compact, factored DTDs
Example candidate set Sg = { ac, ad, bc, bd }
ac ad bc bd| | | => a(c|d) |
![Page 46: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/46.jpg)
46
Factoring Subsystem
Goal: Combine different candidates to derive more compact, factored DTDs
Example candidate set Sg = { ac, ad, bc, bd }
ac ad bc bd| | | => a(c|d) | b(c|d)
![Page 47: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/47.jpg)
47
Factoring Subsystem
Goal: Combine different candidates to derive more compact, factored DTDs
Example candidate set Sg = { ac, ad, bc, bd }
ac ad bc bd| | | => a(c|d) | b(c|d)
=>
![Page 48: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/48.jpg)
48
Factoring Subsystem
Goal: Combine different candidates to derive more compact, factored DTDs
Example candidate set Sg = { ac, ad, bc, bd }
ac ad bc bd| | | => a(c|d) | b(c|d)
=> (a|b)(c|d)
Reduces MDL description length of the candidate DTDs Adoption of factoring algorithms for Boolean expressions
Use heuristic algorithm for selecting subsets of candidate DTDs that give a good factored form
![Page 49: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/49.jpg)
49
Factoring Subsystem Heuristics
Choose subsets S of candidate DTDs from SG such that
DTDs in S have a common prefix p or suffix snumber of DTDs with this common prefix in SG
is high
![Page 50: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/50.jpg)
50
Factoring PrefixesCandidate DTDs
longer prefixes result in MDL cost reduction factored DTD covers all input sequences
abcddd
abceee
abcfff
abcggg
abcd*
abce*
abcf*
abcg*
abc(d*|e*|f*|g*)
![Page 51: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/51.jpg)
51
Factoring Subsystem Heuristics
Choose subsets S of candidate DTDs from SG such that
DTDs in S have a common prefix p or suffix snumber of DTDs with this common prefix in SG
is high
![Page 52: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/52.jpg)
52
Factoring Subsystem Heuristics
Choose subsets S of candidate DTDs from SG such that
DTDs in S have a common prefix p or suffix snumber of DTDs with this common prefix in SG
is high The overlap between every pair of DTDs
D, D’ in S should be minimal
![Page 53: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/53.jpg)
53
Factoring Subsystem Overlap
Input Sequences Candidate DTDs
eab
eabb
eabbb
eababab
e(a|b)*
eab*
e((a|b)*|ab*)
![Page 54: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/54.jpg)
54
Factoring Subsystem Overlap
Input Sequences Candidate DTDs
eab
eabb
eabbb
eababab
e(a|b)*
eab*
e((a|b)*|ab*)
New factored form has much higher MDL cost ! Does not cover more input sequences then e(a|b|)*
![Page 55: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/55.jpg)
55
Experimental Validation
Comparison of XTRACT with IBM DDbE (Data Description by Example)
Synthetic DocumentsRandomly generated example sequences for
synthetic DTDs Real Life Documents
Example documents from different sources e.g. Newspaper Association of America
![Page 56: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/56.jpg)
56
Synthetic Documents1 abcde|efgh|ij|klm
2 (a|b|c|d|f)*gh
3 (a|b|c)d*e*(fgh)*
4 (abcd)*|(e|f|g)*|h|(ijklm)*
5 a*|(b|c|d|e|f)*|gh|(i|j|k)*|(lmn)*
XTRACT recovers each single one of them
DDbE shows serious weaknesses
Recovers only the first one correctly
Deduced DTDs are over-generalizations
Does not even cover all example sequences
Level of factoring is limited
![Page 57: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/57.jpg)
57
Real Life Documents
No Simplified DTD DTD obtained by XTRACT
DTD obtained by DDbE
1 a|b|c|d|e a|b|c|d|e a|b|c|d|e
2 (a|b|c|d|e)* (a|b|c|d|e)* (a|b|c|d|e)*
3 ab*c* ab*c* (ab+c*)|(ac*)
4 a*b?c?d? a*b?c?d? (a+b(c|(c?d))?)|((b|a+)?cd)|((a+|b)?d)|((a+|b)?c)|(a+|b)
5 (a(bc)+d)* (a(bc)*d)* (a|b|c|d)+
6 (ab?c*d?)* - (a|b|c|d)+
![Page 58: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/58.jpg)
58
Conclusion
MDL principle used to control the tradeoff between model simplicity and model generalisation
General purpose tool to extract regular expressions from example documents
Experimental results provide strong support Future work:
Generalization subsystem should detect patterns containing ? nested within Kleene stars (a(bc)?)*
Enhance the system to detect even more complex DTDs
![Page 59: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/59.jpg)
59
Incremental Validation of XML - Documents
![Page 60: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/60.jpg)
60
Abstraction of XML and DTD’sXML Docs abstracted as Labeled Ordered Trees
LOT• element content and attribute values are
ignored
DTD as extended CFG• start symbol (root)• productions : associate to each label a regular
expression that specifies the acceptable labels of the list of children of a node with the given labelLOT satisfies a DTD tree is derivation of the
grammar
![Page 61: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/61.jpg)
61
DTDs: Abstraction & Exampleroot : carscars used newused car*new car*car (year|) model
95 Tigra 94 Astra Mini Boxster03
cars
used new
car car car
year model year model model
car
modelyear
![Page 62: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/62.jpg)
62
Tree Satisfying DTD, General Case
1 2 ii-1 i+1 k-1 k… …
…
s1 s2 sk-1 sk…
…a b c
root : … r
…
L(r)
![Page 63: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/63.jpg)
63
Incremental Validation Problem Statement
For each valid tree T : given a series of update commands,
• efficiently decide if the updated tree T’ is valid
• efficiently update auxiliary structure A(T) and T
![Page 64: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/64.jpg)
64
Updates (1): Node Renaming u(vi,)
1 2 ii-1 i+1 k-1 k… …
…
r
s1 s2 sk-1 sk…
…a b c
vi
![Page 65: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/65.jpg)
65
Incremental Validation of Strings Renaming u(i,b) in string 1...n
with respect to regular language specified by NFA N(Σ,Q,q0,F,δ)
validating updated string from scratch: O(n|Q|2log|Q|)
maintain auxiliary information:
Pre(i) = δ(q0, 1, … i-1) Post(i) = { s | δ(s, i+1, … n) ε F)}
1... i-1b i+1… n valid <-> exists s1 ε Pre(i), s2 ε Post(i) such that s2 ε δ(b,s1)
![Page 66: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/66.jpg)
66
Validating a Renaming u(ai, )
12 ii-1 i+1 n-1 n
… N…
Validation of one update in O(1) given
precomputedPre and Post
Post(i)
Pre(i)
But u(i, ) requires recomputation of Pre(i),
Pre(i+1), … and of Post(i), Post(i-1), …
q0 1
2 i-1
…
qF
n
n-1i+1 …
q0
1
2 i-1
…
![Page 67: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/67.jpg)
67
Transition Relation Definition
12 i j n-1 n
… …… …m
Ti,j = { (q, q’) | }
i+1
q i…i+1
q’j
m+1
Ti,j = Ti,m Tm+1,j
![Page 68: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/68.jpg)
68
Divide-and-conquer approach Transition-Relation-Tree Τn (n=2k)
root: T1,2k
node Tij has children Ti,k and Tk+1,j leaves Ti,i , 1≤i≤n
number of nodes: n+ (n/2) + … + 2 + 1 = 2n-1 balanced
→ Τn has depth log n
![Page 69: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/69.jpg)
69
Transition Relation Trees
1 2 3 4 5 6 7 8
T5,8T1,4
T3,4T1,2 T5,6 T7,8
T1,1 T2,2
T3,3 T4,4
T5,5 T6,6
T7,7 T8,8
T1,8
![Page 70: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/70.jpg)
70
Updating Tn
affected nodes are lying on the path from a leaf to the root
bottom-up recomputing Tij‘s: each Tij with children Tik and Tkj for which at least
one child has been recomputed is replaced by Tik ° Tkj
→ O(log n) recomputations
updated string valid if
<qo,f> T1n for some f F
![Page 71: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/71.jpg)
71
Maintenance of the Structure and Validation in O(log n)
u(6, )
1 2 3 4 5 6 7 8
T1,1 T2,2 T3,3 T4,4 T5,5 T6,6 T7,7 T8,8
T1,2 T3,4 T5,6 T7,8
T5,8T1,4
T1,8If (q0, qF) then valid
T6,6
T5,6
T5,8
T1,8
![Page 72: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/72.jpg)
72
Insertions and Deletions
positions of nodes in the string can change length n of string is dynamic → Recomputing of the entire tree Tn necessary
New approach based on B-Trees: tree structure can be incrementally maintained tree is still balanced and has depth O(log n)
![Page 73: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/73.jpg)
73
Transition B-Trees (2-3 Trees)
1
2
3
5
6
7
9
T1 T2 T3 T5 T6 T7 T9
Ta Tb TcTa = T1 T2
If (q0, qF) Ta Tb Tc then valid
![Page 74: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/74.jpg)
74
Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions
1
2
3
5
6
7
9
8
T1 T2 T3 T5 T6 T7 T8 T9
Ta Tb Tc
![Page 75: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/75.jpg)
75
Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions
1
2
3
5
6
4
7
9
8
T1 T2T7 T8 T9
Ta Tb Tc
T3 T5 T6
![Page 76: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/76.jpg)
76
Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions
T3 T4 T5 T6
1
2
3
5
6
4
7
9
8
T1 T2T7 T8 T9
Ta Tb Tc
![Page 77: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/77.jpg)
77
Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions
Ta Td Te Tc
T3 T4 T5 T6
1
2
3
5
6
4
7
9
8
T1 T2T7 T8 T9
Tf Tg
![Page 78: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/78.jpg)
78
Auxiliary Structures for Incremental DTD Validation
1 2 ii-1 i+1 k-1 k… …
…
r
s1 s2 sk-1 sk…
…
vi
u(vi, )
r
i…
…
r
r
![Page 79: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/79.jpg)
79
XML Schema Validation
XML Schema provide a mechanism to decouple element names from their types and thus allow context-dependent definitions of their structure
Update to a single node may have global repercussions for the typing of the tree
Need more theory: Specialized DTD‘s , binary tree encoding, non-
deterministic tree automata… details are left to the interested reader…
![Page 80: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/80.jpg)
80
Review Given m updates on tree of size n:
incrementally validate DTD in O(m log n)
validate XML Schema in O(m log2 n)
Weakness
Only updates that affected one node at a time are considered
![Page 81: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.](https://reader036.fdocuments.us/reader036/viewer/2022070415/5697c00a1a28abf838cc7b27/html5/thumbnails/81.jpg)
81
Summary
XTRACT as a tool to infer DTDs from a set of example XML documents
An approach to incrementally validate a XML document after an update
Questions?