1 XML Constraints Wenfei Fan University of Edinburgh and Bell Laboratories.

66
1 XML Constraints Wenfei Fan University of Edinburgh and Bell Laboratories
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    2

Transcript of 1 XML Constraints Wenfei Fan University of Edinburgh and Bell Laboratories.

1

XML Constraints

Wenfei Fan

University of Edinburgh

and

Bell Laboratories

2

Outline of Part IV

XML Specifications: types and integrity constraints

Specification of XML constraints:

– keys, foreign keys, FDs

– absolute vs. relative constraints

Analysis of XML constraints– Consistency analysis– Implication analysis

Applications of XML constraints, and research issues

– Relational storage of XML data via constraint propagation

– Schema-directed XML integration

– Normal forms, query optimization, updates, data cleaning . . .

3

Introduction to XML specificaiton

XML Specification:

– types

– integrity constraints

– the need for XML constraints

4

XML data - an example

Rooted, node-labeled tree elements: db, province, capital, city, subtree/sub-document

elements/subelements, e.g., the capital child of province @attributes: @name, @inProvince, carrying text text nodes, with text but no label, e.g., “Hasselt”

...

db

province capital capital

@inProvince

province

capitalcity

@inProvince

@name

“Limburg”

“Limburg”

“Limburg”

“Hasselt”

“others” “Hasselt”

5

XML specification: DTD (type)

Production: constrains the subelement list of each element <!ELEMENT db (province+, capital+)>

<!ELEMENT province (city*, capital)>

Attributes: uniquely identified by name for each element, unordered

province: @name, capital: @inProvince

...

db

province capital capital

@inProvince

province

capitalcity

@inProvince

@name“Limburg”

“Limburg”

“Limburg”

“Hasselt”

“others” “Hasselt”

6

XML specification: integrity constraints

Keys and foreign keys (vs. relational constraints): key: the value of a @name uniquely identifies a province

province.@name province

capital.@inProvince capital

FK: @inProvince of a capital references @name of a province

capital.@inProvince province.@name

...

db

province capital capital

@inProvince

province

capitalcity

@inProvince

@name“Limburg”

“Limburg”

“Limburg”“Hasselt”

“others” “Hasselt”

7

XML specification

A type (DTD) D

A set of integrity constraints,

Example:

DTD D: structure of the document, vs. types in a PL <!ELEMENT db (province+, capital+)>

<!ELEMENT province (city*, capital)> province.@name, capital.@inProvince

Constraints : defined in terms of data values across elements

province.@name province

capital.@inProvince capital

capital.@inProvince province.@name

8

Why XML constraints?

Supported by W3C XML standard, XML Schema

In databases (supported by SQL standard), constraints are: an essential part of the semantics of data, fundamental to conceptual design, useful for choosing efficient storage and access methods, central to update anomaly prevention, data cleaning …

In the XML setting: constraints have proved useful in database storage of XML data (via constraint propagation), schema-directed database publishing/integration in XML, XML query optimization and formulation, design theory for XML specifications: normal forms data cleaning, …

9

Data exchange on the Web: XML publishing

All members of a community (or industry) agree on a schema and exchange data w.r.t. the schema: e-commerce, health-care, ...

Schema-Directed XML Publishing/Integration: mapping data from traditional database to XML satisfying the predefined DTD and constraints

DB1 DB2

XMLDTD

Q: XML view

Web

XMLconstraints

10

Data exchange on the Web: XML shredding

XML shredding: mapping XML data to relations relational design: normalization via constraint propagation from

XML to relations– optimal relational storage of XML data– semantic connection: query/update optimization

DB1 DB2

XML

XML shredding

Web

XMLXML keys

relational FDs

propagation

11

XML constraints

Specification of XML constraints:

– keys, foreign keys, FDs

– absolute vs. relative constraints

12

The limitations of the XML standard (DTD)

<!ATTLIST country name ID #required>

<!ATTLIST province capital ID #required>

<!ATTLIST capital inProvince IDREF #required> Scoping:

– ID unique within the entire document (like oids), while a key needs only to uniquely identify a tuple within a relation

– IDREF untyped: one has no control over what it points to -- you point to something, but you don’t know what it is!

<student id=“01” name=“Saddam” taking=“qsx”/>

<student id=“02” name=“Bush” taking=“qsx 01”/>

<course id=“qsx”/>

13

The limitations of the XML standard (DTD)

keys need to be multi-valued, while IDs must be single-valued (unary) enroll (sid: string, cid: string, grade:string)

a relation may have multiple keys, while an element can have at most one ID (primary)

ID/IDREF can only be defined in a DTD, while XML data may not come with a DTD/schema

ID/IDREF, even relational keys/foreign keys, fail to capture the semantics of hierarchical data – will be seen shortly

A mixture of relational keys and object identities (oids)

Mild extensions of relational constraints do not work for XML!

14

Absolute constraints

Absolute keys and foreign keys are to hold on the entire document.

province.@name province

capital.@inProvince capital

capital.@inProvince province.@name

Extensions of relational counterparts

...

db

province capital capital

@inProvince

province

capitalcity

@inProvince

@name“Limburg”

“Limburg”

“Limburg”

“Hasselt”

“others” “Hasselt”

15

Absolute keys and foreign keys [PODS’00, 01, JACM]

key: [X] . An XML document satisfies the key iff

x y ext() (l X (x.l = y.l) x = y)

foreign key (FK): a combination of an inclusion constraint 1[X] 2[Y], and a key 2[Y] 2 .

A document satisfies the FK iff it satisfies the key and

x ext(1 ) y ext(2 ) (x[X] = y[Y])

, 1 ,2: element types; X, Y: sets (lists) of attributes;

– ext(): the set of elements in an XML document.

Equality issue: (string) value equality: when comparing attributes node identify: when comparing XML elements

Unary keys and foreign keys: defined in terms of single-attribute.

16

Relative constraints [WWW’01, PODS’02,SICOMP]

An XML tree specifies countries, provinces, province capitals.

What is a key for a province?

What does @inProvince of a capital reference?

...

country

capital

@inProvince

province

capital

@inProvince

@name“Limburg”

“Limburg”

“Limburg”

“Hasselt”

“Hasselt”

...

country

capital

@inProvince

province

capital

@inProvince

@name

“Limburg”

“Limburg”

“Maastricht”

“Hasselt”

@name“Belgium” “Holland”

@name

db

...

“Limburg”

17

Examples of relative constraints

Relative constraints: on a subdocument rooted at a country:

key: country (province.@name province)

country (capital.@inProvince capital)

FK: country (capital.@inProvince province.@name)

Absolute: on the entire document: country.@name country

...country

capital

@inProvince

province

capital

@inProvince

@name“Limburg”

“Limburg”

“Limburg”

“Hasselt”

“Hasselt”

...country

capital

@inProvince

province

capital

@inProvince

@name

“Limburg”

“Limburg”“Maastricht”

“Hasselt”

@name“Belgium” “Holland”

@name

db

...

“Limburg”

18

Relative keys and foreign keys

key: (1[X] 1). An document satisfies the key iff

c ext() y, z ext(1)

( (y c) (z c) l X (y.l = z.l) y = z)

foreign key (FK): ( 1[X] 2[Y] ) and a key ( 2[Y] 2) .

A document satisfies the FK iff it satisfies the key and

c ext() y ext(1) (( y c)

z ext(2 ) ((z c) y[X] = z[Y] ))

where (y c): y is a descendant of c (y in the subtree rooted at c); : context type; ext(): the set of elements in an XML document.

19

Relative vs. Absolute

Absolute constraints are a special case of relative ones:

country.@name country db ( country.@name country )

absolute: a fixed context type -- the root type r

Absolute constraints are scoped within the entire document; whereas relative ones within the context of a subdocument.

country (province.@name province)

country (capital.@inProvince capital)

country (capital.@inProvince province.@name)

country.@name country

Together they specify constraints on the entire document

Beyond relational constraints; important for hierarchically structured data: XML, scientific databases, biomedical data, ...

20

Define keys with path expressions

XML data is hierarchically structured!

“name” as a key for employees of companies only: target set is identified with a path expression: //company//employee

XML data is semistructured: it may not have a DTD/schema!– key paths may be missing or have multiple occurrences

key specification should be independent of types

name@id @id

...

db

company government university

employee employee employee

@id

company

employeedept

name employee

name

name name

firstName lastName

21

Path expressions

Path expression: navigating XML trees

A simple yet powerful path language:

q ::= | l | q/q | //

: empty path l: tag q/q: concatenation //: descendants and self – recursively descending downward

22

Absolute path constraints [WWW’01]

Absolute key: (Q, {P1, . . ., Pk} )

Path expressions Q, Pi: XPath, regular path expressions, …

target path Q: to identify a target set [[Q]] of nodes on which the key is defined (vs. relation)

a set of key paths {P1, . . ., Pk}: to provide an identification for

nodes in [[Q]] (vs. key attributes) semantics: for any two nodes in [[Q]], if they have all the key

paths and agree on them by value equality (existential), then they must be the same node (value equality and node identity)

Examples:

(//company//employees, {name, phone}) -- composite key

( //company//employees, {//@id}) -- multiple keys

(//., {@id}) -- capturing ID attributes in DTDs

23

Value equality on trees

Two nodes are value equal iff either they are text nodes (PCDATA) with the same value; or they are attributes with the same tag and the same value; or they are elements having the same tag and their children are

pairwise value equalE.g.: two value-equal names

...

db

person personperson person

@pnone

“234-5678”@phone name

“123-4567”

name

firstName lastName

“George” “Bush”

name

firstName lastName

“George” “Bush”“Jerk”

24

Capturing the semistructured nature

independent of types no structural requirement: tolerating missing/multiple paths

(person, {name}) (person, {name, @phone})

db

person personperson person

@pnone

“234-5678”@phone name

“123-4567”

name

firstName lastName

“George” “Bush”

name

firstName lastName

“George” “Bush”

“JohnDoe”

25

Relative path constraints [WWW’01]

Relative key: (Q, K) path Q identifies a set [[Q]] of nodes, called the context path;

K = (Q’, {P1, . . ., Pk} ) is a key on sub-documents rooted at

nodes in [[Q]] (relative to Q).

Example. (//country, (province, {@capital}))

(//country, {@name}) -- absolute key Absolute keys are a special case of relative keys:

(Q, K) when Q is the empty path Similarly for foreign keys

Specification of XML constraints is more involved than its relational counterparts

26

Keys and foreign keys in XML Schema

key: (Q, {P1, . . ., Pk} )

Path expressions Q, Pi: fragments of XPath Uniqueness and existence: for each node x in [[Q]] and each i in

[1, n], there exists a unique node yi reached via Pi, and yi is

either a text node or an attribute

Foreign keys: (Q, {P1, . . ., Pk} ) (S, {S1, . . ., Sk} )

(S, {S1, . . ., Sk} ) is a key

Uniqueness and existence: both Pi and Si

The uniqueness and existence condition complicates the consistency and implication analyses

Absolute constraint

27

Other constraints for XML

Functional dependencies: {P1, . . ., Pk} {S1, . . ., Sk}

Generalizations of relational FDs – for deriving an extension of relational-schema normal forms

Absolute constraints [Arenas and Libkin, PODS’02]

XICs: x1 … xn ( B(x1, …, Xn)

∨ (i [1, l]) ( y1 … yk Ci (x1, …, xn, y1, …, yk))

Generalization of relational embedded constraints B, Ci: conjunction of simple XPath expressions Subsuming relative keys and foreign keys (Deutsch and Tannen,

[KRDB’01])

28

Constraint analysis

Analysis of XML constraints

– Consistency analysis

– Implication analysis

– Absolute, relative, path-expression constraints

29

Consistency of XML specifications

Given D: a DTD

: a set of integrity constraints over D

Consistency: Is there an XML document that both conforms to

D and satisfies ?

One wants to know whether XML specifications make sense!

Run-time check: attempts to validate documents with (D, ).

This would not tell us whether repeated failures are due to a bad

specification or problems with the documents

static analysis is desirable

30

An inconsistent specification

The specification with D and is inconsistent! DTD D:

<!ELEMENT db (province+, capital+)>

<!ELEMENT province (city*, capital)>

province.@name, capital.@inProvince

Constraints :

province.@name province

capital.@inProvince capital

capital.@inProvince province.@name

In contrast, one can specify keys and foreign keys in SQL without

worrying about their consistency with schema.

31

Cardinality constraints by keys, foreign keys

Constraints :province.@name province

capital.@inProvince capital

capital.@inProvince province.@name

Notation: ext(): the set of elements in an XML document ext(.l): the set of l attribute values of all elements

|ext(province.@name)| = |ext(province)|

|ext(capital.@inProvince)| = |ext(capital)|

|ext(capital.@inProvince)| |ext(province.@name)|

|ext(capital)| |ext(province)|

32

Cardinality constraints imposed by DTDs

DTD D: <!ELEMENT db (province+, capital+)> <!ELEMENT province (city*, capital)>

Variables: Xprovince: the number of province elements under the root Xcapital: the number of capital subelements of the root Ycapital: the number of capital subelements of province’s Xprovince 1, Xcapital 1 |ext(province)| = Xprovince, Xprovince = Ycapital |ext(capital)| = Xcapital + Ycapital |ext(capital)| > |ext(province)|

33

The interaction

Contradiction:

From the constraints : |ext(capital)| |ext(province)|

From the DTD D: |ext(capital)| > |ext(province)|

Thus there exists NO XML document that both conforms to D and

satisfies .

...

db

province capital capital

@inProvince

province

capitalcity

@inProvince

@name“Limburg”

“Limburg”

“Limburg”

“Hasselt”

“others” “Hasselt”

34

Consistency analysis [PODS’01, 02, JACM, SICOMP]

Trivial for relational databases: given any schema and keys,

foreign keys, one can always find a nonempty instance of the

schema satisfying the constraints.

Hard for XML: XML specifications may not be consistent!

– Both DTDs and constraints impose cardinality constraints

– The interaction between these two classes of cardinality

constraints is rather complicated.

35

Consistency analysis of XML constraints

Theorem: The consistency problem is undecidable for multi-attribute absolute keys and foreign keys; NP-complete for unary absolute keys and foreign keys, even for

primary keys (primary: at most one key for each element type); in NEXPTIME for primary multi-attribute absolute keys and

unary foreign keys in 2NEXPTIME and PSPACE-hard for unary absolute regular

keys and foreign keys (target path: /, where is a regular path expression and an element type; key paths: attributes)

undecidable for relative keys and foreign keys, even when all the constraints are unary and primary.

As opposed to the trivial analysis of the relational counterpart.

36

Proof ideas

Multi-attribute constraints: reduction from the implication problem for functional and inclusion dependencies in RDBs.

Unary keys and foreign keys:– a nontrivial encoding of DTDs and unary constraints in terms

of linear integer constraints (O(n2 log n)-time);– polynomially equivalent to LIP, linear integer programming

Multi-attribute primary keys and unary foreign keys:– polynomially equivalent to Prequadratic Diophantine Problem

(PDE): satisfiability of linear integer constraints and prequadratic constraints of the form: x <= y z;

– the precise complexity of PDE, a restriction to the Hilbert’s 10th problem, is open -- nontrivial.

37

Proof idea for relative constraints

Theorem: The consistency problem is undecidable for relative keys and foreign keys, even when all the constraints are unary and are under the primary key restriction.

As opposed to the NP complexity of its absolute counterpart.

Proof idea: reduction from the Hilbert’s 10th problem.

Diophantine equation problem:

P1 (x1, …, xk) = Q1 (x1, …, xk) + c1

. . . Pn (x1, …, xk) = Qn (x1, …, xk) + cn

38

More on regular-expression constraints

XML data is hierarchically structured: define @eid as a key of employees of companies and schools; define @taughtBy as a foreign key of students referencing @eid

of school employees.

...db

university government company

dept

university

dept

@taughtBy @eid

employee employee

employeeemployee

employee

dept employee

employee studentstudent

@eid@eid @eid

@eid

@taughtBy

39

Examples of regular constraints

Key: (university._* + company._*).employee.@eid

(university._* + company._*).employee

FK: _*.student.@taughtBy university._*.employee.@eid

_: wildcard that matches any label

_*: the Kleene closure of _

...db

university government company

dept

university

dept

@taughtBy @eid

employee employee

employeeemployee

employee

dept employee

employee studentstudent

@eid@eid @eid

@eid

@taughtBy

40

Regular path expression

Vertical regular expressions:

::= | | _ | . | + | *

: empty word; : element type; _: wildcard;

“., +, *”: concatenation, disjunction, Kleene star

Example: (university._* + company._*).employee

university._*.employee

nodes(. ): the set of elements in an XML document that are

reachable from the root by following

41

Regular expression constraints

key: .[X] .. A document satisfies the key iff

x y nodes( . ) (l X (x.l = y.l) x = y)

foreign key: 1.1[X] 2.2[Y], and a key 2.2[Y] 2.2

A document satisfies the FK iff it satisfies the key and

x nodes( 1.1 ) y nodes( 2.2 ) (x[X] = y[Y])

where nodes(.): the set of elements reachable from the root by

following .

42

Regular: an extension of absolute constraints

Example:

Key: (university._* + company._*).employee.@eid

(university._* + company._*).employee

FK: _*.student.@taughtBy university._*.employee.@eid

Observation: nodes( _*. ) = ext()

Recall absolute constraints:

key: [X] _*. [X] _*.

foreign key: 1[X] 2[Y], 2[Y] 2

_*. 1 [X] _*.2 [Y], _*. 2 [Y] _*.2

43

Consistency analysis of regular constraints

Corollary: The consistency problem is undecidable for multi-attribute regular keys and foreign keys.

Theorem: It is decidable in 2NEXPTIME and is PSPACE-hard for

unary regular constraints.

2NEXPTIME: an involved encoding in terms of LIP regular expressions in a DTD interact with (vertical) regular path

expressions: reduce DTD to a simple normal form

regular path expressions interact with each other: introduce exponentially many variables for all boolean combinations

encoding “reachability” (nodes(.)) of a path expression: tag variables with states of finite state automata

44

Some tractable cases

Restrictions on constraints.

Theorem: For multi-attribute relative keys only, the consistency problem is in linear time for arbitrary DTDs.

Recall relative keys: country (province.@name province)

In contrast, due to the existence and uniqueness condition:

Theorem: It is intractable for unary keys alone in XML Schema.

Restrictions on DTDs:

Theorem: When DTD is fixed, the consistency problem is in PTIME for absolute unary keys and foreign keys.

In practice, DTD is designed at one time, but constraints are written in stages: constraints are incrementally added.

45

Implication analysis [PODS’00, 01, 02, DBPL’01]

Given D: a DTD

: a set of constraints expressed in C

: a property (a constraint of C)

Implication (C ): Is it the case that for any XML document, if it

conforms to D and satisfies , then it must satisfy ?

C: a constraint language

The need for studying implication: data integration: constraints checking at virtual views optimization of XML queries and XML relational storage design theory for XML specifications: normalization

46

Some complexity results for implication analysis

Theorem: The implication problem is undecidable for multi-attribute absolute keys and foreign keys,

and for unary relative keys and foreign keys; PSPACE-hard for unary regular absolute keys and foreign keys; coNP-complete for unary absolute keys and foreign keys. coNP-hard for XML-Schema unary keys

in linear time for absolute multi-attribute keys; in PTIME for arbitrary absolute keys and foreign keys when the

DTD is fixed, and in PTIME for relative path keys in the absence of DTDs

The analysis of XML constraints is far more intricate than its relational counterpart

47

Applications

Application of XML constraints, and open problems

– Constraint propagation

– Schema-directed XML integration

– Normal form

– Query rewriting/optimization

– Update processing

– Data cleaning

– . . .

48

XML shredding: relational storage of XML data

XML shredding: mapping XML data to relations relational design: normalization

– optimal relational storage of XML data– semantic connection: query/update optimization

DB1 DB2

XML

XML shredding

Web

XMLXML keys

relational FDs

propagation

49

Example: XML constraints

(//book, {isbn}) -- isbn is an (absolute) key of book (//book, (chapter, {number}) -- number is a key of chapter

relative to book (//book, (title, { })) -- each book has a unique title

db

book bookbook book

title chapter

“XML”

chapter

title section

“6”

number section

number text DTD

number

“1” number XPath

title

“XML” number

“1”

number

“10”

isbn isbn

title

chapter chapter

50

Mapping from XML to a predefined relation

Predefined RDB: chapter(bookTitle, chapterNum, chapterTitle) Mapping: for each book, extract its title, and the numbers and

titles of all its chapters Predefined relational key: (bookTitle, chapterNum)

Can the XML data be mapped to the RDB without violating the key?

chapter chapter

db

book bookbook book

title chapter

“XML”

chapter

title section

“6”

number section

number text DTD

number

“1” number XPath

title

“XML” number

“1”

number

“10”

isbn isbn

title

51

A safe mapping

Now change the relational schema to

RDB: chapter(isbn, chapterNum, chapterTitle)

The relation can be populated without any violation. Why?

The relational key (isbn, chapterNum) for chapter is implied (entailed) by the keys on the original XML data:

(//book, {isbn}), (//book, (chapter, {number}), (//book, (title, { }))

chapter chapter

db

book bookbook book

title chapter

“XML”

chapter

title section

“6”

number section

number text DTD

number

“1” number XPath

title

“XML” number

“1”

number

“10”

isbn isbn

title

52

Constraint Propagation [ICDE’03, JCSS]

Input: – a set K of XML keys (context and target path: a fragment of

XPath, key paths: attributes) – a predefined relational schema S, – a mapping f from XML to S (XPath, projection, join, union)– and a relational functional dependency FD over S

Output: is the FD propagated from K via f? I.e., does FD hold

over the DB f(T) for any XML document T that satisfies K?

Theorem: The constraint propagation problem is in PTIME.

Checking the consistency of a predefined relational schema for

storing XML data

XML schema/DTD is not required – K is the only semantics

53

Deriving relational schema for storing XML

One wants to find a “good” relational schema to store:

chapter(isbn, bookTitle, author, chapterNum, chapterTitle)

What is a good schema? In normal form: BCNF, 3NF, … Prevent update anomaly (the relational theory) Efficient storage, query optimization …

But how to find a normalized design?

chapter chapter

db

book bookbook book

title chapter

“XML”

chapter

title section

“6”

number section

number text DTD

number

“1” number XPath

title

“XML” number

“1”

number

“10”

isbn isbn

title

54

Constraint propagation and normalization

From the given XML keys:

(//book, {isbn}), (//book, (chapter, {number}), (//book, (title, { }))

one can derive functional dependencies:

isbn bookTitle, isbn, chapterNum chapterTitle

Normalize the relation by using these functional dependencies:

chapter(isbn, bookTitle, author, chapterNum, chapterTitle)

book(isbn, bookTitle),

chapter(isbn, chapterNum, chapterTitle),

author(isbn, author)

The new schema is in BCNF!

55

Computing minimum cover of propagated FDs

Input: a set K of XML keys, and a mapping f from XML to a universal schema U

Output: a minimum cover F of all the functional dependencies (FDs) propagated from the XML keys K via f

– F is a cover (a set of FDs): any FD propagated from K via f is implied by F

– F is minimum: F contains no redundant FDs, i.e., any FD in F is not entailed by other FDs in F.

Theorem: There is a PTIME algorithm for computing a minimum cover of propagated FDs.

Normalize relational schema for storing/querying XML data!

56

Research issues

For general constraints/mapping languages: undecidable if the mapping language is relationally complete (selection,

projection, join, union, difference), even for XML keys alone if both XML keys and foreign keys are considered, even for the

identity “transformation”

Open: To identify (a) practical mapping languages and (b) practical

XML constraints that allow efficient constraint propagation Constraint propagation from relations to XML

– Information preserving (lossless) data exchange

– Query/update rewriting/optimization

– Overcoming incompleteness of source data (foreign keys)

57

XML publishing/integration

All members of a community (or industry) agree on a schema and exchange data w.r.t. the schema: e-commerce, health-care, ...

Schema-directed XML Publishing/Integration: mapping data from traditional database to XML satisfying the predefined DTD and constraints

DB1 DB2

XMLDTD

Q: XML view

Web

XMLconstraints

58

Schema-directed integration [SIGMOD’03]

integration

XML view

DB

multiple, distributed sources

DB

DB

Schema-directed: XML view conforming to a schema (D, )– D: a DTD– : a set of XML constraints (relative keys, foreign keys)

Attribute Integration Grammar (AIG)

DTD-directed view definition: recursive, nondeterministic Inherited and synthesized attributes

Constraint compilation: automatically captures integrity constraints and DTD in a uniform framework

DTD

constraints

59

XML normal forms

3NF, BCNF?

Extensions of (nested) relational normal forms, via XML FDs – M. Arenas and L. Libkin. A Normal Form for XML Documents,

[PODS 02]. XNFs, decomposition algorithms, complexity, … – M. Vincent, J. Liu and C. Liu. Strong functional dependencies and

their application to normal forms in XML. [TODS 29(3), 2004]– X. Wu, T.W. Ling, S. Lee, M. Lee, G. Dobbie. NF-SS: A Normal

Form for Semistructured Schema. [ER (Workshops) 2001]

60

Research issues for XML normal forms

Implication analysis: more intriguing than relational FDs

Relative functional dependencies: hierarchical nature of XML

“Right” normal form for XML: to prevent update anomalies?

– XML data is often “static”: update anomalies?

– XML data is typically stored in RDBMS

– When XML data is updated, it is done through RDBMS

– Redundancy often helps, e.g., performance and reliability

– Normal form: a right class of constraints to assure “lossless” shredding into relations of certain normal form

Unfortunately, no previous work has studied this

61

Run-time analysis: incremental constraint checking

Input: XML tree T, constraints , update ∆T, where T satisfies Question: does (T + ∆T) satisfy ?

∆X . Code generator: incremental checking. Lucent applicationsM. Benedikt, G. Brun, J. Gibson, R. Kuss and A. Ng. Automated

update management for XML integrity constraints. [PLANX’02]

Application of incremental techniques for attribute grammarM. Abrao, B. Bouchou, M. Alves, D. Laurent, M. Musicante.

Incremental Constraint Checking for XML Documents [XSym’04]

Research issues: Complexity of incremental constraint checking XML editors: broken link detection and repair Incremental checking techniques for XML data stored in RDBMS

62

Query rewriting and optimization

Query translation from XQuery to SQL: XML data stored in RDBMS

– encode XIGs and XQuery in relational queries and constraints– extensions of chase and backchase

A. Deustch and V. Tannen – Reformulation of XML Queries and Constraints [ICDT’03]– MARS: A System for Publishing XML from Mixed and

Redundant Storage [VLDB’03]

R. Krishnamurthy, R. Kaushik, J. Naughton. Efficient XML-to-SQL Query Translation: Where to Add the Intelligence? [VLDB 2004]

Research issues: Rewriting queries over (recursive security) views of XML data Query optimization for (compressed) XML data in native store

63

Data cleaning

Input: XML tree T, constraints , DTD D

Question: if T does not satisfy D + , find a repair T’ such that (a) T’ satisfies D + , and (b) the distance between T and T’ is minimal (update operations: insert, delete, modify)

G. Flesca, F. Furfaro, S. Greco, E. Zumpano. Repairs and Consistent Answers for XML Data with Functional Dependencies [XSym’03]

Research issues: Effective techniques for repairing integrated XML data: conflicts

and inconsistencies may emerge as violations of constraints. – Various constraint languages, – XML schema

Automated tools for repairing Web pages: broken links

64

Summary

Specification of XML constraints: – absolute vs. relative, path constraints: XML data is

hierarchical and semi-structured– mild extensions of relational constraints are not sufficient

Consistency and implication analysis of XML constraints– DTDs interact with XML constraints– far more intricate than their relational counterparts

Applications of XML constraints– XML storage, query, update, integration, cleaning, …– many practical issues remain to be explored

65

References

In addition to the papers mentioned earlier Keys for XML

Computer Networks, Volume 39(5), August 2002, pp 473 - 487.

P. Buneman, S. Davidson, W. Fan, C. Hara, W. Tan On XML Integrity Constraints in the Presence of DTDs

Journal of the ACM (JACM), 49(3), pp 368 - 406, May 2002.Wenfei Fan and Leonid Libkin

On Verifying Consistency of XML Specifications

PODS 2002Marcelo Arenas, Wenfei Fan and Leonid Libkin

What's Hard about XML Schema Constraints?

DEXA 2002

Marcelo Arenas, Wenfei Fan and Leonid Libkin

66

References

Propagating XML Constraints to Relations

JCSS, 73(3):316-361, May 2007.Susan Davidson, Wenfei Fan, and Carmem Hara

Capturing both Types and Constraints in Data Integration SIGMOD, 2003M. Benedikt, C. Chan, W. Fan, J. Freire, and R. Rastogi

XML Constraints: Specification, Analysis, and Applications

LAAIC, 2005Wenfei Fan

Containment and Integrity Constraints for XPath

KRDB 2001

Alin Deutsch, Val Tannen