Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel...

23
Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer , Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Transcript of Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel...

Page 1: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Optimal Probabilistic Generators for XML Collections

Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart

[ICDT 2012]

Page 2: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Adding probabilities to an XML Schema

• XML schemas are useful for describing the structures of XML documents.

– E.g., DTD or XSD

• Schemas may be very general (e.g., xhtml, RSS)

• We want to add probabilities that reflect the likelihood of different parts of the schema

– We will use the probabilities to turn the schema into a probabilistic generative model for XML documents

– In particular, we want them to maximize the likelihood of a given XML document or document collection

- 2 -

Motivation

Optimal Probabilistic Generators for XML Collections

Page 3: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

One Application: XML Auto-Completion [SIGMOD 2012]

• Based on previous document versions / corpus of example documents

• Suggest nodes / sub-trees / node values to the user

• For example:

• Challenges:

– Allow editing every part of the document

– What kind of completion to suggest?

– Finding the top-k best completions

- 3 -

Motivation

Optimal Probabilistic Generators for XML Collections

<MyPapers><Paper>

<title>XML for Beginners</title>

<author>M. Jones<author><author>H. Q.

David</author><author>L.

Martin</author><author>S. Smith</author>

</Paper><Paper>

<title>Advanced XML</title>

<author>M. Jones</author>

<author>J. E. Peterson</author>

<author>G. L. Williams</author>

</Paper><Paper>

<title> </title><author> </author><author> </author><author> </author>

</Paper></MyPapers>

Page 4: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Many Other Usages for a Probabilistic Schema

...

- 4 -

Motivation

Optimal Probabilistic Generators for XML Collections

• Testing – e.g., generating many XML messages to simulate network load and test system performance.

• Explaining – e.g., a probabilistic schema for DBLP may show which types of publications are rarely used, which kinds of attributes are not filled for BibTex, etc.

• Schema Evaluation – how well a given schema describes a given corpus.

Page 5: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Our solution - An Outline

- 5 -Optimal Probabilistic Generators for XML Collections

Preliminaries – Tree Automata

Generators for Schemas without Constraints

Restart Generators

Continuation-Test Generators

Leaf Values

Adding Constraints

Page 6: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Schema as a Deterministic Tree Automaton

- 6 -

Preliminaries

Optimal Probabilistic Generators for XML Collections

q0 q1 q2b

a c

$

An XML document is modeled as an ordered tree.

Document d0:

Schema validation: the children of an a-labeled node are accepted by DFA Aa

Automaton Ar: (L(Ar) = a*bc*$)

Validation is performed for the children of every inner node.

abcd abcd532

$

r

a b c

Page 7: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Using the Schema as a Generator

• Recall that we want to turn the schema from an acceptor into a probabilistic generative model.

• Straightforward nondeterministic generator: repeatedly choose an accepting run for a node's automaton, and generate children accordingly.

• Adding probabilities: we consider two problem settings

1. Generating documents that are accepted by the schema, while maximizing the likelihood of a corpus.

2. Additionally, imposing integrity constraints on the documents (e.g., key constraints)

- 7 -

Preliminaries

Optimal Probabilistic Generators for XML Collections

Page 8: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Probabilistic Generator

• Each transition is assigned a probability

• We assume independent choices, (a Markovian process) thus the document probability is the product.

• In this case, Pr(d)=pa p∙ a p∙ b p∙ $

• The schema and generator ignore leaf values (for now!)- 8 -

Without Constraints

Optimal Probabilistic Generators for XML Collections

ba c

$pa pc

pb p$

q0

q1

q2 $

r

a a b

Page 9: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Formal Problem Definition

• Given a corpus D of documents ,• and a deterministic schema S that accepts every

document in D• We want to find an optimal generator based on S:

– Find probabilities for the transitions of S that maximize the probability of generating D,

– i.e., the maximum likelihood estimator (MLE).

- 9 -

Without Constraints

Optimal Probabilistic Generators for XML Collections

Page 10: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

A Learning Algorithm

- 10 -

Without Constraints

Optimal Probabilistic Generators for XML Collections

b

a c

$

$

The frequency of using each transition during the corpus verification process is recorded.

(q0, a)

(q0, b)

(q1, c)

(q1, $)

1111

q0 q1 q2

r

a b c

Page 11: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

An Algorithm for Probabilities Learning (Cont.)

This is repeated for every node in every corpus document.We set the probability of each transition to be its relative frequency.

- 11 -

Without Constraints

Optimal Probabilistic Generators for XML Collections

(q0, a) 1(q0, b) 1(q1, c) 1(q1, $) 1

/2/2

/2/2Theorem: This efficient algorithm

learns the MLE probabilities – finds an optimal probabilistic generator

Page 12: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Termination

• Theorem: generation terminates with probability 1.

– Guaranteed only because of the choice of probabilities according to the corpus.

- 12 -

Without Constraints

Optimal Probabilistic Generators for XML Collections

Page 13: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Integrity Constraints

• We want to support integrity constraints, which are used in XML schema languages.

• Key Constraint: the leaves of a-labeled leaves have unique values (unary key)

• Inclusion Constraint: the values of a-labeled leaves are contained in those of b-labeled leaves

• Domain Constraint: the values of a-labeled leaves belong to some (finite or infinite) domain

- 13 -

Adding Constraints

Optimal Probabilistic Generators for XML Collections

Page 14: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

New Problem• We want to find optimal generators for XML schemas with

constraints.

• Valid generator output: an XML document, which1. is a accepted by the schema, and

2. there exists a valid leaf value assignment – which does not violate the constraints

– Example: a, b, c are unique and contain each other

- 14 -

Adding Constraints

Optimal Probabilistic Generators for XML Collections

$

r

a a bc

r

a b

b

c

b

Page 15: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Restart Generators

• A simple idea: – Use a probabilistic generator to generate a document

– Check if it has a value assignment valid w.r.t. the constraints

– If not, 'restart' and try again until a valid document is generated

• Proposition: Given a document with no values, checking for the existence of a valid value assignment is in PTIME– Proof: By translating the constraints to bounds on the number

of unique values for each leaf label

• Bad news: number of restarts can be unboundedly large in an optimal generator

- 15 -

Adding Constraints

Optimal Probabilistic Generators for XML Collections

Page 16: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Continuation-test Generators

• Never make choices that lead to a 'dead end', thus always generate a valid document.

• We use a binary test to check if a choice has a continuation.• Example: add to the schema of d0 the constraints:

– c is included in a– c is unique

• The generation process:

- 16 -

Adding Constraints

Optimal Probabilistic Generators for XML Collections

ba c

$$

pa pc

pb p$

q0

q1

q2

r

a b c

Pr(d) = pa p∙ b p∙ c∙1

Perform a continuation-test before taking the

transition

Implies |c|≤|a|

Page 17: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Learning Algorithm for Continuation-test Generators

• The probabilities are again relative frequencies, but –only in cases where there was an alternative choice.

• The learned generator will generate as many c-s as a-s

Adding Constraints

Optimal Probabilistic Generators for XML Collections

(q0, a) 1(q0, b) 1(q1, c) 1(q1, $) 0

/2/2/1/1

(q1, $) was chosen only when (q1, c) was not available.

- 17 -

Page 18: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Results for Continuation-test Generators

• Theorem: The algorithm learns an optimal continuation-test generator, for automata with binary choices.– Extensions to non-binary are discussed in the paper

• Theorem: Continuation-test is NP-Complete– But only in the size of the schema; it is polynomial in the document size

– Both generation and finding the optimal generator are polynomial when using a continuation-test oracle.

– Based on schema satisfiability test [David et al. 2011]

• Theorem: probability of termination for a continuation-test generator may be arbitrarily small!– Proof – by construction of a simple, non-recursive schema

– Can be handled by adding a constraint on the document size.

– Sub-classes of schemas that guarantee termination?

- 18 -

Adding Constraints

Optimal Probabilistic Generators for XML Collections

Page 19: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Adding Values to the Structure

• So far our generators were used only for the document structure

• Leaf values may also have a distribution according to which they can be generated

– The distribution may be learned from the same document collection

• We will focus on the interesting case – generating leaf values for a schema with constraints

- 19 -

Leaf Values

Optimal Probabilistic Generators for XML Collections

Page 20: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Suggested Algorithm• We start with a valid document skeleton

• Order labels by inclusion constraints (e.g., c, b, a)• Choose a leaf from the 'smallest' (most included) label, and including leaves• Draw a value (from the domain) according to a given distribution.• Use PTIME test to verify validity, if not revert the step• Improvements presented in the paper

- 20 -

Leaf Values

Optimal Probabilistic Generators for XML Collections

$

r

a b c

abcdabcd efg

Page 21: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Related Work

• Schema Satisfiability tests [Fan & Libkin 2001; David, Libkin & Tan 2011]

• Probabilistic XML and Probabilistic Schemas [e.g., Benedikt, Kharlamov, Olteanu & Senellart 2010]

• Probabilistic XML generation [e.g., Antonopoulos, Geerts, Martens & Neven 2011]

• Schema Inference [e.g., Bex, Gelade, Neven & Vansummeren 2008]

• AXML [Abiteboul, Benjelloun & Milo 2008]

• PCFGs [e.g., Chi & Geman 1998]

- 21 -

Summary

Optimal Probabilistic Generators for XML Collections

Page 22: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Conclusion

• A model for a probabilistic XML generators• Unconstrained case

– Generation and learning optimal generators can be done efficiently– Termination is guaranteed

• Constrained case– Restart generator

• # of restarts is unbounded

– Continuation-test generators• Generation and learning optimal generators are expensive• Termination is not guaranteed

• Leaf Value generation

• In the talk labels and states are coupled (as in a DTD), but all the results hold when they are uncoupled.

• Future work– More Efficient combinations of restart and continuation-test generators

- 22 -

Summary

Optimal Probabilistic Generators for XML Collections

Page 23: Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Thank You!Thank You!

Q&A