Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel...
-
Upload
brittney-riley -
Category
Documents
-
view
221 -
download
0
Transcript of Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel...
Optimal Probabilistic Generators for XML Collections
Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart
[ICDT 2012]
Adding probabilities to an XML Schema
• XML schemas are useful for describing the structures of XML documents.
– E.g., DTD or XSD
• Schemas may be very general (e.g., xhtml, RSS)
• We want to add probabilities that reflect the likelihood of different parts of the schema
– We will use the probabilities to turn the schema into a probabilistic generative model for XML documents
– In particular, we want them to maximize the likelihood of a given XML document or document collection
- 2 -
Motivation
Optimal Probabilistic Generators for XML Collections
One Application: XML Auto-Completion [SIGMOD 2012]
• Based on previous document versions / corpus of example documents
• Suggest nodes / sub-trees / node values to the user
• For example:
• Challenges:
– Allow editing every part of the document
– What kind of completion to suggest?
– Finding the top-k best completions
- 3 -
Motivation
Optimal Probabilistic Generators for XML Collections
<MyPapers><Paper>
<title>XML for Beginners</title>
<author>M. Jones<author><author>H. Q.
David</author><author>L.
Martin</author><author>S. Smith</author>
</Paper><Paper>
<title>Advanced XML</title>
<author>M. Jones</author>
<author>J. E. Peterson</author>
<author>G. L. Williams</author>
</Paper><Paper>
<title> </title><author> </author><author> </author><author> </author>
</Paper></MyPapers>
Many Other Usages for a Probabilistic Schema
...
- 4 -
Motivation
Optimal Probabilistic Generators for XML Collections
• Testing – e.g., generating many XML messages to simulate network load and test system performance.
• Explaining – e.g., a probabilistic schema for DBLP may show which types of publications are rarely used, which kinds of attributes are not filled for BibTex, etc.
• Schema Evaluation – how well a given schema describes a given corpus.
✗
✓
Our solution - An Outline
- 5 -Optimal Probabilistic Generators for XML Collections
Preliminaries – Tree Automata
Generators for Schemas without Constraints
Restart Generators
Continuation-Test Generators
Leaf Values
Adding Constraints
Schema as a Deterministic Tree Automaton
- 6 -
Preliminaries
Optimal Probabilistic Generators for XML Collections
q0 q1 q2b
a c
$
An XML document is modeled as an ordered tree.
Document d0:
Schema validation: the children of an a-labeled node are accepted by DFA Aa
Automaton Ar: (L(Ar) = a*bc*$)
Validation is performed for the children of every inner node.
abcd abcd532
$
r
a b c
Using the Schema as a Generator
• Recall that we want to turn the schema from an acceptor into a probabilistic generative model.
• Straightforward nondeterministic generator: repeatedly choose an accepting run for a node's automaton, and generate children accordingly.
• Adding probabilities: we consider two problem settings
1. Generating documents that are accepted by the schema, while maximizing the likelihood of a corpus.
2. Additionally, imposing integrity constraints on the documents (e.g., key constraints)
- 7 -
Preliminaries
Optimal Probabilistic Generators for XML Collections
Probabilistic Generator
• Each transition is assigned a probability
• We assume independent choices, (a Markovian process) thus the document probability is the product.
• In this case, Pr(d)=pa p∙ a p∙ b p∙ $
• The schema and generator ignore leaf values (for now!)- 8 -
Without Constraints
Optimal Probabilistic Generators for XML Collections
ba c
$pa pc
pb p$
q0
q1
q2 $
r
a a b
Formal Problem Definition
• Given a corpus D of documents ,• and a deterministic schema S that accepts every
document in D• We want to find an optimal generator based on S:
– Find probabilities for the transitions of S that maximize the probability of generating D,
– i.e., the maximum likelihood estimator (MLE).
- 9 -
Without Constraints
Optimal Probabilistic Generators for XML Collections
A Learning Algorithm
- 10 -
Without Constraints
Optimal Probabilistic Generators for XML Collections
b
a c
$
$
The frequency of using each transition during the corpus verification process is recorded.
(q0, a)
(q0, b)
(q1, c)
(q1, $)
1111
q0 q1 q2
r
a b c
An Algorithm for Probabilities Learning (Cont.)
This is repeated for every node in every corpus document.We set the probability of each transition to be its relative frequency.
- 11 -
Without Constraints
Optimal Probabilistic Generators for XML Collections
(q0, a) 1(q0, b) 1(q1, c) 1(q1, $) 1
/2/2
/2/2Theorem: This efficient algorithm
learns the MLE probabilities – finds an optimal probabilistic generator
Termination
• Theorem: generation terminates with probability 1.
– Guaranteed only because of the choice of probabilities according to the corpus.
- 12 -
Without Constraints
Optimal Probabilistic Generators for XML Collections
Integrity Constraints
• We want to support integrity constraints, which are used in XML schema languages.
• Key Constraint: the leaves of a-labeled leaves have unique values (unary key)
• Inclusion Constraint: the values of a-labeled leaves are contained in those of b-labeled leaves
• Domain Constraint: the values of a-labeled leaves belong to some (finite or infinite) domain
- 13 -
Adding Constraints
Optimal Probabilistic Generators for XML Collections
New Problem• We want to find optimal generators for XML schemas with
constraints.
• Valid generator output: an XML document, which1. is a accepted by the schema, and
2. there exists a valid leaf value assignment – which does not violate the constraints
– Example: a, b, c are unique and contain each other
- 14 -
Adding Constraints
Optimal Probabilistic Generators for XML Collections
$
r
a a bc
r
a b
b
c
…
b
Restart Generators
• A simple idea: – Use a probabilistic generator to generate a document
– Check if it has a value assignment valid w.r.t. the constraints
– If not, 'restart' and try again until a valid document is generated
• Proposition: Given a document with no values, checking for the existence of a valid value assignment is in PTIME– Proof: By translating the constraints to bounds on the number
of unique values for each leaf label
• Bad news: number of restarts can be unboundedly large in an optimal generator
- 15 -
Adding Constraints
Optimal Probabilistic Generators for XML Collections
Continuation-test Generators
• Never make choices that lead to a 'dead end', thus always generate a valid document.
• We use a binary test to check if a choice has a continuation.• Example: add to the schema of d0 the constraints:
– c is included in a– c is unique
• The generation process:
- 16 -
Adding Constraints
Optimal Probabilistic Generators for XML Collections
ba c
$$
pa pc
pb p$
q0
q1
q2
r
a b c
Pr(d) = pa p∙ b p∙ c∙1
Perform a continuation-test before taking the
transition
Implies |c|≤|a|
Learning Algorithm for Continuation-test Generators
• The probabilities are again relative frequencies, but –only in cases where there was an alternative choice.
• The learned generator will generate as many c-s as a-s
Adding Constraints
Optimal Probabilistic Generators for XML Collections
(q0, a) 1(q0, b) 1(q1, c) 1(q1, $) 0
/2/2/1/1
(q1, $) was chosen only when (q1, c) was not available.
- 17 -
Results for Continuation-test Generators
• Theorem: The algorithm learns an optimal continuation-test generator, for automata with binary choices.– Extensions to non-binary are discussed in the paper
• Theorem: Continuation-test is NP-Complete– But only in the size of the schema; it is polynomial in the document size
– Both generation and finding the optimal generator are polynomial when using a continuation-test oracle.
– Based on schema satisfiability test [David et al. 2011]
• Theorem: probability of termination for a continuation-test generator may be arbitrarily small!– Proof – by construction of a simple, non-recursive schema
– Can be handled by adding a constraint on the document size.
– Sub-classes of schemas that guarantee termination?
- 18 -
Adding Constraints
Optimal Probabilistic Generators for XML Collections
Adding Values to the Structure
• So far our generators were used only for the document structure
• Leaf values may also have a distribution according to which they can be generated
– The distribution may be learned from the same document collection
• We will focus on the interesting case – generating leaf values for a schema with constraints
- 19 -
Leaf Values
Optimal Probabilistic Generators for XML Collections
Suggested Algorithm• We start with a valid document skeleton
• Order labels by inclusion constraints (e.g., c, b, a)• Choose a leaf from the 'smallest' (most included) label, and including leaves• Draw a value (from the domain) according to a given distribution.• Use PTIME test to verify validity, if not revert the step• Improvements presented in the paper
- 20 -
Leaf Values
Optimal Probabilistic Generators for XML Collections
$
r
a b c
abcdabcd efg
Related Work
• Schema Satisfiability tests [Fan & Libkin 2001; David, Libkin & Tan 2011]
• Probabilistic XML and Probabilistic Schemas [e.g., Benedikt, Kharlamov, Olteanu & Senellart 2010]
• Probabilistic XML generation [e.g., Antonopoulos, Geerts, Martens & Neven 2011]
• Schema Inference [e.g., Bex, Gelade, Neven & Vansummeren 2008]
• AXML [Abiteboul, Benjelloun & Milo 2008]
• PCFGs [e.g., Chi & Geman 1998]
- 21 -
Summary
Optimal Probabilistic Generators for XML Collections
Conclusion
• A model for a probabilistic XML generators• Unconstrained case
– Generation and learning optimal generators can be done efficiently– Termination is guaranteed
• Constrained case– Restart generator
• # of restarts is unbounded
– Continuation-test generators• Generation and learning optimal generators are expensive• Termination is not guaranteed
• Leaf Value generation
• In the talk labels and states are coupled (as in a DTD), but all the results hold when they are uncoupled.
• Future work– More Efficient combinations of restart and continuation-test generators
- 22 -
Summary
Optimal Probabilistic Generators for XML Collections
Thank You!Thank You!
Q&A