Early Profile Pruning on XML-aware Publish-Subscribe Systems
description
Transcript of Early Profile Pruning on XML-aware Publish-Subscribe Systems
04/22/23 1
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras
University of California, Riverside
04/22/23 2
Overview Motivation Bottom-up Filtering FSM (BUFF) Bounding-based XML Filtering (BoxFilter)
Core Modules Filtering algorithms
Experimental results
04/22/23 3
Motivation Publish-subscribe
systems: The message transmission is defined by the message content
Examples: notification websites hotwire.com or ticketmaster.com
SubscriberSubscriber Subscriber Subscriber
Profile
Submit,Update,Delete
Result
Profile
Submit,Update,Delete
Result
Profile
Submit,Update,Delete
Result
Profile
Submit,Update,Delete
Result
Documents
Documents
Documents
Documents
Matching algorithm
PublisherPublisherPublisherPublisher
04/22/23 4
Publish-subscribe systems The data is
exchanged in XML format. Nodes - correspond
to elements, attributes or text values
Edges represent immediate element-subelement or element-value relationships
<Bib><article vol=“7” no=“11”>
<title>t1</title><author>
<last>DeWitt</last><mi>J</mi><first>David</first>
</author><journal>TPDS</journal><year>1996</year>
</article><article>
<title>t2</title><author>
<last>Florescu</last><first>Daniela</first>
</author><proceedings>SIGMOD
</proceedings><year>2006</year>
</article></Bib>
(a) Document (b) Tree representation
Bib
article
title
journal
author
last first
DavidDeWitt
TPDS
t1
article
title
author
last first
proceedings
DanielaFlorescu
SIGMODt2
mi
J
year
1996year
2006
no
11
vol
7
04/22/23 5
Publish-subscribe systems (cont.) The user profiles are
expressed in XML query language (XPath, XQuery)
XML query contains structural constraints value-based constraints
article
proceedings
conf
author
last
Structural constraints:////article[/author[@last=``Smith'']]//procs[@conf=``VLDB'']Tree pattern:
04/22/23 6
Related Work/Our Contribution Current work
Construction of overlay network Dissemination/indexing of profiles (queries) Processing of stream of messages
We focus on the matching process that takes place within a broker Improves the performance of regular FSM by using a
bottom-up evaluation of the document Develop index-based filtering technique that
performs early pruning of the query profile
04/22/23 7
Overview Motivation Bottom-up Filtering FSM (BUFF) Bounding-based XML Filtering (BoxFilter)
Core Modules Filtering algorithms
Experimental results
04/22/23 8
Bottom-up vs. Top-down filtering State machines are among the most common
methods for the XML matching process Top-down approach: (i.e. in-order traversal or
depth first order): advancing the state machine for each XML element (or attribute) read. Do not consider any form of early pruning
Bottom-up approach: This approach takes into consideration the (usual) fact that an XML document has its more selective elements located at its leaves
04/22/23 9
ExampleQ1a
b
c
d
Q2a
c
d
Q4a
e
f
h
Q5e
f
h
Q6e
g
h
(b) Queries
(d) Bottom up(c) Top-down
1
2
5
70
10
3 4
6
8 9
11
13
12
14
a
bc de
f h
c d
e f h
hg
Q2
Q3
Q1
Q4
Q5
Q6
0
1
9
2d
f
c
h
5a
10
13 14
f e
eg11 12a
Q2
Q4Q5
Q6
(a) Document
a
b
c
a
b
c
d
a
b
c
a
b
c
root
a
b
c
a
b
c
a
b
c
a
b
c
a
b
c
a
b
c
a
b
c
Q3a
e
f
6 7 8e aQ3
3 4abQ1
Top-down approach groups the queries according to their common prefixes
Bottom up: groups them according to their common suffixes.
04/22/23 10
BUFF FSM-based Bottom-up approach for XML
filtering. BUFF avoids translating documents and
queries to Prüfer sequences (as the other algorithms do), and employs a more direct evaluation algorithm.
The document is parsed through a SAX parser, which triggers events for specific marks (tags) in the XML document
The machine keeps a runtime stack that stores the current document path being processed.
04/22/23 11
BUFF Example
(a) Document and BUFFa<a>b<b>c<c>d<d>
</e>
a
cde
b
1
d4
a1
b2
c3
d7
b5
c6
e8 f10
e90
1 2e d 3 4bcQ1
5 6f c 7 8ab
Q2
(b) (c)
</d>
a
cd
b
2
1
(d) a
cb
e1,2
</f>
f
5
(e)
</e>
a
ce
b
1
1,25
(f)
</c>
a
cb
3,6
1,2,5
(g)
e<e>
04/22/23 12
Overview Motivation Bottom-up Filtering FSM (BUFF) Bounding-based XML Filtering (BoxFilter)
Core Modules Filtering algorithms
Experimental results
04/22/23 13
Bounding-based XML Filtering Two major
processes working asynchronously Profile
Management Profile
Matching
Profile Index ProfilesP1 P2 P3
PrüferSequence
ProfileManager
MatchingAlgorithm
InputDocuments
Profiles(queries)
MatchedDocuments
MatchingModule
04/22/23 14
Prüfer Sequence A unique sequential encoding of a labeled tree Algorithm:
Iteratively removes nodes from the tree until all nodes but the last two have been removed.
At each iteration, the algorithm finds and removes the leaf with the smallest label and adds to the Prüfer sequence the label of that leaf's parent.
Theorem: If a query tree Q is a subgraph of a document tree D then the Prüfer sequence of Q is a subsequence of the Prüfer sequence of D
04/22/23 15
Sequence Envelope Assume a set of k Prüfer sequences
representing user profiles S1,..,Sk We can derive two new sequences
Upper bound U: for each position take largest element
Lower bound L: for each position take smallest element
L and U form the smallest possible bounding envelope that encompasses all members of the set of sequences from above and below.
04/22/23 16
Example Assume 3
sequences with 11 symbols each
abcabababcdcdcdecdcdecdedededebab
04/22/23 17
Sequence Envelope (Cont.) The sequence
envelope structure is that it can be used as an aggregation of the sustaining set of sequences
04/22/23 18
BoXFilter Tree Sequence envelopes can be nested forming
BoXFilter tree
04/22/23 19
Filtering algorithms The profiles in the system are organized in
BoXFilter tree. Documents are traversed thought the tree
There are two variations of the filtering algorithm Sequential – documents are processed one by one Batch processing – documents are organized in a tree like
the queries and both trees are joined After the traversal of the BoXFilter tree, there is
a verification step
04/22/23 20
Overview Motivation Bottom-up Filtering FSM (BUFF) Bounding-based XML Filtering (BoxFilter)
Core Modules Filtering algorithms
Experimental results
04/22/23 21
Experimental Results We have generated datasets with 1000, 10000 and
100000 small documents (with up to 8KB) We generated up to 100000 queries with selectivity fixed
to 50%
(a) (b) (c)
100 1,000 10,000 100,0000
0.20.40.60.8
11.21.41.61.8
2
1,000 Documents
Number of Queries
Tim
e (s
ec)
100 1,000 10,000 100,0000
2
4
6
8
10
12
14
10,000 Documents
Number of Queries
Tim
e (s
ec)
100 1,000 10,000 100,0000
20
40
60
80
100
120
140
100,000 Documents
NFABUFFBoXFilter
Number of Queries
Tim
e (s
ec)
04/22/23 22
Experimental Results (cont.)In this set of experiments, we vary the number of documents that match any of the profile queries. (selectivity 1\% means that one percent of the documents satisfy \textit{any} of the queries.)
1 25 50 75 1000
2
4
6
8
10
12
14
Varying Selectivity
NFABUFFBoXFilter
Selectivity (%)
Tim
e (s
ec)
04/22/23 23