Early Profile Pruning on XML-aware Publish-Subscribe Systems

04/22/23 1

Early Profile Pruning on XML-aware Publish-Subscribe Systems

Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras

University of California, Riverside

04/22/23 2

Overview Motivation Bottom-up Filtering FSM (BUFF) Bounding-based XML Filtering (BoxFilter)

Core Modules Filtering algorithms

Experimental results

04/22/23 3

Motivation Publish-subscribe

systems: The message transmission is defined by the message content

Examples: notification websites hotwire.com or ticketmaster.com

SubscriberSubscriber Subscriber Subscriber

Profile

Submit,Update,Delete

Result

Profile


Result

Profile


Result

Profile


Result

Documents

Documents

Documents

Documents

Matching algorithm

PublisherPublisherPublisherPublisher

04/22/23 4

Publish-subscribe systems The data is

exchanged in XML format. Nodes - correspond

to elements, attributes or text values

Edges represent immediate element-subelement or element-value relationships

<Bib><article vol=“7” no=“11”>

<title>t1</title><author>

<last>DeWitt</last><mi>J</mi><first>David</first>

</author><journal>TPDS</journal><year>1996</year>

</article><article>

<title>t2</title><author>

<last>Florescu</last><first>Daniela</first>

</author><proceedings>SIGMOD

</proceedings><year>2006</year>

</article></Bib>

(a) Document (b) Tree representation

Bib

article

title

journal

author

last first

DavidDeWitt

TPDS

t1

article

title

author

last first

proceedings

DanielaFlorescu

SIGMODt2

mi

J

year

1996year

2006

no

11

vol

7

04/22/23 5

Publish-subscribe systems (cont.) The user profiles are

expressed in XML query language (XPath, XQuery)

XML query contains structural constraints value-based constraints

article

proceedings

conf

author

last

Structural constraints:////article[/author[@last=``Smith'']]//procs[@conf=``VLDB'']Tree pattern:

04/22/23 6

Related Work/Our Contribution Current work

Construction of overlay network Dissemination/indexing of profiles (queries) Processing of stream of messages

We focus on the matching process that takes place within a broker Improves the performance of regular FSM by using a

bottom-up evaluation of the document Develop index-based filtering technique that

performs early pruning of the query profile

04/22/23 7




04/22/23 8

Bottom-up vs. Top-down filtering State machines are among the most common

methods for the XML matching process Top-down approach: (i.e. in-order traversal or

depth first order): advancing the state machine for each XML element (or attribute) read. Do not consider any form of early pruning

Bottom-up approach: This approach takes into consideration the (usual) fact that an XML document has its more selective elements located at its leaves

04/22/23 9

ExampleQ1a

b

c

d

Q2a

c

d

Q4a

e

f

h

Q5e

f

h

Q6e

g

h

(b) Queries

(d) Bottom up(c) Top-down

1

2

5

70

10

3 4

6

8 9

11

13

12

14

a

bc de

f h

c d

e f h

hg

Q2

Q3

Q1

Q4

Q5

Q6

0

1

9

2d

f

c

h

5a

10

13 14

f e

eg11 12a

Q2

Q4Q5

Q6

(a) Document

a

b

c

a

b

c

d

a

b

c

a

b

c

root

a

b

c

a

b

c

a

b

c

a

b

c

a

b

c

a

b

c

a

b

c

Q3a

e

f

6 7 8e aQ3

3 4abQ1

Top-down approach groups the queries according to their common prefixes

Bottom up: groups them according to their common suffixes.

04/22/23 10

BUFF FSM-based Bottom-up approach for XML

filtering. BUFF avoids translating documents and

queries to Prüfer sequences (as the other algorithms do), and employs a more direct evaluation algorithm.

The document is parsed through a SAX parser, which triggers events for specific marks (tags) in the XML document

The machine keeps a runtime stack that stores the current document path being processed.

04/22/23 11

BUFF Example

(a) Document and BUFFa<a>b<b>c<c>d<d>

</e>

a

cde

b

1

d4

a1

b2

c3

d7

b5

c6

e8 f10

e90

1 2e d 3 4bcQ1

5 6f c 7 8ab

Q2

(b) (c)

</d>

a

cd

b

2

1

(d) a

cb

e1,2

</f>

f

5

(e)

</e>

a

ce

b

1

1,25

(f)

</c>

a

cb

3,6

1,2,5

(g)

e<e>

04/22/23 12




04/22/23 13

Bounding-based XML Filtering Two major

processes working asynchronously Profile

Management Profile

Matching

Profile Index ProfilesP1 P2 P3

PrüferSequence

ProfileManager

MatchingAlgorithm

InputDocuments

Profiles(queries)

MatchedDocuments

MatchingModule

04/22/23 14

Prüfer Sequence A unique sequential encoding of a labeled tree Algorithm:

Iteratively removes nodes from the tree until all nodes but the last two have been removed.

At each iteration, the algorithm finds and removes the leaf with the smallest label and adds to the Prüfer sequence the label of that leaf's parent.

Theorem: If a query tree Q is a subgraph of a document tree D then the Prüfer sequence of Q is a subsequence of the Prüfer sequence of D

04/22/23 15

Sequence Envelope Assume a set of k Prüfer sequences

representing user profiles S1,..,Sk We can derive two new sequences

Upper bound U: for each position take largest element

Lower bound L: for each position take smallest element

L and U form the smallest possible bounding envelope that encompasses all members of the set of sequences from above and below.

04/22/23 16

Example Assume 3

sequences with 11 symbols each

abcabababcdcdcdecdcdecdedededebab

04/22/23 17

Sequence Envelope (Cont.) The sequence

envelope structure is that it can be used as an aggregation of the sustaining set of sequences

04/22/23 18

BoXFilter Tree Sequence envelopes can be nested forming

BoXFilter tree

04/22/23 19

Filtering algorithms The profiles in the system are organized in

BoXFilter tree. Documents are traversed thought the tree

There are two variations of the filtering algorithm Sequential – documents are processed one by one Batch processing – documents are organized in a tree like

the queries and both trees are joined After the traversal of the BoXFilter tree, there is

a verification step

04/22/23 20




04/22/23 21

Experimental Results We have generated datasets with 1000, 10000 and

100000 small documents (with up to 8KB) We generated up to 100000 queries with selectivity fixed

to 50%

(a) (b) (c)

100 1,000 10,000 100,0000

0.20.40.60.8

11.21.41.61.8

2

1,000 Documents

Number of Queries

Tim

e (s

ec)

100 1,000 10,000 100,0000

2

4

6

8

10

12

14

10,000 Documents

Number of Queries

Tim

e (s

ec)

100 1,000 10,000 100,0000

20

40

60

80

100

120

140

100,000 Documents

NFABUFFBoXFilter

Number of Queries

Tim

e (s

ec)

04/22/23 22

Experimental Results (cont.)In this set of experiments, we vary the number of documents that match any of the profile queries. (selectivity 1\% means that one percent of the documents satisfy \textit{any} of the queries.)

1 25 50 75 1000

2

4

6

8

10

12

14

Varying Selectivity

NFABUFFBoXFilter

Selectivity (%)

Tim

e (s

ec)

04/22/23 23

Early Profile Pruning on XML-aware Publish-Subscribe Systems

Documents

Transcript of Early Profile Pruning on XML-aware Publish-Subscribe Systems