Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong...

56
Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with *Yun Peng and Jianliang Xu (to appear ICDE 2011) March 21 2011 @ COMP630Q
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    1

Transcript of Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong...

Page 1: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Selectivity Estimation of Twig Queries on Cyclic Graphs

Department of Computer Science

Hong Kong Baptist University

Speaker: Byron Choi

Joint work with *Yun Peng and Jianliang Xu

(to appear ICDE 2011)

March 21 2011 @ COMP630Q

Page 2: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Agenda

BackgroundProblem StatementOverview of our FrameworkMatrix and its TransformationsHistograms and EstimationExperimental EvaluationFuture Works

Page 3: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Graph Data is Ubiquitous

Page 4: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Navigational Queries

SELECT a set of nodes via a user-specified path◦ //person[//open auction//person]◦ //

ancestor-descendant axes CONNECT in logic (reachability tests)

there are evidently many other query formalisms

Page 5: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Selectivity Estimation

A classical problem: Given a query, estimate the count of the results efficiently

Requirements◦ accurate◦ efficient estimation time◦ small overhead in terms of size

Page 6: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

XMark, used in this Presentation

Page 7: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Selectivity Estimation (cont’)

Query optimizers rely on the counts to evaluate the costs of query plans

Example:◦ XMark 1.0 (> 180,000 nodes)◦ Query: //person[//open auction//person]◦ 25,500 person’s◦ 12,000 open_auction’s◦ 13,192 open_auction//person’s◦ //open_auction → //person → ↑↑person

Page 8: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Problem StatementData: A rooted directed labeled graph (i.e.,

possibly cyclic)Query: Twig queries (i.e., parent-child and

ancestor-descendant axes and branches)Problem statement: given a cyclic graph G and

a twig Q, estimate the result count of Q on G.department

facul ty facul ty facul ty

name RA TA RA TA TA TAname name

Graph G Twig Query Q

department

f acul ty

RA TA

Page 9: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Our Position Relative to the Current State-of-the-Art

Graph Complexity

Que

ry C

ompl

exit

y

Tree (XML) Cyclic Graph

Path Query

Twig Query

XSketch ’06

Xseed ’06TreeSketch ’06

CST ’04XPathLearner ’02

DataGuide ’99

Our Work

Page 10: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Related Work (Graph-based approaches) Dataguide – Automata theories

◦ J. McHugh and J. Widom. Query optimization for xml. In VLDB, pages 315–326, 1999

TreeSketch and XSketch -- Bisimulation◦ N. Polyzotis and M. Garofalakis. Xsketch synopses for xml

data graphs. ACM Trans. Database Syst., 31(3):1014–1063, 2006.

◦ N. Polyzotis, M. Garofalakis, and Y. Ioannidis. Approximate xml query answers. In SIGMOD, pages 263–274, 2004.

Correlated Subpath Tree (CST)◦ Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S.

Muthukrishnan, R. Ng, and D. Srivastava. Counting twig matches in a tree. In ICDE, pages 595–604, 2001

Straight-Line Grammar◦ D. K. Fisher and S. Maneth. Structural selectivity

estimation for xml documents. In ICDE, pages 626–635, 2007

Page 11: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

2-dimensional histograms on TREEs◦ Y. Wu, J. M. Patel, and H. V. Jagadish. Using

histograms to estimate answer sizes for xml queries. Inf. Syst., 28(1-2):33–59, 2003.

Hidden Markov Model◦ A. Aboulnaga, A. R. Alameldeen, and J. F. Naughton.

Estimating the selectivity of xml path expressions for internet scale applications. InVLDB, pages 591–600, 2001.

A novel bloom filter – two 1-dimensional histograms◦ W. Wang, H. Jiang, H. Lu, and J. X. Yu. Bloom

histogram: Path selectivity estimation for xml data with updates. In VLDB, pages 240–251, 2004.

Related Work (Relational approaches)

Page 12: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Technical Challenges

Interactions between cyclic graphs and recursions (i.e. //) in twig queries

Branches of twigs

Page 13: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

A Typical Framework

Graph Representati onGraph

Summari zati on of Graph’ s

Representati onSel ecti vi ty

Query

Summari zati on techni que

Sel ecti vi ty esti mati on techni que

• Previous research differs from each other in one or more steps

• We also follow this general framework

Page 14: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Framework – with Our Solution Now

Graph Representation

Summarization of Graph Rep.

Selectivity Estimation

Page 15: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Summary of Contributions

1. Cyclic graph representation◦ Prime labeling (vs. other representations)

◦ Matrix representation of prime labeling

◦ Matrix transformation to C1P matrix

2. Summarization of graph’s representation◦ 2-dimensional histogram for cyclic graph

3. Algorithms for selectivity estimation

Page 16: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Characteristics of our ContributionsMatrix representation of cyclic graphs

◦ Reuse some research from matrices

Histogram-based selectivity estimation◦ No uniform distribution assumption

One data node/vertex – one 2-dimensional pointOne query step (child or descendant) – multiple 2-

dimensional points

Page 17: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Agenda

BackgroundProblem StatementOverview of our FrameworkMatrix and its TransformationsHistograms and EstimationExperimental EvaluationFuture Works

Page 18: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Alternative Representations for Cyclic Graphs

Adjacency matrix/list◦ Easy to construct

◦ Inefficient in determining ancestors/descendants

Transitive closure◦ Efficient in ancestors/descendants;

◦ Inefficient in terms of space

Prime labeling◦ Smaller than transitive closure but larger than adjacency matrix

◦ Query efficiency better than adjacency matrix but worse than transitive closure

Page 19: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Prime Labeling

Originally proposed for tree data [X. Wu, ICDE’04]◦ To address update-friendly XML index for reachability tests

Later extended to DAGs [G. Wu, DASFAA’06]◦ Each vertex is assigned a prime number

Our extension to cyclic graphs◦ Applied to cyclic graphs

◦ Reduced labeling size further Not each vertex is labeled with a unique prime number →

smaller than G. Wu et al.

Page 20: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Prime Labeling (con’t)

Large prime numbers near the root of the graph

• assign each leaf vertex a prime number

• assign an intermediate vertex production of label of its children

• label the root

Page 21: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Prime Labeling (our Def.)

G. Wu et al

Yun Peng

Page 22: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Querying with Prime LabelingReachability ≡ Divisibility

◦ c → d: 7 * 11 * 3 / 3 = 7 * 11

◦ c → e: 7 * 11 * 3 / 5 = 46.2

Page 23: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Matrix Representation

3 5 7 11

1 1 1 1

1 1 0 0

1 0 1 1

1 0 0 0

0 1 0 0

0 0 1 0

1 0 0 1

a

b

c

d

e

f

g

Columns:Prime numbers

Rows:Vertices

Reachability: Divisibility ≡ Logic op.s

Experiments: Often just a constant factor smaller than the adjacency matrix!

Page 24: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Where are we?Experiments from XMark

◦ It is just a constant factor smaller than adjacency matrix

◦ How on earth would this be summarized?

Page 25: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Consecutive Ones Property (C1P)

A Consecutive Ones Matrix (C1P matrix) is a 0/1 matrix, in which 1s of each row are consecutive.

Since 1s are consecutive, each row of a C1P matrix can be summarized by an interval: [start column id of 1s, end column id of 1s]

One row → One vertex → One interval

1

2

1 1 1 0 1 1

0 0 1 1 1 1

r

r

1

2

0 1 1 1 1 1

0 0 1 1 1 0

r

r

non-consecutive ones matrix

consecutive ones matrix

[1,5]

[2,4]

Page 26: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

What do we get from C1P?

Page 27: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

What do we get from C1P? (cont’)

Adopting a property of intervals◦ Vertex w is reachable from vertex v, if w locates

within the right-bottom field of v on the plane◦ For example, dot (2,4) is at right bottom part of dot

(1,5), so r2 is reachable from r1

1

2

0 1 1 1 1 1

0 0 1 1 1 0

r

r

[1,5]

[2,4]

r1(1,5)

r2(2,4)

Page 28: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Complexities related to C1P

C1P matrix detection◦ Linear time solvable [Hsu, Algorithms’02]

Transform a non-C1P matrix to a C1P matrix◦ NP-hard [Tan, Algorithmica’07]◦ No polynomial time approximation [Tan, Algorithmica’07]

Page 29: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Our Heuristic Algorithm

Main Idea: given any m*n matrix with r 1s, extract C1P sub matrixes (by the C1P matrix detection algorithm) and then concatenate them one by one

Time complexity: 2( ( ))O m m n r

Page 30: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Pseudo-code of the Matrix Transformation

Extract a submatrix for this iteration

Adding one row at a time

C1P detection – linear time

Transform to C1P – linear time

Page 31: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Optimizations for C1P Trans.

Horizontal matrix decomposition prior to C1P heuristics◦ ◦ Use the 3 sigmas rule on the number of 1’s in rows

Common pattern extraction◦ Done by an intersection of the rows

Compressed (extensible) hash mappings◦ One column in the original matrix may be mapped to

multiple positions in a C1P matrix◦ Support mapping ops in the compressed domain

))(( 2 rnmmO

Page 32: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

What do we get from C1P? (Recall)

Page 33: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Agenda

BackgroundProblem StatementOverview of our FrameworkMatrix and its TransformationsHistograms and EstimationExperimental EvaluationFuture Works

Page 34: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

2-Dimensional Histogram Recall we summarize rows of a C1P

matrix by intervals and then dot them on the 2-d plane

The plane is divided into cells. For each cell, we record the number of dots located within it.

Given a vertex v, the set of vertices reachable from v must be located in the right bottom part of v

Sum up the size of cells located at right-bottom part of v as 1+2 = 3

1

2

1

1

We build a 2-dimensional histogram for each kind of nodes

Page 35: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

2-Dimensional Histogram -- Observations

Data dots are always on top of the diagonal lineData dots are often skewed towards the diagonal

line◦ This is consistent to an observation from an XML

researchThere are different types of cells w.r.t a query →

there should be different estimation rules

Page 36: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Our 2-Dimensional HistogramsMore histogram/structure in a cell

Different estimation rules for different classes of cells

Page 37: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Schematics of Our Estimation

Page 38: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Estimation Details that have been Skipped in this TalkA top-down recursive estimation algorithm

based on the (syntactic) structure of twigsDetails on handling branches

◦ A bottom-up recursive algorithmestimate_intermediate: generating next

query dotsestimate_count: generating count from a

query dot

Page 39: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

top_down (very briefly)

Page 40: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

A rule in estimate_intermediate

Page 41: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Illustration of estimation rules in estimate_intermediate

Page 42: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

A rule in estimate_count

Page 43: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Query-dot generation

Compress f and f^-1Generate query dots in the

compressed domain in one scan

1.They can be large, sometimes2. Many query dots have 0 count

Page 44: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Agenda

BackgroundProblem StatementOverview of our FrameworkMatrix and its TransformationsHistograms and EstimationExperimental EvaluationFuture Works

Page 45: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

ExperimentsDatasets

◦ XMark; DBLP; Treebank.05

Queries◦ Skewed queries based on the tags’ popularities

Optimizations◦ Used all optimizations unless specified otherwise

Page 46: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Error Metrics

◦ Relative error: from XSketch/TreeSketch

◦ Root Mean Square Error (RMSE): from XSeed

◦ Normalized RMSE: from XSeed

| . |est realrealn

2( . )est real

n

RMSEreal

n

Page 47: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Our Est. Error (relative error)

Page 48: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Our Est. Error (RMSE & NRMSE)

Page 49: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Ours vs. XSeed

RMSE NRMSE

XMark 7.1 times better 6.9 times better

Treebank.05 6.8 times better 6.8 times better

Page 50: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Ours vs. XSketch/TreeSketch (indirect)

XSketch focuses on path queries on cyclic graph, which controls error under 10%

TreeSketch focuses on twig queries on tree, which control error under 5%

Page 51: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Our Est. Time on XMark Graph

Page 52: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Performance of C1P Matrix Transformation Optimization

Page 53: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Query Dot Gen. Optimization

Page 54: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

ConclusionsWe are the first work on selectivity estimation of twig

queries on cyclic graphsWe propose a new graph representation technique

◦ Extend prime labeling to cyclic graphs◦ Transform prime labeling to a C1P matrix for summarization

We extend 2-dimensional histogram selectivity estimation technique to cyclic graphs

Experiment results shows that we outperform previous works◦ Our ~1.3% error vs. XSketch/TreeSketch’s 5% error◦ Errors are at least 6.8 times smaller than XSeed

Page 55: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.

Future Works

Incorporating this technique with estimation on◦ Data values◦ Queries with negations

External implementation◦ For quick implementation, we put almost all data

structures in main memoryEstimation performance guarantees

Page 56: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with.