XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis...
-
Upload
lynne-dean -
Category
Documents
-
view
219 -
download
0
description
Transcript of XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis...
XCluster Synopses for Structured XML
ContentAlkis Polyzotis (UC Santa Cruz)
Minos Garofalakis (Intel Research, Berkeley)
XML Summarization
Synopses are essential for XML data managementStatistics for XML query optimizationApproximate query answering
Active research topic in the field of XML databasesMarkov Tables, XSketch, XPathLearner, CSTs, TreeSketch,...
XML XML DataData
SynopsisSynopsis
count(Q) Selectivity of Q
Estimated selectivity of Q
count(Q)
Content HeterogeneityData
Queries
<paper><year>2003</year><title>The history of histograms (abridged)</title><author>Yannis Ioannidis</author><abstract>The history of histograms is long and rich, full of detailed information in every step. It...</abstract></paper>
//paper[year>2000][author contains “Ioannidis”]//abstract[ftcontains histograms,history]
NumericalNumericalStringString
Text Text
RangeRange SubstringSubstring
Term Term Containment Containment
Synopses and Heterogeneity
Mixed predicates => Unified summarization model
Path structureValues of different typesCorrelations between and across
Summarization for textual values
//paper[year>2000][author contains “Ioannidis”]//abstract[ftcontains histograms,history]XML XML
DataData
SynopsisSynopsis
XCluster SynopsesData synopses for heterogeneous XML content
Unified summarization for path structure and numerical, string, and textual contentSupport for twig queries with mixed predicates
XCluster model <=> Element clustering Tight cluster <=> Similar structure and valuesExtensibility to other value types
Principled compression frameworkExperimental results: high accuracy with low storage requirements
Outline
PreliminariesXCluster ModelXCluster CompressionConstruction AlgorithmExperimental Study
Data and Query Model
Tree data with heterogeneous value contentTree-pattern queries with XPath expressions
Result: set of binding tuples
for $q0 in /,$q1 in $q0/p[y>1999], $q2 in $q1/t[contains(XML), $q3 in $q1/ab[ ftcontains(synopsis,data) ]
q0
q1
q3
q2
Numeric Numerical al
Text Text
Text Text
String String
RangeRange
SubstringSubstring
Term TermContainmen Containmen
tt
DataQuery
Problem Definition
Problem: build a data synopsis that can estimate the selectivity of any queryChallenges:
Heterogeneity of contentData correlations
SynopsisSynopsis
XCluster Model
Structural Summarization
Node <=> Elements of same tagStatistical information: node- and edge-counts
Node-count: number of elements in clusterEdge-count: average number of children
XClusterData
Value Summarization
Value summary => Fractional value distribution
Single-dimensionalApproximation method depends on value type
XClusterData
Types of Value SummariesNumerical Content => Histograms
String Content => Pruned Suffix TriesText Content => End-biased Term Histograms
“The history of histograms is long and rich, full of detailed information in every step. It...”
Term Freq0 (history) 21 (histogram) 72 (data) 63 (database) 54 (information) 35 (value) 2
Bucket Freq
010000 7
001000 6
000100 5
100011 7/3
Text Term Matrix Term Histogram
XCluster Model
A node aggregates information about its elementsCorrespondence to clustering: node <=> cluster <=> centroid elementBasic assumptions: independence and uniformity
Tight clusters => Valid assumptions
Each element in A has:- 2 children in B- 3 children in C- value x with prob 70%- value y with prob. 30%
Estimation Example
XCluster
Query
sel(Q)=(1)*(2)*(1*st)*(1/2*sk)
1*st children
1/2*sk children
2 children
1 element
Two-step estimation algorithm:Identify embeddingsEstimate selectivity of each embedding
Accuracy depends on “tightness” of centroids
Embedding
XCluster Compression
Structural Compression
Merge two nodes of same tagNew node acquires aggregate characteristics
Node- and edge-counts are aggregatedValue summaries are “fused”
Conceptually equivalent to cluster merging
Value-Based Compression
Reduce the storage of a single value summarySpecifics depend on type of summary
Histogram: merge k bucketsPruned Suffix Trie: prune k nodes
Remove leaf nodes based on statistical independenceTerm Histogram: move k terms to the uniform bucket
Compression vs. Accuracy
Δ(S,S’): difference in accuracy between S and S’ Key idea: apply operations with low Δ(S,S’)
Absolute vs. Relative metric
Original XCluster S Compressed XCluster S’
SS
S’S’RR
SS
S’S’Absolute Relative
Distance Metric Δ(S,S’)
μ-query => basic query involving structure+values
u[s]/c: the number of children in c per element in u that satisfies value predicate sIntuition: capture centroid information pertaining to c and s
Δ(S,S’): difference of estimates for μ-queries
€
Δ(S,S') = u (u[s]/c −w[s]/c)2 +s,c∑ v (v[s]/c −w[s]/c)2
s,c∑
S S’
XCluster Construction
XCluster Construction
Step 1: Build reference synopsisCount stability + Detailed value summaries
Step 2: Compress structural informationStep 3: Compress value-based information
XML Data
ReferenceSummary
XCluster withdetailed valuedistributions
XClusterç±ç±ç±ç± ç±ç±Step 1 Step 2 Step 3
Structural Compression
Algorithm sketch:1. Generate pool of candidate merge operations2. Apply operations in increasing order of Δ(S,S’)3. Repeat until size < budgetA-priori generation of candidates
Merges at level l trigger merges at level l-1Adaptive, leaf-to-root merging of nodes
XML Data
ReferenceSummary
XCluster withdetailed valuedistributions
XClusterç±ç±ç±ç± ç±ç±Step 1 Step 2 Step 3
Value-Based Compression
Algorithm sketch:1. Generate one operation for each value summary2. Apply value compression with least Δ(S,S’)3. Repeat until size < budgetGenerate operations of “least effect”:
Histograms: merge buckets with least differencePSTs: prune leaves with max independenceTerm Histograms: remove singletons of least freq.
XML Data
ReferenceSummary
XCluster withdetailed valuedistributions
XClusterç±ç±ç±ç± ç±ç±Step 1 Step 2 Step 3
Experimental Study
MethodologyData sets:
Workloads: random twig queriesStructure only and with predicatesBiased toward high selectivities
Metrics:Absolute relative error: |true-estim|/max(true,s)Absolute error: |true-estim|
#Elements #Value Paths Ref. Size (KB)XMark 206130 9 869IMDB 236822 7 462
Accuracy of XClusters
0102030405060708090
150 155 160 165 170 175 180 185 190 195 200Synopsis Size (KB)
Estimation Error (%)
OverallStructNumericStringText
IMDB
XCluster vs. TreeSketch
05
101520253035404550
0 5 10 15 20 25 30 35 40 45 50Synopsis Size (KB)
Estimation Error (%)
XClusterTreeSketch
XMark
ConclusionsXML synopses are essential for XML query optimizationOur contribution: XCluster Synopses
XML summaries for heterogeneous contentSupport for twig queries with numerical, string, and textual predicates
XCluster model: generalized element clusteringPrincipled construction algorithmExperimental results: high accuracy with low storage requirements