The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of...
Transcript of The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of...
The History of Histograms
Yannis IoannidisUniversity of Athens, Hellas
Outline
w Prehistoryw Definitions and Frameworkw The Early Pastw 10 Years Agow The Recent Pastw Industryw Competitorsw The Future
Prehistory
wWord ̀ histogram’ of Greek originn `histo-s’ = `mast’n `gram-ma’ = `something written’
w Not used originally in the Greek language!w Introduced by Karl Pearson in 1892 for a“common form of graphical representation”
Prehistory
Prehistory
w 1662: Concept exists at least since then in mortality tables of J. Grauntw 1786: Bar charts introduced by W. Playfair to
capture Scottish imports/exports w 1833: Histograms introduced by A. M. Guerry as
discrete approximations to distribution functionsw 1859: Florence Nightingale used them to
compare mortality of soldiers and civilians
Prehistory
Prehistory
w Playfair’s bar chart
DefinitionsData Distributions
w One-dimensional data distribution = Set of (attribute value, frequency) pairsw Large and non-uniform ⇒ need
compression and approximationw Concentrate on numeric attributes
DefinitionsData Distributions
Freq
Value
Spread
Area
DefinitionsData Distributions
w Combinations of multiple attribute valuesw Joint frequencyw Multidimensional data distributions =
Set of (value combination, joint frequency)pairs
DefinitionsMultidimensional Data Distributions
Value2
Value1
5
45
17
2
Motivation
w Selectivity estimationw Approximate query answering
w Query optimizationw Query profiling for user feedbackw Load balancing for parallel join executionw Partition-based temporal join execution
within
DefinitionsHistograms
w Partition data distribution into β disjoint bucketsw Approximate values (value combinations)
and frequencies within each bucket
DefinitionsHistograms
Freq
Value
DefinitionsHistograms
Freq
Value
bucket 1
bucket 2
FrameworkHistogram Parameters
w Partition rule: 4 orthogonal parametersn Partition classn Sort parametern Partition constraintn Source parameter
w Construction algorithm
FrameworkHistogram Parameters
w Value approximation within bucketw Frequency approximation within bucketw Error guarantees
FrameworkPartition Class
w Indicates restrictions on partitioningn Serial: non-overlapping ranges of sort
parameter valuesn End-biased: at most one non-singleton
bucket
FrameworkSort Parameter
w Derivative of data distribution element (its value and/or frequency)n Attribute values (V)n Frequencies (F)n Areas (A) = spread x frequency
w Serial: buckets must contain contiguoussort parameter values
FrameworkPartition Class and Sort Parameter
10502040407242090116323653015
FREQUENCYVALUE
FrameworkPartition Class and Sort Parameter
9040363024201610
SORT PAR
90140736530152420204016321050
FREQUENCYVALUE
B1
B2
B3
B4
FrameworkPartition Class and Sort Parameter
9040363024201610
SORT PAR
90140736530152420204016321050
FREQUENCYVALUEB1
B2
B3
B4
FrameworkPartition Class and Sort Parameter
9040363024201610
SORT PAR
90140736530152420204016321050
FREQUENCYVALUE
B1
B2B3
B4
FrameworkSource Parameter
w Derivative of data distribution element (its value and/or frequency)n Spreads (S)n Frequencies (F)n Cumulative frequencies (C)n Areas (A)
w Partition constraint applied on source parameter
FrameworkPartition Constraint
w Mathematical constraint on the source parameter that partitioning must satisfyw General direction: Avoid grouping vastly
different source parameter values
FrameworkPartition Constraint
w Equi-sum: equalize sumsw V-optimal: minimize variancew Maxdiff: minimize maximum difference of
adjacent source valuesw Compressed: preserve high source
values and equalize sums of the restw Spline-based: minimize square root of
error
FrameworkPartition Constr. and Source Parameter
36032072
15028820012810
SOURCE PAR
9040363024201610
SORT PAR
90140736530152420204016321050
FREQVALUEB1
B2
B3
B4
118
138
248
FrameworkHistogram Parameters
w Notationclass : constraint (sort, source)
w Special notation for serial partition classconstraint (sort, source)
FrameworkHistogram Parameters
w Same parameters for multidimensionalhistogramsw Partition rule more intricate: not always
analyzable into 4 orthogonal parametersn No sort parameter often
The Early PastDark Ages
w Essentially, use of 1-bucket histogramsw Large errors
The Early PastFirst Appearance
w Kooi’s PhD Thesisw equi-width histograms
n equi-width = equi-sum (V, S)w Adopted by INGRES
The Early PastFirst Appearance
Freq
Value
The Early PastFirst Appearance
Freq
Value
The Early PastFirst Alternative
w Don’t equalize ranges of values but number of tuples in bucketw equi-depth histograms
n equi-depth = equi-sum (V, F)w Source is only differencew Adopted by several commercial systems
The Early PastFirst Alternative
Freq
Value
The Early PastOptimal Sort Parameter
w Theorem: For single join queries and ac-curate knowledge of values,
serial histograms withfrequency as sort parameter
are optimal.w Generalization of practice to keep high-
frequency values accurately.
The Early PastOptimal Sort Parameters
Freq
Value
10 Years Ago
w Theorem: For single join queries and ac-curate knowledge of values,
serial histograms withfrequency as sort parameter
are optimal.
The Recent Past
w Optimal partition constraints and source parameters?w Optimality when values are not known
accurately?w Optimal values of other histogram
characteristics?
The Recent PastOptimal Constraint and Source
w Theorem: For the average join query and accurate knowledge of values,
v-optimal histograms withfrequency as source parameter
are optimal.v-optimal (F, F)
w v-optimal: minimize variance of source values
The Recent PastOptimal Constraint and Source
Freq
Value
The Recent Past
w If values are not known accurately, no optimality result on any histogram characteristicw Several experimental results identify key
choices
The Recent PastNew Partition Constraints
w All try to group similar source valuesw max-diff: bucket borders at highest
differences of adjacent source valuesw compressed: Preserve high values of
source and equalize sums of the rest
The Recent Pastmaxdiff
Freq
Value
The Recent Pastcompressed
Freq
Value
The Recent PastAlternative Partition Constraints
w Variations on the optimal knot placementproblemn Linear splines onlyn Discontinuous across bucket boundaries
The Recent PastNew Sort and Source Parameters
w Choicesn Attribute values (V)n Spreads (S)n Frequencies (F)n Areas (A)n Cumulative frequencies (C)
w value is best sort parameter overallw area and frequency are best source
parameters overall
The Recent PastMultidimensional Partition Rules
w Multidimensional value domain cannot be sorted to serve as sort parameterw Many alternatives to partition the space of
values into bucketsw Although possible, frequency has not
been used as sort parameter
The Recent PastMultidimensional Partition Class
w A la Grid Filew A la K-D-B-Tree (MHIST)w GENHISTw STHoles
The Recent PastMultidimensional Data Distributions
Value2
Value1
5
45
17
2
The Recent PastM-D Partition Class: Grid File
Value2
Value1
The Recent PastM-D Partition Class: MHIST
Value2
Value1
The Recent PastM-D Partition Class: GENHIST
Value2
Value1
The Recent PastM-D Partition Class: GENHIST
Value2
Value1
The Recent PastM-D Partition Class: GENHIST
Value2
Value1
The Recent PastM-D Partition Class: STHoles
Value2
Value1
The Recent PastHistogram Framework
w Partition rulel Partition classl Sort parameterl Partition constraintl Source parameter
w Construction algorithmw Value and frequency approximationw Error guarantees
The Recent PastValue Approximation
w Continuous value assumption:(min and) max valuew Uniform spread assumption:
above + number of unique valuesw Popularity-based spread:
above with “fake” num of unique valuesw Kernel estimation
The Recent PastValue Approximation
Freq
Valuemaxmin
7
The Recent PastValue Approximation
Freq
Valuemaxmin
24
The Recent PastValue Approximation
w All generalized to multidimensional casew Tradeoff between number of buckets and
information kept within each bucket
The Recent PastFrequency Approximation
w Uniform distribution assumption:average frequencyw Linear spline approximation:
above + spline’s angle
The Recent PastFrequency Approximation
Freq
Value
Industrial Presence
w Only 1-dimensional histograms
w 1970’s: trivial histograms (1 bucket)w 1980’s: equi-width histogramsw 1990’s: equi-depth histogramsw 2000’s: a
Industrial PresenceDB2
compressed (V, F)w Default of 10 singleton and 20 non-
singleton bucketsw Store cumulative frequenciesw Construction based on reservoir samplew Indices used to quantify dependenciesw LEO learning is key
Industrial PresenceORACLE
equi-depth = equi-sum (V, F)w Indices used to quantify dependenciesw On-the-fly dependence estimationw Past selectivities stored for future use
Industrial PresenceSQL Server
max-diff (V, F)w Up to 199 bucketsw Store cumulative frequenciesw Store frequency of max accurately w Construction based on samplew Indices use to quantify dependencies
Histogram Competitors
wWaveletsw Sampling (usually complementary)w Specialized techniques
The Future
w Histograms and clusteringw Bucket recognition and representationw Histograms and tree indicesw Value approximationw Comprehensive technique comparisonw Other data types
The FutureHistograms and Clustering
w Clustering is “identical” problem!n Grouping of similar elements into buckets
(bucket = cluster = pattern)n Small approximation within bucket
w Multidimensional elements aren attribute value combinationsn above + frequency
The FutureHistograms and Clustering
Freq
Value3 4 6 9 23 27
80
60
40
20
4330 32
15
80
50
The FutureHistograms and Clustering
Freq
Value3 4 6 9 23 27
80
60
40
20
4330 32
15
80
50
The FutureHistograms and Clustering
Freq
Value3 4 6 9 23 27
80
60
40
20
4330 32
15
80
50
The FutureHistograms and Clustering
w Very different techniquesw Apply on one problem techniques
developed for the othern Partition rulesn Construction algorithmsn Approximate representations within bucket
The FutureBucket Recognition and Representation
w Essence of histograms or clusteringn Identify groups of similar elementsn Similarity on few characteristics (source)n Store approximation of these characteristics
wWhich are the similar characteristics?[Pattern Recognition]
The FutureBucket Recognition and Representation
w Maybe not original element dimensionsw Maybe not the same for all groups
The FutureBucket Recognition and Representation
Freq
Value3 4 6 9 23 27
80
60
40
20
4330 32
15
80
50
The FutureBucket Recognition and Representation
Freq
Value
80
60
40
20
3 5 7 9 23 27
30
65
The FutureBucket Recognition and Representation
Freq
Value3 4 6 9 23 27
80
60
40
20
4330 32
15
80
50
The FutureBucket Recognition and Representation
Freq
Value
80
60
40
20
3 5 7 9 23 27
30
65
The FutureBucket Recognition and Representation
w Not clustering in the value-frequency space, but the spread-frequency spacewWhy the difference in treatment?w Is this always better?w How can we recognize winner?
The FutureBucket Recognition and Representation
Freq
Value
The FutureBucket Recognition and Representation
Freq
Value
The FutureBucket Recognition and Representation
Freq
Value
The FutureBucket Recognition and Representation
Freq
Value
The FutureHistograms and Tree Indices
w Root of the B+ tree partitions space of values into non-overlapping bucketsw Each bucket further subdivided into
smaller bucketsw Appropriate info next to each bucket turns
each node into a histogramw Entire B+ tree becomes
Hierarchical Histogram
The FutureHistograms and Tree Indices
43512077 71 83
The FutureHistograms and Tree Indices
- Index fanout decreases
+ Indexing and estimation in one+ Incremental estimation with increasing
estimate accuracy
The FutureHistograms and Tree Indices
w B+ tree node “is” equi-depth histogramwWhat kind of trees with other constraints?
n V-optimaln Max-diffn Compressed
w Unbalanced trees: exact search slowerw Unbalanced trees: approximate answers
more accurate
The FutureHistograms and Tree Indices
w Take into account query frequencyw Represent popular values more
accurately – higher in the treew New hierarchical histograms/indices may
be faster than traditional ones
Conclusions
w Histograms very successful in databasesw Possibly best tradeoff between
n Simplicityn Efficiencyn Effectivenessn Applicability
The Future
w New approaches to some characteristicsw Untouched foundational problems
The next 10 yearseven more exciting!