The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of...

The History of Histograms

Yannis IoannidisUniversity of Athens, Hellas

Outline

w Prehistoryw Definitions and Frameworkw The Early Pastw 10 Years Agow The Recent Pastw Industryw Competitorsw The Future

Prehistory

wWord ̀ histogram’ of Greek originn `histo-s’ = `mast’n `gram-ma’ = `something written’

w Not used originally in the Greek language!w Introduced by Karl Pearson in 1892 for a“common form of graphical representation”

Prehistory

Prehistory

w 1662: Concept exists at least since then in mortality tables of J. Grauntw 1786: Bar charts introduced by W. Playfair to

capture Scottish imports/exports w 1833: Histograms introduced by A. M. Guerry as

discrete approximations to distribution functionsw 1859: Florence Nightingale used them to

compare mortality of soldiers and civilians

Prehistory

Prehistory

w Playfair’s bar chart

DefinitionsData Distributions

w One-dimensional data distribution = Set of (attribute value, frequency) pairsw Large and non-uniform ⇒ need

compression and approximationw Concentrate on numeric attributes


Freq

Value

Spread

Area


w Combinations of multiple attribute valuesw Joint frequencyw Multidimensional data distributions =

Set of (value combination, joint frequency)pairs

DefinitionsMultidimensional Data Distributions

Value2

Value1

5

45

17

2

Motivation

w Selectivity estimationw Approximate query answering

w Query optimizationw Query profiling for user feedbackw Load balancing for parallel join executionw Partition-based temporal join execution

within

DefinitionsHistograms

w Partition data distribution into β disjoint bucketsw Approximate values (value combinations)

and frequencies within each bucket


Freq

Value


Freq

Value

bucket 1

bucket 2

FrameworkHistogram Parameters

w Partition rule: 4 orthogonal parametersn Partition classn Sort parametern Partition constraintn Source parameter

w Construction algorithm


w Value approximation within bucketw Frequency approximation within bucketw Error guarantees

FrameworkPartition Class

w Indicates restrictions on partitioningn Serial: non-overlapping ranges of sort

parameter valuesn End-biased: at most one non-singleton

bucket

FrameworkSort Parameter

w Derivative of data distribution element (its value and/or frequency)n Attribute values (V)n Frequencies (F)n Areas (A) = spread x frequency

w Serial: buckets must contain contiguoussort parameter values

FrameworkPartition Class and Sort Parameter

10502040407242090116323653015

FREQUENCYVALUE


9040363024201610

SORT PAR

90140736530152420204016321050

FREQUENCYVALUE

B1

B2

B3

B4


9040363024201610

SORT PAR

90140736530152420204016321050

FREQUENCYVALUEB1

B2

B3

B4


9040363024201610

SORT PAR

90140736530152420204016321050

FREQUENCYVALUE

B1

B2B3

B4

FrameworkSource Parameter

w Derivative of data distribution element (its value and/or frequency)n Spreads (S)n Frequencies (F)n Cumulative frequencies (C)n Areas (A)

w Partition constraint applied on source parameter

FrameworkPartition Constraint

w Mathematical constraint on the source parameter that partitioning must satisfyw General direction: Avoid grouping vastly

different source parameter values

FrameworkPartition Constraint

w Equi-sum: equalize sumsw V-optimal: minimize variancew Maxdiff: minimize maximum difference of

adjacent source valuesw Compressed: preserve high source

values and equalize sums of the restw Spline-based: minimize square root of

error

FrameworkPartition Constr. and Source Parameter

36032072

15028820012810

SOURCE PAR

9040363024201610

SORT PAR

90140736530152420204016321050

FREQVALUEB1

B2

B3

B4

118

138

248


w Notationclass : constraint (sort, source)

w Special notation for serial partition classconstraint (sort, source)


w Same parameters for multidimensionalhistogramsw Partition rule more intricate: not always

analyzable into 4 orthogonal parametersn No sort parameter often

The Early PastDark Ages

w Essentially, use of 1-bucket histogramsw Large errors

The Early PastFirst Appearance

w Kooi’s PhD Thesisw equi-width histograms

n equi-width = equi-sum (V, S)w Adopted by INGRES


Freq

Value


Freq

Value

The Early PastFirst Alternative

w Don’t equalize ranges of values but number of tuples in bucketw equi-depth histograms

n equi-depth = equi-sum (V, F)w Source is only differencew Adopted by several commercial systems

The Early PastFirst Alternative

Freq

Value

The Early PastOptimal Sort Parameter

w Theorem: For single join queries and ac-curate knowledge of values,

serial histograms withfrequency as sort parameter

are optimal.w Generalization of practice to keep high-

frequency values accurately.

The Early PastOptimal Sort Parameters

Freq

Value

10 Years Ago

w Theorem: For single join queries and ac-curate knowledge of values,

serial histograms withfrequency as sort parameter

are optimal.

The Recent Past

w Optimal partition constraints and source parameters?w Optimality when values are not known

accurately?w Optimal values of other histogram

characteristics?

The Recent PastOptimal Constraint and Source

w Theorem: For the average join query and accurate knowledge of values,

v-optimal histograms withfrequency as source parameter

are optimal.v-optimal (F, F)

w v-optimal: minimize variance of source values

The Recent PastOptimal Constraint and Source

Freq

Value

The Recent Past

w If values are not known accurately, no optimality result on any histogram characteristicw Several experimental results identify key

choices

The Recent PastNew Partition Constraints

w All try to group similar source valuesw max-diff: bucket borders at highest

differences of adjacent source valuesw compressed: Preserve high values of

source and equalize sums of the rest

The Recent Pastmaxdiff

Freq

Value

The Recent Pastcompressed

Freq

Value

The Recent PastAlternative Partition Constraints

w Variations on the optimal knot placementproblemn Linear splines onlyn Discontinuous across bucket boundaries

The Recent PastNew Sort and Source Parameters

w Choicesn Attribute values (V)n Spreads (S)n Frequencies (F)n Areas (A)n Cumulative frequencies (C)

w value is best sort parameter overallw area and frequency are best source

parameters overall

The Recent PastMultidimensional Partition Rules

w Multidimensional value domain cannot be sorted to serve as sort parameterw Many alternatives to partition the space of

values into bucketsw Although possible, frequency has not

been used as sort parameter

The Recent PastMultidimensional Partition Class

w A la Grid Filew A la K-D-B-Tree (MHIST)w GENHISTw STHoles

The Recent PastMultidimensional Data Distributions

Value2

Value1

5

45

17

2

The Recent PastM-D Partition Class: Grid File

Value2

Value1

The Recent PastM-D Partition Class: MHIST

Value2

Value1

The Recent PastM-D Partition Class: GENHIST

Value2

Value1


Value2

Value1


Value2

Value1

The Recent PastM-D Partition Class: STHoles

Value2

Value1

The Recent PastHistogram Framework

w Partition rulel Partition classl Sort parameterl Partition constraintl Source parameter

w Construction algorithmw Value and frequency approximationw Error guarantees

The Recent PastValue Approximation

w Continuous value assumption:(min and) max valuew Uniform spread assumption:

above + number of unique valuesw Popularity-based spread:

above with “fake” num of unique valuesw Kernel estimation


Freq

Valuemaxmin

7


Freq

Valuemaxmin

24


w All generalized to multidimensional casew Tradeoff between number of buckets and

information kept within each bucket

The Recent PastFrequency Approximation

w Uniform distribution assumption:average frequencyw Linear spline approximation:

above + spline’s angle

The Recent PastFrequency Approximation

Freq

Value

Industrial Presence

w Only 1-dimensional histograms

w 1970’s: trivial histograms (1 bucket)w 1980’s: equi-width histogramsw 1990’s: equi-depth histogramsw 2000’s: a

Industrial PresenceDB2

compressed (V, F)w Default of 10 singleton and 20 non-

singleton bucketsw Store cumulative frequenciesw Construction based on reservoir samplew Indices used to quantify dependenciesw LEO learning is key

Industrial PresenceORACLE

equi-depth = equi-sum (V, F)w Indices used to quantify dependenciesw On-the-fly dependence estimationw Past selectivities stored for future use

Industrial PresenceSQL Server

max-diff (V, F)w Up to 199 bucketsw Store cumulative frequenciesw Store frequency of max accurately w Construction based on samplew Indices use to quantify dependencies

Histogram Competitors

wWaveletsw Sampling (usually complementary)w Specialized techniques

The Future

w Histograms and clusteringw Bucket recognition and representationw Histograms and tree indicesw Value approximationw Comprehensive technique comparisonw Other data types

The FutureHistograms and Clustering

w Clustering is “identical” problem!n Grouping of similar elements into buckets

(bucket = cluster = pattern)n Small approximation within bucket

w Multidimensional elements aren attribute value combinationsn above + frequency


Freq

Value3 4 6 9 23 27

80

60

40

20

4330 32

15

80

50


Freq

Value3 4 6 9 23 27

80

60

40

20

4330 32

15

80

50


Freq

Value3 4 6 9 23 27

80

60

40

20

4330 32

15

80

50


w Very different techniquesw Apply on one problem techniques

developed for the othern Partition rulesn Construction algorithmsn Approximate representations within bucket

The FutureBucket Recognition and Representation

w Essence of histograms or clusteringn Identify groups of similar elementsn Similarity on few characteristics (source)n Store approximation of these characteristics

wWhich are the similar characteristics?[Pattern Recognition]


w Maybe not original element dimensionsw Maybe not the same for all groups


Freq

Value3 4 6 9 23 27

80

60

40

20

4330 32

15

80

50


Freq

Value

80

60

40

20

3 5 7 9 23 27

30

65


w Not clustering in the value-frequency space, but the spread-frequency spacewWhy the difference in treatment?w Is this always better?w How can we recognize winner?


Freq

Value


Freq

Value


Freq

Value


Freq

Value

The FutureHistograms and Tree Indices

w Root of the B+ tree partitions space of values into non-overlapping bucketsw Each bucket further subdivided into

smaller bucketsw Appropriate info next to each bucket turns

each node into a histogramw Entire B+ tree becomes

Hierarchical Histogram


43512077 71 83


- Index fanout decreases

+ Indexing and estimation in one+ Incremental estimation with increasing

estimate accuracy


w B+ tree node “is” equi-depth histogramwWhat kind of trees with other constraints?

n V-optimaln Max-diffn Compressed

w Unbalanced trees: exact search slowerw Unbalanced trees: approximate answers

more accurate


w Take into account query frequencyw Represent popular values more

accurately – higher in the treew New hierarchical histograms/indices may

be faster than traditional ones

Conclusions

w Histograms very successful in databasesw Possibly best tradeoff between

n Simplicityn Efficiencyn Effectivenessn Applicability

The Future

w New approaches to some characteristicsw Untouched foundational problems

The next 10 yearseven more exciting!

The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of...

Documents

Transcript of The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of...