Histograms in MariaDB, MySQL and PostgreSQL - Percona · PDF fileHistograms in MariaDB, MySQL...

Histograms in MariaDB, MySQL and PostgreSQL

Sergei Petrunia, MariaDBSergei Petrunia, MariaDB

Santa Clara, California | April 24th – 27th, 2017Santa Clara, California | April 24th – 27th, 2017

2

What this talk is about

● Data statistics histograms in– MariaDB

– MySQL (status so far)

– PostgreSQL

● This is not a competitive comparison– Rather, a survey

Histograms and query optimizersClick to add textClick to add text

4

Query optimizer needs data statistics

● Which query plan enumerates fewer rows– orders->customers or customers->orders?

● It depends on row counts and condition selectivities

● Condition selectivity has a big impact on query speed

select * from customers join orders on customer.cust_id=orders.customer_idwhere customers.balance<1000 and orders.total>10K

5

Data statistics has a big impact on optimizer

● A paper "How good are query optimizers, really?"– Leis et al, VLDB 2015

● Conclusions section:

– "In contrast to cardinality estimation, the contribution of the cost

model to the overall query performance is limited."

● This matches our experience

6

Data statistics usage

● Need a *cheap* way to answer questions about– Numbers of rows in the table

– Condition selectivity

– Column widths

– Number of distinct values

– …

● Condition selectivity is the most challenging

7

Histogram as a compact data summary

● Partition the value space into buckets● Keep an array of (bucket_bounds, n_values)

– Takes O(#buckets) space

8

Histogram and condition selectivity

col BETWEEN ‘a’ AND ‘b’

● Sum row counts in the covered buckets

● Partially covered bucket?– Assume a fraction of rows match

– This is a source of inaccuracy

● More buckets – more accurate estimates

9

Histogram types

● Different strategies for choosing buckets– Equi-width

– Equi-height

– Most Common Values

– ...

10

Equi-width histogram

● Bucket bounds pre-defined– Equal, log-scale, etc

● Easy to understand, easy to collect.

● Not very efficient

– Densely and sparsely-populated regions have the same #buckets

– What if densely-populated regions had more buckets?

11

Equi-height histogram

● Pick the bucket bounds such that each bucket has the same #rows– Densely populated areas get

more buckets

– Sparsely populated get fewer buckets

● Estimation error is limited by bucket size– Which is now limited.

12

Most Common Values histogram

● Suitable for enum-type domains

● All possible values fit in the histogram

● Just a list of values and frequencies

value1 count1

value2 count2

value3 count3

... ...

13

Histogram collection algorithms● Equi-width

– Find (or guess) min and max value

– For each value● Find which histogram bin it falls into● Increment bin’s counter

● Equi-height– Sort the values

– First value starts bin #0

– Value at n_values * (1/n_bins) starts bin #2

– Value at n_values * (2/n_bins) starts bin #3

– ...

14

Histogram collection strategies● Scan the whole dataset

– Used by MariaDB

– Produces a “perfect” histogram

– May be expensive

● Do random sampling– Used by PostgreSQL (MySQL going to do it, too?)

– Produces imprecise histograms

– Non-deterministic results

● Incremental updates– hard to do, not used

15

Summary so far ● Query optimizers need condition selectivities● These are provided by histograms● Histograms are compact data summaries● Histogram types

– Width-balanced

– Height-balanced (better)

– Most-Common-Values

● Histogram collection methods– Scan the whole dataset

– Do random sampling.

Histograms in MariaDBClick to add textClick to add text

17

Histograms in MariaDB

● Available in MariaDB 10.0– (Stable since March, 2014)

● Used in the real world● Good for common use cases

– has some limitations ● Sometimes are called “Engine-Independent Table Statistics”

– Although being engine-independent is not the primary point.

18

Histogram storage in MariaDB

● Are stored in mysql.column_stats tableCREATE TABLE mysql.column_stats ( db_name varchar(64) NOT NULL, table_name varchar(64) NOT NULL, column_name varchar(64) NOT NULL, min_value varbinary(255) DEFAULT NULL, max_value varbinary(255) DEFAULT NULL, nulls_ratio decimal(12,4) DEFAULT NULL, avg_length decimal(12,4) DEFAULT NULL, avg_frequency decimal(12,4) DEFAULT NULL, hist_size tinyint unsigned, hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'), histogram varbinary(255), PRIMARY KEY (db_name,table_name,column_name));

● Very compact: max 255 bytes (per column)

19

Collecting a histogram

set histogram_size=255;set histogram_type='DOUBLE_PREC_HB';

analyze table tbl persistent for all;analyze table tbl persistent for columns (col1, col2) indexes ();+----------+---------+----------+-----------------------------------------+| Table | Op | Msg_type | Msg_text |+----------+---------+----------+-----------------------------------------+| test.tbl | analyze | status | Engine-independent statistics collected || test.tbl | analyze | status | OK |+----------+---------+----------+-----------------------------------------+

● Manual collection only

set use_stat_tables='preferably';set optimizer_use_condition_selectivity=4;<query>;

● Make the optimizer use it

20

Examining a histogramselect * from mysql.column_stats where table_name='pop1980_cp' and column_name='firstname'*************************** 1. row *************************** db_name: babynames table_name: pop1980_cp column_name: firstname min_value: Aaliyah max_value: Zvi nulls_ratio: 0.0000 avg_length: 6.0551avg_frequency: 194.4642 hist_size: 32 hist_type: DOUBLE_PREC_HB histogram: � ��C)�G�[j\�\�fzz�z]��3�

select decode_histogram(hist_type,histogram)from mysql.column_stats where table_name='pop1980_cp' and column_name='firstname'*************************** 1. row ***************************decode_histogram(hist_type,histogram): 0.00201,0.04048,0.03833,0.03877,0.04158,0.11852,0.07912,0.00218,0.00093,0.03940,0.07710,0.00124,0.08035,0.11992,0.03877,0.03989,0.24140

21

Histograms in MariaDB - summary

● Available since MariaDB 10.0● Special ANALYZE command to collect stats

– Does a full table scan

– May require a lot of space for big VARCHARs:MDEV-6529 “EITS ANALYZE uses disk space inefficiently for VARCHAR columns”

● Not used by the optimizer by default– Special settings to get optimizer to use them.

Histograms in PostgreSQLClick to add textClick to add text

23

Histograms in PostgreSQL

● Data statistics – Fraction of NULL-values

– Most common value (MCV) list

– Height-balanced histogram (excludes MCV values)

– A few other parameters● avg_length● n_distinct_values● ...

● Collection algorithm– One-pass random sampling

24

Collecting histograms in PostgreSQL-- Global parameter specifying number of buckets-- the default is 100set default_statistics_target=N;

-- Can also override for specific columnsalter table tbl alter column_name set statistics N;

-- Collect the statisticsanalyze tablename;

# number of inserted/updated/deleted tuples to trigger an ANALYZEautovacuum_analyze_threshold = N

# fraction of the table size to add to autovacuum_analyze_threshold # when deciding whether to trigger ANALYZE autovacuum_analyze_scale_factor=N.N

postgresql.conf, or per-table

25

Examining the histogram

select * from pg_stats where tablename='pop1980';

tablename | pop1980attname | firstnameinherited | fnull_frac | 0avg_width | 7n_distinct | 9320most_common_vals | {Michael,Jennifer,Christopher,Jason,David,James, Matthew,John,Joshua,Amanda}most_common_freqs | {0.0201067,0.0172667,0.0149067,0.0139,0.0124533, 0.01164,0.0109667,0.0107133,0.0106067,0.01028}histogram_bounds | {Aaliyah,Belinda,Christine,Elsie,Jaron,Kamia, Lindsay,Natasha,Robin,Steven,Zuriel}correlation | 0.0066454most_common_elems |

26

Histograms are collected by doing sampling● src/backend/commands/analyze.c, std_typanalyze() refers to

● "Random Sampling for Histogram Construction: How much is enough?” – Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, ACM SIGMOD, 1998.

Histogram sizeRows in table (=10^6)

Max relative error in bin (=0.5)

Error probability (=0.01)

Random sample size

● 100 buckets = 30,000 rows sample

27

Histogram sampling in PostgreSQL

● 30K rows are sampled from random locations in the table– Does a skip scan forward

– “Randomly chosen rows in randomly chosen blocks”

● Choice of Most Common Values– Sample values that are 25% more common than average

– Values that would take more than one histogram bucket.

– All seen values are MCVs? No histogram is built.

28

Beyond single-column histograms● Conditions can be correlated

select ... from order_items where shipdate='2015-12-15' AND item_name='christmas light'

'swimsuit'● Correlation can have a big effect

– MIN(1/n, 1/m)

– (1/n) * (1/m)

– 0

● Multi-column “histograms” are hard● “Possible PostgreSQL 10.0 feature: multivariate statistics”

29

PostgreSQL: Conclusions

● Collects and uses both– Height-balanced histogram

– Most Common Values list

● Uses sampling for collection● Can run ANALYZE yourself

– Or VACUUM will do it automatically

● Multivariate stats are in the plans

31

Histogram test - PostgreSQL

● Real world data, people born in 1980Jennifer 58,381Allison, 4,868Jennice, 7

test=# explain analyze select count(*) from pop1980 where firstname='Jennifer'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Aggregate (cost=68456.71..68456.71 rows=1 width=8) (actual time=372.593..372.593 rows=1 loops=1) -> Seq Scan on pop1980 (cost=0.00..68312.62 rows=57632 width=0) (actual time=0.288..366.058 rows=58591 loops=1) Filter: ((firstname)::text = 'Jennifer'::text) Rows Removed by Filter: 3385539 Planning time: 0.098 ms Execution time: 372.625 ms

test=# explain analyze select count(*) from pop1980 where firstname='Allison'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------- Aggregate (cost=68313.66..68313.67 rows=1 width=8) (actual time=372.415..372.415 rows=1 loops=1) -> Seq Scan on pop1980 (cost=0.00..68312.62 rows=413 width=0) (actual time=119.238..372.023 rows=4896 loops=1) Filter: ((firstname)::text = 'Allison'::text) Rows Removed by Filter: 3439234 Planning time: 0.086 ms Execution time: 372.447 ms

test=# explain analyze select count(*) from pop1980 where firstname='Jennice'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Aggregate (cost=68313.66..68313.67 rows=1 width=8) (actual time=345.966..345.966 rows=1 loops=1) -> Seq Scan on pop1980 (cost=0.00..68312.62 rows=413 width=0) (actual time=190.896..345.961 rows=7 loops=1) Filter: ((firstname)::text = 'Jennice'::text) Rows Removed by Filter: 3444123 Planning time: 0.388 ms Execution time: 346.010 ms

0.9x

0.08x

103x

Histograms in MySQLClick to add textClick to add text

33

Histograms in MySQL

● Not available for use in MySQL 8.0.1● There are pieces of histogram code, still

– This gives some clues

● Another feature that uses histograms: P_S statement latencies– P_S.events_statements_histogram_global

P_S.events_statements_histogram_by_digest

– These are totally different kind of histogram● Buckets are log-scale equi-width.

34

Sampling

● Currently only has a default implementation only– Which does a full table scan and “rolls the dice” for each row

● Assume there will be an InnoDB implementation

enum class enum_sampling_method { SYSTEM };

class handler { ... int ha_sample_init(double sampling_percentage, int sampling_seed, enum_sampling_method sampling_method); int ha_sample_next(uchar *buf); int ha_sample_end();

● New methods for storage engine API

35

Histogram storage

● Will be stored in mysql.column_stats table

CREATE TABLE mysql.column_stats ( database_name varchar(64) COLLATE utf8_bin NOT NULL, table_name varchar(64) COLLATE utf8_bin NOT NULL, column_name varchar(64) COLLATE utf8_bin NOT NULL, histogram json NOT NULL, PRIMARY KEY (database_name,table_name,column_name));

● Will be stored as JSON– No limits on size?

36

“Singleton” histograms

● This is what PostgreSQL calls “Most Common Values”

{ "last-updated": "2015-11-04 15:19:51.000000", "histogram-type": "singleton", "null-values": 0.1, // Fraction of NULL values

"buckets": [ [ 42, // Value, data type depends on the source column. 0.001978728666831561 // "Cumulative" frequency ], …

] }

37

Height-balanced histograms{ "last-updated": "2015-11-04 15:19:51.000000", "histogram-type": "equi-height", "null-values": 0.1, // Fraction of NULL values

"buckets": [ [ "bar", // Lower inclusive value "foo", // Upper inclusive value 0.001978728666831561, // Cumulative frequency

10 // Number of distinct values in this bucket ], ... ]}

38

Height-balanced histograms ... "buckets": [ [ "bar", // Lower inclusive value "foo", // Upper inclusive value 0.001978728666831561, // Cumulative frequency

10 // Number of distinct values in this bucket ], ... ]}

● Why “upper inclusive value”? To support holes? At cost of 2x histogram size?

● Why frequency in each bucket? it’s equi-height, so frequencies should be the same?

● Per-bucket #distinct is interesting but doesn’t seem high-demand.

39

Histograms

● “Singleton”● Height-balanced

● Both kinds store nulls_fraction Fraction of NULLs is stored– In both kind of histograms so you can’t have both at the same time?

● Height-balanced allow for “gaps”● Each bucket has #distinct (non-optional?)

40

MySQL histograms summary

● Seem to be coming in MySQL 8.0● Support two types

– “Singleton”

– “Height-balanced”

● Both kinds store null-values so they are not used together?● “Height-balanced”

– May have “holes”?

– Stores “frequency” for each bin (?)

● Collection will probably use sampling– Which has only full scan implementation ATM

ConclusionsClick to add textClick to add text

42

Conclusions● Histograms are compact data summaries for use by the optimizer● PostgreSQL

– Has a mature implementation

– Uses sampling and auto-collection

● MariaDB– Supports histograms since MariaDB 10.0

● Compact● Height-balanced only

– Need to run ANALYZE manually and set the optimizer to use them

● MySQL– Don’t have histograms, still.

– Preparing to have them in 8.0

– Will support two kinds● Most common values ● Height-balanced “with gaps” (?)

43

Thanks!

44

Rate My Session

Histograms in MariaDB, MySQL and PostgreSQL - Percona · PDF fileHistograms in MariaDB, MySQL...

Documents

Transcript of Histograms in MariaDB, MySQL and PostgreSQL - Percona · PDF fileHistograms in MariaDB, MySQL...