Histograms in MariaDB, MySQL and PostgreSQL - Percona · PDF fileHistograms in MariaDB, MySQL...
-
Upload
vuongduong -
Category
Documents
-
view
269 -
download
2
Transcript of Histograms in MariaDB, MySQL and PostgreSQL - Percona · PDF fileHistograms in MariaDB, MySQL...
Histograms in MariaDB, MySQL and PostgreSQL
Sergei Petrunia, MariaDBSergei Petrunia, MariaDB
Santa Clara, California | April 24th – 27th, 2017Santa Clara, California | April 24th – 27th, 2017
2
What this talk is about
● Data statistics histograms in– MariaDB
– MySQL (status so far)
– PostgreSQL
● This is not a competitive comparison– Rather, a survey
Histograms and query optimizersClick to add textClick to add text
4
Query optimizer needs data statistics
● Which query plan enumerates fewer rows– orders->customers or customers->orders?
● It depends on row counts and condition selectivities
● Condition selectivity has a big impact on query speed
select * from customers join orders on customer.cust_id=orders.customer_idwhere customers.balance<1000 and orders.total>10K
5
Data statistics has a big impact on optimizer
● A paper "How good are query optimizers, really?"– Leis et al, VLDB 2015
● Conclusions section:
– "In contrast to cardinality estimation, the contribution of the cost
model to the overall query performance is limited."
● This matches our experience
6
Data statistics usage
● Need a *cheap* way to answer questions about– Numbers of rows in the table
– Condition selectivity
– Column widths
– Number of distinct values
– …
● Condition selectivity is the most challenging
7
Histogram as a compact data summary
● Partition the value space into buckets● Keep an array of (bucket_bounds, n_values)
– Takes O(#buckets) space
8
Histogram and condition selectivity
col BETWEEN ‘a’ AND ‘b’
● Sum row counts in the covered buckets
● Partially covered bucket?– Assume a fraction of rows match
– This is a source of inaccuracy
● More buckets – more accurate estimates
9
Histogram types
● Different strategies for choosing buckets– Equi-width
– Equi-height
– Most Common Values
– ...
10
Equi-width histogram
● Bucket bounds pre-defined– Equal, log-scale, etc
● Easy to understand, easy to collect.
● Not very efficient
– Densely and sparsely-populated regions have the same #buckets
– What if densely-populated regions had more buckets?
11
Equi-height histogram
● Pick the bucket bounds such that each bucket has the same #rows– Densely populated areas get
more buckets
– Sparsely populated get fewer buckets
● Estimation error is limited by bucket size– Which is now limited.
12
Most Common Values histogram
● Suitable for enum-type domains
● All possible values fit in the histogram
● Just a list of values and frequencies
value1 count1
value2 count2
value3 count3
... ...
13
Histogram collection algorithms● Equi-width
– Find (or guess) min and max value
– For each value● Find which histogram bin it falls into● Increment bin’s counter
● Equi-height– Sort the values
– First value starts bin #0
– Value at n_values * (1/n_bins) starts bin #2
– Value at n_values * (2/n_bins) starts bin #3
– ...
14
Histogram collection strategies● Scan the whole dataset
– Used by MariaDB
– Produces a “perfect” histogram
– May be expensive
● Do random sampling– Used by PostgreSQL (MySQL going to do it, too?)
– Produces imprecise histograms
– Non-deterministic results
● Incremental updates– hard to do, not used
15
Summary so far ● Query optimizers need condition selectivities● These are provided by histograms● Histograms are compact data summaries● Histogram types
– Width-balanced
– Height-balanced (better)
– Most-Common-Values
● Histogram collection methods– Scan the whole dataset
– Do random sampling.
Histograms in MariaDBClick to add textClick to add text
17
Histograms in MariaDB
● Available in MariaDB 10.0– (Stable since March, 2014)
● Used in the real world● Good for common use cases
– has some limitations ● Sometimes are called “Engine-Independent Table Statistics”
– Although being engine-independent is not the primary point.
18
Histogram storage in MariaDB
● Are stored in mysql.column_stats tableCREATE TABLE mysql.column_stats ( db_name varchar(64) NOT NULL, table_name varchar(64) NOT NULL, column_name varchar(64) NOT NULL, min_value varbinary(255) DEFAULT NULL, max_value varbinary(255) DEFAULT NULL, nulls_ratio decimal(12,4) DEFAULT NULL, avg_length decimal(12,4) DEFAULT NULL, avg_frequency decimal(12,4) DEFAULT NULL, hist_size tinyint unsigned, hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'), histogram varbinary(255), PRIMARY KEY (db_name,table_name,column_name));
● Very compact: max 255 bytes (per column)
19
Collecting a histogram
set histogram_size=255;set histogram_type='DOUBLE_PREC_HB';
analyze table tbl persistent for all;analyze table tbl persistent for columns (col1, col2) indexes ();+----------+---------+----------+-----------------------------------------+| Table | Op | Msg_type | Msg_text |+----------+---------+----------+-----------------------------------------+| test.tbl | analyze | status | Engine-independent statistics collected || test.tbl | analyze | status | OK |+----------+---------+----------+-----------------------------------------+
● Manual collection only
set use_stat_tables='preferably';set optimizer_use_condition_selectivity=4;<query>;
● Make the optimizer use it
20
Examining a histogramselect * from mysql.column_stats where table_name='pop1980_cp' and column_name='firstname'*************************** 1. row *************************** db_name: babynames table_name: pop1980_cp column_name: firstname min_value: Aaliyah max_value: Zvi nulls_ratio: 0.0000 avg_length: 6.0551avg_frequency: 194.4642 hist_size: 32 hist_type: DOUBLE_PREC_HB histogram: � ���C)�G�[j\�\�fzz�z]����3�
select decode_histogram(hist_type,histogram)from mysql.column_stats where table_name='pop1980_cp' and column_name='firstname'*************************** 1. row ***************************decode_histogram(hist_type,histogram): 0.00201,0.04048,0.03833,0.03877,0.04158,0.11852,0.07912,0.00218,0.00093,0.03940,0.07710,0.00124,0.08035,0.11992,0.03877,0.03989,0.24140
21
Histograms in MariaDB - summary
● Available since MariaDB 10.0● Special ANALYZE command to collect stats
– Does a full table scan
– May require a lot of space for big VARCHARs:MDEV-6529 “EITS ANALYZE uses disk space inefficiently for VARCHAR columns”
● Not used by the optimizer by default– Special settings to get optimizer to use them.
Histograms in PostgreSQLClick to add textClick to add text
23
Histograms in PostgreSQL
● Data statistics – Fraction of NULL-values
– Most common value (MCV) list
– Height-balanced histogram (excludes MCV values)
– A few other parameters● avg_length● n_distinct_values● ...
● Collection algorithm– One-pass random sampling
24
Collecting histograms in PostgreSQL-- Global parameter specifying number of buckets-- the default is 100set default_statistics_target=N;
-- Can also override for specific columnsalter table tbl alter column_name set statistics N;
-- Collect the statisticsanalyze tablename;
# number of inserted/updated/deleted tuples to trigger an ANALYZEautovacuum_analyze_threshold = N
# fraction of the table size to add to autovacuum_analyze_threshold # when deciding whether to trigger ANALYZE autovacuum_analyze_scale_factor=N.N
postgresql.conf, or per-table
25
Examining the histogram
select * from pg_stats where tablename='pop1980';
tablename | pop1980attname | firstnameinherited | fnull_frac | 0avg_width | 7n_distinct | 9320most_common_vals | {Michael,Jennifer,Christopher,Jason,David,James, Matthew,John,Joshua,Amanda}most_common_freqs | {0.0201067,0.0172667,0.0149067,0.0139,0.0124533, 0.01164,0.0109667,0.0107133,0.0106067,0.01028}histogram_bounds | {Aaliyah,Belinda,Christine,Elsie,Jaron,Kamia, Lindsay,Natasha,Robin,Steven,Zuriel}correlation | 0.0066454most_common_elems |
26
Histograms are collected by doing sampling● src/backend/commands/analyze.c, std_typanalyze() refers to
● "Random Sampling for Histogram Construction: How much is enough?” – Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, ACM SIGMOD, 1998.
Histogram sizeRows in table (=10^6)
Max relative error in bin (=0.5)
Error probability (=0.01)
Random sample size
● 100 buckets = 30,000 rows sample
27
Histogram sampling in PostgreSQL
● 30K rows are sampled from random locations in the table– Does a skip scan forward
– “Randomly chosen rows in randomly chosen blocks”
● Choice of Most Common Values– Sample values that are 25% more common than average
– Values that would take more than one histogram bucket.
– All seen values are MCVs? No histogram is built.
28
Beyond single-column histograms● Conditions can be correlated
select ... from order_items where shipdate='2015-12-15' AND item_name='christmas light'
'swimsuit'● Correlation can have a big effect
– MIN(1/n, 1/m)
– (1/n) * (1/m)
– 0
● Multi-column “histograms” are hard● “Possible PostgreSQL 10.0 feature: multivariate statistics”
29
PostgreSQL: Conclusions
● Collects and uses both– Height-balanced histogram
– Most Common Values list
● Uses sampling for collection● Can run ANALYZE yourself
– Or VACUUM will do it automatically
● Multivariate stats are in the plans
30
Histogram test - MariaDB
● Real world data, people born in 1980
MariaDB [babynames]> analyze select count(*) from pop1980 where firstname='Jennifer';+------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra |+------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+| 1 | SIMPLE | pop1980 | ALL | NULL | NULL | NULL | NULL | 3444156 | 3444156.00 | 4.69 | 1.70 | Using where |+------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+
MariaDB [babynames]> analyze select count(*) from pop1980 where firstname='Allison';+------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra |+------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+| 1 | SIMPLE | pop1980 | ALL | NULL | NULL | NULL | NULL | 3444156 | 3444156.00 | 2.89 | 0.14 | Using where |+------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+
MariaDB [babynames]> analyze select count(*) from pop1980 where firstname='Jennice';+------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra |+------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+| 1 | SIMPLE | pop1980 | ALL | NULL | NULL | NULL | NULL | 3444156 | 3444156.00 | 4.69 | 0.00 | Using where |+------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+
Jennifer 58,381Allison, 4,868Jennice, 7
2.75x
20x
?x
31
Histogram test - PostgreSQL
● Real world data, people born in 1980Jennifer 58,381Allison, 4,868Jennice, 7
test=# explain analyze select count(*) from pop1980 where firstname='Jennifer'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Aggregate (cost=68456.71..68456.71 rows=1 width=8) (actual time=372.593..372.593 rows=1 loops=1) -> Seq Scan on pop1980 (cost=0.00..68312.62 rows=57632 width=0) (actual time=0.288..366.058 rows=58591 loops=1) Filter: ((firstname)::text = 'Jennifer'::text) Rows Removed by Filter: 3385539 Planning time: 0.098 ms Execution time: 372.625 ms
test=# explain analyze select count(*) from pop1980 where firstname='Allison'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------- Aggregate (cost=68313.66..68313.67 rows=1 width=8) (actual time=372.415..372.415 rows=1 loops=1) -> Seq Scan on pop1980 (cost=0.00..68312.62 rows=413 width=0) (actual time=119.238..372.023 rows=4896 loops=1) Filter: ((firstname)::text = 'Allison'::text) Rows Removed by Filter: 3439234 Planning time: 0.086 ms Execution time: 372.447 ms
test=# explain analyze select count(*) from pop1980 where firstname='Jennice'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Aggregate (cost=68313.66..68313.67 rows=1 width=8) (actual time=345.966..345.966 rows=1 loops=1) -> Seq Scan on pop1980 (cost=0.00..68312.62 rows=413 width=0) (actual time=190.896..345.961 rows=7 loops=1) Filter: ((firstname)::text = 'Jennice'::text) Rows Removed by Filter: 3444123 Planning time: 0.388 ms Execution time: 346.010 ms
0.9x
0.08x
103x
Histograms in MySQLClick to add textClick to add text
33
Histograms in MySQL
● Not available for use in MySQL 8.0.1● There are pieces of histogram code, still
– This gives some clues
● Another feature that uses histograms: P_S statement latencies– P_S.events_statements_histogram_global
P_S.events_statements_histogram_by_digest
– These are totally different kind of histogram● Buckets are log-scale equi-width.
34
Sampling
● Currently only has a default implementation only– Which does a full table scan and “rolls the dice” for each row
● Assume there will be an InnoDB implementation
enum class enum_sampling_method { SYSTEM };
class handler { ... int ha_sample_init(double sampling_percentage, int sampling_seed, enum_sampling_method sampling_method); int ha_sample_next(uchar *buf); int ha_sample_end();
● New methods for storage engine API
35
Histogram storage
● Will be stored in mysql.column_stats table
CREATE TABLE mysql.column_stats ( database_name varchar(64) COLLATE utf8_bin NOT NULL, table_name varchar(64) COLLATE utf8_bin NOT NULL, column_name varchar(64) COLLATE utf8_bin NOT NULL, histogram json NOT NULL, PRIMARY KEY (database_name,table_name,column_name));
● Will be stored as JSON– No limits on size?
36
“Singleton” histograms
● This is what PostgreSQL calls “Most Common Values”
{ "last-updated": "2015-11-04 15:19:51.000000", "histogram-type": "singleton", "null-values": 0.1, // Fraction of NULL values
"buckets": [ [ 42, // Value, data type depends on the source column. 0.001978728666831561 // "Cumulative" frequency ], …
] }
37
Height-balanced histograms{ "last-updated": "2015-11-04 15:19:51.000000", "histogram-type": "equi-height", "null-values": 0.1, // Fraction of NULL values
"buckets": [ [ "bar", // Lower inclusive value "foo", // Upper inclusive value 0.001978728666831561, // Cumulative frequency
10 // Number of distinct values in this bucket ], ... ]}
38
Height-balanced histograms ... "buckets": [ [ "bar", // Lower inclusive value "foo", // Upper inclusive value 0.001978728666831561, // Cumulative frequency
10 // Number of distinct values in this bucket ], ... ]}
● Why “upper inclusive value”? To support holes? At cost of 2x histogram size?
● Why frequency in each bucket? it’s equi-height, so frequencies should be the same?
● Per-bucket #distinct is interesting but doesn’t seem high-demand.
39
Histograms
● “Singleton”● Height-balanced
● Both kinds store nulls_fraction Fraction of NULLs is stored– In both kind of histograms so you can’t have both at the same time?
● Height-balanced allow for “gaps”● Each bucket has #distinct (non-optional?)
40
MySQL histograms summary
● Seem to be coming in MySQL 8.0● Support two types
– “Singleton”
– “Height-balanced”
● Both kinds store null-values so they are not used together?● “Height-balanced”
– May have “holes”?
– Stores “frequency” for each bin (?)
● Collection will probably use sampling– Which has only full scan implementation ATM
ConclusionsClick to add textClick to add text
42
Conclusions● Histograms are compact data summaries for use by the optimizer● PostgreSQL
– Has a mature implementation
– Uses sampling and auto-collection
● MariaDB– Supports histograms since MariaDB 10.0
● Compact● Height-balanced only
– Need to run ANALYZE manually and set the optimizer to use them
● MySQL– Don’t have histograms, still.
– Preparing to have them in 8.0
– Will support two kinds● Most common values ● Height-balanced “with gaps” (?)
43
Thanks!
44
Rate My Session