Post on 15-Feb-2016
description
Compression Aware Physical Database
Design
Vivek NarasayyaManoj Syamala
Microsoft Research Brown University
Hideaki Kimura*
hkimura@cs.brown.edu{viveknar,manojsy}@microsoft.com
(*) Graduates soon. On Job Market.
2/28
Every Major DBMS Supports Saves Storage Consumption Saves I/O Bandwidth
Background: Compression in DB
Tables,Indexes
SELECTCompressed
Data
INSERT
Decompress
Compress
Query ProcessEngine
DBMS A: 4x!DBMS B: 10x!DBMS C: 12x!
3/28
Compression Schemes in DB
CitySeattle
San JoseSeattle
..
Dictionary Encoding
121..
Dict.1:Seattle
2:San Jose
+
◦ Local dict. (Oracle, SQL Server)◦ Global dict. (DB2)
NULL Suppression
LZO, RLE…
Price000321000054000015
..
@321@54@15
..
Prefix Suppression,
4/28
Two Types of Compression in DBOrder
IndependentOrder
Dependent
A000AA000AA000AA000AA00BBB00BBB00BBB00BBB
BXXYYXXYY
IABA
000AA000AA00BBB00BBB000AA000AA00BBB00BBB
BXXXXYYYY
IBA
A@AA@AA@AA@AA@BB
B@BB
B@BB
B@BB
B
A@AA@AA@BB
B@BB
B@AA@AA@BB
B@BB
B
IAB IBAA
000AA↑↑↑
00BBB↑↑↑
A000AA
↑00BBB
↑000AA
↑00BBB
↑
IAB IBA
page
= ≠frag
men
ted
◦ NULL-Supp.◦ Global dict.◦ …
◦ Run Length Enc.◦ Local dict.◦ …
5/28
Saves Storage Space, I/OCPU Overhead to Compress & Decompress
Different Compression Scheme= Different Saving ↔ Overhead
Benefits and Overheads
How Do We Use It?DBA
6/28
Depends on Workload◦SELECTs/INSERTs Frequency◦CPU bottleneck? IO bottleneck?
Issue 1: To Compress or not..Depends on Data
9GB10GB-90%
10GB 1GB -10%
High Compression Ratio Low Compression Ratio
7/28
Issue 2: What Index to Create
I1 I2I3 I4
Q1 Q2
I1 I3 I5
I5
SyntacticallyRelevantIndexes
SelectCandidate
Configurations
EnumerateBest
Configuration
Configuration
I1 I5
Physical DB Design ToolDBMS
QueryOptimizer
HypotheticalIndexes
Estimate Runtime
What-ifAnalysis
Prune
8/28
Run Design Tool to Select IndexesCompress them, then Repeat.
Naïve Solution: Staged Design
Stage 1 Stage 2Compress!
100 MBBudget
Idx
MV
100 MB
Idx
MV
50 MB
Idx
MV
100 MB
Workload
9/28
Misses an index that makes sense only with compression
Problem in tight space budget
Shipdate State
Price Discount
Feb 21 CA $123 10%Jan 9 RI $222 0%Jul 5 TX $213 5%
Sales
SELECT SUM(Price*Discount) FROM SalesWHERE State='CA' and Jul 01 < Shipdate < Sep 01
I1 (State, Shipdate):95 MB → 50 MB
I2 (State, Shipdate)Include (Price, Discount):
170 MB → 90 MB
Choice for 100 MB?
10/28
Example: Tight Space Budget
?
Good design:175MBCREATE COMPRESSED INDEX (L_PARTKEY,L_ORDERKEY,L_SUPPKEY) INCLUDE (L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT)
Staged:155MBCREATE INDEX (L_ORDERKEY) INCLUDE(L_SUPPKEY,L_COMMITDATE,L_RECEIPTDATE)
0 200 400 600 800 10000
10
20
30
40
50
60
70 TPC-H, 2ndary Index Only
Good design
Staged
Space Budget [MB]
Impr
ovem
ent [
%]
11/28
Result in too high CPU overheads for compression/decompression.
Problem in plenty space budget
I2 (State, Shipdate)Include (Price, Discount):
170 MB → 90 MB
Choice for 200 MB?
UPDATE Sales SET Price=..
I1 (State, Shipdate):95 MB → 50 MB
INSERT INTO Sales …
CPU Overheads
12/28
Example: Plenty Space Budget
0 500 1000 1500 2000 2500 30000
10
20
30
40
50
60 UPDATE Intensive TPC-H, 2ndary Index Only
Good design
Staged
Space Budget [MB]
Impr
ovem
ent [
%]
Worse with More Budget!
13/28
How to Estimate Index-size after compression?
How to Evaluate benefits/overheads of compression?
How Compression affects Candidate Selection/Enumeration?
Integrated Solution Needed!
14/28
Essential Metric of Indexes◦To Fit Space Budget◦To Estimate I/O cost
Need Compression Fraction
Size Estimation
Col-AWidth=8
Col-BWidth=
4Col-C
Width=10StatsTable
#tuple=1M
Size (IABC) = (8 + 4 + 10 + 4) * 1M = 26 MB
Clust. KeyWidth=4
Comp. Size (IABC) = 26 MB * CF (IABC)
15/28
SampleCF [Idreos et al. ICDE'10]
Prior work
Sample Size: Cost ↔ Accuracy Still Expensive for 1,000s of indexes
1GBTable
10MBSample
CREATECOMPRESSE
DINDEX Naïve I...
05
101520
Des
ign
Tool
Ru
ntim
e [m
in]
SampleCFOverheads
16/28
Solution Overview
Microsoft SQL Server
Query Optimizer
(Compression Aware Cost Model)
Samples
Temp DB
Workload
Candidate Selection
Merging
Enumeration
Physical design recommendation
Size Estimation
What-if analysis
SampleCFDatabase
Engine Tuning Advisor (DTA)
Storage bound
17/28
Index Size Deduction
IbIa
Ia,b
SampleCF
Col-Ext Deduction
Ia IbNULL supp. (ORD-IND)
Ia,b
Ib,a
Col-SetDeduction
A000AA
///
00BBB///
A000AA
/00BBB
/000AA
/00BBB
/
IABIBA
Local dict. (ORD-DEP)
4,
1,
AIL
AIDV
AB
AB 2,
2,
AIL
AIDV
BA
BA
Estimate From Run-Length
Sum-upSavings
More Details in paper
18/28
Size-Estimation Strategy◦Sample Size?◦Deduction Path?◦Expected Errors?
Formulate as Graph ProblemGreedy algorithm to solve
(details in the paper)
Optimize Accuracy-Cost Trade-off
19/28
Query Cost model to consider (De)Compression CPU cost
Candidate Selection/Enumeration
Issues in Design Tool
Key Challenge:Space-Performance Trade-off
20/28
Candidate Selection:Space-Performance Trade-off
IA IB IC IDQ1 Q2
SelectFastest
IA IC
IA IB IC ID
Compressed Versions
Add CompressedIndexes
Compressed Indexesare often
Slower-but-Smaller
Most of themare Ignored!
(exception: very highcompression ratio)
21/28
Skyline Candidate Selection
Configuration Size
Que
ry C
ost Slow-small
Fast-large
Construct Skyline of Configurations Pick Both Fast-Indexes
and Small-Indexes
22/28
Greedy picks un-compressed indexes too early
Enumeration: Problem
IA
IB ICB IC 10MB5MB10MB
Comp.
Seed IA IC
15MB Room
IA IB
IA ICB
IA IB ICB
IA IB IC
IA ICICB
Optimal Design
23/28
Recover oversized configurationsCompress indexes in the config.
Local Backtrack in Enumeartion
IA IA IB IA IB IC
IA ICICB
RecoverIf Oversized
IA IB ICC …
24/28
Implemented on SQL Server 2008◦Modified Database Tuning Advisor (DTA) "DTAc"
◦Modified Query Cost ModelTPC-H Scale-1 (more results in paper)
◦SELECT-intensive/UPDATE-intensive◦Compared Estimated Runtime
Experimental Results
25/28
Both Skyline & Backtrack are required esp. for tight budget
Candidate Selection/Enumeration
50 300 700 15000
10
20
30
40
50
60
70
80 Select Intensive
Budget [MB]
Impr
ovem
ent
[%]
50 300 700 15000
10
20
30
40
50
60
70 Update Intensive DTAc (Both)SkylineBacktrackDTAc (None)DTA
Budget [MB]Im
prov
emen
t [%
]
Clustered/2ndary Indexes
26/28
Especially better in tight budgetChoose lightly compressed designs in UPDATE-intensive
DTAc vs. DTA
0 200 400 600 800 10000
20
40
60
80 Select Intensive
DTAc
DTA
Budget [MB]
Impr
ovem
ent
[%]
0 200 400 600 800 10000
20
40
60
Update Intensive
DTAc
DTA
Budget [MB]
Impr
ovem
ents
[%]
Clustered/2ndary/MV Indexes
27/28
Reduce Size Estimation Overheads for a factor of 3
Mostly <10% Estimation Error
Overhead in DTA
DTAc w/oOptimization
DTAc0
5
10
15
20 MV-EstimateMV-SamplePartial-EstimatePartial-SampleTable-EstimateTable-SampleOther
Des
ign
Tool
Ru
ntim
e [m
in]
28/28
Opportunities and Challenges Integrated Approach to exploit compression in physical design◦Space-Performance Tradeoff◦Size Estimation
Open Issues◦Column-Store
Conclusion