Download - Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance TILMANN RABL, MEIKEL POESS, HANS- ARNO JACOBSEN, PATRICK.

Transcript

Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query PerformanceTILMANN RABL, MEIKEL POESS, HANS-ARNO JACOBSEN, PATRICK AND ELIZABETH O’NEIL

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

ICPE 2013, PRAGUE, 24/04/2013

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 2

Real Life Data is Distributed Uniformly…

◦ Customers zip codes typically clustered around metropolitan areas◦ Seasonal items (lawn mowers, snow shovels, …) sold mostly during specific

periods◦ US retail sales:

◦ peak during Holiday Season◦ December sales are 2x of

January sales

Source: US Census Data

Well, Not Really

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 3

Student Seminar Signup Distribution

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 4

How Can Skew Effect Database Systems?

Data placement◦ Partitioning◦ Indexing

Data structures◦ Tree balance◦ Bucket fill ratio◦ Histograms

Optimizer finding the optimal query plan◦ Index vs. non-index driven plans◦ Hash join vs. merge join◦ Hash group by vs. sort group by

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 5

Agenda

Data Skew in Current Benchmarks

Star Schema Benchmark (SSB)

Parallel Data Generation Framework (PDGF)

Introducing Skew in SSB

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 6

Data Skew in Benchmarks

TPC-D (1994-1999): only uniform data◦ SIGMOD 1997 - “Successor of TPC-D

should include data skew”◦ No effect until …

TPC-DS (released 2012)◦ Contains comparability zones ◦ Not fully utilized

TPC-D/H variations◦ Chaudhuri and Narayasa: Zipfian distribution on all columns◦ Crolotte and Ghazal: comparability zones

Still lots of open potential

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 7

Star Schema Benchmark I

Star schema version of TPC-H◦ Merged Order and Lineitem◦ Date dimension ◦ Dropped Partsupp◦ Selectivity hierarchies

◦ C_City C_Nation C_Region◦ …

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 8

Star Schema Benchmark II

Completely new set of queries

4 flights of 3-4 queries◦ Designed for functional coverage and selectivity coverage◦ Drill down in dimension hierarchies◦ Predefined selectivity

select sum(lo_extendedprice*lo_discount) as revenue from lineorder, date where lo_orderdate = d_datekey and d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25;

select sum(lo_extendedprice*lo_discount) as revenue from lineorder, date where lo_orderdate = d_datekey and d_yearmonthnum = 199301 and lo_discount between 1 and 3 and lo_quantity between 26 and 35;

DrilldownQ1.1

Q1.2

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 9

Parallel Data Generation Framework

Generic data generation framework

Relational model◦ Schema specified in configuration file◦ Post-processing stage for alternative representations

Repeatable computation◦ Based on XORSHIFT random number generators◦ Hierarchical seeding strategy

Frank, Poess, and Rabl: Efficient Update Data Generation for DBMS Benchmarks. ICPE '12.Rabl and Poess: Parallel Data Generation for Performance Analysis of Large, Complex RDBMS. DBTest '11.Poess, Rabl, Frank, and Danisch: A PDGF Implementation for TPC-H. TPCTC '11.Rabl, Frank, Sergieh, and Kosch: A Data Generator for Cloud-Scale Benchmarking. TPCTC '10.

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 10

Configuring PDGF Schema configuration

Relational model◦ Tables, fields

Properties◦ Table size, characters, …

Generators◦ Simple generators◦ Metagenerators

Update definition◦ Insert, update, delete◦ Generated as change data capture

<table name="SUPPLIER"> <size>${S}</size> <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> <gen_IdGenerator /> </field> <field name="S_NAME" size="25" type="VARCHAR"> <gen_PrePostfixGenerator> <gen_PaddingGenerator> <gen_OtherFieldValueGenerator> <reference field="S_SUPPKEY" /> </gen_OtherFieldValueGenerator > <character>0</character> <padToLeft>true</padToLeft> <size>9</size> </gen_PaddingGenerator > <prefix>Supplier </prefix> </gen_PrePostfixGenerator> </field>[..]

PDGFXML DB

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 11

Opportunities to Inject Data Skew in

Foreign key relations◦ E.g., L_PARTKEY

One fact table measures◦ E.g., L_Quantity

Single dimension hierarchy◦ E.g., P_Brand → P_Category → P_Mfgr

Multiple dimension hierarchies◦ E.g., City → Nation in Supplier and Customer

Experimental methodology◦ One experiment series for each of the above◦ Comparison to original SSB◦ Comparison of index-forced, non-index, and automatic optimizer mode◦ SSB scale factor 100 (100 GB), x86 server

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 12

Skew in Foreign Key Relations

Very realistic Easy to implement in PDGF

◦ Just add a distribution to the reference

But! Dimension attributes uniformly distributed Dimension keys uncorrelated to dimension attributes Very limited effect on selectivity Focus on attributes in selectivity predicates

<distribution name="Exponential“ lambda="0.26235" />

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 13

Lo_Quantity distribution◦ Values range between 0 and 50◦ Originally uniform distribution with:

◦ P(X=x)=0.02◦ Coefficient of variation of 0.00000557

◦ Proposed skewed distribution with:◦

Query 1.1◦ lo_quantity < x, x ∈ [2, 51]

Results◦ Switches too early to non-index plan◦ Switches too late to non-index plan◦ Optimizer agnostic to distribution

Skew in Fact Table Measure – Lo_Quantity

xxXP

3.1

3.0)(

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 14

Skew in Single Dimension Hierarchy - Part

P_Category distribution◦ Uniform P(X=x)=0.04◦ Skewed P(X=x)= 0.01 - 48.36◦ Probabilities explicitly defined

Query 2.1 ◦ Restrictions on two dimensions

Results uniform case◦ Index driven superior◦ Optimizer chooses non-index driven

Results skewed case◦ Switches too early to non-index plan

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 15

Skewed S_City & C_City◦ Probabilites exponentially

distributed

Query 3.3◦ Restrictions on 3 dimensions◦ Variation on Supplier and Customer

city

Results uniform and skewed cases◦ Automatic plan performs best◦ Cross over between automatic

uniform and skewed too late

Skew in Multiple Dimension Hierarchies – S_City & C_City

Join Cardinality Elapsed Time

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 16

Conclusion & Future Work

PDGF implementation of SSB

Introduction of skew in SSB

Extensive performance analysis◦ Several interesting optimizer effects◦ Performance impact of skew

Future Work Further analysis on impact of skew

Skew in query generation

Complete suite for testing skew effects

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 17

Thanks

Questions?

Download and try PDGF:

http://www.paralleldatageneration.org

(scripts used in the study available on website above)

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 18

Back-up Slides

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 19

Configuring PDGF Generation

Generation configuration

Defines the output◦ Scheduling◦ Data format◦ Sorting◦ File name and location

Post processing◦ Filtering of values◦ Merging of tables◦ Splitting of tables◦ Templates (e.g. XML / queries)

<table name="QUERY_PARAMETERS" exclude="false" > <output name="CompiledTemplateOutput" > [..] <template ><!-- int y = (fields [0]. getPlainValue ()).intValue (); int d = (fields [1]. getPlainValue ()).intValue (); int q = (fields [2]. getPlainValue ()).intValue (); String n = pdgf.util.Constants.DEFAULT_LINESEPARATOR; buffer.append("-- Q1.1" + n); buffer.append("select sum(lo_extendedprice *"); buffer.append(" lo_discount) as revenue" + n); buffer.append(“ from lineorder , date" + n); buffer.append(“ where lo_orderdate = d_datekey" + n); buffer.append(“ and d_year = " + y + n); buffer.append(“ and lo_disc between " + (d - 1)); buffer.append(“ and " + (d + 1) + n); buffer.append(“ and lo_quantity < " + q + ";" + n); --></template > </output ></table >