Optimising Column stores with statistical analysis

How do Column Stores Work?

Turning Rows into Columns

Product Customer

Date Sale

Beer Thomas 2011-11-25

Vodka Thomas 2011-11-25

10 GBP

Whiskey Christian 2011-11-25

Vodka Alexei 2011-11-25

10 GBP

Vodka Alexei 2011-11-25

10 GBP

1 Beer

2 Beer

3 Vodka

4 Whiskey

5 Whiskey

6 Vodka

7 Vodka

Customer

1 Thomas

2 Thomas

3 Thomas

4 Christian

5 Christian

6 Alexei

7 Alexei

Product Customer

And so on… until…

And we get…

1 Beer

2 Beer

3 Vodka

4 Whiskey

5 Whiskey

6 Vodka

7 Vodka

Customer

1 Thomas

2 Thomas

3 Thomas

4 Christian

5 Christian

6 Alexei

7 Alexei

Product Customer

1 2011-11-25

2 2011-11-25

3 2011-11-25

4 2011-11-25

5 2011-11-25

6 2011-11-25

7 2011-11-25

1 2 GBP

2 2 GBP

3 10 GBP

4 5 GBP

5 5 GBP

6 10 GBP

7 10 GBP

And what now?

1 Beer

2 Beer

3 Vodka

4 Whiskey

5 Whiskey

6 Vodka

7 Vodka

Product

Run lengthEncode

Product’

ID Value

1-2 Beer

3 Vodka

4-5 Whiskey

6-7 Vodka

Applying Compression

ID Value

3 Vodka

Whiskey

ID Customer

1-3 Thomas

4-5 Christian

6-7 Alexei

Product’ Customer’

ID Date

2011-11-25

Date’

ID Sale

1-2 2 GBP

3 10 GBP

4-5 5 GBP

6-7 10 GBP

Sale’

Insights• With dictionary, every

value can be assumed to fit a machine word (64bits)

• Compression is proportional with total number of run length (RL) in all columns

• Number of RL will depend on ordering of rows

ID Value

3 Vodka

Whiskey

Product’

One RL

Ordering Example

Product

Customer

Beer Thomas

Vodka Thomas

Whiskey

Christian

Whiskey

Christian

Vodka Alexei

Product

Customer

Beer Thomas

Whiskey

Christian

Vodka Thomas

Whiskey

Christian

Beer Thomas

Vodka Alexei

Product

Customer

Beer Thomas

Whiskey

Christian

Vodka Thomas

Whiskey

Christian

Beer Thomas

Vodka Alexei

Product

Customer

BeerThomas

Whiskey

Christian

Vodka Alexei

There is some overhead…

Clusteron ID

Data Size 327MB 327MB

Column Index Size

59MB 142MB

Manipulating the Rules

Rule of Thumb?

“Sort by lowest cardinality column first”

Rationale: Low cardinality columns

have potential for long RL(C1, C2): 68MB (C2, C1): 61MB Lowest first is

worse!

OK, so what about highest first?Loose correlation

(C1, C2): 64MB (C2, C1): 68MB

Highest first is worse!

What are we looking for?1) Values that are skewed or have low cardinality

2) Columns that correlate/cluster with other columns

Just Read the Magic Code?• Values with low cardinality are

easy (COUNT DISTINCT)• Is there a more general way to

classify the notion of “predictable content of a column”?

• Yes, Entropy:

Coming to Terms with Entropy• Intuition: A single number expressing the amount of

“surprise” at seeing a value in a column• Consider an example:

SKEW SPLAT ID

Histogram

COUNT DISTINCT

10001 10001 1000000

DISTINCT / COUNT

0.01 0.01 1

Calculate and Evaluate

≈ 0.21

≈ 13

≈ 20

New theory: Lower Entropy First

You will NEVER win

Take best of these

Column that “cluster” with other columns?• Is there a way to calculate this?• Yes indeed, information theory to the help again• Mutual information:

• “The information left in Y, given that I know X”

Mutual WHAT?

H(X ¦ Y) H(Y ¦ X) I(X;Y)

From I(X;Y) we can find the distance

“Find the minimal distancethat visits all columns

in the information plane”

d(c1,c2)

d(c2,c3)

So, how does THAT work?

Better… not impressive.. More consistent

Take best of these

Medicine is compared against placebo

What else does d(X,Y) tell us?• Consider this fact

table:d(A, B) is zero!

What is our expected estimate of rows?

Dodge this!

Why is this so Hard?

Reflecting on Information Distance“Find the shortest paththat visits all cities on

a map”

Picture Credits: RUC.dk

How many routes are there?

n! = n * (n-1) * (n-2) * … * 1

Travelling Salesman Problem

There are MORE than n! routes• What if lexicographical ordering of the columns isn’t

best?• Daniel Lemire et al: ”Reordering Rows for Better Compression: Beyond

the Lexicographic Order“ ( http://arxiv.org/pdf/1207.2189.pdf )

• Some may be ruled out immediately (ex: don’t go to Skagen from Copenhagen and then to Roskilde)

• The issue of local optimums exist

Heuristics are your Best Bet• “Find Minimum RLE” can be shown to be NP complete• There is no fast algorithm that finds the optimal

• I have shown you one heuristic• moderate gain for a small effort• Shown that 2x gains are possible• Any ordering is (typically) better than random, often by a lot

• I wrote a tool to help analyse: TableStat.exe• Interested? Come up to talk after• I need more real life datasets to test on

P = NP ? We just don’t know

Optimising Column stores with statistical analysis

Technology

Transcript of Optimising Column stores with statistical analysis

Column-Stores vs. Row-Stores: How Different Are They Really? · Subhro Bhattacharyya Department ofComputer Science IndianInstituteofTechnology, Bombay SIGMOD (2008) Column-Storesvs.

No SQL is not about SQL No SQL is a Zoo.. Key-Value Stores Wide Column Stores Document Stores Graph Databases.

NoSQL Deep Dive mit Cassandra - bedcon · NoSQL 13.04.2011 3 BerlinExpertDays Graph Databases Wide Column Stores / Column Families Document Stores Key Value / Tupe Stores. Apache

C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.

Teil XI NoSQL I/_/Kapitel11.pdf · NoSQL 1. Motivation f¨ur NoSQL 2. Datenmodelle fur NoSQL¨ 3. KV-Stores und Wide Column 4. Document Stores 5. Graph Stores Sattler/Saake — VL

ENHANCEMENTS TO SQL SERVER COLUMN STORES€¦ · SQL server introduced a new index type, Column Store Index’s, where data is stored in column-wise in compressed form. It also introduced

Fast Column Scans: Paged Indices for In-Memory Column Stores … · 2013. 10. 11. · In-Memory Column Stores Martin Faust, David Schwalb, Jens Krueger Hasso Plattner Institute, Potsdam,

ENHANCEMENT TO SQL SERVER COLUMN STOREStozsu/courses/CS848/W15/presentations/... · The figure is taken from the “Enhancements to SQL Server Column Stores” paper. Column Store

Optimising Projects Optimising Assets, Project … · Optimising Projects Optimising Assets, Project Planning & Contracting To Reduce Costs Overall Matthew Green – Vattenfall Wind

Self-organizing Tuple Reconstruction in Column-stores · Column-stores gained popularity as a promising physical de-sign alternative. Each attribute of a relation is physically stored

Scaling up analytical queries with column -stores

fast multi-column sorting in main-memory column-stores

COSC 6397 Big Data Analytics Data Formats (II) HBasegabriel/courses/cosc6397_s14/BDA_11_Data... · 2018. 6. 18. · including key-value stores, document databases, wide-column stores,

VLDB 2009 Tutorial on Column-Stores

Column Stores vs Row Stores : How Different Are They ......Is there a fundamental di erence in architecture of column oriented databases? Or can we use a more \column oriented" design

Column-Stores vs. Row-Stores: How Different Are They Really? · either by vertically partitioning the schema, or by indexing every column so that columns can be accessed independently.

Data Science and varlamis@hua.gr Harokopio University of ......Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra,

Advanced Data Management Technologies · Key-Value stores Simple K/V lookups (DHT) Column stores Each key is associated with many attributes (columns) NoSQL column stores are actuallyhybrid

Column Stores and Google BigQuery

class 5 column stores 2 - Harvard University