1 C-Store: A Column-oriented DBMS By New England Database Group.

32
1 C-Store: A Column-oriented DBMS By New England Database Group

Transcript of 1 C-Store: A Column-oriented DBMS By New England Database Group.

1

C-Store: A Column-oriented DBMS

By

New England Database Group

2

M.I.T

Current DBMS Gold StandardCurrent DBMS Gold Standard Current DBMS Gold StandardCurrent DBMS Gold Standard

Store fields in one record contiguously on diskUse B-tree indexingUse small (e.g. 4K) disk blocksAlign fields on byte or word boundariesConventional (row-oriented) query optimizer

and executor (technology from 1979)Aries-style transactions

Store fields in one record contiguously on diskUse B-tree indexingUse small (e.g. 4K) disk blocksAlign fields on byte or word boundariesConventional (row-oriented) query optimizer

and executor (technology from 1979)Aries-style transactions

3

M.I.T

Terminology -- “Row Store”

Record 2

Record 4

Record 1

Record 3

E.g. DB2, Oracle, Sybase, SQLServer, …

4

M.I.T

Row Stores are Row Stores are Write OptimizedWrite Optimized Row Stores are Row Stores are Write OptimizedWrite Optimized

Can insert and delete a record in one physical

writeGood for on-line transaction processing (OLTP)But not for read mostly applications

Data warehousesCRM

Can insert and delete a record in one physical

writeGood for on-line transaction processing (OLTP)But not for read mostly applications

Data warehousesCRM

5

M.I.T

Elephants Have Extended Row StoresElephants Have Extended Row StoresElephants Have Extended Row StoresElephants Have Extended Row Stores

With Bitmap indicesBetter sequential readIntegration of “datacube” productsMaterialized views

With Bitmap indicesBetter sequential readIntegration of “datacube” productsMaterialized views

But there may be a better idea…….

6

M.I.T

Column StoresColumn Stores

7

M.I.T

At 100K Feet…. At 100K Feet….

Ad-hoc queries read 2 columns out of 20In a very large warehouse, Fact table is

rarely clustered correctlyColumn store reads 10% of what a row

store reads

Ad-hoc queries read 2 columns out of 20In a very large warehouse, Fact table is

rarely clustered correctlyColumn store reads 10% of what a row

store reads

8

M.I.T

C-Store (Column Store) ProjectC-Store (Column Store) Project

Brandeis/Brown/MIT/UMass-Boston

projectUsual suspects participatingEnough coded to get performance

numbers for some queriesComplete status later

Brandeis/Brown/MIT/UMass-Boston

projectUsual suspects participatingEnough coded to get performance

numbers for some queriesComplete status later

9

M.I.T

We Build on Previous Pioneering We Build on Previous Pioneering Work….Work….

Sybase IQ (early ’90s)Monet (see CIDR ’05 for the most recent

description)

Sybase IQ (early ’90s)Monet (see CIDR ’05 for the most recent

description)

10

M.I.T

C-Store Technical IdeasC-Store Technical Ideas

Code the columns to save spaceNo alignmentBig disk blocksOnly materialized views (perhaps many)Focus on Sorting not indexingAutomatic physical DBMS design

Code the columns to save spaceNo alignmentBig disk blocksOnly materialized views (perhaps many)Focus on Sorting not indexingAutomatic physical DBMS design

11

M.I.T

C-store (Column Store) Technical C-store (Column Store) Technical IdeasIdeas

Optimize for grid computingInnovative redundancyXacts – but no need for MohanData ordered on anything, Not just timeColumn optimizer and executor

Optimize for grid computingInnovative redundancyXacts – but no need for MohanData ordered on anything, Not just timeColumn optimizer and executor

12

M.I.T

How to Evaluate This Paper….How to Evaluate This Paper….

None of the ideas in isolation merit

publication Judge the complete system by its

(hopefully intelligent) choice ofSmall collection of inter-related

powerful ideasThat together put performance in a

new sandbox

None of the ideas in isolation merit

publication Judge the complete system by its

(hopefully intelligent) choice ofSmall collection of inter-related

powerful ideasThat together put performance in a

new sandbox

13

M.I.T

Code the ColumnsCode the Columns

Work hard to shrink spaceUse extra space for multiple orders

Fundamentally easier than in a row storeE.g. RLE works well

Work hard to shrink spaceUse extra space for multiple orders

Fundamentally easier than in a row storeE.g. RLE works well

14

M.I.T

No AlignmentNo Alignment

Densepack columnsE.g. a 5 bit field takes 5 bits

Current CPU speed going up faster than

disk bandwidthFaster to shift data in CPU than to

waste disk bandwidth

Densepack columnsE.g. a 5 bit field takes 5 bits

Current CPU speed going up faster than

disk bandwidthFaster to shift data in CPU than to

waste disk bandwidth

15

M.I.T

Big Disk BlocksBig Disk Blocks

TunableBig (minimum size is 64K)

TunableBig (minimum size is 64K)

16

M.I.T

Only Materialized ViewsOnly Materialized Views

Projection (materialized view) is some

number of columns from a fact tablePlus columns in a dimension table – with

a 1-n join between Fact and Dimension

tableStored in order of a storage key(s)Several may be stored!!!!!With a permutation, if necessary, to map

between them

Projection (materialized view) is some

number of columns from a fact tablePlus columns in a dimension table – with

a 1-n join between Fact and Dimension

tableStored in order of a storage key(s)Several may be stored!!!!!With a permutation, if necessary, to map

between them

17

M.I.T

Only Materialized ViewsOnly Materialized Views

Table (as the user specified it and sees

it) is not stored!No secondary indexes (they are a one

column sorted MV plus a permutation, if

you really want one)

Table (as the user specified it and sees

it) is not stored!No secondary indexes (they are a one

column sorted MV plus a permutation, if

you really want one)

18

M.I.T

Example

User view:

EMP (name, age, salary, dept)

Dept (dname, floor)

Possible set of MVs:MV-1 (name, dept, floor) in floor order

MV-2 (salary, age) in age order

MV-3 (dname, salary, name) in salary order

19

M.I.T

Different Indexing

Few values Many values

Sequential

RLE encoded

Conventional B-tree at

the value level

Delta encoded

Conventional B-tree at

the block level

Non sequential Bitmap per value

Conventional Gzip

Conventional B-tree at

the block level

20

M.I.T

Automatic Physical DBMS DesignAutomatic Physical DBMS Design

Not enough 4-star wizards to go aroundAccept a “training set” of queries and a

space budgetChoose the MVs auto-magicallyRe-optimize periodically based on a log

of the interactions

Not enough 4-star wizards to go aroundAccept a “training set” of queries and a

space budgetChoose the MVs auto-magicallyRe-optimize periodically based on a log

of the interactions

21

M.I.T

Optimize for Grid ComputingOptimize for Grid Computing

I.e. shared-nothingDewitt (Gamma) was right

Horizontal partitioning and intra-query

parallelism as in Gamma

I.e. shared-nothingDewitt (Gamma) was right

Horizontal partitioning and intra-query

parallelism as in Gamma

22

M.I.T

Innovative RedundancyInnovative Redundancy

Hardly any warehouse is recovered by a

redo from the logTakes too long!

Store enough MVs at enough places to

ensure K-safetyRebuild dead objects from elsewhere in

the networkK-safety is a DBMS-design problem!

Hardly any warehouse is recovered by a

redo from the logTakes too long!

Store enough MVs at enough places to

ensure K-safetyRebuild dead objects from elsewhere in

the networkK-safety is a DBMS-design problem!

23

M.I.T

XACTS – No MohanXACTS – No Mohan

Undo from a log (that does not need to

be persistent)Redo by rebuild from elsewhere in the

network

Undo from a log (that does not need to

be persistent)Redo by rebuild from elsewhere in the

network

24

M.I.T

XACTS – No MohanXACTS – No Mohan

Snapshot isolation (run queries as of a

tunable time in the recent past)To solve read-write conflicts

Distributed XactsWithout a prepare message (no 2

phase commit)

Snapshot isolation (run queries as of a

tunable time in the recent past)To solve read-write conflicts

Distributed XactsWithout a prepare message (no 2

phase commit)

25

M.I.T

Storage (sort) Key(s) is not Storage (sort) Key(s) is not Necessarily TimeNecessarily Time

That would be too limitingSo how to do fast updates to densepack

column storage that is not in entry

sequence?

That would be too limitingSo how to do fast updates to densepack

column storage that is not in entry

sequence?

26

M.I.T

Solution – a Hybrid StoreSolution – a Hybrid Store

Read-optimized

Column store

Write-optimized

Column store

Tuple mover

(Much like Monet)

(What we have been

talking about so far)

(Batch rebuilder)

27

M.I.T

Column ExecutorColumn Executor

Column operations – not row operationsColumns remain coded – if possibleLate materialization of columns

Column operations – not row operationsColumns remain coded – if possibleLate materialization of columns

28

M.I.T

Column OptimizerColumn Optimizer

Chooses MVs on which to run the queryMost important task

Build in snowflake schemas Which are simple to optimize without

exhaustive searchLooking at extensions

Chooses MVs on which to run the queryMost important task

Build in snowflake schemas Which are simple to optimize without

exhaustive searchLooking at extensions

29

M.I.T

Current PerformanceCurrent Performance

100X popular row store in 40% of the

space10X popular column store in 70% of the

space7X popular row store in 1/6th of the spaceCode available with BSD license

100X popular row store in 40% of the

space10X popular column store in 70% of the

space7X popular row store in 1/6th of the spaceCode available with BSD license

30

M.I.T

Structure Going ForwardStructure Going Forward

Vertica Very well financed start-up to

commercialize C-storeDoing the heavy lifting

University ResearchFunded by Vertica

Vertica Very well financed start-up to

commercialize C-storeDoing the heavy lifting

University ResearchFunded by Vertica

31

M.I.T

VerticaVertica

Complete alpha system in December ‘05Everything, including DBMS designerWith current performance!Looking for early customers to work with

(see me if you are interested)

Complete alpha system in December ‘05Everything, including DBMS designerWith current performance!Looking for early customers to work with

(see me if you are interested)

32

M.I.T

University ResearchUniversity Research

Extension of algorithms to non-snowflake

schemasStudy of L2 cache performanceStudy of coding strategiesStudy of executor optionsStudy of recovery tacticsNon-cursor interfaceStudy of optimizer primitives

Extension of algorithms to non-snowflake

schemasStudy of L2 cache performanceStudy of coding strategiesStudy of executor optionsStudy of recovery tacticsNon-cursor interfaceStudy of optimizer primitives