Download - 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

1

C-Store: A Column-oriented DBMS

New England Database Group

(Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston)

Extended for Big Data Reading Group Presentation by Shimin Chen

2

M.I.T

Relational Database

Record 1

Record 2

Record 3

Attribute1 Attribute2 Attribute3

e.g. Customer(cid, name, address, discount) Product(pid, name, manufacturer, price, quantity) Order(oid, cid, pid, quantity)

3

M.I.T

Current DBMS -- “Row Store”

Record 2

Record 4

Record 1

Record 3

E.g. DB2, Oracle, Sybase, SQLServer, …

4

M.I.T

Row Stores are Row Stores are Write OptimizedWrite Optimized

(use white board)(use white board) Row Stores are Row Stores are Write OptimizedWrite Optimized

(use white board)(use white board)

Store fields in one record contiguously on diskUse small (e.g. 4K) disk blocksUse B-tree indexingAlign fields on byte or word boundaries

Assume shifting data values is costly

Transactions: write-ahead logging

Store fields in one record contiguously on diskUse small (e.g. 4K) disk blocksUse B-tree indexingAlign fields on byte or word boundaries

Assume shifting data values is costly

Transactions: write-ahead logging

5

M.I.T

Row Stores are Row Stores are Write OptimizedWrite Optimized Row Stores are Row Stores are Write OptimizedWrite Optimized

Can insert and delete a record in one physical write

Good for on-line transaction processing (OLTP)

But not for read mostly applications

Data warehouses

Customer Relationship Management (CRM)

Electronic library card catalogs

…

Can insert and delete a record in one physical write

Good for on-line transaction processing (OLTP)

But not for read mostly applications

Data warehouses

Customer Relationship Management (CRM)

Electronic library card catalogs

…

6

M.I.T

Column StoresColumn Stores

7

M.I.T

At 100K Feet…. At 100K Feet….

Read-optimized: Periodically a bulk load of new data Long period of ad-hoc queries

Benefit: Ad-hoc queries read 2 columns out of 20 Column store reads 10% of what a row store reads

Previous pioneering work:Sybase IQ (early ’90s)

Monet (see CIDR ’05 for the most recent description)

Read-optimized: Periodically a bulk load of new data Long period of ad-hoc queries

Benefit: Ad-hoc queries read 2 columns out of 20 Column store reads 10% of what a row store reads

Previous pioneering work:Sybase IQ (early ’90s)

Monet (see CIDR ’05 for the most recent description)

8

M.I.T

C-Store Technical IdeasC-Store Technical Ideas

Data storage: Only materialized views (perhaps many)

Compress the columns to save space

No alignment

Big disk blocks

Innovative redundancy

Optimize for grid (cluster) computing

Focus on Sorting not indexing

Automatic physical DBMS design

Column optimizer and executor

Data storage: Only materialized views (perhaps many)

Compress the columns to save space

No alignment

Big disk blocks

Innovative redundancy

Optimize for grid (cluster) computing

Focus on Sorting not indexing

Automatic physical DBMS design

Column optimizer and executor

9

M.I.T

How to Evaluate This Paper….How to Evaluate This Paper….

None of the ideas in isolation merit publication

Judge the complete system by its (hopefully

intelligent) choice of

Small collection of inter-related powerful ideas

That together put performance in a new sandbox

None of the ideas in isolation merit publication

Judge the complete system by its (hopefully

intelligent) choice of

Small collection of inter-related powerful ideas

That together put performance in a new sandbox

10

M.I.T

Outline

OverviewRead-optimized column storeQuery execution and optimizationHandling transactional updatesPerformanceSummary

11

M.I.T

Data ModelData Model

Projection (materialized view): some number of columns from a fact table plus columns in a dimension table – with a 1-n join

between Fact and Dimension table (conceptually) no duplicate elimination

Stored in order of a storage key(s)

Note: base table is not stored anywhere

Projection (materialized view): some number of columns from a fact table plus columns in a dimension table – with a 1-n join

between Fact and Dimension table (conceptually) no duplicate elimination

Stored in order of a storage key(s)

Note: base table is not stored anywhere

12

M.I.T

Example

Logical base tables:– EMP (name, age, salary, dept)– DEPT (dname, floor)

Example projections– EMP1 (name, age | age)– EMP2 (dept, age, DEPT.floor | DEPT.floor)– EMP3 (name, salary | salary)– DEPT1 (dname, floor | floor)

13

M.I.T

Optimize for Grid ComputingOptimize for Grid Computing

I.e. shared-nothingHorizontal partitioning and intra-query

parallelism as in Gamma

Paper talks about “Grid computers …

may have tens to hundreds of nodes …”

I.e. shared-nothingHorizontal partitioning and intra-query

parallelism as in Gamma

Paper talks about “Grid computers …

may have tens to hundreds of nodes …”

14

M.I.T

Projection Detail #1

Each projection is horizontally partitioned into “segment”s– Segment identifier– Unit of distribution and parallelism– Value-based partitioning, key range of sort key(s)

Column-wise store inside segment Storage key: ordinal record number in segment– calculated as needed

15

M.I.T

Projection Detail #2

Different encoding schemes for different columns Depends on ordering and value distribution– Self-order, few distinct values:

(value, position, num_entries) – Foreign-order, few distinct values:

(value, bitmap), bitmap is run-length encoded– Self-order, many distinct values:

block-oriented, delta value encoding– Foreign-order, many distinct values:

gzip

16

M.I.T

Different Indexing

Few values Many values

Sequential(self-order)

RLE encoded

Conventional B-tree at

the value level

Delta encoded


the block level

Non sequential(foreign-order)

Bitmap per value

Conventional Gzip


the block level

17

M.I.T

Big Disk BlocksBig Disk Blocks

TunableBig (minimum size is 64K)

TunableBig (minimum size is 64K)

18

M.I.T

Reconstructing Base Table from Projections

Join Index:– Projection T1 has M segments, projection T2 has n

segments– T1 and T2 are on same base table– Join index consists of M tables, one per T1 segment– Entry: segment ID and storage key of corresponding

record in T2 In general, needs multiple join indices for reconstructing

a base table Join index is costly to store and maintain– Each column expected to be in multiple projections– Reduce # of join indices

19

M.I.T

Innovative RedundancyInnovative Redundancy

Hardly any warehouse is recovered by redo from log Takes too long!

Store enough projections to ensure K-safety Column can be in K different projections

Rebuild dead objects from elsewhere in the network

Hardly any warehouse is recovered by redo from log Takes too long!

Store enough projections to ensure K-safety Column can be in K different projections

Rebuild dead objects from elsewhere in the network

20

M.I.T

Automatic Physical DBMS DesignAutomatic Physical DBMS Design

Accept a “training set” of queries and a

space budgetChoose the projections and join indices

auto-magicallyRe-optimize periodically based on a log

of the interactions

Accept a “training set” of queries and a

space budgetChoose the projections and join indices

auto-magicallyRe-optimize periodically based on a log

of the interactions

21

M.I.T

Outline


22

M.I.T

Operators

Decompress Select: generate bitstring Mask: bitstring+projection selected rows Project: choose a subset of columns Concat: combine multiple projections that are sorted

in the same order Sort Permute: according to a join index Join Aggregation operators Bitstring operators

23

M.I.T

Execution

Query plan: a tree of operators (data flow)– Leaf: accessing the data storage– Internal: calls “get_next”

Operators return 64KB blocks

24

M.I.T

Column Optimizer (discussion)Column Optimizer (discussion)

Cost-based estimation for query plan constructionChooses projections on which to run the queryCost model includes compression typesWhen to perform “mask” operatorBuild in snowflake schemas

Which are simple to optimize without exhaustive search

Looking at extensions

Cost-based estimation for query plan constructionChooses projections on which to run the queryCost model includes compression typesWhen to perform “mask” operatorBuild in snowflake schemas

Which are simple to optimize without exhaustive search

Looking at extensions

25

M.I.T

Outline


26

M.I.T

Online Updates Are Necessary

Transactional updates are necessary even in read-mostly environment

Online updates for error corrections Real-time data warehouses– Reduce the delay between OLTP system and

warehouse towards zero

27

M.I.T

Solution – a Hybrid StoreSolution – a Hybrid Store

Read-optimized

Column store

Write-optimized

Column store

Tuple mover

(What we have been

talking about so far)

(Batch rebuilder)

28

M.I.T

Write Store

Column store Horizontally partitioned as the read store– 1:1 mapping between RS segments and WS

segments Storage keys are explicitly stored– Btree: sort key storage key

No compression (the data size is small)

29

M.I.T

Handling Updates

Optimize read-only query: do not hold locks– Snapshot isolation– The query is run on a snapshot of the data– Ensure transactions related to this snapshot have already

committed Each WS site: insertion vector (with timestamps), deletion vector,

(updates become insertions and detetions) Maintain a high water mark and a low water mark of WS sites:

– HWM: all transactions before HWM have committed – LWM: no records in read store are inserted before LWM

Queries can specify a time before HWM

30

M.I.T

HWM and epochs

TA: time authority updates the coarse timer (epochs)

31

M.I.T

TransactionsTransactions

Undo from a log (that does not need to

be persistent)Redo by rebuild from elsewhere in the

network

Undo from a log (that does not need to

be persistent)Redo by rebuild from elsewhere in the

network

32

M.I.T

Tuple-Mover

Read RS segment Combine WS segment into a new version of the RS

segment, do not update in place Record last move time for this segment in WS

Tlast_move LWM Time authority will periodically sends out a new LWM

epoch number

33

M.I.T

Current PerformanceCurrent Performance

Varying storage:

100X popular row store in 40% of the space

10X popular column store in 70% of the space

7X popular row store in 1/6th of the space

Code available with BSD license

Varying storage:

100X popular row store in 40% of the space

10X popular column store in 70% of the space

7X popular row store in 1/6th of the space

Code available with BSD license

34

M.I.T

Summary

Column store is optimized for read queries Cluster parallelism Interesting data organization Handling write queries