1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al....

34
1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group Presentation by Shimin Chen

Transcript of 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al....

Page 1: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

1

C-Store: A Column-oriented DBMS

New England Database Group

(Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston)

Extended for Big Data Reading Group Presentation by Shimin Chen

Page 2: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

2

M.I.T

Relational Database

Record 1

Record 2

Record 3

Attribute1 Attribute2 Attribute3

e.g. Customer(cid, name, address, discount) Product(pid, name, manufacturer, price, quantity) Order(oid, cid, pid, quantity)

Page 3: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

3

M.I.T

Current DBMS -- “Row Store”

Record 2

Record 4

Record 1

Record 3

E.g. DB2, Oracle, Sybase, SQLServer, …

Page 4: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

4

M.I.T

Row Stores are Row Stores are Write OptimizedWrite Optimized

(use white board)(use white board) Row Stores are Row Stores are Write OptimizedWrite Optimized

(use white board)(use white board)

Store fields in one record contiguously on diskUse small (e.g. 4K) disk blocksUse B-tree indexingAlign fields on byte or word boundaries

Assume shifting data values is costly

Transactions: write-ahead logging

Store fields in one record contiguously on diskUse small (e.g. 4K) disk blocksUse B-tree indexingAlign fields on byte or word boundaries

Assume shifting data values is costly

Transactions: write-ahead logging

Page 5: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

5

M.I.T

Row Stores are Row Stores are Write OptimizedWrite Optimized Row Stores are Row Stores are Write OptimizedWrite Optimized

Can insert and delete a record in one physical write

Good for on-line transaction processing (OLTP)

But not for read mostly applications

Data warehouses

Customer Relationship Management (CRM)

Electronic library card catalogs

Can insert and delete a record in one physical write

Good for on-line transaction processing (OLTP)

But not for read mostly applications

Data warehouses

Customer Relationship Management (CRM)

Electronic library card catalogs

Page 6: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

6

M.I.T

Column StoresColumn Stores

Page 7: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

7

M.I.T

At 100K Feet…. At 100K Feet….

Read-optimized: Periodically a bulk load of new data Long period of ad-hoc queries

Benefit: Ad-hoc queries read 2 columns out of 20 Column store reads 10% of what a row store reads

Previous pioneering work:Sybase IQ (early ’90s)

Monet (see CIDR ’05 for the most recent description)

Read-optimized: Periodically a bulk load of new data Long period of ad-hoc queries

Benefit: Ad-hoc queries read 2 columns out of 20 Column store reads 10% of what a row store reads

Previous pioneering work:Sybase IQ (early ’90s)

Monet (see CIDR ’05 for the most recent description)

Page 8: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

8

M.I.T

C-Store Technical IdeasC-Store Technical Ideas

Data storage: Only materialized views (perhaps many)

Compress the columns to save space

No alignment

Big disk blocks

Innovative redundancy

Optimize for grid (cluster) computing

Focus on Sorting not indexing

Automatic physical DBMS design

Column optimizer and executor

Data storage: Only materialized views (perhaps many)

Compress the columns to save space

No alignment

Big disk blocks

Innovative redundancy

Optimize for grid (cluster) computing

Focus on Sorting not indexing

Automatic physical DBMS design

Column optimizer and executor

Page 9: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

9

M.I.T

How to Evaluate This Paper….How to Evaluate This Paper….

None of the ideas in isolation merit publication

Judge the complete system by its (hopefully

intelligent) choice of

Small collection of inter-related powerful ideas

That together put performance in a new sandbox

None of the ideas in isolation merit publication

Judge the complete system by its (hopefully

intelligent) choice of

Small collection of inter-related powerful ideas

That together put performance in a new sandbox

Page 10: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

10

M.I.T

Outline

OverviewRead-optimized column storeQuery execution and optimizationHandling transactional updatesPerformanceSummary

Page 11: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

11

M.I.T

Data ModelData Model

Projection (materialized view): some number of columns from a fact table plus columns in a dimension table – with a 1-n join

between Fact and Dimension table (conceptually) no duplicate elimination

Stored in order of a storage key(s)

Note: base table is not stored anywhere

Projection (materialized view): some number of columns from a fact table plus columns in a dimension table – with a 1-n join

between Fact and Dimension table (conceptually) no duplicate elimination

Stored in order of a storage key(s)

Note: base table is not stored anywhere

Page 12: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

12

M.I.T

Example

Logical base tables:– EMP (name, age, salary, dept)– DEPT (dname, floor)

Example projections– EMP1 (name, age | age)– EMP2 (dept, age, DEPT.floor | DEPT.floor)– EMP3 (name, salary | salary)– DEPT1 (dname, floor | floor)

Page 13: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

13

M.I.T

Optimize for Grid ComputingOptimize for Grid Computing

I.e. shared-nothingHorizontal partitioning and intra-query

parallelism as in Gamma

Paper talks about “Grid computers …

may have tens to hundreds of nodes …”

I.e. shared-nothingHorizontal partitioning and intra-query

parallelism as in Gamma

Paper talks about “Grid computers …

may have tens to hundreds of nodes …”

Page 14: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

14

M.I.T

Projection Detail #1

Each projection is horizontally partitioned into “segment”s– Segment identifier– Unit of distribution and parallelism– Value-based partitioning, key range of sort key(s)

Column-wise store inside segment Storage key: ordinal record number in segment– calculated as needed

Page 15: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

15

M.I.T

Projection Detail #2

Different encoding schemes for different columns Depends on ordering and value distribution– Self-order, few distinct values:

(value, position, num_entries) – Foreign-order, few distinct values:

(value, bitmap), bitmap is run-length encoded– Self-order, many distinct values:

block-oriented, delta value encoding– Foreign-order, many distinct values:

gzip

Page 16: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

16

M.I.T

Different Indexing

Few values Many values

Sequential(self-order)

RLE encoded

Conventional B-tree at

the value level

Delta encoded

Conventional B-tree at

the block level

Non sequential(foreign-order)

Bitmap per value

Conventional Gzip

Conventional B-tree at

the block level

Page 17: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

17

M.I.T

Big Disk BlocksBig Disk Blocks

TunableBig (minimum size is 64K)

TunableBig (minimum size is 64K)

Page 18: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

18

M.I.T

Reconstructing Base Table from Projections

Join Index:– Projection T1 has M segments, projection T2 has n

segments– T1 and T2 are on same base table– Join index consists of M tables, one per T1 segment– Entry: segment ID and storage key of corresponding

record in T2 In general, needs multiple join indices for reconstructing

a base table Join index is costly to store and maintain– Each column expected to be in multiple projections– Reduce # of join indices

Page 19: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

19

M.I.T

Innovative RedundancyInnovative Redundancy

Hardly any warehouse is recovered by redo from log Takes too long!

Store enough projections to ensure K-safety Column can be in K different projections

Rebuild dead objects from elsewhere in the network

Hardly any warehouse is recovered by redo from log Takes too long!

Store enough projections to ensure K-safety Column can be in K different projections

Rebuild dead objects from elsewhere in the network

Page 20: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

20

M.I.T

Automatic Physical DBMS DesignAutomatic Physical DBMS Design

Accept a “training set” of queries and a

space budgetChoose the projections and join indices

auto-magicallyRe-optimize periodically based on a log

of the interactions

Accept a “training set” of queries and a

space budgetChoose the projections and join indices

auto-magicallyRe-optimize periodically based on a log

of the interactions

Page 21: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

21

M.I.T

Outline

OverviewRead-optimized column storeQuery execution and optimizationHandling transactional updatesPerformanceSummary

Page 22: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

22

M.I.T

Operators

Decompress Select: generate bitstring Mask: bitstring+projection selected rows Project: choose a subset of columns Concat: combine multiple projections that are sorted

in the same order Sort Permute: according to a join index Join Aggregation operators Bitstring operators

Page 23: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

23

M.I.T

Execution

Query plan: a tree of operators (data flow)– Leaf: accessing the data storage– Internal: calls “get_next”

Operators return 64KB blocks

Page 24: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

24

M.I.T

Column Optimizer (discussion)Column Optimizer (discussion)

Cost-based estimation for query plan constructionChooses projections on which to run the queryCost model includes compression typesWhen to perform “mask” operatorBuild in snowflake schemas

Which are simple to optimize without exhaustive search

Looking at extensions

Cost-based estimation for query plan constructionChooses projections on which to run the queryCost model includes compression typesWhen to perform “mask” operatorBuild in snowflake schemas

Which are simple to optimize without exhaustive search

Looking at extensions

Page 25: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

25

M.I.T

Outline

OverviewRead-optimized column storeQuery execution and optimizationHandling transactional updatesPerformanceSummary

Page 26: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

26

M.I.T

Online Updates Are Necessary

Transactional updates are necessary even in read-mostly environment

Online updates for error corrections Real-time data warehouses– Reduce the delay between OLTP system and

warehouse towards zero

Page 27: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

27

M.I.T

Solution – a Hybrid StoreSolution – a Hybrid Store

Read-optimized

Column store

Write-optimized

Column store

Tuple mover

(What we have been

talking about so far)

(Batch rebuilder)

Page 28: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

28

M.I.T

Write Store

Column store Horizontally partitioned as the read store– 1:1 mapping between RS segments and WS

segments Storage keys are explicitly stored– Btree: sort key storage key

No compression (the data size is small)

Page 29: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

29

M.I.T

Handling Updates

Optimize read-only query: do not hold locks– Snapshot isolation– The query is run on a snapshot of the data– Ensure transactions related to this snapshot have already

committed Each WS site: insertion vector (with timestamps), deletion vector,

(updates become insertions and detetions) Maintain a high water mark and a low water mark of WS sites:

– HWM: all transactions before HWM have committed – LWM: no records in read store are inserted before LWM

Queries can specify a time before HWM

Page 30: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

30

M.I.T

HWM and epochs

TA: time authority updates the coarse timer (epochs)

Page 31: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

31

M.I.T

TransactionsTransactions

Undo from a log (that does not need to

be persistent)Redo by rebuild from elsewhere in the

network

Undo from a log (that does not need to

be persistent)Redo by rebuild from elsewhere in the

network

Page 32: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

32

M.I.T

Tuple-Mover

Read RS segment Combine WS segment into a new version of the RS

segment, do not update in place Record last move time for this segment in WS

Tlast_move LWM Time authority will periodically sends out a new LWM

epoch number

Page 33: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

33

M.I.T

Current PerformanceCurrent Performance

Varying storage:

100X popular row store in 40% of the space

10X popular column store in 70% of the space

7X popular row store in 1/6th of the space

Code available with BSD license

Varying storage:

100X popular row store in 40% of the space

10X popular column store in 70% of the space

7X popular row store in 1/6th of the space

Code available with BSD license

Page 34: 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

34

M.I.T

Summary

Column store is optimized for read queries Cluster parallelism Interesting data organization Handling write queries