1
C-Store: A Column-oriented DBMS
New England Database Group
(Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston)
Extended for Big Data Reading Group Presentation by Shimin Chen
2
M.I.T
Relational Database
Record 1
Record 2
Record 3
Attribute1 Attribute2 Attribute3
e.g. Customer(cid, name, address, discount) Product(pid, name, manufacturer, price, quantity) Order(oid, cid, pid, quantity)
3
M.I.T
Current DBMS -- “Row Store”
Record 2
Record 4
Record 1
Record 3
E.g. DB2, Oracle, Sybase, SQLServer, …
4
M.I.T
Row Stores are Row Stores are Write OptimizedWrite Optimized
(use white board)(use white board) Row Stores are Row Stores are Write OptimizedWrite Optimized
(use white board)(use white board)
Store fields in one record contiguously on diskUse small (e.g. 4K) disk blocksUse B-tree indexingAlign fields on byte or word boundaries
Assume shifting data values is costly
Transactions: write-ahead logging
Store fields in one record contiguously on diskUse small (e.g. 4K) disk blocksUse B-tree indexingAlign fields on byte or word boundaries
Assume shifting data values is costly
Transactions: write-ahead logging
5
M.I.T
Row Stores are Row Stores are Write OptimizedWrite Optimized Row Stores are Row Stores are Write OptimizedWrite Optimized
Can insert and delete a record in one physical write
Good for on-line transaction processing (OLTP)
But not for read mostly applications
Data warehouses
Customer Relationship Management (CRM)
Electronic library card catalogs
…
Can insert and delete a record in one physical write
Good for on-line transaction processing (OLTP)
But not for read mostly applications
Data warehouses
Customer Relationship Management (CRM)
Electronic library card catalogs
…
6
M.I.T
Column StoresColumn Stores
7
M.I.T
At 100K Feet…. At 100K Feet….
Read-optimized: Periodically a bulk load of new data Long period of ad-hoc queries
Benefit: Ad-hoc queries read 2 columns out of 20 Column store reads 10% of what a row store reads
Previous pioneering work:Sybase IQ (early ’90s)
Monet (see CIDR ’05 for the most recent description)
Read-optimized: Periodically a bulk load of new data Long period of ad-hoc queries
Benefit: Ad-hoc queries read 2 columns out of 20 Column store reads 10% of what a row store reads
Previous pioneering work:Sybase IQ (early ’90s)
Monet (see CIDR ’05 for the most recent description)
8
M.I.T
C-Store Technical IdeasC-Store Technical Ideas
Data storage: Only materialized views (perhaps many)
Compress the columns to save space
No alignment
Big disk blocks
Innovative redundancy
Optimize for grid (cluster) computing
Focus on Sorting not indexing
Automatic physical DBMS design
Column optimizer and executor
Data storage: Only materialized views (perhaps many)
Compress the columns to save space
No alignment
Big disk blocks
Innovative redundancy
Optimize for grid (cluster) computing
Focus on Sorting not indexing
Automatic physical DBMS design
Column optimizer and executor
9
M.I.T
How to Evaluate This Paper….How to Evaluate This Paper….
None of the ideas in isolation merit publication
Judge the complete system by its (hopefully
intelligent) choice of
Small collection of inter-related powerful ideas
That together put performance in a new sandbox
None of the ideas in isolation merit publication
Judge the complete system by its (hopefully
intelligent) choice of
Small collection of inter-related powerful ideas
That together put performance in a new sandbox
10
M.I.T
Outline
OverviewRead-optimized column storeQuery execution and optimizationHandling transactional updatesPerformanceSummary
11
M.I.T
Data ModelData Model
Projection (materialized view): some number of columns from a fact table plus columns in a dimension table – with a 1-n join
between Fact and Dimension table (conceptually) no duplicate elimination
Stored in order of a storage key(s)
Note: base table is not stored anywhere
Projection (materialized view): some number of columns from a fact table plus columns in a dimension table – with a 1-n join
between Fact and Dimension table (conceptually) no duplicate elimination
Stored in order of a storage key(s)
Note: base table is not stored anywhere
12
M.I.T
Example
Logical base tables:– EMP (name, age, salary, dept)– DEPT (dname, floor)
Example projections– EMP1 (name, age | age)– EMP2 (dept, age, DEPT.floor | DEPT.floor)– EMP3 (name, salary | salary)– DEPT1 (dname, floor | floor)
13
M.I.T
Optimize for Grid ComputingOptimize for Grid Computing
I.e. shared-nothingHorizontal partitioning and intra-query
parallelism as in Gamma
Paper talks about “Grid computers …
may have tens to hundreds of nodes …”
I.e. shared-nothingHorizontal partitioning and intra-query
parallelism as in Gamma
Paper talks about “Grid computers …
may have tens to hundreds of nodes …”
14
M.I.T
Projection Detail #1
Each projection is horizontally partitioned into “segment”s– Segment identifier– Unit of distribution and parallelism– Value-based partitioning, key range of sort key(s)
Column-wise store inside segment Storage key: ordinal record number in segment– calculated as needed
15
M.I.T
Projection Detail #2
Different encoding schemes for different columns Depends on ordering and value distribution– Self-order, few distinct values:
(value, position, num_entries) – Foreign-order, few distinct values:
(value, bitmap), bitmap is run-length encoded– Self-order, many distinct values:
block-oriented, delta value encoding– Foreign-order, many distinct values:
gzip
16
M.I.T
Different Indexing
Few values Many values
Sequential(self-order)
RLE encoded
Conventional B-tree at
the value level
Delta encoded
Conventional B-tree at
the block level
Non sequential(foreign-order)
Bitmap per value
Conventional Gzip
Conventional B-tree at
the block level
17
M.I.T
Big Disk BlocksBig Disk Blocks
TunableBig (minimum size is 64K)
TunableBig (minimum size is 64K)
18
M.I.T
Reconstructing Base Table from Projections
Join Index:– Projection T1 has M segments, projection T2 has n
segments– T1 and T2 are on same base table– Join index consists of M tables, one per T1 segment– Entry: segment ID and storage key of corresponding
record in T2 In general, needs multiple join indices for reconstructing
a base table Join index is costly to store and maintain– Each column expected to be in multiple projections– Reduce # of join indices
19
M.I.T
Innovative RedundancyInnovative Redundancy
Hardly any warehouse is recovered by redo from log Takes too long!
Store enough projections to ensure K-safety Column can be in K different projections
Rebuild dead objects from elsewhere in the network
Hardly any warehouse is recovered by redo from log Takes too long!
Store enough projections to ensure K-safety Column can be in K different projections
Rebuild dead objects from elsewhere in the network
20
M.I.T
Automatic Physical DBMS DesignAutomatic Physical DBMS Design
Accept a “training set” of queries and a
space budgetChoose the projections and join indices
auto-magicallyRe-optimize periodically based on a log
of the interactions
Accept a “training set” of queries and a
space budgetChoose the projections and join indices
auto-magicallyRe-optimize periodically based on a log
of the interactions
21
M.I.T
Outline
OverviewRead-optimized column storeQuery execution and optimizationHandling transactional updatesPerformanceSummary
22
M.I.T
Operators
Decompress Select: generate bitstring Mask: bitstring+projection selected rows Project: choose a subset of columns Concat: combine multiple projections that are sorted
in the same order Sort Permute: according to a join index Join Aggregation operators Bitstring operators
23
M.I.T
Execution
Query plan: a tree of operators (data flow)– Leaf: accessing the data storage– Internal: calls “get_next”
Operators return 64KB blocks
24
M.I.T
Column Optimizer (discussion)Column Optimizer (discussion)
Cost-based estimation for query plan constructionChooses projections on which to run the queryCost model includes compression typesWhen to perform “mask” operatorBuild in snowflake schemas
Which are simple to optimize without exhaustive search
Looking at extensions
Cost-based estimation for query plan constructionChooses projections on which to run the queryCost model includes compression typesWhen to perform “mask” operatorBuild in snowflake schemas
Which are simple to optimize without exhaustive search
Looking at extensions
25
M.I.T
Outline
OverviewRead-optimized column storeQuery execution and optimizationHandling transactional updatesPerformanceSummary
26
M.I.T
Online Updates Are Necessary
Transactional updates are necessary even in read-mostly environment
Online updates for error corrections Real-time data warehouses– Reduce the delay between OLTP system and
warehouse towards zero
27
M.I.T
Solution – a Hybrid StoreSolution – a Hybrid Store
Read-optimized
Column store
Write-optimized
Column store
Tuple mover
(What we have been
talking about so far)
(Batch rebuilder)
28
M.I.T
Write Store
Column store Horizontally partitioned as the read store– 1:1 mapping between RS segments and WS
segments Storage keys are explicitly stored– Btree: sort key storage key
No compression (the data size is small)
29
M.I.T
Handling Updates
Optimize read-only query: do not hold locks– Snapshot isolation– The query is run on a snapshot of the data– Ensure transactions related to this snapshot have already
committed Each WS site: insertion vector (with timestamps), deletion vector,
(updates become insertions and detetions) Maintain a high water mark and a low water mark of WS sites:
– HWM: all transactions before HWM have committed – LWM: no records in read store are inserted before LWM
Queries can specify a time before HWM
30
M.I.T
HWM and epochs
TA: time authority updates the coarse timer (epochs)
31
M.I.T
TransactionsTransactions
Undo from a log (that does not need to
be persistent)Redo by rebuild from elsewhere in the
network
Undo from a log (that does not need to
be persistent)Redo by rebuild from elsewhere in the
network
32
M.I.T
Tuple-Mover
Read RS segment Combine WS segment into a new version of the RS
segment, do not update in place Record last move time for this segment in WS
Tlast_move LWM Time authority will periodically sends out a new LWM
epoch number
33
M.I.T
Current PerformanceCurrent Performance
Varying storage:
100X popular row store in 40% of the space
10X popular column store in 70% of the space
7X popular row store in 1/6th of the space
Code available with BSD license
Varying storage:
100X popular row store in 40% of the space
10X popular column store in 70% of the space
7X popular row store in 1/6th of the space
Code available with BSD license
34
M.I.T
Summary
Column store is optimized for read queries Cluster parallelism Interesting data organization Handling write queries
Top Related