Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O(...
-
Upload
kelly-turner -
Category
Documents
-
view
213 -
download
0
Transcript of Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O(...
Lecture 6
Indexing Part 2Column Stores
Indexes RecapHeap File Bitmap Hash File B+Tree
Insert O(1) O(1) O(1) O( logB n )
Delete O(P) O(1) O(1) O( logB n )
RangeScan
O(P) -- / O(P) -- / O(P) O( logB n + R )
Lookup O(P) O(C) O(1) O( logB n )
n : number of tuplesP : number of pages in fileB : branching factor of B-Tree (keys / node)R : number of pages in rangeC: cardinality (#) of unique values on key
B+ Tree Indexes
• Balanced wide tree• Fast value lookup and range scans• Each node is a disk page (except root)• Leafs point to tuple pages
4
Secondary Indices Example
• Index record points to a bucket that contains pointers to all the actual records with that particular search-key value.
• Secondary indices have to be dense
Secondary index on balance field of account
B+ Tree Insertion
• Locate leaf where for new key and pointer• Insert into leaf node• If overfull, split node• Recursively update parents to keep tree
balanced and (non-root) nodes >= half full
B+ Tree Insertion
Insert Clearview
B+ Tree Insertion
B+-Tree before and after insertion of “Clearview”
B+ Tree Deletion
• Find leaf key and pointer• Delete from leaf• If leaf underfull (> ½ entries used), rebalance
with neighbors• Recursively update parents to keep balance
and reflect new leaf contents– May delete root with one entry
B+ Tree Deletion Example
• Deleting “Downtown” causes merging of under-full leaves– leaf node can become empty only for n=3!
Before and after deleting “Downtown”
Study Break: B+ Tree
• See tree on board• Insert 9 into the tree• Insert 3 into the original tree• Delete 8 from start tree w/left leaf
redistribution• Delete 8 with right redistribution
Column Store Performance
• How much do these optimizations matter?
• Wanted to compare against best you could do with a commercial system
12
Emulating a Column Store
• Two approaches:1. Vertical partitioning: for n column table,
store n two-column tables, with ith table containing a tuple-id, and attribute i• Sort on tuple-id• Merge joins for query results
2. Index-only plans• Create a secondary index on each column• Never follow pointers to base table
13
Bottom Line
C-Store, Compression
C-Store, No Compression
C-Store, Early Materialize
Rows
Rows, Vert. Part.
Rows, All Indexes
4
15
41
26
80
221
Time (s)
SSBM (Star Schema Benchmark -- O’Neil et al ICDE 08) Data warehousing benchmark based on TPC-H Scale 100 (60 M row table), 17 columns Average across 12 queries Row store is a commercial DB, tuned by professional DBA vs
C-Store
Commercial System Does Not Benefit From Vertical Partitioning
14
Problems with Vertical Partitioning
①Tuple headers Total table is 4GB Each column table is ~1.0 GB Factor of 4 overhead from tuple headers and tuple-ids
②Merge joins Answering queries requires joins Row-store doesn’t know that column-tables are sorted
Sort hurts performance
Would need to fix these, plus add direct operation on compressed data, to approach C-Store performance
Problems with Index-Only PlansConsider the query:
SELECT store_name, SUM(revenue) FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.country = “Canada” GROUP BY store_name
• Two WHERE clauses result in a list of tuple IDs that pass all predicates
• Need to go pick up values from store_name and revenue columns
• But indexes map from valuetuple ID!• Column stores can efficiently go from tuple IDvalue in each
column
16
Recommendations for Row-Store Designers
• Might be possible to get C-Store like performance①Need to store tuple headers elsewhere (not
require that they be read from disk w/ tuples)②Need to provide efficient merge join
implementation that understands sorted columns③Need to support direct operation on compressed
data• Requires “late materialization” design
Study Break: Column Stores
• Given the schema:grades (a_cid int, student_id int, grade char(2), grade_num int)
• Estimate how much data we would read if we select avg(grade_num) from 1M records in column store?– What about a row store?
• If we have 5k students, how much data do we need to access to count the number of students who have earned an A where a_cid=339. Do the same exercise with a row store.
Column Stores Solution
• Column store: avg(grade_num) = 8 bytes * 1M tuples = 8 MB
Row store: (3*8 + 2) bytes * 1M = 26 MB• Count # of tuples from two cols, 8 bytes
(a_cid) + 2 bytes (grade) * 1M = 10 MB Row store: 26 MB again