Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O(...

18
Lecture 6 Indexing Part 2 Column Stores

Transcript of Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O(...

Page 1: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

Lecture 6

Indexing Part 2Column Stores

Page 2: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

Indexes RecapHeap File Bitmap Hash File B+Tree

Insert O(1) O(1) O(1) O( logB n )

Delete O(P) O(1) O(1) O( logB n )

RangeScan

O(P) -- / O(P) -- / O(P) O( logB n + R )

Lookup O(P) O(C) O(1) O( logB n )

n : number of tuplesP : number of pages in fileB : branching factor of B-Tree (keys / node)R : number of pages in rangeC: cardinality (#) of unique values on key

Page 3: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

B+ Tree Indexes

• Balanced wide tree• Fast value lookup and range scans• Each node is a disk page (except root)• Leafs point to tuple pages

Page 4: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

4

Secondary Indices Example

• Index record points to a bucket that contains pointers to all the actual records with that particular search-key value.

• Secondary indices have to be dense

Secondary index on balance field of account

Page 5: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

B+ Tree Insertion

• Locate leaf where for new key and pointer• Insert into leaf node• If overfull, split node• Recursively update parents to keep tree

balanced and (non-root) nodes >= half full

Page 6: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

B+ Tree Insertion

Insert Clearview

Page 7: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

B+ Tree Insertion

B+-Tree before and after insertion of “Clearview”

Page 8: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

B+ Tree Deletion

• Find leaf key and pointer• Delete from leaf• If leaf underfull (> ½ entries used), rebalance

with neighbors• Recursively update parents to keep balance

and reflect new leaf contents– May delete root with one entry

Page 9: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

B+ Tree Deletion Example

• Deleting “Downtown” causes merging of under-full leaves– leaf node can become empty only for n=3!

Before and after deleting “Downtown”

Page 10: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

Study Break: B+ Tree

• See tree on board• Insert 9 into the tree• Insert 3 into the original tree• Delete 8 from start tree w/left leaf

redistribution• Delete 8 with right redistribution

Page 11: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

Column Store Performance

• How much do these optimizations matter?

• Wanted to compare against best you could do with a commercial system

Page 12: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

12

Emulating a Column Store

• Two approaches:1. Vertical partitioning: for n column table,

store n two-column tables, with ith table containing a tuple-id, and attribute i• Sort on tuple-id• Merge joins for query results

2. Index-only plans• Create a secondary index on each column• Never follow pointers to base table

Page 13: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

13

Bottom Line

C-Store, Compression

C-Store, No Compression

C-Store, Early Materialize

Rows

Rows, Vert. Part.

Rows, All Indexes

4

15

41

26

80

221

Time (s)

SSBM (Star Schema Benchmark -- O’Neil et al ICDE 08) Data warehousing benchmark based on TPC-H Scale 100 (60 M row table), 17 columns Average across 12 queries Row store is a commercial DB, tuned by professional DBA vs

C-Store

Commercial System Does Not Benefit From Vertical Partitioning

Page 14: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

14

Problems with Vertical Partitioning

①Tuple headers Total table is 4GB Each column table is ~1.0 GB Factor of 4 overhead from tuple headers and tuple-ids

②Merge joins Answering queries requires joins Row-store doesn’t know that column-tables are sorted

Sort hurts performance

Would need to fix these, plus add direct operation on compressed data, to approach C-Store performance

Page 15: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

Problems with Index-Only PlansConsider the query:

SELECT store_name, SUM(revenue) FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.country = “Canada” GROUP BY store_name

• Two WHERE clauses result in a list of tuple IDs that pass all predicates

• Need to go pick up values from store_name and revenue columns

• But indexes map from valuetuple ID!• Column stores can efficiently go from tuple IDvalue in each

column

Page 16: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

16

Recommendations for Row-Store Designers

• Might be possible to get C-Store like performance①Need to store tuple headers elsewhere (not

require that they be read from disk w/ tuples)②Need to provide efficient merge join

implementation that understands sorted columns③Need to support direct operation on compressed

data• Requires “late materialization” design

Page 17: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

Study Break: Column Stores

• Given the schema:grades (a_cid int, student_id int, grade char(2), grade_num int)

• Estimate how much data we would read if we select avg(grade_num) from 1M records in column store?– What about a row store?

• If we have 5k students, how much data do we need to access to count the number of students who have earned an A where a_cid=339. Do the same exercise with a row store.

Page 18: Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--

Column Stores Solution

• Column store: avg(grade_num) = 8 bytes * 1M tuples = 8 MB

Row store: (3*8 + 2) bytes * 1M = 26 MB• Count # of tuples from two cols, 8 bytes

(a_cid) + 2 bytes (grade) * 1M = 10 MB Row store: 26 MB again