15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS... · 2010-01-11 · C....
Transcript of 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS... · 2010-01-11 · C....
C. Faloutsos 15-826
1
CMU SCS
15-826: Multimedia Databases and Data Mining
Lecture#2: Primary key indexing – B-trees Christos Faloutsos - CMU
www.cs.cmu.edu/~christos
CMU SCS
Reading Material
[Ramakrishnan & Gehrke, 3rd ed, ch. 10]
15-826 Copyright: C. Faloutsos (2010) 2
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 3
Problem
Given a large collection of (multimedia) records, find similar/interesting things, ie:
• Allow fast, approximate queries, and • Find rules/patterns
C. Faloutsos 15-826
2
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 4
Outline
Goal: ‘Find similar / interesting things’ • Intro to DB • Indexing - similarity search • Data Mining
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 5
Indexing - Detailed outline
• primary key indexing – B-trees and variants – (static) hashing – extendible hashing
• secondary key indexing • spatial access methods • text • ...
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 6
In even more detail:
• B – trees
• B+ - trees, B*-trees
• hashing
C. Faloutsos 15-826
3
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 7
Primary key indexing
• find employee with ssn=123
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 8
B-trees
• the most successful family of index schemes (B-trees, B+-trees, B*-trees)
• Can be used for primary/secondary, clustering/non-clustering index.
• balanced “n-way” search trees
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 9
Citation • Rudolf Bayer and Edward M.
McCreight, Organization and Maintenance of Large Ordered Indices, Acta Informatica, 1:173-189, 1972.
• Received the 2001 SIGMOD innovations award • among the most cited db publications
• www.informatik.uni-trier.de/~ley/db/about/top.html
C. Faloutsos 15-826
4
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 10
B-trees
Eg., B-tree of order 3:
1 3
6
7
9
13
<6
>6 <9 >9
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 11
B - tree properties:
• each node, in a B-tree of order n: – Key order – at most n pointers – at least n/2 pointers (except root) – all leaves at the same level – if number of pointers is k, then node has exactly k-1
keys – (leaves are empty)
v1 v2 … vn-1 p1 pn
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 12
Properties
• “block aware” nodes: each node -> disk page
• O(log (N)) for everything! (ins/del/search)
• typically, if n = 50 - 100, then 2 - 3 levels
• utilization >= 50%, guaranteed; on average 69%
C. Faloutsos 15-826
5
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 13
Queries
• Algo for exact match query? (eg., ssn=8?)
1 3
6
7
9
13
<6
>6 <9 >9
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 14
Queries
• Algo for exact match query? (eg., ssn=8?)
1 3
6
7
9
13
<6
>6 <9 >9
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 15
Queries
• Algo for exact match query? (eg., ssn=8?)
1 3
6
7
9
13
<6
>6 <9 >9
C. Faloutsos 15-826
6
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 16
Queries
• Algo for exact match query? (eg., ssn=8?)
1 3
6
7
9
13
<6
>6 <9 >9
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 17
Queries
• Algo for exact match query? (eg., ssn=8?)
1 3
6
7
9
13
<6
>6 <9 >9 H steps (= disk accesses)
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 18
Queries
• what about range queries? (eg., 5<salary<8) • Proximity/ nearest neighbor searches? (eg.,
salary ~ 8 )
C. Faloutsos 15-826
7
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 19
Queries • what about range queries? (eg., 5<salary<8) • Proximity/ nearest neighbor searches? (eg.,
salary ~ 8 )
1 3
6
7
9
13
<6
>6 <9 >9
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 20
Queries • what about range queries? (eg., 5<salary<8) • Proximity/ nearest neighbor searches? (eg.,
salary ~ 8 )
1 3
6
7
9
13
<6
>6 <9 >9
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 21
B-trees: Insertion
• Insert in leaf; on overflow, push middle up (recursively)
• split: preserves B - tree properties
C. Faloutsos 15-826
8
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 22
B-trees
Easy case: Tree T0; insert ‘8’
1 3
6
7
9
13
<6
>6 <9 >9
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 23
B-trees
Tree T0; insert ‘8’
1 3
6
7
9
13
<6
>6 <9 >9
8
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 24
B-trees
Hardest case: Tree T0; insert ‘2’
1 3
6
7
9
13
<6
>6 <9 >9
2
C. Faloutsos 15-826
9
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 25
B-trees
Hardest case: Tree T0; insert ‘2’
1 2
6
7
9
13 3
push middle up
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 26
B-trees
Hardest case: Tree T0; insert ‘2’
6
7
9
13 1 3
2 2 Ovf; push middle
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 27
B-trees
Hardest case: Tree T0; insert ‘2’
7
9
13 1 3
2
6 Final state
C. Faloutsos 15-826
10
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 28
B-trees: Insertion
• Q: What if there are two middles? (eg, order 4)
• A: either one is fine
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 29
B-trees: Insertion
• Insert in leaf; on overflow, push middle up (recursively – ‘propagate split’)
• split: preserves all B - tree properties (!!) • notice how it grows: height increases when
root overflows & splits • Automatic, incremental re-organization
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 30
Overview
• B – trees – Dfn, Search, insertion, deletion
• B+ - trees
• hashing
C. Faloutsos 15-826
11
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 31
Deletion
Rough outline of algo: • Delete key; • on underflow, may need to merge
In practice, some implementors just allow underflows to happen…
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 32
B-trees – Deletion
Easiest case: Tree T0; delete ‘3’
1 3
6
7
9
13
<6
>6 <9 >9
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 33
B-trees – Deletion
Easiest case: Tree T0; delete ‘3’
1
6
7
9
13
<6
>6 <9 >9
C. Faloutsos 15-826
12
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 34
B-trees – Deletion
Easiest case: Tree T0; delete ‘3’
1
6
7
9
13
<6
>6 <9 >9
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 35
B-trees – Deletion
• Case1: delete a key at a leaf – no underflow • Case2: delete non-leaf key – no underflow • Case3: delete leaf-key; underflow, and ‘rich
sibling’ • Case4: delete leaf-key; underflow, and ‘poor
sibling’
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 36
B-trees – Deletion
• Case2: delete a key at a non-leaf – no underflow (eg., delete 6 from T0)
1 3
6
7
9
13
<6
>6 <9 >9
Delete & promote, ie:
C. Faloutsos 15-826
13
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 37
B-trees – Deletion
• Case2: delete a key at a non-leaf – no underflow (eg., delete 6 from T0)
1 3 7
9
13
<6
>6 <9 >9
Delete & promote, ie:
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 38
B-trees – Deletion
• Case2: delete a key at a non-leaf – no underflow (eg., delete 6 from T0)
1 7
9
13
<6
>6 <9 >9
Delete & promote, ie: 3
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 39
B-trees – Deletion
• Case2: delete a key at a non-leaf – no underflow (eg., delete 6 from T0)
1 7
9
13
<3
>3 <9 >9 3
FINAL TREE
C. Faloutsos 15-826
14
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 40
B-trees – Deletion
• Case2: delete a key at a non-leaf – no underflow (eg., delete 6 from T0)
• Q: How to promote? • A: pick the largest key from the left sub-tree
(or the smallest from the right sub-tree) • Observation: every deletion eventually
becomes a deletion of a leaf key
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 41
B-trees – Deletion • Case1: delete a key at a leaf – no underflow • Case2: delete non-leaf key – no underflow • Case3: delete leaf-key; underflow, and ‘rich
sibling’ • Case4: delete leaf-key; underflow, and ‘poor
sibling’
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 42
B-trees – Deletion
• Case3: underflow & ‘rich sibling’ (eg., delete 7 from T0)
1 3
6
7
9
13
<6
>6 <9 >9
Delete & borrow, ie:
C. Faloutsos 15-826
15
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 43
B-trees – Deletion
• Case3: underflow & ‘rich sibling’ (eg., delete 7 from T0)
1 3
6 9
13
<6
>6 <9 >9
Delete & borrow, ie:
Rich sibling
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 44
B-trees – Deletion
• Case3: underflow & ‘rich sibling’
• ‘rich’ = can give a key, without underflowing
• ‘borrowing’ a key: THROUGH the PARENT!
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 45
B-trees – Deletion
• Case3: underflow & ‘rich sibling’ (eg., delete 7 from T0)
1 3
6 9
13
<6
>6 <9 >9
Delete & borrow, ie:
Rich sibling
NO!!
C. Faloutsos 15-826
16
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 46
B-trees – Deletion
• Case3: underflow & ‘rich sibling’ (eg., delete 7 from T0)
1 3
6 9
13
<6
>6 <9 >9
Delete & borrow, ie:
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 47
B-trees – Deletion
• Case3: underflow & ‘rich sibling’ (eg., delete 7 from T0)
1 3
9
13
<6
>6 <9 >9
Delete & borrow, ie:
6
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 48
B-trees – Deletion
• Case3: underflow & ‘rich sibling’ (eg., delete 7 from T0)
1
3 9
13
<6
>6 <9 >9
Delete & borrow, ie:
6
C. Faloutsos 15-826
17
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 49
B-trees – Deletion
• Case3: underflow & ‘rich sibling’ (eg., delete 7 from T0)
1
3 9
13
<3
>3 <9 >9
Delete & borrow, through the parent
6
FINAL TREE
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 50
B-trees – Deletion • Case1: delete a key at a leaf – no underflow • Case2: delete non-leaf key – no underflow • Case3: delete leaf-key; underflow, and ‘rich
sibling’ • Case4: delete leaf-key; underflow, and ‘poor
sibling’
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 51
B-trees – Deletion
• Case4: underflow & ‘poor sibling’ (eg., delete 13 from T0)
1 3
6
7
9
13
<6
>6 <9 >9
C. Faloutsos 15-826
18
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 52
B-trees – Deletion
• Case4: underflow & ‘poor sibling’ (eg., delete 13 from T0)
1 3
6
7
9 <6
>6 <9 >9
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 53
B-trees – Deletion
• Case4: underflow & ‘poor sibling’ (eg., delete 13 from T0)
1 3
6
7
9 <6
>6 <9 >9
A: merge w/ ‘poor’ sibling
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 54
B-trees – Deletion
• Case4: underflow & ‘poor sibling’ (eg., delete 13 from T0)
• Merge, by pulling a key from the parent • exact reversal from insertion: ‘split and push
up’, vs. ‘merge and pull down’ • Ie.:
C. Faloutsos 15-826
19
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 55
B-trees – Deletion
• Case4: underflow & ‘poor sibling’ (eg., delete 13 from T0)
1 3
6
7
<6
>6
A: merge w/ ‘poor’ sibling
9
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 56
B-trees – Deletion
• Case4: underflow & ‘poor sibling’ (eg., delete 13 from T0)
1 3
6
7
<6
>6 9
FINAL TREE
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 57
B-trees – Deletion
• Case4: underflow & ‘poor sibling’ • -> ‘pull key from parent, and merge’ • Q: What if the parent underflows?
C. Faloutsos 15-826
20
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 58
B-trees – Deletion
• Case4: underflow & ‘poor sibling’ • -> ‘pull key from parent, and merge’ • Q: What if the parent underflows? • A: repeat recursively
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 59
Overview
• B – trees
• B+ - trees, B*-trees
• hashing
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 60
B+ trees - Motivation
1st reason - B-tree – print keys in sorted order:
1 3
6
7
9
13
<6
>6 <9 >9
C. Faloutsos 15-826
21
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 61
B+ trees - Motivation
B-tree needs back-tracking – how to avoid it?
1 3
6
7
9
13
<6
>6 <9 >9
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 62
B+ trees - Motivation 2nd reason: if we want to store the whole
record with the key –> problems (what?)
1 3
6
7
9
13
<6
>6 <9 >9
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 63
Solution: B+ - trees
• facilitate sequential ops
• They string all leaf nodes together
• AND
• replicate keys from non-leaf nodes, to make sure every key appears at the leaf level
C. Faloutsos 15-826
22
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 64
B+ trees
1 3
6
6
9
9
<6
>=6 <9 >=9
7 13
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 65
B+ trees - insertion
1 3
6
6
9
9
<6
>=6 <9 >=9
7 13
Eg., insert ‘8’
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 66
Overview
• B – trees
• B+ - trees, B*-trees
• hashing
C. Faloutsos 15-826
23
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 67
B*-trees
• splits drop util. to 50%, and maybe increase height
• How to avoid them?
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 68
B*-trees: deferred split! • Instead of splitting, LEND keys to sibling! (through PARENT, of course!)
1 3
6
7
9
13
<6
>6 <9 >9
2
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 69
B*-trees: deferred split! • Instead of splitting, LEND keys to sibling! (through PARENT, of course!)
1 2
3
6
9
13
<3
>3 <9 >9
2
7
FINAL TREE
C. Faloutsos 15-826
24
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 70
B*-trees: deferred split! • Notice: shorter, more packed, faster tree • It’s a rare case, where space utilization and
speed improve together • BUT: What if the sibling has no room for
our ‘lending’?
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 71
B*-trees: deferred split! • BUT: What if the sibling has no room for
our ‘lending’? • A: 2-to-3 split: get the keys from the
sibling, pool them with ours (and a key from the parent), and split in 3.
• Details: too messy (and even worse for deletion)
CMU SCS
15-826 Copyright: C. Faloutsos (2010) 72
Conclusions • Main ideas: recursive; block-aware; on
overflow -> split; defer splits
• All B-tree variants have excellent, O(logN) worst-case performance for ins/del/search
• B+ tree is the prevailing indexing method
• More details: [Knuth vol 3.] or [Ramakrishnan & Gehrke, 3rd ed, ch. 10]