CS5614:(Big)Data#...
Transcript of CS5614:(Big)Data#...
![Page 1: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/1.jpg)
CS 5614: (Big) Data Management Systems
B. Aditya Prakash Lecture #5: Hashing and Sor5ng
![Page 2: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/2.jpg)
(Sta7c) Hashing
§ Problem: “find EMP record with ssn=123” § What if disk space was free, and 5me was at premium?
Prakash 2014 VT CS 5614 2
![Page 3: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/3.jpg)
Hashing
§ A: Brilliant idea: key-‐to-‐address transforma5on:
Prakash 2014 VT CS 5614 3
#0 page
#123 page
#999,999,999
123; Smith; Main str
![Page 4: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/4.jpg)
Hashing
§ Since space is NOT free: § use M, instead of 999,999,999 slots § hash func5on: h(key) = slot-‐id
Prakash 2014 VT CS 5614 4
#0 page
#123 page
#999,999,999
123; Smith; Main str
![Page 5: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/5.jpg)
Hashing
§ Typically: each hash bucket is a page, holding many records:
Prakash 2014 VT CS 5614 5
#0 page
#h(123)
M
123; Smith; Main str
![Page 6: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/6.jpg)
Hashing
§ No5ce: could have clustering, or non-‐clustering versions:
Prakash 2014 VT CS 5614 6
#0 page
#h(123)
M
123; Smith; Main str.
![Page 7: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/7.jpg)
Hashing
§ No5ce: could have clustering, or non-‐clustering versions:
Prakash 2014 VT CS 5614 7
123 ...
#0 page
#h(123)
M
... EMP file
123; Smith; Main str.
...
234; Johnson; Forbes ave
345; Tompson; Fi]h ave
...
![Page 8: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/8.jpg)
Design decisions
§ 1) formula h() for hashing func5on § 2) size of hash table M § 3) collision resolu5on method
Prakash 2014 VT CS 5614 8
![Page 9: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/9.jpg)
Design decisions -‐ func7ons
§ Goal: uniform spread of keys over hash buckets
§ Popular choices: – Division hashing – Mul5plica5on hashing
Prakash 2014 VT CS 5614 9
![Page 10: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/10.jpg)
Division hashing
§ h(x) = (a*x+b) mod M § eg., h(ssn) = (ssn) mod 1,000
– gives the last three digits of ssn § M: size of hash table -‐ choose a prime number, defensively (why?)
Prakash 2014 VT CS 5614 10
![Page 11: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/11.jpg)
Division hashing
Prakash 2014 VT CS 5614 11
§ eg., M=2; hash on driver-‐license number (dln), where last digit is ‘gender’ (0/1 = M/F)
§ in an army unit with predominantly male soldiers
§ Thus: avoid cases where M and keys have common divisors -‐ prime M guards against that!
![Page 12: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/12.jpg)
Mul7plica7on hashing
Prakash 2014 VT CS 5614 12
h(x) = [ frac-onal-‐part-‐of ( x * φ ) ] * M
§ φ: golden ra5o ( 0.618... = ( sqrt(5)-‐1)/2 )
§ in general, we need an irra5onal number
§ advantage: M need not be a prime number
§ but φ must be irra5onal
![Page 13: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/13.jpg)
Other hashing func7ons
Prakash 2014 VT CS 5614 13
§ quadra5c hashing (bad) § ...
![Page 14: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/14.jpg)
Other hashing func7ons
Prakash 2014 VT CS 5614 14
§ quadra5c hashing (bad) § ...
§ conclusion: use division hashing
![Page 15: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/15.jpg)
Size of hash table
Prakash 2014 VT CS 5614 15
§ eg., 50,000 employees, 10 employee-‐records / page
§ Q: M=?? pages/buckets/slots
![Page 16: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/16.jpg)
Size of hash table
Prakash 2014 VT CS 5614 16
§ eg., 50,000 employees, 10 employees/page
§ Q: M=?? pages/buckets/slots
§ A: u5liza5on ~ 90% and – M: prime number
Eg., in our case: M= closest prime to 50,000/10 / 0.9 = 5,555
![Page 17: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/17.jpg)
Collision resolu7on
§ Q: what is a ‘collision’? § A: ??
Prakash 2014 VT CS 5614 17
![Page 18: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/18.jpg)
Collision resolu7on
Prakash 2014 VT CS 5614 18
#0 page
#h(123)
M
123; Smith; Main str.
![Page 19: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/19.jpg)
Collision resolu7on
§ Q: what is a ‘collision’? § A: ?? § Q: why worry about collisions/overflows? (recall that buckets are ~90% full)
§ A: ‘birthday paradox’
Prakash 2014 VT CS 5614 19
![Page 20: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/20.jpg)
Collision resolu7on
§ open addressing – linear probing (ie., put to next slot/bucket) – re-‐hashing
§ separate chaining (ie., put links to overflow pages)
Prakash 2014 VT CS 5614 20
![Page 21: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/21.jpg)
Collision resolu7on
Prakash 2014 VT CS 5614 21
#0 page
#h(123)
M
123; Smith; Main str.
linear probing:
![Page 22: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/22.jpg)
Collision resolu7on
Prakash 2014 VT CS 5614 22
#0 page
#h(123)
M
123; Smith; Main str.
re-‐hashing
h1()
h2()
![Page 23: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/23.jpg)
Collision resolu7on
Prakash 2014 VT CS 5614 23
123; Smith; Main str.
separate chaining
![Page 24: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/24.jpg)
Design decisions -‐ conclusions
§ func5on: division hashing – h(x) = ( a*x+b ) mod M
§ size M: ~90% u5l.; prime number. § collision resolu5on: separate chaining
– easier to implement (dele5ons!); – no danger of becoming full
Prakash 2014 VT CS 5614 24
![Page 25: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/25.jpg)
Problem with sta7c hashing
§ problem: overflow? § problem: underflow? (underu5liza5on)
Prakash 2014 VT CS 5614 25
![Page 26: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/26.jpg)
Solu7on: Dynamic/extendible hashing
§ idea: shrink / expand hash table on demand.. § ..dynamic hashing § Details: how to grow gracefully, on overflow? § Many solu5ons -‐ One of them: ‘extendible hashing’ [Fagin et al]
Prakash 2014 VT CS 5614 26
![Page 27: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/27.jpg)
Extendible hashing
Prakash 2014 VT CS 5614 27
#0 page
#h(123)
M
123; Smith; Main str.
![Page 28: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/28.jpg)
Extendible hashing
Prakash 2014 VT CS 5614 28
#0 page
#h(123)
M
123; Smith; Main str.
solu7on:
split the bucket in two
![Page 29: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/29.jpg)
Extendible hashing
Prakash 2014 VT CS 5614 29
in detail: § keep a directory, with ptrs to hash-‐buckets § Q: how to divide contents of bucket in two? § A: hash each key into a very long bit string; keep only as many bits as needed
Eventually:
![Page 30: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/30.jpg)
Extendible hashing
Prakash 2014 VT CS 5614 30
directory
00... 01...
10...
11...
10101...
10110...
1101...
10011...
0111... 0001...
101001...
![Page 31: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/31.jpg)
Extendible hashing
Prakash 2014 VT CS 5614 31
directory
00... 01...
10...
11...
10101...
10110...
1101...
10011...
0111... 0001...
101001...
![Page 32: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/32.jpg)
Extendible hashing
Prakash 2014 VT CS 5614 32
directory
00... 01...
10...
11...
10101...
10110...
1101...
10011...
0111... 0001...
101001...
split on 3-‐rd bit
![Page 33: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/33.jpg)
Extendible hashing
Prakash 2014 VT CS 5614 33
directory
00... 01...
10...
11...
1101...
10011...
0111... 0001...
101001... 10101...
10110...
new page / bucket
![Page 34: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/34.jpg)
Extendible hashing
Prakash 2014 VT CS 5614 34
directory (doubled)
1101...
10011...
0111... 0001...
101001... 10101...
10110...
new page / bucket
000... 001...
010...
011...
100... 101...
110...
111...
![Page 35: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/35.jpg)
Extendible hashing
Prakash 2014 VT CS 5614 35
00... 01...
10...
11...
10101...
10110...
1101...
10011...
0111... 0001...
101001...
000... 001...
010...
011...
100... 101...
110...
111...
1101...
10011...
0111... 0001...
101001... 10101...
10110...
BEFORE AFTER
![Page 36: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/36.jpg)
Extendible hashing
§ Summary: directory doubles on demand § or halves, on shrinking files § needs ‘local’ and ‘global’ depth
Prakash 2014 VT CS 5614 36
![Page 37: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/37.jpg)
Linear hashing -‐ overview
§ Mo5va5on § main idea § search algo § inser5on/split algo § dele5on
Prakash 2014 VT CS 5614 37
![Page 38: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/38.jpg)
Linear hashing
§ Mo5va5on: ext. hashing needs directory etc etc; which doubles (ouch!)
§ Q: can we do something simpler, with smoother growth?
Prakash 2014 VT CS 5614 38
![Page 39: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/39.jpg)
Linear hashing
§ Mo5va5on: ext. hashing needs directory etc etc; which doubles (ouch!)
§ Q: can we do something simpler, with smoother growth?
§ A: split buckets from le] to right, regardless of which one overflowed (‘crazy’, but it works well!) -‐ Eg.:
Prakash 2014 VT CS 5614 39
![Page 40: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/40.jpg)
Linear hashing
Prakash 2014 VT CS 5614 40
Ini5ally: h(x) = x mod N (N=4 here)
Assume capacity: 3 records / bucket
Insert key ‘17’
0 1 2 3 bucket-‐ id
4 8 5 9 13
6 7 11
![Page 41: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/41.jpg)
Linear hashing
Prakash 2014 VT CS 5614 41
Ini5ally: h(x) = x mod N (N=4 here)
0 1 2 3 bucket-‐ id
4 8 5 9 13
6 7 11
17 overflow of bucket#1
![Page 42: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/42.jpg)
Linear hashing
Prakash 2014 VT CS 5614 42
Ini5ally: h(x) = x mod N (N=4 here)
0 1 2 3 bucket-‐ id
4 8 5 9 13
6 7 11
17 overflow of bucket#1
Split #0, anyway!!!
![Page 43: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/43.jpg)
Linear hashing
Prakash 2014 VT CS 5614 43
Ini5ally: h(x) = x mod N (N=4 here)
0 1 2 3 bucket-‐ id
4 8 5 9 13
6 7 11
17 Split #0, anyway!!!
Q: But, how?
![Page 44: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/44.jpg)
Linear hashing
Prakash 2014 VT CS 5614 44
A: use two h.f.: h0(x) = x mod N
h1(x) = x mod (2*N)
0 1 2 3 bucket-‐ id
4 8 5 9 13
6 7 11
17
![Page 45: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/45.jpg)
Linear hashing -‐ a]er split:
Prakash 2014 VT CS 5614 45
A: use two h.f.: h0(x) = x mod N
h1(x) = x mod (2*N)
0 1 2 3 bucket-‐ id
8 5 9 13
6 7 11
17
4
4
![Page 46: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/46.jpg)
Linear hashing -‐ a]er split:
Prakash 2014 VT CS 5614 46
A: use two h.f.: h0(x) = x mod N
h1(x) = x mod (2*N)
0 1 2 3 bucket-‐ id
8 5 9 13
6 7 11
17
4
overflow
4
![Page 47: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/47.jpg)
Linear hashing -‐ a]er split:
Prakash 2014 VT CS 5614 47
A: use two h.f.: h0(x) = x mod N
h1(x) = x mod (2*N)
0 1 2 3 bucket-‐ id
8 5 9 13
6 7 11
17
4
overflow
4
split ptr
![Page 48: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/48.jpg)
Linear hashing -‐ searching?
Prakash 2014 VT CS 5614 48
h0(x) = x mod N (for the un-‐split buckets) h1(x) = x mod (2*N) (for the spliEed ones)
0 1 2 3 bucket-‐ id
8 5 9 13
6 7 11
17
4
overflow
4
split ptr
![Page 49: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/49.jpg)
Linear hashing -‐ searching?
Prakash 2014 VT CS 5614 49
Q1: find key ‘6’? Q2: find key ‘4’?
Q3: key ‘8’?
0 1 2 3 bucket-‐ id
8 5 9 13
6 7 11
17
4
overflow
4
split ptr
![Page 50: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/50.jpg)
Linear hashing -‐ searching?
Prakash 2014 VT CS 5614 50
Algo to find key ‘k’:
• compute b= h0(k);
• if b<split-‐ptr, compute b=h1(k)
• search bucket b
![Page 51: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/51.jpg)
Linear hashing -‐ inser7on?
Prakash 2014 VT CS 5614 51
Algo: insert key ‘k’
• compute appropriate bucket ‘b’
• if the overflow criterion is true • split the bucket of ‘split-‐ptr’ • split-‐ptr ++ (*)
![Page 52: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/52.jpg)
Linear hashing -‐ inser7on?
Prakash 2014 VT CS 5614 52
§ no5ce: overflow criterion is up to us!! § Q: sugges5ons?
![Page 53: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/53.jpg)
Linear hashing -‐ inser7on?
Prakash 2014 VT CS 5614 53
§ no5ce: overflow criterion is up to us!! § Q: sugges5ons? § A1: space u5liza5on >= u-‐max
![Page 54: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/54.jpg)
Linear hashing -‐ inser7on?
Prakash 2014 VT CS 5614 54
§ no5ce: overflow criterion is up to us!! § Q: sugges5ons? § A1: space u5liza5on > u-‐max § A2: avg length of ovf chains > max-‐len § A3: ....
![Page 55: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/55.jpg)
Linear hashing -‐ inser7on?
Prakash 2014 VT CS 5614 55
Algo: insert key ‘k’
• compute appropriate bucket ‘b’
• if the overflow criterion is true • split the bucket of ‘split-‐ptr’ • split-‐ptr ++ (*)
what if we reach the right edge??
![Page 56: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/56.jpg)
Linear hashing -‐ split now?
Prakash 2014 VT CS 5614 56
h0(x) = x mod N (for the un-‐split buckets) h1(x) = x mod (2*N) for the spliEed ones)
split ptr
0 1 2 3 4 5 6
![Page 57: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/57.jpg)
Linear hashing -‐ split now?
Prakash 2014 VT CS 5614 57
h0(x) = x mod N (for the un-‐split buckets) h1(x) = x mod (2*N) (for the spliEed ones)
split ptr
0 1 2 3 4 5 6 7
![Page 58: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/58.jpg)
Linear hashing -‐ split now?
Prakash 2014 VT CS 5614 58
h0(x) = x mod N (for the un-‐split buckets) h1(x) = x mod (2*N) (for the spliEed ones)
split ptr
0 1 2 3 4 5 6 7
![Page 59: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/59.jpg)
Linear hashing -‐ split now?
Prakash 2014 VT CS 5614 59
h0(x) = x mod N (for the un-‐split buckets) h1(x) = x mod (2*N) (for the spliEed ones)
split ptr
0 1 2 3 4 5 6 7
![Page 60: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/60.jpg)
Linear hashing -‐ split now?
Prakash 2014 VT CS 5614 60
split ptr
0 1 2 3 4 5 6 7
this state is called ‘full expansion’
![Page 61: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/61.jpg)
Linear hashing -‐ observa7ons
Prakash 2014 VT CS 5614 61
In general, at any point of 5me, we have at most two h.f. ac5ve, of the form:
• hn(x) = x mod (N * 2n)
• hn+1(x) = x mod (N * 2n+1)
(aHer a full expansion, we have only one h.f.)
![Page 62: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/62.jpg)
Linear hashing -‐ dele7on?
§ reverse of inser5on:
Prakash 2014 VT CS 5614 62
![Page 63: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/63.jpg)
Linear hashing -‐ dele7on?
Prakash 2014 VT CS 5614 63
§ reverse of inser5on: § if the underflow criterion is met
– contract!
![Page 64: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/64.jpg)
Linear hashing -‐ how to contract?
Prakash 2014 VT CS 5614 64
h0(x) = mod N (for the un-‐split buckets) h1(x) = mod (2*N) (for the spliEed ones)
split ptr
0 1 2 3 4 5 6
![Page 65: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/65.jpg)
Linear hashing -‐ how to contract?
Prakash 2014 VT CS 5614 65
h0(x) = mod N (for the un-‐split buckets) h1(x) = mod (2*N) (for the spliEed ones)
split ptr
0 1 2 3 4 5
![Page 66: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/66.jpg)
Hashing -‐ pros?
Prakash 2014 VT CS 5614 66
![Page 67: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/67.jpg)
Hashing -‐ pros?
§ Speed, – on exact match queries – on the average
Prakash 2014 VT CS 5614 67
![Page 68: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/68.jpg)
B(+)-‐trees -‐ pros?
Prakash 2014 VT CS 5614 68
![Page 69: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/69.jpg)
B(+)-‐trees -‐ pros?
§ Speed on search: – exact match queries, worst case – range queries – nearest-‐neighbor queries
§ Speed on inser5on + dele5on § smooth growing and shrinking (no re-‐org)
Prakash 2014 VT CS 5614 69
![Page 70: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/70.jpg)
Conclusions
§ B-‐trees and variants: in all DBMSs § hash indices: in some
– (but hashing in useful for joins…)
Prakash 2014 VT CS 5614 70
![Page 71: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/71.jpg)
SORTING
Prakash 2014 VT CS 5614 71
![Page 72: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/72.jpg)
Prakash 2014 VT CS 5614 72
Why Sort?
![Page 73: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/73.jpg)
Prakash 2014 VT CS 5614 73
Why Sort?
§ select ... order by – e.g., find students in increasing gpa order
§ bulk loading B+ tree index. § duplicate eliminaLon (select distinct) § select ... group by § Sort-‐merge join algorithm involves sor5ng.
![Page 74: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/74.jpg)
Prakash 2014 VT CS 5614 74
Outline
§ two-‐way merge sort § external merge sort § fine-‐tunings § B+ trees for sor5ng
![Page 75: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/75.jpg)
Prakash 2014 VT CS 5614 75
2-‐Way Sort: Requires 3 Buffers § Pass 0: Read a page, sort it, write it.
– only one buffer page is used
§ Pass 1, 2, 3, …, etc.: requires 3 buffer pages – merge pairs of runs into runs twice as long – three buffer pages used.
Main memory buffers
INPUT 1
INPUT 2
OUTPUT
Disk Disk
![Page 76: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/76.jpg)
Prakash 2014 VT CS 5614 76
Two-‐Way External Merge Sort
§ Each pass we read + write each page in file.
Input file
1-page runs
2-page runs
4-page runs
8-page runs
PASS 0
PASS 1
PASS 2
PASS 3
9
3,4 6,2 9,4 8,7 5,6 3,1 2
3,4 5,6 2,6 4,9 7,8 1,3 2
2,3 4,6
4,7 8,9
1,3 5,6 2
2,3 4,4 6,7 8,9
1,2 3,5 6
1,2 2,3 3,4 4,5 6,6 7,8
![Page 77: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/77.jpg)
Prakash 2014 VT CS 5614 77
Two-‐Way External Merge Sort
§ Each pass we read + write each page in file.
Input file
1-page runs
2-page runs
4-page runs
8-page runs
PASS 0
PASS 1
PASS 2
PASS 3
9
3,4 6,2 9,4 8,7 5,6 3,1 2
3,4 5,6 2,6 4,9 7,8 1,3 2
2,3 4,6
4,7 8,9
1,3 5,6 2
2,3 4,4 6,7 8,9
1,2 3,5 6
1,2 2,3 3,4 4,5 6,6 7,8
![Page 78: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/78.jpg)
Prakash 2014 VT CS 5614 78
Two-‐Way External Merge Sort
§ Each pass we read + write each page in file.
Input file
1-page runs
2-page runs
4-page runs
8-page runs
PASS 0
PASS 1
PASS 2
PASS 3
9
3,4 6,2 9,4 8,7 5,6 3,1 2
3,4 5,6 2,6 4,9 7,8 1,3 2
2,3 4,6
4,7 8,9
1,3 5,6 2
2,3 4,4 6,7 8,9
1,2 3,5 6
1,2 2,3 3,4 4,5 6,6 7,8
![Page 79: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/79.jpg)
Prakash 2014 VT CS 5614 79
Two-‐Way External Merge Sort
§ Each pass we read + write each page in file.
Input file
1-page runs
2-page runs
4-page runs
8-page runs
PASS 0
PASS 1
PASS 2
PASS 3
9
3,4 6,2 9,4 8,7 5,6 3,1 2
3,4 5,6 2,6 4,9 7,8 1,3 2
2,3 4,6
4,7 8,9
1,3 5,6 2
2,3 4,4 6,7 8,9
1,2 3,5 6
1,2 2,3 3,4 4,5 6,6 7,8
![Page 80: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/80.jpg)
Prakash 2014 VT CS 5614 80
Two-‐Way External Merge Sort
§ Each pass we read + write each page in file.
§ N pages in the file => § So total cost is: § Idea: Divide and conquer:
sort subfiles and merge
⎡ ⎤= +log2 1N
⎡ ⎤( )2 12N Nlog +
Input file
1-page runs
2-page runs
4-page runs
8-page runs
PASS 0
PASS 1
PASS 2
PASS 3
9
3,4 6,2 9,4 8,7 5,6 3,1 2
3,4 5,6 2,6 4,9 7,8 1,3 2
2,3 4,6
4,7 8,9
1,3 5,6 2
2,3 4,4 6,7 8,9
1,2 3,5 6
1,2 2,3 3,4 4,5 6,6 7,8
![Page 81: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/81.jpg)
Prakash 2014 VT CS 5614 81
External merge sort
B > 3 buffers § Q1: how to sort? § Q2: cost?
![Page 82: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/82.jpg)
Prakash 2014 VT CS 5614 82
General External Merge Sort
B Main memory buffers Disk Disk
. . . . . .
B>3 buffer pages. How to sort a file with N pages?
. . .
![Page 83: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/83.jpg)
Prakash 2014 VT CS 5614 83
General External Merge Sort
– Pass 0: use B buffer pages. Produce sorted runs of B pages each.
– Pass 1, 2, …, etc.: merge B-‐1 runs.
⎡ ⎤N B/
B Main memory buffers
INPUT 1
INPUT B-1
OUTPUT
Disk Disk
INPUT 2
. . . . . . . . .
![Page 84: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/84.jpg)
Prakash 2014 VT CS 5614 84
Sor7ng
– create sorted runs of size B (how many?) – merge them (how?)
B
... ...
![Page 85: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/85.jpg)
Prakash 2014 VT CS 5614 85
Sor7ng
– create sorted runs of size B – merge first B-‐1 runs into a sorted run of (B-‐1) *B, ...
B
... ... …..
![Page 86: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/86.jpg)
Prakash 2014 VT CS 5614 86
Sor7ng
– How many steps we need to do? ‘i’, where B*(B-‐1)^i > N – How many reads/writes per step? N+N
B
... ... …..
![Page 87: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/87.jpg)
Prakash 2014 VT CS 5614 87
Cost of External Merge Sort
§ Number of passes: § Cost = 2N * (# of passes)
⎡ ⎤⎡ ⎤1 1+ −log /B N B
![Page 88: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/88.jpg)
Prakash 2014 VT CS 5614 88
Cost of External Merge Sort
§ E.g., with 5 buffer pages, to sort 108 page file: – Pass 0: = 22 sorted runs of 5 pages each (last run is only 3 pages)
– Pass 1: = 6 sorted runs of 20 pages each (last run is only 8 pages)
– Pass 2: 2 sorted runs, 80 pages and 28 pages – Pass 3: Sorted file of 108 pages
⎡ ⎤108 5/
⎡ ⎤22 4/
Formula check: ┌log4 22┐= 3 … + 1 à 4 passes √
![Page 89: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/89.jpg)
Prakash 2014 VT CS 5614 89
Number of Passes of External Sort
N B=3 B=5 B=9 B=17 B=129 B=257 100 7 4 3 2 1 1 1,000 10 5 4 3 2 2 10,000 13 7 5 4 2 2 100,000 17 9 6 5 3 3 1,000,000 20 10 7 5 3 3 10,000,000 23 12 8 6 4 3 100,000,000 26 14 9 7 4 4 1,000,000,000 30 15 10 8 5 4
( I/O cost is 2N 5mes number of passes)
![Page 90: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/90.jpg)
Prakash 2014 VT CS 5614 90
§ Quicksort is a fast way to sort in memory.
Internal Sort Algorithm
![Page 91: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/91.jpg)
Prakash 2014 VT CS 5614 91
Blocked I/O & double-‐buffering
§ So far, we assumed random disk access § Cost changes, if we consider that runs are wriyen (and read) sequen5ally
§ What could we do to exploit it?
![Page 92: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/92.jpg)
Prakash 2014 VT CS 5614 92
Blocked I/O & double-‐buffering
§ So far, we assumed random disk access § Cost changes, if we consider that runs are wriyen (and read) sequen5ally
§ What could we do to exploit it? § A1: Blocked I/O (exchange a few r.d.a for several sequen5al ones)
§ A2: double-‐buffering
![Page 93: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/93.jpg)
Double Buffering § To reduce wait 5me for I/O request to complete, can prefetch into `shadow block’. – Poten5ally, more passes; in prac5ce, most files s5ll sorted in 2-‐3 passes.
OUTPUT
OUTPUT'
Disk Disk
INPUT 1
INPUT k
INPUT 2
INPUT 1'
INPUT 2'
INPUT k'
block size b
Prakash 2014 VT CS 5614 93
![Page 94: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/94.jpg)
Prakash 2014 VT CS 5614 94
Using B+ Trees for Sor7ng
§ Scenario: Table to be sorted has B+ tree index on sor5ng column(s).
§ Idea: Can retrieve records in order by traversing leaf pages.
§ Is this a good idea? § Cases to consider:
– B+ tree is clustered – B+ tree is not clustered
![Page 95: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/95.jpg)
Prakash 2014 VT CS 5614 95
Using B+ Trees for Sor7ng
§ Scenario: Table to be sorted has B+ tree index on sor5ng column(s).
§ Idea: Can retrieve records in order by traversing leaf pages.
§ Is this a good idea? § Cases to consider:
– B+ tree is clustered Good idea! – B+ tree is not clustered Could be a very bad idea!
![Page 96: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/96.jpg)
Prakash 2014 VT CS 5614 96
Clustered B+ Tree Used for Sor7ng
§ Cost: root to the le]-‐most leaf, then retrieve all leaf pages (Alterna5ve 1)
Always beyer than external sor5ng!
(Directs search)
Data Records
Index
Data Entries ("Sequence set")
![Page 97: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/97.jpg)
Prakash 2014 VT CS 5614 97
Unclustered B+ Tree Used for Sor7ng
§ Alterna5ve (2) for data entries; each data entry contains rid of a data record. In general, one I/O per data record!
(Directs search)
Data Records
Index
Data Entries ("Sequence set")
![Page 98: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/98.jpg)
Prakash 2014 VT CS 5614 98
External Sor7ng vs. Unclustered Index
p: # of records per page B=1,000 and block size=32 for sor-ng p=100 is the more realis-c value.
N Sorting p=1 p=10 p=100 100 200 100 1,000 10,000 1,000 2,000 1,000 10,000 100,000 10,000 40,000 10,000 100,000 1,000,000 100,000 600,000 100,000 1,000,000 10,000,000 1,000,000 8,000,000 1,000,000 10,000,000 100,000,000 10,000,000 80,000,000 10,000,000 100,000,000 1,000,000,000
![Page 99: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’](https://reader034.fdocuments.us/reader034/viewer/2022042104/5e81eec9426cfc64c05e3fb5/html5/thumbnails/99.jpg)
Prakash 2014 VT CS 5614 99
Summary
§ External sor5ng is important § External merge sort minimizes disk I/O cost:
– Pass 0: Produces sorted runs of size B (# buffer pages).
– Later passes: merge runs.
§ Clustered B+ tree is good for sor5ng; unclustered tree is usually very bad.