Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics,...
-
Upload
taliyah-arrowsmith -
Category
Documents
-
view
213 -
download
0
Transcript of Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics,...
Adapting Hash Table Design for Real-Life Dataset
Budapest University of Technology and Economics, HungaryDepartment of Automation and Applied Informatics
Sándor Juhász, Ákos Dudás
IADIS Multi Conference on Computer Science and Information Systems 2009
Contents
1. Data transformation2. Hash tables:
- types and variations of hash tables- refined definitions
3. Inputs and hash functions4. Performance of open hash tables5. Performance of bucket tables6. Summary
Data transformation – reason for hash tables
• data transformation: converts data from a source data format into destination format
• hash tables are beneficial – allow data transformation in nearly 1
step;– fast, and still compact in memory;– many existing implementations;– easy to implement, easy to customize
or not? will see…
Data transformation – reason for hash tables
• in the particular case:– web log processing, recorded activity of
Internet users for hundreds of web portals– unique ID for the users on the website– the ID is long, 40 hexadecimal digits (20
bytes)– ID used very frequently
• transform IDs to 4 bytes to save memory and storage space, – important constraint: the transformation
must retain uniqueness of the values
Hash tables in general– refined definitions
• open hashing: “… all items are stored within the hash table” (NIST)
• proposed definition: “…only one item can be assigned to any slot of the hash table”– more permissive
itemitemitem
empty
emptyitem
ptr to itemptr to itemptr to itemptr to item
ptr to itemptr to item
item
item
item
item
Hash tables in general– refined definitions
• chaining: “… linked lists handle collision in a hash table” (NIST)
• proposed definition: “… allowing more items to be assigned to any slot of the hash table”– more permissive– may also use array, not just linked list– rather call this bucket hashing
ptr to bucketptr to bucketptr to bucket
item
itemitem
ptr to bucketptr to bucketptr to bucket
array of items
array of items
hash tables
open hash
linear probing
item
table
pointer tabl
e
linear double hashing
item
table
pointer tabl
e
quadratic quotient
item
table
pointer tabl
e
bucket hash
linked lists
item table
pointer table
arrays
Types and variations of hash tables
• reasons for the difference: – length of search path– number of indirections– memory alignment
Types and variations of hash tables– item table with linear probing
key | value key | valuekey | value
empty
emptykey | value
structure of the hash table
alignment of the hash table in memory
1. item 2. item 3. item 4. item …
cache line cache line cache line…
1 2 3
open bucket
linear probing double hash/ quadratic quotient array
linked list
pointer table item table
Types and variations of hash tables– item table with linear double hashing/quadratic quotient
key | value key | valuekey | value
empty
emptykey | value
structure of the hash table
alignment of the hash table in memory
1. item 2. item 3. item 4. item
cache line cache line cache line…
2 35. item 6. item …
1
open bucket
linear probing double hash/ quadratic quotient array
linked list
pointer table item table
Types and variations of hash tables– pointer table with array
length
structure of the hash table
alignment of the hash table in memory
ptr ptr ptr ptr ptr ptr
length 1. item 2. item 3. item
cache line cache line …
2 3length 1. item …
4
1. bucket n. bucket …
5
key | value key | value key | valuelength key | valuelength key | value key | value
length key | value key | value key | value
open bucket
linear probing double hash/ quadratic quotient array
linked list
pointer table item table
1
Types and variations of hash tables– pointer table with list
key | value| ptr
structure of the hash table
alignment of the hash table in memory
ptr ptr ptr ptr ptr ptr
key | value| ptrkey | value| ptr
key | value| ptr
key | value| ptr
…
2. item 3. item
cache line cache line cache line…
3 41. item
2
1. bucket m. bucket n. bucket 1. bucket k. bucket 1. bucket
…
open bucket
linear probing double hash/ quadratic quotient array
linked list
pointer table item table
1
Types and variations of hash tables– item table with list
open bucket
linear probing double hash/ quadratic quotient array
linked list
pointer table item table
structure of the hash table
alignment of the hash table in memory
key | value| ptrkey | value| ptrkey | value| ptr key | value| ptr key | value| ptr key | value| ptr key | value| ptr
key | value| ptrkey | value| ptr
key | value| ptr
key | value| ptr
…
2. item 3. item
cache line cache line cache line…
2 31. item
1
1. bucket m. bucket n. bucket 1. bucket k. bucket 1. bucket
Inputs and hash functions
• First point of optimization: hash function• General purpose hash functions
– Custom hash function– FNV (Fowler/Noll/Vo), widespread use– Jenkins hash function
• The distributions of the output of the general purpose hash functions unknown on real-life input.
uniform “bumpy” real-life
Performance of open hash tables– uniform and “bumpy” inputs
“bumpy” input distribution: similarbest: linear probing, Jenkins or the custom hash function
180000000 280000000 380000000 4800000000
10000
20000
30000
40000
50000
60000
70000
item-table-linear-probing -- simple item-table-linear-double-hashing -- simple item-table-quadratic-quotient -- simpleitem-table-linear-probing -- FNV item-table-linear-double-hashing -- FNV item-table-quadratic-quotient -- FNVitem-table-linear-probing -- Jen item-table-linear-double-hashing -- Jen item-table-quadratic-quotient -- Jen
search time [sec]
reserved memory [MB]
uniform distribution
Performance of open hash tables– real-life input
with inferior hash function: not able to operatebest: linear probing, Jenkins hash function
180000000 230000000 280000000 330000000 380000000 430000000 4800000000
10000
20000
30000
40000
50000
60000
70000
item-table-linear-probing -- simple item-table-linear-double-hashing -- simple item-table-quadratic-quotient -- simpleitem-table-linear-probing -- FNV item-table-linear-double-hashing -- FNV item-table-quadratic-quotient -- FNVitem-table-linear-probing -- Jen item-table-linear-double-hashing -- Jen item-table-quadratic-quotient -- Jen
search time [sec]
reserved memory [MB]
Performance of open hash tables– real-life input
The step count of the algorithms do not confirm the observed difference in performance, but the number of L2 cache misses do.
180000000 280000000 380000000 4800000000
1
2
3
4
5
6
7
8
9
10average step count per search
reserved memory [MB]
180000000 280000000 380000000 4800000000
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
L2 cache misses [billion pcs]
reserved memory [MB]
Performance of bucket hash tables– uniform and “bumpy” inputs
“bumpy” input distribution: similarbest: item-table-with-list, Jenkins or the custom hash function
180000000 280000000 380000000 4800000000
10000
20000
30000
40000
50000
60000
70000
item-table-with-list -- simple pointer-table-with-array -- simple pointer-table-with-list -- simpleitem-table-with-list -- FNV pointer-table-with-array -- FNV pointer-table-with-list -- FNVitem-table-with-list -- Jen pointer-table-with-array -- Jen pointer-table-with-list -- Jen
search time [sec]
reserved memory [MB]
uniform distibution
Performance of bucket hash tables– real-life input
best: pointer-table-with-array, because it is not that sensitive to the hash function
180000000 280000000 380000000 4800000000
10000
20000
30000
40000
50000
60000
70000
item-table-with-list -- simple pointer-table-with-array -- simple pointer-table-with-list -- simpleitem-table-with-list -- FNV pointer-table-with-array -- FNV pointer-table-with-list -- FNVitem-table-with-list -- Jen pointer-table-with-array -- Jen pointer-table-with-list -- Jen
search time [sec]
reserved memory [MB]
Performance of bucket hash tables– real-life input
The step count of the algorithms do not confirm the observed difference in performance, but the number of L2 cache misses do.
180000000 280000000 380000000 4800000000
1
2
3
4
5
6
7
8
9
10
average step count per search
reserved memory [MB]
180000000 280000000 380000000 4800000000
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
L2 cache misses [billion pcs]
reserved memory [MB]
Summary
• Significance of hashing in fast data transformation• New definitions for hash table types• Introduction of additional hash structures with various
memory layouts• Multiple inputs and hash functions• Robustness criterion• Open hash tables: linear probing is the fastest; all
variants are unable to handle real-life input with inferior hash function
• Bucket hash tables: arrays are favorable because they are robust and not sensitive to the hash function
• Verified using real-life input
Questions
?Adapting Hash Table Design for Real-Life Dataset
Sándor Juhász