Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics,...

21
Adapting Hash Table Design for Real-Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics Sándor Juhász, Ákos Dudás IADIS Multi Conference on Computer Science and Information Systems 2009

Transcript of Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics,...

Page 1: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Adapting Hash Table Design for Real-Life Dataset

Budapest University of Technology and Economics, HungaryDepartment of Automation and Applied Informatics

Sándor Juhász, Ákos Dudás

IADIS Multi Conference on Computer Science and Information Systems 2009

Page 2: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Contents

1. Data transformation2. Hash tables:

- types and variations of hash tables- refined definitions

3. Inputs and hash functions4. Performance of open hash tables5. Performance of bucket tables6. Summary

Page 3: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Data transformation – reason for hash tables

• data transformation: converts data from a source data format into destination format

• hash tables are beneficial – allow data transformation in nearly 1

step;– fast, and still compact in memory;– many existing implementations;– easy to implement, easy to customize

or not? will see…

Page 4: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Data transformation – reason for hash tables

• in the particular case:– web log processing, recorded activity of

Internet users for hundreds of web portals– unique ID for the users on the website– the ID is long, 40 hexadecimal digits (20

bytes)– ID used very frequently

• transform IDs to 4 bytes to save memory and storage space, – important constraint: the transformation

must retain uniqueness of the values

Page 5: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Hash tables in general– refined definitions

• open hashing: “… all items are stored within the hash table” (NIST)

• proposed definition: “…only one item can be assigned to any slot of the hash table”– more permissive

itemitemitem

empty

emptyitem

ptr to itemptr to itemptr to itemptr to item

ptr to itemptr to item

item

item

item

item

Page 6: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Hash tables in general– refined definitions

• chaining: “… linked lists handle collision in a hash table” (NIST)

• proposed definition: “… allowing more items to be assigned to any slot of the hash table”– more permissive– may also use array, not just linked list– rather call this bucket hashing

ptr to bucketptr to bucketptr to bucket

item

itemitem

ptr to bucketptr to bucketptr to bucket

array of items

array of items

Page 7: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

hash tables

open hash

linear probing

item

table

pointer tabl

e

linear double hashing

item

table

pointer tabl

e

quadratic quotient

item

table

pointer tabl

e

bucket hash

linked lists

item table

pointer table

arrays

Types and variations of hash tables

• reasons for the difference: – length of search path– number of indirections– memory alignment

Page 8: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Types and variations of hash tables– item table with linear probing

key | value key | valuekey | value

empty

emptykey | value

structure of the hash table

alignment of the hash table in memory

1. item 2. item 3. item 4. item …

cache line cache line cache line…

1 2 3

open bucket

linear probing double hash/ quadratic quotient array

linked list

pointer table item table

Page 9: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Types and variations of hash tables– item table with linear double hashing/quadratic quotient

key | value key | valuekey | value

empty

emptykey | value

structure of the hash table

alignment of the hash table in memory

1. item 2. item 3. item 4. item

cache line cache line cache line…

2 35. item 6. item …

1

open bucket

linear probing double hash/ quadratic quotient array

linked list

pointer table item table

Page 10: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Types and variations of hash tables– pointer table with array

length

structure of the hash table

alignment of the hash table in memory

ptr ptr ptr ptr ptr ptr

length 1. item 2. item 3. item

cache line cache line …

2 3length 1. item …

4

1. bucket n. bucket …

5

key | value key | value key | valuelength key | valuelength key | value key | value

length key | value key | value key | value

open bucket

linear probing double hash/ quadratic quotient array

linked list

pointer table item table

1

Page 11: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Types and variations of hash tables– pointer table with list

key | value| ptr

structure of the hash table

alignment of the hash table in memory

ptr ptr ptr ptr ptr ptr

key | value| ptrkey | value| ptr

key | value| ptr

key | value| ptr

2. item 3. item

cache line cache line cache line…

3 41. item

2

1. bucket m. bucket n. bucket 1. bucket k. bucket 1. bucket

open bucket

linear probing double hash/ quadratic quotient array

linked list

pointer table item table

1

Page 12: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Types and variations of hash tables– item table with list

open bucket

linear probing double hash/ quadratic quotient array

linked list

pointer table item table

structure of the hash table

alignment of the hash table in memory

key | value| ptrkey | value| ptrkey | value| ptr key | value| ptr key | value| ptr key | value| ptr key | value| ptr

key | value| ptrkey | value| ptr

key | value| ptr

key | value| ptr

2. item 3. item

cache line cache line cache line…

2 31. item

1

1. bucket m. bucket n. bucket 1. bucket k. bucket 1. bucket

Page 13: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Inputs and hash functions

• First point of optimization: hash function• General purpose hash functions

– Custom hash function– FNV (Fowler/Noll/Vo), widespread use– Jenkins hash function

• The distributions of the output of the general purpose hash functions unknown on real-life input.

uniform “bumpy” real-life

Page 14: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Performance of open hash tables– uniform and “bumpy” inputs

“bumpy” input distribution: similarbest: linear probing, Jenkins or the custom hash function

180000000 280000000 380000000 4800000000

10000

20000

30000

40000

50000

60000

70000

item-table-linear-probing -- simple item-table-linear-double-hashing -- simple item-table-quadratic-quotient -- simpleitem-table-linear-probing -- FNV item-table-linear-double-hashing -- FNV item-table-quadratic-quotient -- FNVitem-table-linear-probing -- Jen item-table-linear-double-hashing -- Jen item-table-quadratic-quotient -- Jen

search time [sec]

reserved memory [MB]

uniform distribution

Page 15: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Performance of open hash tables– real-life input

with inferior hash function: not able to operatebest: linear probing, Jenkins hash function

180000000 230000000 280000000 330000000 380000000 430000000 4800000000

10000

20000

30000

40000

50000

60000

70000

item-table-linear-probing -- simple item-table-linear-double-hashing -- simple item-table-quadratic-quotient -- simpleitem-table-linear-probing -- FNV item-table-linear-double-hashing -- FNV item-table-quadratic-quotient -- FNVitem-table-linear-probing -- Jen item-table-linear-double-hashing -- Jen item-table-quadratic-quotient -- Jen

search time [sec]

reserved memory [MB]

Page 16: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Performance of open hash tables– real-life input

The step count of the algorithms do not confirm the observed difference in performance, but the number of L2 cache misses do.

180000000 280000000 380000000 4800000000

1

2

3

4

5

6

7

8

9

10average step count per search

reserved memory [MB]

180000000 280000000 380000000 4800000000

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

L2 cache misses [billion pcs]

reserved memory [MB]

Page 17: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Performance of bucket hash tables– uniform and “bumpy” inputs

“bumpy” input distribution: similarbest: item-table-with-list, Jenkins or the custom hash function

180000000 280000000 380000000 4800000000

10000

20000

30000

40000

50000

60000

70000

item-table-with-list -- simple pointer-table-with-array -- simple pointer-table-with-list -- simpleitem-table-with-list -- FNV pointer-table-with-array -- FNV pointer-table-with-list -- FNVitem-table-with-list -- Jen pointer-table-with-array -- Jen pointer-table-with-list -- Jen

search time [sec]

reserved memory [MB]

uniform distibution

Page 18: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Performance of bucket hash tables– real-life input

best: pointer-table-with-array, because it is not that sensitive to the hash function

180000000 280000000 380000000 4800000000

10000

20000

30000

40000

50000

60000

70000

item-table-with-list -- simple pointer-table-with-array -- simple pointer-table-with-list -- simpleitem-table-with-list -- FNV pointer-table-with-array -- FNV pointer-table-with-list -- FNVitem-table-with-list -- Jen pointer-table-with-array -- Jen pointer-table-with-list -- Jen

search time [sec]

reserved memory [MB]

Page 19: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Performance of bucket hash tables– real-life input

The step count of the algorithms do not confirm the observed difference in performance, but the number of L2 cache misses do.

180000000 280000000 380000000 4800000000

1

2

3

4

5

6

7

8

9

10

average step count per search

reserved memory [MB]

180000000 280000000 380000000 4800000000

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

L2 cache misses [billion pcs]

reserved memory [MB]

Page 20: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Summary

• Significance of hashing in fast data transformation• New definitions for hash table types• Introduction of additional hash structures with various

memory layouts• Multiple inputs and hash functions• Robustness criterion• Open hash tables: linear probing is the fastest; all

variants are unable to handle real-life input with inferior hash function

• Bucket hash tables: arrays are favorable because they are robust and not sensitive to the hash function

• Verified using real-life input

Page 21: Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics.

Questions

?Adapting Hash Table Design for Real-Life Dataset

Sándor Juhász