COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science,...

15
COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington Marcus Frean, Lindsay Groves, Peter Andreae and Thomas Kuehne, VUW

Transcript of COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science,...

Page 1: COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

CO

MP 1

03

Hashing

2013-T2 Lecture 28

Thomas KuehneSchool of Engineering and Computer Science, Victoria

University of Wellington

Marcus Frean, Lindsay Groves, Peter Andreae and Thomas Kuehne, VUW

Page 2: COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

2

RECAP-TODAY

RECAP Linked Structures, including trees, heaps

achieved perfect O(log n) insert/find performance

TODAY Mind-blowingly fast sorting

O(1) insert/find performance!

Page 3: COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

3

Linear Time Sorting Algorithm

Constant time per entry to sort

private int HashSort(int[] numbers) {

int[] present = new int[7];

for (int i = 0; i < numbers.length(); i++)

present[numbers[i]]++;}

Limitations elements must be integers

element value range must be limited

frequency data structure may be sparsely populated

5

3

5

2

6

1

numbers present

0

1

2

3

4

5

6

1

1

2

1

1

1

cf. BucketSor

t

Page 4: COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

4

Hashing

Fixing the limitations convert element into an integer

use a hash function to assign an integer to an element

Potential Set, Bag, Maps with constant time insert /

find!

Challenges how to compute the hash code? how to deal with collisions?

Page 5: COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

5

O(1) Sets with big values?

We need a way to compute an index for an object:add(“2001 – A Space Odyssey”)

“Hashing”: compute the “hash code” of an object

0 1 2 3 4 5 6 7 8 9 581 N✔ ✗ ✔ ✔✗ ✗ ✗ ✗ ✗ ✗ ✗⋯ ⋯✗

Hash function 581

“2001 – A Space Odyssey”

Page 6: COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

6

O(1) Sets with big values?

But there are too many possible film titles!

Suppose the hash function always produces a number between 0 and 1000 ⇒ some film titles must end up with the same

number!

⇒ “Collision”

0 1 2 3 4 5 6 7 8 9 581 N✔ ✗ ✔ ✔✗ ✗ ✗ ✗ ✗ ✗ ✗⋯ ⋯✔✔

HASH

“Gravity”“2001 – A Space Odyssey”

HA

SH

Page 7: COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

7

Detecting collisions Store the item in the array, instead of a

boolean

Questions1. How to choose hash function that minimises

collisions?2. How to manage collisions when they occur?

0 1 2 3 4 5 6 7 8 9 581 N⋯ ⋯

“Gravity”“2001 – A Space

Odyssey”HA

SH

HASH

Page 8: COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

9

Computing Hash Codes

Wish list Summary for HashCode Function Should produce an integer

Should distribute the hash codes evenly through the range

minimises collisions

Should be fast to compute

Should take account of all components of the object

Must be consistent with equals() two items that are equal must have the same

hash value

Can we avoid clashes altogether? That would be perfect! perfect hash function

Page 9: COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

10

A Simple Hash Function for Strings

We could add up the codes of all the characters:

private int hash(String value) {int hashCode = 0;

for (int i = 0; i < value.length(); i++) hashCode += value.charAt(i);

return hashCode;}

Why is this not very good?

Page 10: COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

11

Example: Hashing course codes

418 ← DEAF101

419 ← DEAF102 DEAF201 ⋮

429 ← BBSC201 MDIA101

430 ← ECHI410 MDIA102 MDIA201

431 ← ECHI303 JAPA111 JAPA201 MDIA202 MDIA220 MDIA301

432 ← ARCH101 ASIA101 BBSC231 BBSC303 BBSC321 CHEM201 ECHI403 ECHI412 JAPA112 JAPA211 JAPA301 MDIA203 MDIA302 MDIA320 ⋮

450 ← ANTH412 ARCH389 ARTH111 BIOL228 BIOL327 BIOL372 CHEM489 COML304 COML403 COML421 COMP102 COMP201 CRIM313 CRIM421 DESN215 DESN233 ECON328 ECON409 ECON418 ECON508 EDUC449 EDUC458 EDUC548 EDUC557 ENGL228 ENGL408 ENGL426 ENGL435 ENGL444 ENGL453 FREN124 FREN331 FREN403 FREN412 GEOL362 GEOL407 GERM214 GERM403 GERM412 INFO213 INFO312 INFO402 ITAL206 ITAL215 LALS501 LATI404 LING224 LING323 LING404 MAOR102 MARK304 MARK403 MATH206

MATH314 MATH323 MATH431 MOFI403 PHIL104 PHIL203 PHIL302 PHIL320 PHIL401 PHIL410 RELI321 RELI411 SAMO101

a lot of collisions!

Page 11: COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

12

Better Hash Functions Make the contribution of each character depend on its

position:private int hash(String course) {

int k = 257;int hashCode = 0;

for (int i = 0; i < course.length(); i++)hashCode = hashCode * k + course.charAt(i);

return hashCode;}

hashCode(s) = k6x s0 + k5x s1 + k4x s2 + k3x s3 + k2x s4 + k1x s5 + s6

(it is best to use a prime number for the constant k)

Page 12: COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

13

Perfect Hash Functions Perfect hash function gives no collisions for a

given data set

Example - for VUW courses

private int hash(String course) { int hash = 0; for (int i = 0; i < course.length(); i++) hash = (hash * 51 + course.charAt(i)) % 72201; return hash;}

Building a perfect hash function is very difficult very specific to a particular set of possible

values only useful in very specialised circumstances

Page 13: COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

14

Dealing with Collisions

Two approaches Use a collection at each place

(“buckets” or “chaining”)

Look for an empty place in the hashtable(“probing” or “open addressing”)

0 1 2 3 4 5 6 7 8 9 581 N⋯ ⋯

“2001 – A Space Odyssey”

HA

SH

“Gravity”

HASH

Page 14: COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

15

Collisions: chaining / buckets Store a Set in each cell:

hash value → which set

Performance?

if the array is of size k, each subset will be about 1/kth of size()

cost ≈ cost(hashCode) + cost (subset)

ant fox

hen

dog

bee

kea

cow elk

owl

pig sow

tui

ape bat

bug cat

eel gnu

jay nit

ray

yak cod

roe

This is what Java's HashMap does.

If the sets get too big Rehash:double array size and reassign elements

Page 15: COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

16

Java and hashCode

All objects have a hashCode method and an equals method, so:

you can call equals on any object and you can put any object into a HashSet,

HashMap, … Many predefined objects (eg String) have good equals

and hashCode methods defined

The default equals method: compares references, i.e., equals is == if this is not what you want, define your own

equals method

The default hashCode returns an integer based on the reference (pointer

value) If you redefine equals, you should redefine hashCode

too!