Data Structures Hash Tables Phil Tayco Slide version 1.0 May 4, 2015.

Data Structures

Hash TablesPhil Tayco

Slide version 1.0

May 4, 2015

Hash Tables

Storage space revisited

• A common argument in recent computing is the lower costs of acquiring large amounts of disk space

• Situations can then be adjusted that treat using large amounts as not as critical

• This implies the use of arrays for managing data sets

Hash Tables

Sorted data

• If we are okay with using arrays, then certain situations using them could be identified

• Sorted data leads to O(log n) performance• Sorting the data is at best O(n log n) using

quicksort and O(n) if we kept the order while performing maintenance

• Performance is strong if the data is sorted but maintaining it can be costly

Hash Tables

Unless we don’t need to sort

• Sorted data helps when presenting parts or all of the data (such as a web page report)

• If there isn’t a need to show sorted data (such as an employee management system where records are maintained one at a time), then the need to sort the data is removed

• Unsorted data, however, is O(n) so we are now looking for a structure that helps with O(log n) maintenance performance (or better) that does not need the sorting (and we are okay with using arrays)

Hash Tables

Array index as key

• To take advantage of this, we need to take advantage of the fact that arrays allow for direct access to array elements

• Direct access is achieved by using the array index number

• The question is how to maximize use of the array index when performing the maintenance functions?

Hash Tables

An ideal example

• Consider a company of 1,000 employees and perhaps this particular company is very unlikely to exceed 100,000

• Storage is not an issue and memory capacity can easily accommodate 100,000 records

• The program to maintain these records does not have functionality that requires showing the employee records in any sorted way

• This is all great because an array can be used with a large amount of space that can handle the worst case 100,000 records

Hash Tables

Index representation

• To take full advantage of the array, we treat the array index as a key value to identifying an employee

• Sequential employee id numbers make the perfect key (Employee 15 is employees[14])

• On a larger scale, employee SSN can be used in the same way (assuming you can hold up to 999,999,999 records!)

• Each employee id is a unique index value so there would never be overlap (unless you reused employee ids after they left the company)

Hash Tables

Ideal efficiency

• Just how fast does this performance lead to?– Search: you know the id number, you know the

array index and you have direct access– Insert: maintaining the last known employee id

number is easy enough to take advantage of adding new employees

– Update/Date: is a search followed by an appropriate change

• Each one of these ends up at O(1)!

Hash Tables

Reality

• Such ideal situations are in fact that: ideal• Some situations tend to lose out on some

factor:– Not quite enough storage space requiring a

smaller array size– ID values may not be a unique number

• Can we reduce the array size and find a way to line up a unique record ID with an array index?

Hash Tables

Hashing

• Hashing involves deriving an index value through some logical calculation

• Derivation is applied to a field or combination of fields of the record that calculate an index

• Typical example: Adding all ASCII values of some field like first and last name and using mod to calculate the index

Hash Tables

Calculations

• Example: “Phil Tayco” as the name of the record– Add all ASCII character values

• 50 + 104 + 105 + 108 = 367 for “Phil”• 54 + 97 + 121 + 99 + 111 = 482 for “Tayco• Total = 849

– Say we only allow for 500 array elements. We can also mod this value by the array size

• 849 % 500 = array index 349

• Utilizing this approach means we have a consistent formula to derive an index value

Hash Tables

Limitations

• Challenges immediately come to mind when looking at this example:– Eventually, an index value calculation for 2

different records will derive the same value (called a “collision”)

– A calculation that guarantees a unique value often leads to a large amount of space required with heavy under utilization

• We need to keep the capacity of the array reasonable while handling the inevitable collisions

Hash Tables

Collisions

• Multiple approaches for handling collisions when hashing

• Open addressing uses the strategy to find another open element in the array following a search-like algorithm

• Assumption is that there will be enough space for all entries (i.e. the estimated maximum capacity of the hash array is adequate

Hash Tables

Linear Probing

• Linear probing is the basic open address agorithm– If a collision occurs, look in the next immediate

spot in the array– If it is open, place the next item there– If it is not, continue looking in the next array

index (wrapping to index 0 if needed) until an open spot is found

• This is an issue only if the capacity is reached (making the initial estimate important

Hash Tables

Linear Probe Search

• If the hash array utilizes this form of collision handling on insert, the other functions must follow suit– Search uses the hash function to find if a given

record is at the hash location– If it is “empty” at that location, the search if over– If it is there, then the record is found– Otherwise, the search continues with the next

array element• “Empty”, however, must be defined such a

predetermined record value. Why…?

Hash Tables

Linear Probe Delete

• Because a delete cannot simply mean to perform the search and if the record is found, remove it from the array

• This would leave an empty spot in the array that may be interpreted as a record not found during a search

• Instead, the array element is changed to another pre-determined value of “deleted”– Search does not treat this as an empty spot

Hash Tables

Example: Records “T”, “Y” and “R” have been hashed into the array

T Y R

Hash Tables

New record “D” comes in and the hash function calculates its index as index [3]

T Y R

D

Hash Tables

Record “D” collides with record “T”. Linear probe means try the next index

T Y R

D

Hash Tables

However, record “Y” is already there, so we try the next one. It is open, so that’s where “D” goes

T Y RD

Hash Tables

Later on, record “Y” is called for deletion. When “Y” is hashed, its index value is [4]. “Y” is there, so the deletion is performed

T Y RD

Hash Tables

However, if we remove it, that creates an empty space…

T RD

Hash Tables

If we left it this way, when search for record “D” begins, its original hash value is still [3]

T RD

Hash Tables

Since index [3] is not “empty”, search goes to index [4] which is empty and then incorrectly returns “not found”

T RD

Hash Tables

Solution is instead of removing the record, put in a designated “deleted” value (such as -1)

T -1 RD

Hash Tables

Now when search for record “D” is performed, the linear probe will treat the “-1” as not empty and continue the search correctly

T -1 RD

Hash Tables

Linear Probe Efficiency

• As records start to fill up the array, you can infer that the efficiency of the algorithm degrades to O(n)

• The degradation is dependent on the complexity of the hash function (more spaced out locations) and nature of the data (does the selected fields of data result in spaced out hash values)

• Other methods of probing exist– Quadratic probing– Double hashing

Hash Tables

The bottom line

• Whatever the hash function and open addressing probe approach you take, the logic and strategy is the same:– Determine an appropriate field(s) for hash use– Develop a hash function that generates reasonably

spaced index values– Design a collision handling approach that takes

advantage of the hash strategy• Best and worst case will always range from O(1)

to O(n)• Open addressing means trying to reduce the

likelihood of O(n)

Hash Tables

A more dynamic approach

• What if you’re not quite sure of your capacity estimate? Or, perhaps the maximum size is wildly outrageous and conducive to unused space

• A second collision handling approach allows for keeping a reasonably large sized array and dynamically addressing the collisions

• “Dynamic” memory management implies a second structure…

Hash Tables

A hash array of linked lists

• This method, known as “Separate Chaining” makes each element of the array a “head” node of a linked list

• When insert is performed, the hash index is found and the new element is inserted into the linked list there

• If a collision occurs, it’s okay because the linked list insert handles it

• When search or delete is performed, the initial hash takes place followed by a standard linked list search or delete

Hash Tables

Same example as before. 3 records as heads of lists in the hash array

T Y R

Hash Tables

Record “D” is hashed to index [3] and is inserted into the linked list (note that T is now the 2nd node in the linked list there)

D Y R

T

Hash Tables

Delete of record “Y” is simply hashing to index [4] and performing a linked list delete

D R

T

Hash Tables

Search for “D” hashes to index [3] as normal and a linked list search is performed (which happens to be the head node!)

D R

T

Hash Tables

Separate Chaining pros and cons

• The overhead with using a linked list does impact performance but not necessarily the coding since the functions can be modularized

• In theory, the performance is the same as open addressing since it still depends on the hash function developed

• The size of the hash array is not a critical dependency since the linked lists handle the need for additional space

• The right combination of a hash function that yields wide ranging index values with the use of linked lists is generally preferred

Hash Tables

Summary

• Hash tables have strong benefit for situations where single record search and maintenance is primary because of its near O(1) performance

• Obtaining records in ordered groups and data sets is challenging to do and not conducive to hash tables

• Collisions can be handled using open addressing or separate chaining, the latter of which is generally considered more flexible for performance and memory usage

• The key is the hash function itself – many formulas and theories exist on what fields and calculations to use to derive index values

Data Structures Hash Tables Phil Tayco Slide version 1.0 May 4, 2015.

Documents

Transcript of Data Structures Hash Tables Phil Tayco Slide version 1.0 May 4, 2015.