09 Buckets
-
Upload
maulik-s-hakani -
Category
Documents
-
view
230 -
download
0
Transcript of 09 Buckets
-
8/4/2019 09 Buckets
1/40
Comp 122, Spring 2004
Keys into Buckets:
Lower bounds, Linear-time sort, & Hashing
-
8/4/2019 09 Buckets
2/40
linsort - 2 Lin / DeviComp 122
Comparison-based Sorting
Comparison sort
Only comparison of pairs of elements may be used to gainorder information about a sequence.
Hence, a lower bound on the number of comparisons will be alower bound on the complexity of any comparison-basedsorting algorithm.
All our sorts have been comparison sorts The best worst-case complexity so far is(n lg n)
(merge sort and heapsort).
We prove a lower bound of(n lg n)for anycomparison sort: merge sort and heapsort are optimal.
The idea is simple: there are n! outcomes, so we need atree with n! leaves, and therefore lg(n!) =
-
8/4/2019 09 Buckets
3/40
linsort - 3 Lin / DeviComp 122
Decision Tree
For insertion sort operating on three elements.
1:2
2:3 1:3
1:3 2:31,2,3
1,3,2 3,1,2
2,1,3
2,3,1 3,2,1
>
>
>>
Contains 3! = 6 leaves.
Simply unroll all loops
for all possible inputs.
Node i:j means
compare A[i] to A[j].
Leaves show outputs;
No two paths go to
same leaf!
-
8/4/2019 09 Buckets
4/40
linsort - 4 Lin / DeviComp 122
Decision Tree (Contd.)
Execution of sorting algorithm corresponds to tracing a
path from root to leaf. The tree models all possible execution traces.
At each internal node, a comparison aiaj is made. Ifaiaj, follow left subtree, else follow right subtree. View the tree as if the algorithm splits in two at each node,
based on information it has determined up to that point.
When we come to a leaf, ordering a(1)a (2) a (n)is established.
A correct sorting algorithm must be able to produce anypermutation of its input.
Hence, each of the n! permutations must appear at one or moreof the leaves of the decision tree.
-
8/4/2019 09 Buckets
5/40
linsort - 5 Lin / DeviComp 122
A Lower Bound for Worst Case
Worst case no. of comparisons for a sorting
algorithm is
Length of the longest path from root to any of the
leaves in the decision tree for the algorithm.
Which is the height of its decision tree.
A lower bound on the running time of any
comparison sort is given by
A lower bound on the heights of all decision trees inwhich each permutation appears as a reachable leaf.
-
8/4/2019 09 Buckets
6/40
linsort - 6 Lin / DeviComp 122
Optimal sorting for three elements
Any sort of six elements has 5 internal nodes.
1:2
2:3 1:3
1:3 2:31,2,3
1,3,2 3,1,2
2,1,3
2,3,1 3,2,1
>
>
>>
There must be a worst-case path of length 3.
-
8/4/2019 09 Buckets
7/40linsort - 7 Lin / DeviComp 122
A Lower Bound for Worst Case
Proof:
Suffices to determine the height of a decision tree. The number of leaves is at least n!(# outputs)
The number of internal nodes n!1
The height is at least lg (n!1) = (n lg n)QED
Theorem 8.1:
Any comparison sort algorithm requires (n lg n) comparisons in theworst case.
-
8/4/2019 09 Buckets
8/40linsort - 8 Lin / DeviComp 122
Beating the lower bound
We can beat the lower bound if we dont baseour sort on comparisons:
Counting sort for keys in [0..k], k=O(n)
Radix sortfor keys with a fixed number of digits Bucket sort for random keys (uniformly distributed)
-
8/4/2019 09 Buckets
9/40linsort - 9 Lin / DeviComp 122
Counting Sort
Assumption: we sort integers in {0, 1, 2, , k}. Input:A[1..n] {0, 1, 2, , k}n.
ArrayA and values n and kare given.
Output:B[1..n] sorted. AssumeB is alreadyallocated and given as a parameter.
Auxiliary Storage:C[0..k] counts
Runs in linear time ifk = O(n).
-
8/4/2019 09 Buckets
10/40linsort - 10 Lin / DeviComp 122
Counting-Sort (A, B, k)
CountingSort(A,B,k)
1. fori 1 to k
2. doC[i] 0
3. forj 1 to length[A]
4. doC[A[j]]
C[A[j]] + 15. fori 2 to k
6. doC[i] C[i] + C[i1]
7. forjlength[A] downto 1
8. doB[C[A[ j ]]] A[j]9. C[A[j]] C[A[j]]1
O(k) init counts
O(k) prefix sum
O(n) count
O(n) reorder
-
8/4/2019 09 Buckets
11/40
linsort - 11 Lin / DeviComp 122
Radix Sort
Used to sort on card-sorters:
Do a stable sort on each column,one column at a time.
The human operator is
part of the algorithm!
Key idea:sort on the least significant digit first and
on the remaining digits in sequential order. The sortingmethod used to sort each digit must be stable.
If we start with the most significant digit, wellneed extra storage.
-
8/4/2019 09 Buckets
12/40
linsort - 12 Lin / DeviComp 122
An Example
392 631 928 356
356 392 631 392
446 532 532 446
928 495 446 495
631 356 356 532
532 446 392 631
495 928 495 928
Input After sorting
on LSD
After sorting
on middledigit
After sorting
on MSD
-
8/4/2019 09 Buckets
13/40
linsort - 13 Lin / DeviComp 122
Radix-Sort(A, d)
Correctness of Radix Sort
By induction on the number of digits sorted.
Assume that radix sort works for d1 digits.
Show that it works for ddigits.Radix sort ofddigits radix sort of the low-order d
1 digits followed by a sort on digit d.
RadixSort(A, d)
1. for i 1 to d2. do use a stable sort to sort array A on digit i
-
8/4/2019 09 Buckets
14/40
linsort - 14 Lin / DeviComp 122
Algorithm Analysis
Each pass over n d-digit numbers then takes time(n+k). (Assuming counting sort is used for each pass.)
There are d passes, so the total time for radix sort is(d(n+k)).
When dis a constant and k= O(n), radix sort runs inlinear time.
Radix sort, if uses counting sort as the intermediate
stable sort, does not sort in place. If primary memory storage is an issue, quicksort or other sorting methods
may be preferable.
-
8/4/2019 09 Buckets
15/40
linsort - 15 Lin / DeviComp 122
Bucket Sort
Assumes input is generated by a random process
that distributes the elements uniformly over [0, 1).
Idea:
Divide [0, 1) into n equal-sized buckets.
Distribute the n input values into the buckets.
Sort each bucket.
Then go through the buckets in order, listing elements
in each one.
-
8/4/2019 09 Buckets
16/40
linsort - 16 Lin / DeviComp 122
An Example
-
8/4/2019 09 Buckets
17/40
linsort - 17 Lin / DeviComp 122
Bucket-Sort (A)
BucketSort(A)
1. n length[A]
2. fori 1 to n3. do insertA[i] into listB[ nA[i]]
4. fori0ton15. do sort listB[i] with insertion sort
6. concatenate the listsB[i]s together in order7. return the concatenated lists
Input:A[1..n], where 0 A[i] < 1 for all i.
Auxiliary array:B[0..n1] of linked lists, each list initially empty.
-
8/4/2019 09 Buckets
18/40
linsort - 18 Lin / DeviComp 122
Analysis
Relies on no bucket getting too many values. All lines except insertion sorting in line 5 take O(n)
altogether.
Intuitively, if each bucket gets a constant number of
elements, it takes O(1) time to sort each bucket O(n)sort time for all buckets.
We expect each bucket to have few elements, since
the average is 1 element per bucket. But we need to do a careful analysis.
-
8/4/2019 09 Buckets
19/40
linsort - 19 Lin / DeviComp 122
AnalysisContd.
RV ni= no. of elements placed in bucket
B[i].
Insertion sort runs in quadratic time. Hence, time for
bucket sort is:
1
0
2
1
0
2
1
0
2
1
0
2
)][][(])[()(
n)expectatiooflinearity(by)]([)(
)()()]([
havewen,expectatio
oflinearityusingandsidesbothofnsexpectatioTaking
)()()(
n
i
i
n
i
i
n
i
i
n
i
i
XaEaXEnEOn
nOEn
nOnEnTE
nOnnT
(8.1)
-
8/4/2019 09 Buckets
20/40
linsort - 20 Lin / DeviComp 122
AnalysisContd.
Claim: E[ni2] = 21/n.
Proof:
Define indicator random variables.
Xij = I{A[j] falls in bucket i}
Pr{A[j] falls in bucket i} = 1/n.
ni =
n
j
ijX1
(8.2)
-
8/4/2019 09 Buckets
21/40
linsort - 21 Lin / DeviComp 122
AnalysisContd.
njkj
nkikij
n
jij
n
j njkj
nk
ikijij
ik
n
j
n
k
ij
n
j
iji
XXEXE
XXX
XXE
XEnE
1 11
2
1 1 1
2
1 1
2
1
2
n.expectatiooflinearityby,][][
E
][
(8.3)
-
8/4/2019 09 Buckets
22/40
linsort - 22 Lin / DeviComp 122
AnalysisContd.
2
2
22
1
11
][][][variables.
randomtindependenareand,Since
:for][
1
11
110
}bucketinfalls][Pr{1
}bucketinfalltdoesn'][Pr{0][
nnn
XEXEXXE
XXkj
kjXXE
n
nn
ijA
ijAXE
ikijikij
ikij
ikij
ij
-
8/4/2019 09 Buckets
23/40
linsort - 23 Lin / DeviComp 122
AnalysisContd.
)(
)()(
)/12()()]([
.1
2
11
1)1(
1
11][
1
0
2
1 1 1 2
2
n
nOn
nOnnTE
n
n
n
nnn
nn
nnnE
n
i
n
j njjk
nki
Substituting (8.2) in (8.1), we have,
(8.3) is hence,
-
8/4/2019 09 Buckets
24/40
Comp 122, Spring 2004
Hash Tables1
-
8/4/2019 09 Buckets
25/40
linsort - 25 Lin / DeviComp 122
Dictionary
Dictionary:
Dynamic-set data structure for storing items indexedusing keys.
Supports operations Insert, Search, and Delete.
Applications: Symbol table of a compiler.
Memory-management tables in operating systems.
Large-scale distributed systems.
Hash Tables: Effective way of implementing dictionaries.
Generalization of ordinary arrays.
-
8/4/2019 09 Buckets
26/40
linsort - 26 Lin / DeviComp 122
Direct-address Tables
Direct-address Tables are ordinary arrays.
Facilitate direct addressing.
Element whose key is kis obtained by indexing into
the kth position of the array.
Applicable when we can afford to allocate an array
with one position for every possible key.
i.e. when the universe of keys Uis small.
Dictionary operations can be implemented to take
O(1) time.
Details in Sec. 11.1.
-
8/4/2019 09 Buckets
27/40
linsort - 27 Lin / DeviComp 122
Hash Tables
Notation:
UUniverse of all possible keys.
KSet of keys actually stored in the dictionary.
|K| = n.
When U is very large, Arrays are not practical.
|K|
-
8/4/2019 09 Buckets
28/40
linsort - 28 Lin / DeviComp 122
Hashing
Hash function h: Mapping from Uto the slots of a
hash table T[0..m1].
h : U {0,1,, m1}
With arrays, key kmaps to slotA[k].
With hash tables, key kmaps or hashes to slot
T[h[k]].
h[k] is the hash value of key k.
-
8/4/2019 09 Buckets
29/40
linsort - 29 Lin / DeviComp 122
Hashing
0
m1
h(k1)
h(k4)
h(k2)=h(k5)
h(k3)
U(universe of keys)
K
(actual
keys)
k1
k2
k3
k5
k4
collision
-
8/4/2019 09 Buckets
30/40
linsort - 30 Lin / DeviComp 122
Issues with Hashing
Multiple keys can hash to the same slot
collisions are possible.
Design hash functions such that collisions are
minimized.
But avoiding collisions is impossible. Design collision-resolution techniques.
Search will cost (n) time in the worst case.
However, all operations can be made to have anexpected complexity of(1).
-
8/4/2019 09 Buckets
31/40
linsort - 31 Lin / DeviComp 122
Methods of Resolution
Chaining:
Store all elements that hash to the same
slot in a linked list.
Store a pointer to the head of the linked
list in the hash table slot. Open Addressing:
All elements stored in hash table itself.
When collisions occur, use a systematic(consistent) procedure to store elements
in free slots of the table.
k2
0
m1
k1 k4
k5 k6
k7 k3
k8
-
8/4/2019 09 Buckets
32/40
linsort - 32 Lin / DeviComp 122
Collision Resolution by Chaining
0
m1
h(k1)=h(k4)
h(k2)=h(k5)=h(k6)
h(k3)=h(k7)
U(universe of keys)
K
(actual
keys)
k1
k2
k3
k5
k4
k6
k7k8
h(k8)
X
X
X
-
8/4/2019 09 Buckets
33/40
linsort - 33 Lin / DeviComp 122
k2
Collision Resolution by Chaining
0
m1
U(universe of keys)
K
(actual
keys)
k1
k2
k3
k5
k4
k6
k7k8
k1 k4
k5 k6
k7 k3
k8
-
8/4/2019 09 Buckets
34/40
linsort - 34 Lin / DeviComp 122
Hashing with Chaining
Dictionary Operations:
Chained-Hash-Insert (T, x)
Insertx at the head of list T[h(key[x])].
Worst-case complexityO(1).
Chained-Hash-Delete (T, x) Deletex from the list T[h(key[x])].
Worst-case complexityproportional to length of list with
singly-linked lists. O(1) with doubly-linked lists.
Chained-Hash-Search (T, k) Search an element with key kin list T[h(k)].
Worst-case complexityproportional to length of list.
l i h i d h h
-
8/4/2019 09 Buckets
35/40
linsort - 35 Lin / DeviComp 122
Analysis on Chained-Hash-Search
Load factor=n/m = average keys per slot.
mnumber of slots. nnumber of elements stored in the hash table.
Worst-case complexity:(n) + time to compute h(k).
Average depends on how h distributes keys among m slots. Assume
Simple uniform hashing.
Any key is equally likely to hash into any of the m slots,independent of where any other key hashes to.
O(1) time to compute h(k).
Time to search for an element with key kis (|T[h(k)]|).
Expected length of a linked list = load factor = = n/m.
-
8/4/2019 09 Buckets
36/40
linsort - 36 Lin / DeviComp 122
Expected Cost of an Unsuccessful Search
Proof:
Any key not already in the table is equally likely to hashto any of the m slots.
To search unsuccessfully for any key k, need to search to
the end of the list T[h(k)], whose expected length is .
Adding the time to compute the hash function, the total
time required is (1+).
Theorem:
An unsuccessful search takes expected time (1+).
E d C f S f l S h
-
8/4/2019 09 Buckets
37/40
linsort - 37 Lin / DeviComp 122
Expected Cost of a Successful Search
Proof:
The probability that a list is searched is proportional to the numberof elements it contains.
Assume that the element being searched for is equally likely to beany of the n elements in the table.
The number of elements examined during a successful search foran elementx is 1 more than the number of elements that appearbeforex inxs list.
These are the elements insertedafterxwas inserted.
Goal:
Find the average, over the n elementsx in the table, ofhow many elementswere inserted intoxs list afterx was inserted.
Theorem:
A successful search takes expected time (1+).
E d C f S f l S h
-
8/4/2019 09 Buckets
38/40
linsort - 38 Lin / DeviComp 122
Expected Cost of a Successful Search
Proof (contd):
Letxibe the ith element inserted into the table, and let ki = key[xi].
Define indicator random variablesXij = I{h(ki) = h(kj)}, for all i,j.
Simple uniform hashing Pr{h(ki) = h(kj)} = 1/m
E[Xij] = 1/m.
Expected number of elements examined in a successful search is:
Theorem:
A successful search takes expected time (1+).
n
i
n
ij
ijXn
E1 1
11
No. of elements inserted afterxi into the same slot asxi.
P f C d
-
8/4/2019 09 Buckets
39/40
linsort - 39 Lin / DeviComp 122
ProofContd.
n
m
n
nnn
nm
innm
innm
mn
XEn
X
n
E
n
i
n
i
n
i
n
i
n
ij
n
i
n
ij
ij
n
i
n
ij
ij
221
2
11
2
)1(11
11
)(11
11
1
][11
11
2
1 1
1
1 1
1 1
1 1
(linearity of expectation)
Expected total time for a successful search= Time to compute hash function + Time
to search
= O(2+/2/2n) = O(1+ ).
E d C I i
-
8/4/2019 09 Buckets
40/40
Li / D i
Expected CostInterpretation
Ifn = O(m), then =n/m = O(m)/m = O(1).
Searching takes constant time on average. Insertion is O(1) in the worst case.
Deletion takes O(1) worst-case time when lists are doubly
linked. Hence, all dictionary operations take O(1) time on
average with hash tables with chaining.