Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.
-
Upload
aldous-anderson -
Category
Documents
-
view
214 -
download
0
Transcript of Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.
Running Time of Kruskal’s Algorithm
Huffman Codes
Monday, July 14th
Outline For Today
1. Runtime of Kruskal’s Algorithm (Union-Find Data
Structure)
2. Data Encodings & Finding An Optimal Prefix-free
Encoding
3. Prefix-free Encodings Binary Trees
4. Huffman Codes
Outline For Today
1. Runtime of Kruskal’s Algorithm (Union-Find Data
Structure)
2. Data Encodings & Finding An Optimal Prefix-free
Encoding
3. Prefix-free Encodings Binary Trees
4. Huffman Codes
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Creates a cycle
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Creates a cycle
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Creates a cycle
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Creates a cycle
Recap: Kruskal’s Algorithm Simulation
B C
1
4
2A D E
F
3
G
2.5
7.5
H
7
Final Tree!
Same as Tprim
Recap: Kruskal’s Algorithm Pseudocode procedure kruskal(G(V, E)):
sort E in order of increasing weights rename E so w(e1) < w(e2) < … < w(em) T = {} // final tree edges for i = 1 to m: if T ∪ ei=(u,v) doesn’t create cycle add ei to T return T
Recap: For Correctness We Proved 2 Things1. Outputs a Spanning Tree Tkrsk
2. Tkrsk is a minimum spanning tree
1: Kruskal Outputs a Spanning Tree (1)
Need to prove Tkrsk is spanning AND is acyclic
Acyclic is by definition of the algorithm.
Why is Tkrsk spanning (i.e., connected)?
Recall Empty Cut Lemma:
A graph is not connected iff ∃ cut (X, Y) with no
crossing edges
If all cuts have a crossing edge -> graph is
connected!
2: Kruskal is Optimal (by Cut Property)Let (u, v) be any edge added by Kruskal’s Algorithm.
u and v are in different comp. (b/c Kruskal checks for
cycles)
ux
y
v
t
zw
Claim: (u, v) is min-edge crossing this cut!
Kruskal’s Runtime
procedure kruskal(G(V, E)): sort E in order of increasing weights
rename E so w(e1) < w(e2) < … < w(em) T = {} // final tree edges for i = 1 to m: if T ∪ ei=(u,v) doesn’t create cycle add ei to T return T
O(mlog(n))
m iterations
?Option 1: check if u v path exists! ⤳
Run a BFS/DFS from u or v => O(|T| + n) = O(n)
Can we speed up cycle checking?
***BFS/DFS Total Runtime: O(mn)***
Speeding Kruskal’s Algorithm
Goal: Check for cycles in log(n) time.
Observation: (u, v) creates a cycle iff u and v
are in the same connected component
Option 2: check if u’s component = v’s
component
More Specific Goal: check the component of
each vertex in log(n) time
Union-Find Data Structure
Operation 1: Maintain the component
structure of T as we add new edges to it.
Operation 2: Query component of each
vertex v
Union
Find
Kruskal’s With Union-Find (Conceptually)
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
Kruskal’s With Union-Find (Conceptually)
A
CB
E
B C
1
46
2
5
D
A D E
FF
3
G G
2.5
7.5
HH
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Find(A) = A
Find(D) = DUnion(A, D)
Kruskal’s With Union-Find (Conceptually)
A
CB
E
B C
1
46
2
5
A
A D E
FF
3
G G
2.5
7.5
HH
8
7
9
Find(D) = A
Find(E) = EUnion(A, E)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
CB
A
B C
1
46
2
5
A
A D E
FF
3
G G
2.5
7.5
HH
8
7
9
Find(C) = C
Find(F) = FUnion(C, F)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
CB
A
B C
1
46
2
5
A
A D E
FC
3
G G
2.5
7.5
HH
8
7
9
Find(E) = A
Find(F) = CUnion(A, C)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
AB
A
B C
1
46
2
5
A
A D E
FA
3
G G
2.5
7.5
HH
8
7
9
Find(A) = A
Find(B) = BUnion(A, B)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
AA
A
B C
1
46
2
5
A
A D E
FA
3
G G
2.5
7.5
HH
8
7
9
Find(D) = A
Find(C) = ASkip (D, C)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
AA
A
B C
1
46
2
5
A
A D E
FA
3
G G
2.5
7.5
HH
8
7
9
Find(A) = A
Find(C) = ASkip (A, C)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
AA
A
B C
1
46
2
5
A
A D E
FA
3
G G
2.5
7.5
HH
8
7
9
Find(C) = A
Find(H) = HUnion(A, H)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
AA
A
B C
1
46
2
5
A
A D E
FA
3
G G
2.5
7.5
AH
8
7
9
Find(F) = A
Find(G) = GUnion(A, G)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
AB
A
B C
1
46
2
5
A
A D E
FA
3
A G
2.5
7.5
AH
8
7
9
Find(B) = A
Find(C) = ASkip (B, C)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
AB
A
B C
1
46
2
5
A
A D E
FA
3
A G
2.5
7.5
AH
8
7
9
Find(H) = A
Find(G) = ASkip (H, G)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Union-Find Implementation Simulation
A1
B1
C1
D1
E1
F1
G1
H1
Union-Find Implementation Simulation
A1
B1
C1
D1
E1
F1
G1
H1
Union-Find Implementation Simulation
A2
B1
C1
D
E1
F1
G1
H1
Union-Find Implementation Simulation
A2
B1
C1
D
E1
F1
G1
H1
Union-Find Implementation Simulation
A3
B1
C1
D
F1
G1
H1
E
Union-Find Implementation Simulation
A3
B1
C1
D
F1
G1
H1
E
Union-Find Implementation Simulation
A3
B1
C2
D
G1
H1
E F
Union-Find Implementation Simulation
A3
B1
C2
D
G1
H1
E F
Union-Find Implementation Simulation
A5
B1
CD
G1
H1
E
F
Union-Find Implementation Simulation
A5
B1
CD
G1
H1
E
F
Union-Find Implementation Simulation
A6
CD
G1
H1
E
F
B
Union-Find Implementation Simulation
A6
CD
G1
H1
E
F
B
Union-Find Implementation Simulation
A7
CD
G1
E
F
B H
Union-Find Implementation Simulation
A7
CD
G1
E
F
B H
Union-Find Implementation Simulation
A8
CD E
F
B H G
C
A
X7
W Z
Y
T
Linked Structure Per Connected Component
Leader
C
A W Z
Y
T
Union Operation
F G
X7
E3
Union: **Make Leader of Small Component Point to the leader of Large Component**
C
A W Z
Y
T
Union Operation
F G
X10
E
Cost: O(1)(1 pointer update, 1 increment)
Union: **Make Leader of Small Component Point to the leader of Large Component**
C
A W Z
Y
T
Union Operation
F G
X10
E
C
A W Z
Y
T
Find Operation
F G
X10
E
Find: “pointer chase” until the leader
Cost: # pointers to leader
?
Cost of Find Operation
Claim: For any v,
#-pointers to leader(v) ≤ log2(|
component(v)|)
≤ log2(n)
Proof: Each time v’s path to leader increases by
1, the size of its component at least doubles!
|component(v)| starts at 1, increases to n,
therefore it can double at most log2(n) time!
Summary of Union-Find
Initialization: Each v is a comp. of size 1 and points to
itself.
When we union two components, we make the leader
of the smaller one point to the larger one (break ties
arbitrarily).
Find(v):
Pointer chasing to the leader
Cost: O(log2(|component|)) = O(log2(n))
Union(u, v): 1 pointer update, 1 increment => O(1)
Kruskal’s Runtime With Union-Find
procedure kruskal(G(V, E)):sort E in order of increasing weights
rename E so w(e1) < w(e2) < … < w(em) init Union-Find T = {} // final tree edges for i = 1 to m: ei=(u,v) if find(u) != find(v) add ei to T Union(find(u), find(v)) return T
O(mlog(n))
m iterations
log(n)
***Total Runtime: O(mlog(n))***Same as Prim’s with heaps
O(1)
O(n)
Outline For Today
1. Runtime of Kruskal’s Algorithm (Union-Find Data
Structure)
2. Data Encodings & Finding An Optimal Prefix-free
Encoding
3. Prefix-free Encodings Binary Trees
4. Huffman Codes
Data Encodings and Compression
All data in the digital world gets represented as 0s and
1s. 010010100010010100011110110010010101010110100001110100010011000010010101011010100010
100001110100010011000010010101011010100010010100011010100010010100010010110110010101
111100111010001001100001100101101011010100010011000110101000100101001010110110010101
Goal of Data Compression: Make the binary
blob as small as possible, satisfying the
protocol.
Encoding-Decoding Protocol
010010100010010100011110110010010101010110100001110100010011000010010101011010100010
encoder
decoder
Alphabet A = {a, b, c, …., z}, assume |A|
= 32 ab…z
0000000001…11111
Option 1: Fixed Length Codes
Each letter mapped to exactly 5
bits
Example: ASCII encoding
cat
ab…z
0000000001…11111
encoder
decoder
000110000010100
Example: Fixed Length Codes
000110000010100
cat
A = {a, b, c, …., z}
Output Size of Fixed Length Codes
Input: Alphabet A, text document of length n
Each letter is mapped to log2(|A|) bits
Output Size: nlog2(|A|)
Optimal if letters appear with same frequencies in
text!In practice, letters appear with different
frequencies
Ex: In English, letters a, t, e are much more
frequent than q, z, x
Question: Can we do better?
Option 2: Variable Length Binary Codes
Goal is to assign:
Frequently appearing letters short bit strings
Infrequently appearing ones long bit strings
Hope: On average have ≤ nlog2(|A|) encoded bits for
documents of size n (or ≤ log2(|A|) bits per letter)
Example 1: The Morse’s Code (not binary)Two Symbols: Dots (●) and Dash (−) or light and dark
But end of a letter is indicated with a pause
(effectively a third symbol)
frequents: e => ●, t => −, a => ●−
Infrequents: c => −●−●, j => ●−−−
cat encoder −●−●P
−●−●P●−P−P
cat decode
●−P
−P
Can We Have a Morse Code with 2 Symbols?Goal: Same idea as the Morse code but with only 2
symbols.
frequents: e => 0, t => 1, a => 01
Infrequents: c => 1010, j => 0111
cat encoder 1010
1010011decode
011
taeett?
teteat?cat?
**Decoding is Ambigous**
Why Was There Ambiguity?
The encoding of one letter was a
prefix of another letter.
Ex: e => 0 is a prefix of a => 01
Goal: Use a “prefix-free” encoding, i.e.
no letter’s encoding is a prefix of
another!Note: Fixed-length encoding was naturally
“prefix-free”.
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
110010
decode
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
110010
decodec
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
110010
decodeca
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
110010
decodecab
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
11101101100
decode
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
11101101100
decoded
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
11101101100
decodeda
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
11101101100
decodedac
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
11101101100
decodedacc
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
11101101100
decodedacca
Benefits of Variable Length Codes
Ex: A = {a, b, c, d}, Frequencies: a: 45% b: 40% c:
10% d: 5%
abcd
010110111
Variable Length
Codeabcd
00011011
Fixed Length Code
A document of length
100K
Fixed Length
Code
Variable Length
Code200K bits
(2
bits/letter)
a: 45Kb: 80Kc: 30Kd: 15K
Total:170K bits(1.7 b/l)
Formal Problem Statement
Input: An alphabet A, and frequencies 𝓕 of letters in A
Output: a prefix-free encoding Ɣ, i.e. a mapping A ->
{0,1}* that minimizes the average bits per letter
Outline For Today
1. Runtime of Kruskal’s Algorithm (Union-Find Data
Structure)
2. Data Encodings & Finding An Optimal Prefix-free
Encoding
3. Prefix-free Encodings Binary Trees
4. Huffman Codes
Prefix-free Encodings Binary Trees
We can represent each prefix-free code Ɣ as a binary
tree T as follows:
abcd
010110111
Code 1
b
c d
0 1
a0 1
0 1
Encoding of letter x = path from the root to the leaf
with x
Prefix-free Encodings Binary Trees
We can represent each prefix-free code Ɣ as a binary
tree T as follows:
abcd
00011011
Code 2
c d
0 1
0 1
a b
0 1
Reverse is Also True
Each labeled binary tree T corresponds to a prefix-free
code for an alphabet A, where |A| = # leaves in T
b e
0 1
0 1
a0
1
c d0
1
abcde
011000000111
Why is this code prefix-free?
Reverse is Also True
Claim: Each labeled binary tree T corresponds to a
prefix-free code for an alphabet A, where |A| = #
leaves in T
Proof: Take path P = {0,1}* to leaf x as x’
encoding
Since each letter x is at a leaf,
the path from the root to x is a dead-end
and cannot be part of a path to another letter y.
Number of Bits for Letter x?
b
c d
0 1
a0 1
0 1
Let A be an alphabet, and T be a binary tree where
letters of A are the leaves of T
Answer: depthT(x)
Question: What’s the number
of bits for each letter x in the
encoding corresponding to T?
Formal Problem Statement Restated
Input: An alphabet A, and frequencies 𝓕 of letters in A
Output: A binary tree T, where letters of A are the
leaves of T, that has the minimum average bit length
(ABL):
Outline For Today
1. Runtime of Kruskal’s Algorithm (Union-Find Data
Structure)
2. Data Encodings & Finding An Optimal Prefix-free
Encoding
3. Prefix-free Encodings Binary Trees
4. Huffman Codes
Observation 1 About Optimal T
Claim: The optimal binary tree T is full, i.e., each non-
leaf vertex u has exactly 2 children
a
0 1
c
0 1
b
0 1
e0
a
0 1
c
0 1
b
0 1
e
Why?T T`
Claim: The optimal binary tree T is full, i.e., each non-
leaf vertex u has exactly 2 children
a
0 1
c
0 1
b
0 1
e0
a
0 1
c
0 1
b
0 1
e
Exchange Argument: Can replace u with its only child and decrease the
depths of some leaves, giving a better tree T`.
Observation 1 About Optimal T
Claim: The optimal binary tree T is full, i.e., each non-
leaf vertex has exactly 2 children
T T`
c
0 1
1
0
a b
1c
0 1
0
a b
1
Observation 1 About Optimal T
First Algorithm: Shannon-Fano Codes
From 1948
Top-down Divide-Conquer type approach
1. Divide the alphabet into A0 and A1 s.t the frequencies
of letters in A0 and A1 are roughly 50%
2. Find an encoding Ɣ0 for A0, and Ɣ1 for A1
3. Append 0 to the encodings of Ɣ0 and 1 to Ɣ1
First Algorithm: Shannon-Fano Codes
Ex: A = {a, b, c, d}, Frequencies: a: 45% b: 40% c:
10% d: 5%
A0 = {a, d}, A1 = {b, c}
d
0 1
a c
0 1
b
0 1
Fixed-length encoding, which we saw was
suboptimal!
Observation 2 About Optimal T
Claim: In any optimal tree T if leaf x has depth i, and leaf
y has depth j, s.t i < j => f(x) ≥ f(y)
Why?
Exchange Argument: Replace x and y and get a better
tree T`.
Observation 2 About Optimal T
Ex: A = {a, b, c, d}, Frequencies: a: 45% b: 40% c: 10%
d: 5%
b
a d
0 1
c0 1
0 1b
c d
0 1
a0 1
0 1
T => 2.4 bits/letter
T` => 1.7 bits/letter
Corollary
In any optimal tree T the two lowest
frequency letters are both in the lowest
level of the tree!
Huffman’s Key Insight
Observation 1 => optimal Ts are full => each leaf has
a sibling
Corollary => 2 lowest freq. letters x, y are at the same
level
Changing letters across the same level does not
change the cost of T
b
c d
0 1
a0 1
0 1
There is an optimal tree T,
in which the two lowest
frequency letters are
siblings (in the lowest level
of the tree).
Possible Greedy Algorithm
Possible greedy algorithm:
1. If x, y are siblings, treat them as a single meta-letter
xy
2. Find an optimal tree T* with A-{x, y} + {xy}
3. Expand xy back into x and y in T*
Possible Greedy Algorithm (Example)
xy t
0 1
z0 1
Ex: A = {x, y, z, t}, and let x, y be the two lowest freq.
letters
Let A` = {xy, z, t}
t
0 1
z0 1
x y
0 1
T* T
The weight of meta-letter?
Q: What weight should be attached to the meta-letter
xy?
A: f(x) + f(y) procedure Huffman(A, 𝓕): if (|A|=2): return T where branch 0, 1 point to A[0] and A[1], respectively
let x, y be lowest two frequency letters let A` = A-{x,y}+{xy} let ` = - {x, y} + {xy: f(x) + f(y)}𝓕 𝓕 T* = Huffman(A`, `)𝓕 expand x, y in T* to get Treturn T
Huffman’s Algorithm (1951)
procedure Huffman(A, 𝓕): if (|A|=2): return T where branch 0, 1 point to A[0] and A[1], respectively
let x, y be lowest two frequency letters let A` = A-{x,y}+{xy} let ` = - {x, y} + {xy: f(x) + f(y)}𝓕 𝓕 T* = Huffman(A`, `)𝓕 expand x, y in T* to get Treturn T
Huffman’s Algorithm Correctness (1)
By induction on the |A|
Base case: |A| = 2 => return simple full tree with 2
leaves
IH: Assume true for all alphabets of size k-1
Huffman will get a Tk-1opt with meta-letter xy and
expand xy
Huffman’s Algorithm Correctness (2)
xy t
0 1z
0 1t
0 1z
0 1
x y0 1
Tk-1opt T
f(xy)*depth(xy)=(f(x) +
f(y))*depth(xy)
(f(x) + f(y))*(depth(xy) + 1)
Total diff = f(x) + f(y)
Huffman’s Algorithm Correctness (3)
Take any optimal Z, we’ll argue ABL(T) ≤ ABL(Z)
By corollary we can assume in Z x,y are also siblings at
the lowest level.
Consider Z` by merging them => Z` is valid prefix-
code for A` of size k-1
ABL(Z) = ABL(Z`) + f(x) + f(y)
ABL(T) = ABL(T`) + f(x) + f(y)
By IH: ABL(T`) ≤ ABL(T`) => ABL(T) ≤ ABL(z)
Q.E.D
Huffman’s Algorithm Runtime
Exercise: Make Huffman run in O(|A|log(|A|))?