Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

104
Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Transcript of Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Page 1: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Running Time of Kruskal’s Algorithm

Huffman Codes

Monday, July 14th

Page 2: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Outline For Today

1. Runtime of Kruskal’s Algorithm (Union-Find Data

Structure)

2. Data Encodings & Finding An Optimal Prefix-free

Encoding

3. Prefix-free Encodings Binary Trees

4. Huffman Codes

Page 3: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Outline For Today

1. Runtime of Kruskal’s Algorithm (Union-Find Data

Structure)

2. Data Encodings & Finding An Optimal Prefix-free

Encoding

3. Prefix-free Encodings Binary Trees

4. Huffman Codes

Page 4: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 5: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 6: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 7: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 8: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 9: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Creates a cycle

Page 10: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Creates a cycle

Page 11: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 12: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 13: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Creates a cycle

Page 14: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Creates a cycle

Page 15: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Recap: Kruskal’s Algorithm Simulation

B C

1

4

2A D E

F

3

G

2.5

7.5

H

7

Final Tree!

Same as Tprim

Page 16: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Recap: Kruskal’s Algorithm Pseudocode procedure kruskal(G(V, E)):

sort E in order of increasing weights rename E so w(e1) < w(e2) < … < w(em) T = {} // final tree edges for i = 1 to m: if T ∪ ei=(u,v) doesn’t create cycle add ei to T return T

Page 17: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Recap: For Correctness We Proved 2 Things1. Outputs a Spanning Tree Tkrsk

2. Tkrsk is a minimum spanning tree

Page 18: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

1: Kruskal Outputs a Spanning Tree (1)

Need to prove Tkrsk is spanning AND is acyclic

Acyclic is by definition of the algorithm.

Why is Tkrsk spanning (i.e., connected)?

Recall Empty Cut Lemma:

A graph is not connected iff ∃ cut (X, Y) with no

crossing edges

If all cuts have a crossing edge -> graph is

connected!

Page 19: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

2: Kruskal is Optimal (by Cut Property)Let (u, v) be any edge added by Kruskal’s Algorithm.

u and v are in different comp. (b/c Kruskal checks for

cycles)

ux

y

v

t

zw

Claim: (u, v) is min-edge crossing this cut!

Page 20: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Kruskal’s Runtime

procedure kruskal(G(V, E)): sort E in order of increasing weights

rename E so w(e1) < w(e2) < … < w(em) T = {} // final tree edges for i = 1 to m: if T ∪ ei=(u,v) doesn’t create cycle add ei to T return T

O(mlog(n))

m iterations

?Option 1: check if u v path exists! ⤳

Run a BFS/DFS from u or v => O(|T| + n) = O(n)

Can we speed up cycle checking?

***BFS/DFS Total Runtime: O(mn)***

Page 21: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Speeding Kruskal’s Algorithm

Goal: Check for cycles in log(n) time.

Observation: (u, v) creates a cycle iff u and v

are in the same connected component

Option 2: check if u’s component = v’s

component

More Specific Goal: check the component of

each vertex in log(n) time

Page 22: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Data Structure

Operation 1: Maintain the component

structure of T as we add new edges to it.

Operation 2: Query component of each

vertex v

Union

Find

Page 23: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Kruskal’s With Union-Find (Conceptually)

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

Page 24: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Kruskal’s With Union-Find (Conceptually)

A

CB

E

B C

1

46

2

5

D

A D E

FF

3

G G

2.5

7.5

HH

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Find(A) = A

Find(D) = DUnion(A, D)

Page 25: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Kruskal’s With Union-Find (Conceptually)

A

CB

E

B C

1

46

2

5

A

A D E

FF

3

G G

2.5

7.5

HH

8

7

9

Find(D) = A

Find(E) = EUnion(A, E)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 26: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Kruskal’s With Union-Find (Conceptually)

A

CB

A

B C

1

46

2

5

A

A D E

FF

3

G G

2.5

7.5

HH

8

7

9

Find(C) = C

Find(F) = FUnion(C, F)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 27: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Kruskal’s With Union-Find (Conceptually)

A

CB

A

B C

1

46

2

5

A

A D E

FC

3

G G

2.5

7.5

HH

8

7

9

Find(E) = A

Find(F) = CUnion(A, C)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 28: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Kruskal’s With Union-Find (Conceptually)

A

AB

A

B C

1

46

2

5

A

A D E

FA

3

G G

2.5

7.5

HH

8

7

9

Find(A) = A

Find(B) = BUnion(A, B)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 29: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Kruskal’s With Union-Find (Conceptually)

A

AA

A

B C

1

46

2

5

A

A D E

FA

3

G G

2.5

7.5

HH

8

7

9

Find(D) = A

Find(C) = ASkip (D, C)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 30: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Kruskal’s With Union-Find (Conceptually)

A

AA

A

B C

1

46

2

5

A

A D E

FA

3

G G

2.5

7.5

HH

8

7

9

Find(A) = A

Find(C) = ASkip (A, C)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 31: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Kruskal’s With Union-Find (Conceptually)

A

AA

A

B C

1

46

2

5

A

A D E

FA

3

G G

2.5

7.5

HH

8

7

9

Find(C) = A

Find(H) = HUnion(A, H)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 32: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Kruskal’s With Union-Find (Conceptually)

A

AA

A

B C

1

46

2

5

A

A D E

FA

3

G G

2.5

7.5

AH

8

7

9

Find(F) = A

Find(G) = GUnion(A, G)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 33: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Kruskal’s With Union-Find (Conceptually)

A

AB

A

B C

1

46

2

5

A

A D E

FA

3

A G

2.5

7.5

AH

8

7

9

Find(B) = A

Find(C) = ASkip (B, C)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 34: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Kruskal’s With Union-Find (Conceptually)

A

AB

A

B C

1

46

2

5

A

A D E

FA

3

A G

2.5

7.5

AH

8

7

9

Find(H) = A

Find(G) = ASkip (H, G)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 35: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Implementation Simulation

A1

B1

C1

D1

E1

F1

G1

H1

Page 36: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Implementation Simulation

A1

B1

C1

D1

E1

F1

G1

H1

Page 37: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Implementation Simulation

A2

B1

C1

D

E1

F1

G1

H1

Page 38: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Implementation Simulation

A2

B1

C1

D

E1

F1

G1

H1

Page 39: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Implementation Simulation

A3

B1

C1

D

F1

G1

H1

E

Page 40: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Implementation Simulation

A3

B1

C1

D

F1

G1

H1

E

Page 41: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Implementation Simulation

A3

B1

C2

D

G1

H1

E F

Page 42: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Implementation Simulation

A3

B1

C2

D

G1

H1

E F

Page 43: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Implementation Simulation

A5

B1

CD

G1

H1

E

F

Page 44: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Implementation Simulation

A5

B1

CD

G1

H1

E

F

Page 45: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Implementation Simulation

A6

CD

G1

H1

E

F

B

Page 46: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Implementation Simulation

A6

CD

G1

H1

E

F

B

Page 47: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Implementation Simulation

A7

CD

G1

E

F

B H

Page 48: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Implementation Simulation

A7

CD

G1

E

F

B H

Page 49: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Union-Find Implementation Simulation

A8

CD E

F

B H G

Page 50: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

C

A

X7

W Z

Y

T

Linked Structure Per Connected Component

Leader

Page 51: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

C

A W Z

Y

T

Union Operation

F G

X7

E3

Union: **Make Leader of Small Component Point to the leader of Large Component**

Page 52: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

C

A W Z

Y

T

Union Operation

F G

X10

E

Cost: O(1)(1 pointer update, 1 increment)

Union: **Make Leader of Small Component Point to the leader of Large Component**

Page 53: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

C

A W Z

Y

T

Union Operation

F G

X10

E

Page 54: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

C

A W Z

Y

T

Find Operation

F G

X10

E

Find: “pointer chase” until the leader

Cost: # pointers to leader

?

Page 55: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Cost of Find Operation

Claim: For any v,

#-pointers to leader(v) ≤ log2(|

component(v)|)

≤ log2(n)

Proof: Each time v’s path to leader increases by

1, the size of its component at least doubles!

|component(v)| starts at 1, increases to n,

therefore it can double at most log2(n) time!

Page 56: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Summary of Union-Find

Initialization: Each v is a comp. of size 1 and points to

itself.

When we union two components, we make the leader

of the smaller one point to the larger one (break ties

arbitrarily).

Find(v):

Pointer chasing to the leader

Cost: O(log2(|component|)) = O(log2(n))

Union(u, v): 1 pointer update, 1 increment => O(1)

Page 57: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Kruskal’s Runtime With Union-Find

procedure kruskal(G(V, E)):sort E in order of increasing weights

rename E so w(e1) < w(e2) < … < w(em) init Union-Find T = {} // final tree edges for i = 1 to m: ei=(u,v) if find(u) != find(v) add ei to T Union(find(u), find(v)) return T

O(mlog(n))

m iterations

log(n)

***Total Runtime: O(mlog(n))***Same as Prim’s with heaps

O(1)

O(n)

Page 58: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Outline For Today

1. Runtime of Kruskal’s Algorithm (Union-Find Data

Structure)

2. Data Encodings & Finding An Optimal Prefix-free

Encoding

3. Prefix-free Encodings Binary Trees

4. Huffman Codes

Page 59: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Data Encodings and Compression

All data in the digital world gets represented as 0s and

1s. 010010100010010100011110110010010101010110100001110100010011000010010101011010100010

100001110100010011000010010101011010100010010100011010100010010100010010110110010101

111100111010001001100001100101101011010100010011000110101000100101001010110110010101

Page 60: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Goal of Data Compression: Make the binary

blob as small as possible, satisfying the

protocol.

Encoding-Decoding Protocol

010010100010010100011110110010010101010110100001110100010011000010010101011010100010

encoder

decoder

Page 61: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Alphabet A = {a, b, c, …., z}, assume |A|

= 32 ab…z

0000000001…11111

Option 1: Fixed Length Codes

Each letter mapped to exactly 5

bits

Example: ASCII encoding

Page 62: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

cat

ab…z

0000000001…11111

encoder

decoder

000110000010100

Example: Fixed Length Codes

000110000010100

cat

A = {a, b, c, …., z}

Page 63: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Output Size of Fixed Length Codes

Input: Alphabet A, text document of length n

Each letter is mapped to log2(|A|) bits

Output Size: nlog2(|A|)

Optimal if letters appear with same frequencies in

text!In practice, letters appear with different

frequencies

Ex: In English, letters a, t, e are much more

frequent than q, z, x

Question: Can we do better?

Page 64: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Option 2: Variable Length Binary Codes

Goal is to assign:

Frequently appearing letters short bit strings

Infrequently appearing ones long bit strings

Hope: On average have ≤ nlog2(|A|) encoded bits for

documents of size n (or ≤ log2(|A|) bits per letter)

Page 65: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Example 1: The Morse’s Code (not binary)Two Symbols: Dots (●) and Dash (−) or light and dark

But end of a letter is indicated with a pause

(effectively a third symbol)

frequents: e => ●, t => −, a => ●−

Infrequents: c => −●−●, j => ●−−−

cat encoder −●−●P

−●−●P●−P−P

cat decode

●−P

−P

Page 66: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Can We Have a Morse Code with 2 Symbols?Goal: Same idea as the Morse code but with only 2

symbols.

frequents: e => 0, t => 1, a => 01

Infrequents: c => 1010, j => 0111

cat encoder 1010

1010011decode

011

taeett?

teteat?cat?

**Decoding is Ambigous**

Page 67: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Why Was There Ambiguity?

The encoding of one letter was a

prefix of another letter.

Ex: e => 0 is a prefix of a => 01

Goal: Use a “prefix-free” encoding, i.e.

no letter’s encoding is a prefix of

another!Note: Fixed-length encoding was naturally

“prefix-free”.

Page 68: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

110010

decode

Page 69: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

110010

decodec

Page 70: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

110010

decodeca

Page 71: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

110010

decodecab

Page 72: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

11101101100

decode

Page 73: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

11101101100

decoded

Page 74: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

11101101100

decodeda

Page 75: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

11101101100

decodedac

Page 76: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

11101101100

decodedacc

Page 77: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

11101101100

decodedacca

Page 78: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Benefits of Variable Length Codes

Ex: A = {a, b, c, d}, Frequencies: a: 45% b: 40% c:

10% d: 5%

abcd

010110111

Variable Length

Codeabcd

00011011

Fixed Length Code

A document of length

100K

Fixed Length

Code

Variable Length

Code200K bits

(2

bits/letter)

a: 45Kb: 80Kc: 30Kd: 15K

Total:170K bits(1.7 b/l)

Page 79: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Formal Problem Statement

Input: An alphabet A, and frequencies 𝓕 of letters in A

Output: a prefix-free encoding Ɣ, i.e. a mapping A ->

{0,1}* that minimizes the average bits per letter

Page 80: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Outline For Today

1. Runtime of Kruskal’s Algorithm (Union-Find Data

Structure)

2. Data Encodings & Finding An Optimal Prefix-free

Encoding

3. Prefix-free Encodings Binary Trees

4. Huffman Codes

Page 81: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Prefix-free Encodings Binary Trees

We can represent each prefix-free code Ɣ as a binary

tree T as follows:

abcd

010110111

Code 1

b

c d

0 1

a0 1

0 1

Encoding of letter x = path from the root to the leaf

with x

Page 82: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Prefix-free Encodings Binary Trees

We can represent each prefix-free code Ɣ as a binary

tree T as follows:

abcd

00011011

Code 2

c d

0 1

0 1

a b

0 1

Page 83: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Reverse is Also True

Each labeled binary tree T corresponds to a prefix-free

code for an alphabet A, where |A| = # leaves in T

b e

0 1

0 1

a0

1

c d0

1

abcde

011000000111

Why is this code prefix-free?

Page 84: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Reverse is Also True

Claim: Each labeled binary tree T corresponds to a

prefix-free code for an alphabet A, where |A| = #

leaves in T

Proof: Take path P = {0,1}* to leaf x as x’

encoding

Since each letter x is at a leaf,

the path from the root to x is a dead-end

and cannot be part of a path to another letter y.

Page 85: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Number of Bits for Letter x?

b

c d

0 1

a0 1

0 1

Let A be an alphabet, and T be a binary tree where

letters of A are the leaves of T

Answer: depthT(x)

Question: What’s the number

of bits for each letter x in the

encoding corresponding to T?

Page 86: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Formal Problem Statement Restated

Input: An alphabet A, and frequencies 𝓕 of letters in A

Output: A binary tree T, where letters of A are the

leaves of T, that has the minimum average bit length

(ABL):

Page 87: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Outline For Today

1. Runtime of Kruskal’s Algorithm (Union-Find Data

Structure)

2. Data Encodings & Finding An Optimal Prefix-free

Encoding

3. Prefix-free Encodings Binary Trees

4. Huffman Codes

Page 88: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Observation 1 About Optimal T

Claim: The optimal binary tree T is full, i.e., each non-

leaf vertex u has exactly 2 children

a

0 1

c

0 1

b

0 1

e0

a

0 1

c

0 1

b

0 1

e

Why?T T`

Page 89: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Claim: The optimal binary tree T is full, i.e., each non-

leaf vertex u has exactly 2 children

a

0 1

c

0 1

b

0 1

e0

a

0 1

c

0 1

b

0 1

e

Exchange Argument: Can replace u with its only child and decrease the

depths of some leaves, giving a better tree T`.

Observation 1 About Optimal T

Page 90: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Claim: The optimal binary tree T is full, i.e., each non-

leaf vertex has exactly 2 children

T T`

c

0 1

1

0

a b

1c

0 1

0

a b

1

Observation 1 About Optimal T

Page 91: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

First Algorithm: Shannon-Fano Codes

From 1948

Top-down Divide-Conquer type approach

1. Divide the alphabet into A0 and A1 s.t the frequencies

of letters in A0 and A1 are roughly 50%

2. Find an encoding Ɣ0 for A0, and Ɣ1 for A1

3. Append 0 to the encodings of Ɣ0 and 1 to Ɣ1

Page 92: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

First Algorithm: Shannon-Fano Codes

Ex: A = {a, b, c, d}, Frequencies: a: 45% b: 40% c:

10% d: 5%

A0 = {a, d}, A1 = {b, c}

d

0 1

a c

0 1

b

0 1

Fixed-length encoding, which we saw was

suboptimal!

Page 93: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Observation 2 About Optimal T

Claim: In any optimal tree T if leaf x has depth i, and leaf

y has depth j, s.t i < j => f(x) ≥ f(y)

Why?

Exchange Argument: Replace x and y and get a better

tree T`.

Page 94: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Observation 2 About Optimal T

Ex: A = {a, b, c, d}, Frequencies: a: 45% b: 40% c: 10%

d: 5%

b

a d

0 1

c0 1

0 1b

c d

0 1

a0 1

0 1

T => 2.4 bits/letter

T` => 1.7 bits/letter

Page 95: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Corollary

In any optimal tree T the two lowest

frequency letters are both in the lowest

level of the tree!

Page 96: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Huffman’s Key Insight

Observation 1 => optimal Ts are full => each leaf has

a sibling

Corollary => 2 lowest freq. letters x, y are at the same

level

Changing letters across the same level does not

change the cost of T

b

c d

0 1

a0 1

0 1

There is an optimal tree T,

in which the two lowest

frequency letters are

siblings (in the lowest level

of the tree).

Page 97: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Possible Greedy Algorithm

Possible greedy algorithm:

1. If x, y are siblings, treat them as a single meta-letter

xy

2. Find an optimal tree T* with A-{x, y} + {xy}

3. Expand xy back into x and y in T*

Page 98: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Possible Greedy Algorithm (Example)

xy t

0 1

z0 1

Ex: A = {x, y, z, t}, and let x, y be the two lowest freq.

letters

Let A` = {xy, z, t}

t

0 1

z0 1

x y

0 1

T* T

Page 99: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

The weight of meta-letter?

Q: What weight should be attached to the meta-letter

xy?

A: f(x) + f(y) procedure Huffman(A, 𝓕): if (|A|=2): return T where branch 0, 1 point to A[0] and A[1], respectively

let x, y be lowest two frequency letters let A` = A-{x,y}+{xy} let ` = - {x, y} + {xy: f(x) + f(y)}𝓕 𝓕 T* = Huffman(A`, `)𝓕 expand x, y in T* to get Treturn T

Page 100: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Huffman’s Algorithm (1951)

procedure Huffman(A, 𝓕): if (|A|=2): return T where branch 0, 1 point to A[0] and A[1], respectively

let x, y be lowest two frequency letters let A` = A-{x,y}+{xy} let ` = - {x, y} + {xy: f(x) + f(y)}𝓕 𝓕 T* = Huffman(A`, `)𝓕 expand x, y in T* to get Treturn T

Page 101: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Huffman’s Algorithm Correctness (1)

By induction on the |A|

Base case: |A| = 2 => return simple full tree with 2

leaves

IH: Assume true for all alphabets of size k-1

Huffman will get a Tk-1opt with meta-letter xy and

expand xy

Page 102: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Huffman’s Algorithm Correctness (2)

xy t

0 1z

0 1t

0 1z

0 1

x y0 1

Tk-1opt T

f(xy)*depth(xy)=(f(x) +

f(y))*depth(xy)

(f(x) + f(y))*(depth(xy) + 1)

Total diff = f(x) + f(y)

Page 103: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Huffman’s Algorithm Correctness (3)

Take any optimal Z, we’ll argue ABL(T) ≤ ABL(Z)

By corollary we can assume in Z x,y are also siblings at

the lowest level.

Consider Z` by merging them => Z` is valid prefix-

code for A` of size k-1

ABL(Z) = ABL(Z`) + f(x) + f(y)

ABL(T) = ABL(T`) + f(x) + f(y)

By IH: ABL(T`) ≤ ABL(T`) => ABL(T) ≤ ABL(z)

Q.E.D

Page 104: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th.

Huffman’s Algorithm Runtime

Exercise: Make Huffman run in O(|A|log(|A|))?