Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... •...
Transcript of Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... •...
Data Structures
Huffman Codes
1
Huffman Codes
• Optimal static technique for data compression using binary codes
– Savings of 20% to 90% are typical
– Depending on the characteristics of the data
• Binary codes
– Fixed-length
– Variable-length
2
Example
• Suppose we have a 100,000-character data file
• Fixed-length
– 300,000 bits to code the entire file
• Variable-length
– (45·1 + 13·3 + 12·3 + 16·3 + 9·4 + 5·4)·1,000 = 224,000
3
Prefix Codes
• Encoding is always simple for any binary character code – Just concatenate the codewords representing each character
• Variable-length code is not necessary uniquely decodable – With the code: 0 01 10 101 – There are many ways to parse 001011101
• Prefix codes
– No codeword is a prefix of some other codeword – Decoding is unambiguous
• Example
– With the code: 0 101 100 111 1101 1100 – 001011101 parses uniquely as 0 · 0 · 101 · 1101
4
• A binary tree whose leaves are the given characters
• Node x also contains the total frequency of the characters in the sub-tree rooted at x.
• Decoding: 001011101 parses uniquely as • 001 · 011 · 101
• 0 · 0 · 101 · 1101
Representation of Prefix Codes
5
Cost of Binary Code Tree
f(c) is the frequency of c in the file
dT(c) is the depth of c’s leaf in the tree
6
Constructing an Huffman Code
• Input: – Set of n characters C. – For each character c C, the frequency f(c) of c.
• Output:
– A tree T, called Huffman tree, corresponding to the optimal prefix code.
• General idea:
– Assign those characters that occur more frequently a shorter code (near the top of the tree).
• Method:
– Begin with a set of the |C| leaves. – Repeatedly, using a min-priority queue Q, keyed on frequencies, identify the
two least-frequent objects to merge together. – Until all objects are merged.
7
Constructing an Huffman code
8
Constructing an Huffman code
Huffman(C)
n = |C|
Q = priority queue of C
for (i = 1 to n − 1) do
allocate a new node z
z.left = x = ExtractMin(Q)
z.right = y = ExtractMin(Q)
z.freq = x.freq + y.freq
insert(Q, z)
return ExtractMin(Q)
• Running time O(n log n), where n = |C|.
• Huffman code is not unique
9
Correctness
• Huffman Codes are prefix codes
– Since all codewords are leaf nodes
10
Correctness
• Claim:
– An optimal prefix code tree is a full tree.
– (Huffman tree is a full tree.)
11
Correctness
• Lemma: – There is an optimal prefix code tree in which the two symbols with
smallest frequencies are siblings (in the last level).
• Proof: – An optimal tree is a full binary tree.
– If necessary, it is possible to interchange the two symbols with smallest frequencies with two siblings symbols in the last level of the tree without affecting the cost.
12
Correctness
• Theorem:
– Huffman code is an optimal prefix code
If this Huffman Tree is optimal
The extension should be optimal Huffman
Tree too
By induction
13
Proof of the Theorem
• By induction on the size of the alphabet. • For |C| = 2 the result is trivially true. • Assume by induction that the result holds when |C|< k, and let |C| = k, and T be
an Huffman tree of C. • For the sake of contradiction, assume that T is not optimal. That is, there is a tree S
such that B(S) < B(T). • By the lemma, we can assume that the two symbols x and y with smallest
frequencies are siblings in S. • By the algorithm, since x and y are minimal, they are siblings in T. • Let T’ and S’ be the trees obtained from T and S, respectively, by removing these
two siblings and replacing their parents by a leaf with frequency x.freq + y.freq. • T’ is an Huffman tree of a smaller set than C, and by induction it optimal prefix free
codes tree. • B(S’) + x.freq + y.freq = B(S) < B(T) = B(T’) + x.freq + y.freq. • It follows that B(S’) < B(T’), which contradicts the optimality of T’. • Hence, T is optimal.
14