15-211 Fundamental Data Structures and Algorithms

15-211 Fundamental Data Structures and Algorithms Aleks Nanevski February 10, 2004 based on a lecture by Peter Lee LZW Compression


15-211 Fundamental Data Structures and Algorithms. LZW Compression. Aleks Nanevski February 10, 2004 based on a lecture by Peter Lee. Last Time…. Problem: data compression. Convert a string into a shorter string. Lossless – represents exactly the same information. - PowerPoint PPT Presentation

Transcript of 15-211 Fundamental Data Structures and Algorithms

Page 1: 15-211 Fundamental Data Structures and Algorithms

15-211Fundamental Data Structures and Algorithms

Aleks NanevskiFebruary 10, 2004

based on a lecture by Peter Lee

LZW Compression

Page 2: 15-211 Fundamental Data Structures and Algorithms

Last Time…

Page 3: 15-211 Fundamental Data Structures and Algorithms

Problem: data compression

Convert a string into a shorter string.Lossless – represents exactly

the same information.Lossy – approximates the

original information. Uses of compression:

Images over the web: JPEGMusic: MP3General-purpose: ZIP, GZIP, JAR, …

Page 4: 15-211 Fundamental Data Structures and Algorithms

Huffman trees

Page 6: 15-211 Fundamental Data Structures and Algorithms

Huffman compression

Huffman trees provide a straightforward method for file compression.1. Scan the file and compute frequencies2. Build the code tree3. Write code tree to the output file as a

header4. Scan input, encode, and write into the

output file

Page 7: 15-211 Fundamental Data Structures and Algorithms

Huffman decompression

Read the header in the compressed file, and build the code tree

Read the rest of the file, decode using the tree

Write to output

Page 8: 15-211 Fundamental Data Structures and Algorithms

Beating Huffman

How about doing better than Huffman!

Impossible!Huffman’s algorithm gives the optimal

prefix code!

Right.But who says we have to use a prefix


Page 9: 15-211 Fundamental Data Structures and Algorithms


Suppose we have a file containingabcdabcdabcdabcdabcdabcd…


This could be expressed very compactly asabcd^1000

Page 10: 15-211 Fundamental Data Structures and Algorithms


Page 11: 15-211 Fundamental Data Structures and Algorithms

Dictionary-based methods

Here is a simple idea:Keep track of “words” that we have seen, and

replace them with a code number when we see them again.

The code is typically shorter than the word

We can maintain dictionary entries (word, code)

and make additions to the dictionary as we read the input file.

Page 12: 15-211 Fundamental Data Structures and Algorithms

Lempel & Ziv (1977/78)

Page 13: 15-211 Fundamental Data Structures and Algorithms

Fred Hacker’s algorithm…

Fred now knows what to do…

Create the dictionary:

( <the-whole-file>, 1 )

Transmit 1, done.

Page 14: 15-211 Fundamental Data Structures and Algorithms


Fred’s algorithm provides excellent compression, but…

Page 15: 15-211 Fundamental Data Structures and Algorithms


Fred’s algorithm provides excellent compression, but…

…the receiver does not know what is in the dictionary!And sending the dictionary is the same as

sending the entire uncompressed file

Thus, we can’t decompress the “1”.

Page 16: 15-211 Fundamental Data Structures and Algorithms


…we need to build our dictionary in such a way that the receiver can rebuild the dictionary easily.

Page 17: 15-211 Fundamental Data Structures and Algorithms

LZW Compression:The Binary Version

LZW=variant of Lempel-Ziv Compression, by Terry Welch (1984)

Page 18: 15-211 Fundamental Data Structures and Algorithms

Maintaining a dictionary

We need a way of incrementally building up a dictionary during compression in such a way that…

…someone who wants to uncompress can “rediscover” the very same dictionary

And we already know that a convenient way to build a dictionary incrementally is to use a trie

Page 19: 15-211 Fundamental Data Structures and Algorithms

Binary LZW

In this method, we build up binary tries In a binary trie, each node has two

children In addition, we will add the following:

each left edge is marked 0each right edge is marked 1each leaf has a label from the set {0,…,n}

Page 20: 15-211 Fundamental Data Structures and Algorithms

A binary trie

0 1



1 2

4 5

0 0






Page 21: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Compression

1. We start with a binary trie consisting of a root node and two children

left child labeled 0, and right labeled 1

2. We read the bits of the input file, and follow the trie

3. When a leaf is reached, we emit the label at the leaf

4. Then, add two new children to that leaf (converting it into an internal node)

Page 22: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Compression, pt.2

5. The new left child takes the old label

6. The new right child takes a new label value that is one greater than the current maximum label value

Page 23: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Compression example


0 1

0 1



Page 24: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Compression example


0 1



Output: 1

1 2

0 1

Page 25: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Compression example


0 1



Output: 10

1 2

0 1


0 1

Page 26: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Compression example


0 1



Output: 103

1 2

0 1


0 1


0 1

Page 27: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Compression example


0 1


Output: 1034

1 2

0 1


0 1


0 1

4 5

0 1

Page 28: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Compression example


0 1


Output: 10340

1 2

0 10 1


0 1

4 5

0 1

0 6

Page 29: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Compression example


0 1


Output: 103402


0 10 1


0 1

4 5

0 1

0 6 2

0 1


Page 30: 15-211 Fundamental Data Structures and Algorithms

Binary LZW output

So from the input10010110011

we get output103402

To represent this output we can keep track of the number of labels n each time we emit a codeand use log(n) bits for that code

Page 31: 15-211 Fundamental Data Structures and Algorithms

Binary LZW output

We started with input


Encoded it as 103402, for which we get the bit sequence 001 000 011 100 000 010

This looks like an expansion instead of a compression

But what if we have a larger input, with more repeating sequences?

Try it!

Page 32: 15-211 Fundamental Data Structures and Algorithms

Binary LZW output

One can also use Huffman compression on the output…

Page 33: 15-211 Fundamental Data Structures and Algorithms

Binary LZW termination

Note that binary LZW has a serious problem, in that the input might end while we are in the middle of the trie (instead of at a leaf node)

This is a nasty problemwhich is why we won’t use this binary

methodBut this is still good for illustration


Page 34: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Uncompress

To uncompress, we need to read the compressed file and rebuild the same trie as we go along

To do this, we need to maintain the trie and also the maximum label value

Page 35: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Uncompress example


0 1




Page 36: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Uncompress example


0 1


Output: 1


0 1



Page 37: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Uncompress example


0 1


Output: 10


0 10 1


Page 38: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Uncompress example


0 1


Output: 1001


0 10 1


0 1



Page 39: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Uncompress example


0 1


Output: 1001011


0 10 1


0 1

4 5

0 1

0 2

Page 40: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Uncompress example


0 1


Output: 100101100


0 10 1


0 1

4 5

0 1

0 6


Page 41: 15-211 Fundamental Data Structures and Algorithms

Binary LZW: Uncompress example


0 1


Output: 10010110011


0 10 1


0 1

4 5

0 1

0 6 2

0 1


Page 42: 15-211 Fundamental Data Structures and Algorithms

LZW Compression:The Byte Version

Page 43: 15-211 Fundamental Data Structures and Algorithms

Byte method

The binary LZW method doesn’t really workwe show it for illustrative purposes

Instead, we use a slightly more complicated version that works on bytes or charactersWe can think of each byte as a

“character” in the range {0…255}

Page 44: 15-211 Fundamental Data Structures and Algorithms

Byte method trie

Instead of a binary trie, we use a more general trie in whicheach node can have up to n children

(where n is the size of the alphabet), one for each byte/character

every node (not just the leaves) has an integer label from the set {0…m}, for some m• except the root node, which has no label

Page 45: 15-211 Fundamental Data Structures and Algorithms

Byte method LZW

We start with a trie that contains a root and n childrenone child for each possible charactereach child labeled 0…n

When we compress as before, by walking down the triebut, after emitting a code and growing

the trie, we must start from the root’s child labeled c, where c is the character that caused us to grow the trie

Page 46: 15-211 Fundamental Data Structures and Algorithms

LZW: Byte method example

Suppose our entire character set consists only of the four letters:{a, b, c, d}

Let’s consider the compression of the stringbaddad

Page 47: 15-211 Fundamental Data Structures and Algorithms

Byte LZW: Compress example


a bDictionary:


10 32

c d

Page 48: 15-211 Fundamental Data Structures and Algorithms

Byte LZW: Compress example


a bDictionary:


10 32

c d




Page 49: 15-211 Fundamental Data Structures and Algorithms

Byte LZW: Compress example


a bDictionary:


10 32

c d






Page 50: 15-211 Fundamental Data Structures and Algorithms

Byte LZW: Compress example


a bDictionary:


10 32

c d








Page 51: 15-211 Fundamental Data Structures and Algorithms

Byte LZW: Compress example


a bDictionary:


10 32

c d










Page 52: 15-211 Fundamental Data Structures and Algorithms

Byte LZW: Compress example


a bDictionary:


10 32

c d










Page 53: 15-211 Fundamental Data Structures and Algorithms

Byte LZW output

So, the inputbaddad

compresses to10335

which again can be given in bit form, just like in the binary method…

…or compressed again using Huffman

Page 54: 15-211 Fundamental Data Structures and Algorithms

Byte LZW: Uncompress example

The uncompress step for byte LZW is the most complicated part of the entire process, but is largely similar to the binary method

Page 55: 15-211 Fundamental Data Structures and Algorithms

Byte LZW: Uncompress example


a bDictionary:


10 32

c d

Page 56: 15-211 Fundamental Data Structures and Algorithms

Byte LZW: Uncompress example


a bDictionary:


10 32

c d


Page 57: 15-211 Fundamental Data Structures and Algorithms

Byte LZW: Uncompress example


a bDictionary:


10 32

c d




Page 58: 15-211 Fundamental Data Structures and Algorithms

Byte LZW: Uncompress example


a bDictionary:


10 32

c d






Page 59: 15-211 Fundamental Data Structures and Algorithms

Byte LZW: Uncompress example


a bDictionary:


10 32

c d








Page 60: 15-211 Fundamental Data Structures and Algorithms

Byte LZW: Uncompress example


a bDictionary:


10 32

c d










Page 61: 15-211 Fundamental Data Structures and Algorithms

LZW Byte method:An alternative presentation

Page 62: 15-211 Fundamental Data Structures and Algorithms

Getting off the ground

Suppose we want to compress a file containing only letters a, b, c and d.

It seems reasonable to start with a dictionary

a:0 b:1 c:2 d:3

At least we can then deal with the first letter.

And the receiver knows how to start.

Page 63: 15-211 Fundamental Data Structures and Algorithms

Growing pains

Now suppose the file starts like so:

a b b a b b …

We scan the a, look it up and output a 0.

After scanning the b, we have seen the word ab. So, we add it to the dictionary

a:0 b:1 c:2 d:3 ab:4

Page 64: 15-211 Fundamental Data Structures and Algorithms

Growing pains

We already scanned the first b.

a b b a b b …

Then we get another b.

We output a 1 for the first b, and add bb to the dictionary

a:0 b:1 c:2 d:3 ab:4 bb:5

Page 65: 15-211 Fundamental Data Structures and Algorithms


Right, so far zero compression.

We already scanned the second b.

a b b a b b …

After scanning a, we output 1 for the b, and put ba in the dictionary

… d:3 ab:4 bb:5 ba:6

Still zero compression.

Page 66: 15-211 Fundamental Data Structures and Algorithms

But now…

We already scanned a.

a b b a b b …

We scan the next b, and ab : 4 is in the dictionary.

We scan the next b, output 4, and put abb into the dictionary.

… d:3 ab:4 bb:5 ba:6 abb:7

We got compression, because 4 is shorter than ab.

Page 67: 15-211 Fundamental Data Structures and Algorithms

We already scanned the last b

a b b a b b …

Suppose the input continues

a b b a b b b b a …

We scan the next b, and bb:5 is in the dictionary

We scan the next b, output 5, and put bbb into the dictionary

… ab:4 bb:5 ba:6 abb:7 bbb:8

And so on

Page 68: 15-211 Fundamental Data Structures and Algorithms

More Hits

As our dictionary grows, we are able to replace longer and longer blocks by short code numbers.

a b b a b b b b a …

0 1 1 4 5 6

And we increase the dictionary at each step by adding another word.

Page 69: 15-211 Fundamental Data Structures and Algorithms


where each prefix is in the dictionary.

We stop when we fall out of the dictionary:

a1 a2 a3 …. ak b

We scan a sequence of symbols

a1 a2 a3 …. ak

Page 70: 15-211 Fundamental Data Structures and Algorithms

Summary (cont’d)

We output the code for a1 a2 a3 …. ak and

put a1 a2 a3 …. ak b into the dictionary.

Then we set

a1 = b

And start all over.

Page 71: 15-211 Fundamental Data Structures and Algorithms

More importantly

Since we extend our dictionary in such a simple way, it can be easily reconstructed on the other end.

Start with the same initialization, then

Read one code number after the other, look up the each one in the dictionary, and extend the dictionary as you go along.

Page 72: 15-211 Fundamental Data Structures and Algorithms

Sort of

Let's take a closer look at an example.

Assume alphabet {a,b,c}.

The code for aabbaabb is 0 0 1 1 3 5.

The decoding starts with dictionary D:

0:a, 1:b, 2:c

Page 73: 15-211 Fundamental Data Structures and Algorithms

Moving along

The first 4 code words are already in D.

0 0 1 1 3 5

and produce output a a b b.

As we go along, we extend D:

0:a, 1:b, 2:c, 3:aa, 4:ab, 5:bb

For the code numbers 3 5, get

a a b b a a b b

Page 74: 15-211 Fundamental Data Structures and Algorithms


We have also added to D:

6:ba, 7:aab

But these entries are never used.

Everything is easy, since there is already an entry in D for each code number when we encounter it.

Page 75: 15-211 Fundamental Data Structures and Algorithms

Is this it?

Unfortunately, no.

It may happen that we run into a code word without having an appropriate entry in D.

But, it can only happen in very special circumstances, and we can manufacture the missing entry.

Page 76: 15-211 Fundamental Data Structures and Algorithms

A Bad Run

Consider input

a a b b b a a ==> 0 0 1 5 3

After reading 0 0 1, we output

a a b

and extend D with codes for aa and ab

0:a, 1:b, 2:c, 3:aa, 4:ab

Page 77: 15-211 Fundamental Data Structures and Algorithms


We have read 0 0 1 from the input

0 0 1 5 3

The dictionary is

0:a, 1:b, 2:c, 3:aa, 4:ab

The next code number to read is 5, but it’s not in D.

How could this have happened?

Can we recover?

Page 78: 15-211 Fundamental Data Structures and Algorithms

… narrowly averted

This problem only arises when on the compressor end:

• the input contains a substring

…s s s …

• compressor read s , output code c for s , and added c+1: s s to the dictionary.

• Here s is a single symbol, but a (possibly empty) word.

Page 79: 15-211 Fundamental Data Structures and Algorithms

… narrowly averted (pt. 2)

On the decompressor end, D contains

c: s

• but does not contain c+1: s s

• the decompressor has already output

x = s

and is now looking at unknown code number c+1.

Page 80: 15-211 Fundamental Data Structures and Algorithms

… narrowly averted (pt. 3)

But then the fix is to output

x + first(x)

where x is the last decompressed word, and first(x) the first symbol of x.

Because x=s was already output, we get the required

s s s

We also update the dictionary to contain the new entry x+first(x) = s s.

Page 81: 15-211 Fundamental Data Structures and Algorithms

In our example we have read 0 0 1 from the input

0 0 1 5 3

The last decompressed word is b, and the next code number to read is 5. Thus

• s = b

• = empty

•The next word to output and add to D is

s s = bb


Page 82: 15-211 Fundamental Data Structures and Algorithms


Let x be the last added word.

Ordinarily, D contains a word y matching to the input code number.

We output y and extend D with

x+ first (y)

But sometimes we immediately use x.

Then it must be x = s and we output

x + first(x) = s s

Page 83: 15-211 Fundamental Data Structures and Algorithms

Example (extended)

0 0 1 5 3 6 7 9 5 aabbbaabbaaabaababb

Input Output add to D

0 a

0 + a 3:aa

1 + b 4:ab

5 - bb 5:bb

3 + aa 6:bba

6 + bba 7:aab

7 + aab 8:bbaa

9 - aaba 9:aaba

5 + bb 10:aabab

Page 84: 15-211 Fundamental Data Structures and Algorithms

Pseudo Code: Compression

Initialize dictionary D to all words of length 1.

Read all input characters:

output code numbers from D,

extend D whenever a new word appears.

New code words: just an integer counter.

Page 85: 15-211 Fundamental Data Structures and Algorithms

Less Pseudo

initialize D;

c = nextchar; // next input character

W = c; // a string

while( c = nextchar ) {

if( W+c is in D ) // dictionary

W = W + c;


output code(W); add W+c to D; W = c;


output code(W)

Page 86: 15-211 Fundamental Data Structures and Algorithms

Pseudo Code: Decompression

Initialize dictionary D with all words of length 1.

Read all code numbers and

- output corresponding words from D,

- extend D at each step.

This time the dictionary is of the form

( integer, word )

Keys are integers, values words.

Page 87: 15-211 Fundamental Data Structures and Algorithms

Less Pseudo

initialize D;

pc = nextcode; // first code number

x = word(pc); // corresponding word

output x;

First code number is easy: codes only a single symbol.

Remember as pc (previous code) and x (previous word).

Page 88: 15-211 Fundamental Data Structures and Algorithms

More Less Pseudo

while ( c = nextcode ) {

if ( c is in D ) {

y = word(c);

ww = x + first(y);

insert ww in D;

output y;


else {

Page 89: 15-211 Fundamental Data Structures and Algorithms

The hard case

else {

y = x + first(x);

insert y in D;

output y;


pc = c;

x = y;


Page 90: 15-211 Fundamental Data Structures and Algorithms

One more detail…

One detail remains: how to build the dictionary for compression (decompression is easy).

We need to be able to scan through a sequence of symbols and check if they form a prefix of a word already in the dictionary.

We use tries for dictionaries.

Page 91: 15-211 Fundamental Data Structures and Algorithms


a b

10 32

c d







a:0 b:1 c:2 d:3 ba:4 ad:4 dd:6

Corresponds to dictionary

Page 92: 15-211 Fundamental Data Structures and Algorithms


In the LZW situation, we can add the new word to the trie dictionary in O(1) steps after discovering that the string is no longer a prefix of a dictionary word.

Just add a new leaf to the last node touched.

Page 93: 15-211 Fundamental Data Structures and Algorithms

LZW details

• In reality, one usually restricts the code words to be 12 or 16 bit integers.

• Hence, one may have to flush the dictionary ever so often (i.e. proceed to compress the rest of the input with an empty dictionary).

• But we won’t bother with this.

Page 94: 15-211 Fundamental Data Structures and Algorithms

LZW details

Lastly, LZW generates as output a stream of integers.

It makes perfect sense to try to compress these further, e.g., by Huffman.

Page 95: 15-211 Fundamental Data Structures and Algorithms

Summary of LZW

LZW is an adaptive, dictionary based compression method.

Encoding is easy in LZW, but uses a special data structure (trie).

Decoding is slightly complicated, requires no special data structures.