Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols...

69
Noiseless Coding

Transcript of Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols...

Page 1: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Noiseless Coding

Page 2: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Introduction

Noiseless Coding Compression without distortion

Basic Concept Symbols with lower probabilities are represented by the binary i

ndices with longer length

Methods Huffman codes, Lempel-Ziv codes, Arithmetic codes

and Golomb codes

Page 3: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Entropy

Consider a set of symbols S={S1,...,SN}.

The entropy of the symbols is defined as

N

i ii SP

SPSH1

2 )(

1log)()(

where

P(Si) is the probability of Si.

Page 4: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Example:

Consider a set of symbols {a,b,c} withP(a)=1/4, P(b)=1/4 and P(c)=1/2.

The entropy of the symbols is then given by

5.1)(

1log)(

)(

1log)(

)(

1log)(

cPcP

bPbP

aPaP

Page 5: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Consider a message containing symbols in S.

Define rate of a source coding technique as the average number of bits representing each symbol after compressing.

Page 6: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Example:

Suppose the following message is desired

to be compressed .

a a a b c a

Suppose a encoding technique uses 7 bits to represent the message.

The rate of the the encoding technique therefore is 7/6.(since there are 6 symbols)

Page 7: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

The lowest rate for encoding a message without distortion is the entropy of the symbols in the message.

Shannon’s source coding theorem:

Page 8: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Therefore, in an optimal noiseless source encoder, the average number of bits used to represent each symbol Si is

)(

1log2

iSP

It will take larger number of bits to represent a symbol having small probability.

Page 9: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Because the entropy is the limit of the noiseless encoder, we usually call the noiseless encoder, the entropy encoder.

Page 10: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Huffman Codes

We start with a set of symbols , where each symbol is associated with a probability .

Merge two symbols having lowest probabilities to a new symbol .

Page 11: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Repeat the merging process until all the symbols are merged to a single symbol .

Following the merging path, we can form the Huffman codes .

Page 12: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Example

Consider the following three symbols :

a ( with prob. 0.5 )

b ( with prob. 0.3 )

c ( with prob. 0.2 )

Page 13: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

a

b

c

Huffman Codes :

a 1

b 01

c 00

Merging Process

1

0

0

1

Page 14: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Example

Suppose the following message is desired to be compressed .

a a a b c a The results of the Huffman coding are :

1 1 1 01 00 1 Total # of bits used to represent the me

ssage : 8 bits (Rate=8/6=4/3)

Page 15: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

If the message is not compressed by the Huffman codes , each symbol should be represented by 2 bits . Total # of bits used to represent the message therefore is 12 bits .

We have saved 4 bits using the Huffman codes .

Page 16: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Example

Page 17: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
Page 18: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Discussions

It does not matter how the symbols are arranged.

It does not matter how the final code tree are labeled (with 0s and 1s).

Huffman code is not unique.

Page 19: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Lempel-Ziv Codes

Parse the input sequence into non-overlapping blocks of different lengths.

Construct a dictionary based on the blocks.

Use the dictionary for both encoding and decoding.

Page 20: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

It is NOT necessary to pre-specify the probability associated with each symbol.

Page 21: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Dictionary Generation process

1. Initialize the dictionary to contain all blocks of length one.

2. Search for the longest block W which has appeared in the dictionary.

3. Encode W by its index in the dictionary. 4. Add W followed by the first symbol of

the next block to the dictionary. 5. Go to Step 2.

Page 22: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Example

Consider the following input message

a b a a b a

Initial dictionary:

index entry

0 a

1 b

Page 23: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

encoder side

W = a, output 0 Add ab to the dictionary

index entry

0 a

1 b

2 ab

a b a a b a

Page 24: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

encoder side

W = b, output 1 Add ba to the dictionary

index entry

0 a

1 b

2 ab

3 ba

a b a a b a

Page 25: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

W = a, output 0 Add aa to the dictionary

index entry

0 a

1 b

2 ab

3 ba

4 aa

encoder sidea b a a b a

Page 26: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

W = ab, output 2 Add aba to the dictionary

index entry

0 a

1 b

2 ab

3 ba

4 aa

5 aba

encoder sidea b a a b a

Page 27: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

W = a, output 0 Stop

index entry

0 a

1 b

2 ab

3 ba

4 aa

5 aba

a b a a b aencoder side

Page 28: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

decoder side

Initial dictionary

Input 0, generate a

index entry

0 a

1 b

0

a

Page 29: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

decoder side

Receive 1, generate b Add ab to the dictionary

index entry

0 a

1 b

2 ab

a

1

b

Page 30: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

decoder side

Receive 0, generate a Add ba to the dictionary

index entry

0 a

1 b

2 ab

3 ba

a

0

b a

Page 31: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

decoder side

Receive 2, generate ab Add aa to the dictionary

index entry

0 a

1 b

2 ab

3 ba

4 aa

a

2

b a a b

Page 32: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

decoder side

Receive 0, generate a Add aba to the dictionary

index entry

0 a

1 b

2 ab

3 ba

4 aa

5 aba

a

0

b a a b a

Page 33: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Example

Page 34: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Example

a a a b c a

Consider again the following message

The initial dictionary is given by

index entry

0 a

1 b

2 c

Page 35: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

After the encoding process, the output of the encoder is given by

0 3 1 2 0

The final dictionary is given by

index entry

0 a

1 b

2 c

3 aa

4 aab

5 bc

6 ca

Page 36: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

decoder side

Initial dictionary

Receive 0, generate a

index entry

0 a

1 b

2 c

0

a

Page 37: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

decoder side

Receive 3, generate ? Decoder get stuck !!!

We need Welch correction to this problem.

index entry

0 a

1 b

2 c

a

3

?

Page 38: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

It turns out that this behavior can arise whenever one sees a pattern of the form

xwxwx

where x is a single symbol, and w is either empty or a sequence of symbols such that

xw already appears in the encoder and decoder table, but xwx does not.

Welch correction

Page 39: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

In this case the encoder will send the index for xw, and add xwx to the table with a new index i.

Next it will parse xwx and send the new index i.

The decoder will receive the index i but will not yet have the corresponding word in the dictionary.

Welch correction

Page 40: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Therefore, when the decoder can not find the corresponding word for an index i, the word must be

xwx,

where

xw can be found from the last decoded symbols.

Welch correction

Page 41: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Here, the last decoded symbol is a.

Therefore, x=a, and w= ,

Hence, xwx=aa.

index entry

0 a

1 b

2 c

3 aa

a

3

a a

decoder side

Page 42: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

decoder side

Receive 1, generate b Add aab to the dictionary

a

1

a a b

index entry

0 a

1 b

2 c

3 aa

4 aab

Page 43: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

decoder side

Receive 2 and 0, generate c and a Final dictionary

a

2

a a b c a

0

index entry

0 a

1 b

2 c

3 aa

4 aab

5 bc

6 ca

Page 44: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Example

Consider the following message

a b a b a b a

After the encoding process, the output ofthe encoder is given by

0 1 2 4

The final dictionary is given by

index entry

0 a

1 b

2 ab

3 ba

4 aba

Page 45: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

decoder side

Receive 0, 1 and 2, generate a, b and ab

current dictionary

index entry

0 a

1 b

2 ab

3 baa

1

b a b

0 2

Page 46: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

decoder side

Receive 4, generate ? current dictionary

index entry

0 a

1 b

2 ab

3 baa b a b

4

?

Page 47: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Here, the last decoded symbol is ab.

Therefore, x=a, and w= b,

Hence, xwx=aba.

a b a b

4

a b a

index entry

0 a

1 b

2 ab

3 ba

4 aba

Page 48: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Discussions

Theoretically, the size of the dictionary can grow infinitely large.

In practice, the dictionary size is limited. Once the limit is reached, no more entries are added. Welch had recommended a dictionary of size 4096. This corresponds to 12 bits per index.

Page 49: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Discussions

The length of indices may vary. When the number of entries n in the dictionary is such that

2m n > 2m-1

then the length of indices can be m.

Page 50: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Discussions

Use the message

as an example, the encoded indices are

a a a b c a

0 3 1 2 0

0011

001010

000

Need 13 bits (Rate=13/6)

Page 51: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Discussions

The above examples, as most other illustrative examples in the literature, does not result in real compression. Actually, more bits are used to represent the indices than the original data. This is because the length of the input data in the example is too short.

In practice, the Lempel-Ziv algorithm works well (lead to actual compression) only when the input data is sufficiently large and there are sufficient redundancy in the data.

Page 52: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Discussions

Many popular programs (e.g. Unix compress and uncompress, gzip and gunzip, GIF format and Windows WinZip) are based on the Lempel-Ziv algorithm.

Page 53: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Arithmetic Codes

A message is represented by an interval of real numbers between 0 and 1 .

As the message becomes longer , the interval needed to represent it becomes smaller .

=>The number of bits needed to specify that interval grows .

Page 54: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Arithmetic Codes

Successive symbols of the message reduce the size of the interval according to the symbol probabilities .

Page 55: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Example

Again we consider the following three symbols :

a ( with prob. 0.5 ) b ( with prob. 0.3 ) c ( with prob. 0.2 ) Suppose we also encode the same

message as the previous example : a a a b c a

Page 56: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

The encoding process

1 0.5 0.25 0.125 0.1 0.1 0.09625

0.5 0.25 0.125 0.0625

0.1 0.0925

0.09625

0 0 0 0 0.0625 0.0925 0.0925

a a a b c a

Final interval

c

b

a

Page 57: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Final interval

The final interval therefore is

[ 0.0625 , 0.09625 ) Any number in this final interval can be

used for the decoding process . For instance , we pick

0.09375 ∈ [ 0.0625 , 0.09625 )

Page 58: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

1 0.5 0.25 0.125 0.1 0.1

0.5 0.25 0.125 0.0625

0.1 0.0925

0.09625

0 0 0 0 0.0625 0.0925

c

b

a

0.09375 [ 0,0.5 )∈

output a

0.09375 [ 0,0.25 )∈

output a

0.09375 [ 0,0.125∈ )

output a

0.09375 [ 0.925,0.1 )∈

output c

0.09375 [ 0.0625,0.1 )∈

output b

0.09375 [ 0.0925,0.∈09625 )

output a

Page 59: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Decoder

Therefore , the decoder successfully identify the source sequence

a a a b c a Note that 0.09375 can be represented by the

binary sequence 0 0 0 1 1 (0.5) (0.25) (0.125) (0.0625) (0.03125) We only need 5 bits to represent the

message (Rate=5/6).

Page 60: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Discussions

Given the same message

a a a b c a No compression 12 bits Huffman codes 8 bits Lempel-Ziv 13 bits Arithmetic codes 5 bits

Page 61: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Discussions

The length of the interval may become very small for a long message, causing underflow problem.

Page 62: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Discussions

The encoder does not transmit any thing until the entire message has been encoded . In most applications an incremental mode is necessary .

Page 63: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Discussions

The symbol frequencies (i.e.,probabilities) might vary with time . It is therefore desired to use an adaptive symbol frequency model for encoding and decoding.

Page 64: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Golomb Codes

Well-suited for messages containing lots of 0’s and not too many 1’s.

Example:

Fax Documents

Page 65: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

First step of Golomb Code:

Convert the input sequence into integers

Example:

– 00100000010000000001000000000010000000000000000000000000001

– 2,6,9,10,27

Page 66: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

Second step of Golomb code:

Convert the integers into encoded bitstream

– Select an integer m.– For each integer obtained from the first step n, com

pute q and r, where

n=qm+r

Page 67: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

– Binary code for n has two parts:1. q is coded in unary2. r can be coded in fixed length code or variable

length code.– Example:m=5, value of r can be 0,1,2,3,4. Their VLC (denoted by )are: r̂

Page 68: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

– Example:

The binary code for n has the following

form:

Therefore, the encoded bistream is given by

Page 69: Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

References

1. K. Sayood, Introduction to Data Compression, Morgan Kaufmann, 2000.