Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols...
-
Upload
morgan-singleton -
Category
Documents
-
view
224 -
download
1
Transcript of Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols...
Noiseless Coding
Introduction
Noiseless Coding Compression without distortion
Basic Concept Symbols with lower probabilities are represented by the binary i
ndices with longer length
Methods Huffman codes, Lempel-Ziv codes, Arithmetic codes
and Golomb codes
Entropy
Consider a set of symbols S={S1,...,SN}.
The entropy of the symbols is defined as
N
i ii SP
SPSH1
2 )(
1log)()(
where
P(Si) is the probability of Si.
Example:
Consider a set of symbols {a,b,c} withP(a)=1/4, P(b)=1/4 and P(c)=1/2.
The entropy of the symbols is then given by
5.1)(
1log)(
)(
1log)(
)(
1log)(
cPcP
bPbP
aPaP
Consider a message containing symbols in S.
Define rate of a source coding technique as the average number of bits representing each symbol after compressing.
Example:
Suppose the following message is desired
to be compressed .
a a a b c a
Suppose a encoding technique uses 7 bits to represent the message.
The rate of the the encoding technique therefore is 7/6.(since there are 6 symbols)
The lowest rate for encoding a message without distortion is the entropy of the symbols in the message.
Shannon’s source coding theorem:
Therefore, in an optimal noiseless source encoder, the average number of bits used to represent each symbol Si is
)(
1log2
iSP
It will take larger number of bits to represent a symbol having small probability.
Because the entropy is the limit of the noiseless encoder, we usually call the noiseless encoder, the entropy encoder.
Huffman Codes
We start with a set of symbols , where each symbol is associated with a probability .
Merge two symbols having lowest probabilities to a new symbol .
Repeat the merging process until all the symbols are merged to a single symbol .
Following the merging path, we can form the Huffman codes .
Example
Consider the following three symbols :
a ( with prob. 0.5 )
b ( with prob. 0.3 )
c ( with prob. 0.2 )
a
b
c
Huffman Codes :
a 1
b 01
c 00
Merging Process
1
0
0
1
Example
Suppose the following message is desired to be compressed .
a a a b c a The results of the Huffman coding are :
1 1 1 01 00 1 Total # of bits used to represent the me
ssage : 8 bits (Rate=8/6=4/3)
If the message is not compressed by the Huffman codes , each symbol should be represented by 2 bits . Total # of bits used to represent the message therefore is 12 bits .
We have saved 4 bits using the Huffman codes .
Example
Discussions
It does not matter how the symbols are arranged.
It does not matter how the final code tree are labeled (with 0s and 1s).
Huffman code is not unique.
Lempel-Ziv Codes
Parse the input sequence into non-overlapping blocks of different lengths.
Construct a dictionary based on the blocks.
Use the dictionary for both encoding and decoding.
It is NOT necessary to pre-specify the probability associated with each symbol.
Dictionary Generation process
1. Initialize the dictionary to contain all blocks of length one.
2. Search for the longest block W which has appeared in the dictionary.
3. Encode W by its index in the dictionary. 4. Add W followed by the first symbol of
the next block to the dictionary. 5. Go to Step 2.
Example
Consider the following input message
a b a a b a
Initial dictionary:
index entry
0 a
1 b
encoder side
W = a, output 0 Add ab to the dictionary
index entry
0 a
1 b
2 ab
a b a a b a
encoder side
W = b, output 1 Add ba to the dictionary
index entry
0 a
1 b
2 ab
3 ba
a b a a b a
W = a, output 0 Add aa to the dictionary
index entry
0 a
1 b
2 ab
3 ba
4 aa
encoder sidea b a a b a
W = ab, output 2 Add aba to the dictionary
index entry
0 a
1 b
2 ab
3 ba
4 aa
5 aba
encoder sidea b a a b a
W = a, output 0 Stop
index entry
0 a
1 b
2 ab
3 ba
4 aa
5 aba
a b a a b aencoder side
decoder side
Initial dictionary
Input 0, generate a
index entry
0 a
1 b
0
a
decoder side
Receive 1, generate b Add ab to the dictionary
index entry
0 a
1 b
2 ab
a
1
b
decoder side
Receive 0, generate a Add ba to the dictionary
index entry
0 a
1 b
2 ab
3 ba
a
0
b a
decoder side
Receive 2, generate ab Add aa to the dictionary
index entry
0 a
1 b
2 ab
3 ba
4 aa
a
2
b a a b
decoder side
Receive 0, generate a Add aba to the dictionary
index entry
0 a
1 b
2 ab
3 ba
4 aa
5 aba
a
0
b a a b a
Example
Example
a a a b c a
Consider again the following message
The initial dictionary is given by
index entry
0 a
1 b
2 c
After the encoding process, the output of the encoder is given by
0 3 1 2 0
The final dictionary is given by
index entry
0 a
1 b
2 c
3 aa
4 aab
5 bc
6 ca
decoder side
Initial dictionary
Receive 0, generate a
index entry
0 a
1 b
2 c
0
a
decoder side
Receive 3, generate ? Decoder get stuck !!!
We need Welch correction to this problem.
index entry
0 a
1 b
2 c
a
3
?
It turns out that this behavior can arise whenever one sees a pattern of the form
xwxwx
where x is a single symbol, and w is either empty or a sequence of symbols such that
xw already appears in the encoder and decoder table, but xwx does not.
Welch correction
In this case the encoder will send the index for xw, and add xwx to the table with a new index i.
Next it will parse xwx and send the new index i.
The decoder will receive the index i but will not yet have the corresponding word in the dictionary.
Welch correction
Therefore, when the decoder can not find the corresponding word for an index i, the word must be
xwx,
where
xw can be found from the last decoded symbols.
Welch correction
Here, the last decoded symbol is a.
Therefore, x=a, and w= ,
Hence, xwx=aa.
index entry
0 a
1 b
2 c
3 aa
a
3
a a
decoder side
decoder side
Receive 1, generate b Add aab to the dictionary
a
1
a a b
index entry
0 a
1 b
2 c
3 aa
4 aab
decoder side
Receive 2 and 0, generate c and a Final dictionary
a
2
a a b c a
0
index entry
0 a
1 b
2 c
3 aa
4 aab
5 bc
6 ca
Example
Consider the following message
a b a b a b a
After the encoding process, the output ofthe encoder is given by
0 1 2 4
The final dictionary is given by
index entry
0 a
1 b
2 ab
3 ba
4 aba
decoder side
Receive 0, 1 and 2, generate a, b and ab
current dictionary
index entry
0 a
1 b
2 ab
3 baa
1
b a b
0 2
decoder side
Receive 4, generate ? current dictionary
index entry
0 a
1 b
2 ab
3 baa b a b
4
?
Here, the last decoded symbol is ab.
Therefore, x=a, and w= b,
Hence, xwx=aba.
a b a b
4
a b a
index entry
0 a
1 b
2 ab
3 ba
4 aba
Discussions
Theoretically, the size of the dictionary can grow infinitely large.
In practice, the dictionary size is limited. Once the limit is reached, no more entries are added. Welch had recommended a dictionary of size 4096. This corresponds to 12 bits per index.
Discussions
The length of indices may vary. When the number of entries n in the dictionary is such that
2m n > 2m-1
then the length of indices can be m.
Discussions
Use the message
as an example, the encoded indices are
a a a b c a
0 3 1 2 0
0011
001010
000
Need 13 bits (Rate=13/6)
Discussions
The above examples, as most other illustrative examples in the literature, does not result in real compression. Actually, more bits are used to represent the indices than the original data. This is because the length of the input data in the example is too short.
In practice, the Lempel-Ziv algorithm works well (lead to actual compression) only when the input data is sufficiently large and there are sufficient redundancy in the data.
Discussions
Many popular programs (e.g. Unix compress and uncompress, gzip and gunzip, GIF format and Windows WinZip) are based on the Lempel-Ziv algorithm.
Arithmetic Codes
A message is represented by an interval of real numbers between 0 and 1 .
As the message becomes longer , the interval needed to represent it becomes smaller .
=>The number of bits needed to specify that interval grows .
Arithmetic Codes
Successive symbols of the message reduce the size of the interval according to the symbol probabilities .
Example
Again we consider the following three symbols :
a ( with prob. 0.5 ) b ( with prob. 0.3 ) c ( with prob. 0.2 ) Suppose we also encode the same
message as the previous example : a a a b c a
The encoding process
1 0.5 0.25 0.125 0.1 0.1 0.09625
0.5 0.25 0.125 0.0625
0.1 0.0925
0.09625
0 0 0 0 0.0625 0.0925 0.0925
a a a b c a
Final interval
c
b
a
Final interval
The final interval therefore is
[ 0.0625 , 0.09625 ) Any number in this final interval can be
used for the decoding process . For instance , we pick
0.09375 ∈ [ 0.0625 , 0.09625 )
1 0.5 0.25 0.125 0.1 0.1
0.5 0.25 0.125 0.0625
0.1 0.0925
0.09625
0 0 0 0 0.0625 0.0925
c
b
a
0.09375 [ 0,0.5 )∈
output a
0.09375 [ 0,0.25 )∈
output a
0.09375 [ 0,0.125∈ )
output a
0.09375 [ 0.925,0.1 )∈
output c
0.09375 [ 0.0625,0.1 )∈
output b
0.09375 [ 0.0925,0.∈09625 )
output a
Decoder
Therefore , the decoder successfully identify the source sequence
a a a b c a Note that 0.09375 can be represented by the
binary sequence 0 0 0 1 1 (0.5) (0.25) (0.125) (0.0625) (0.03125) We only need 5 bits to represent the
message (Rate=5/6).
Discussions
Given the same message
a a a b c a No compression 12 bits Huffman codes 8 bits Lempel-Ziv 13 bits Arithmetic codes 5 bits
Discussions
The length of the interval may become very small for a long message, causing underflow problem.
Discussions
The encoder does not transmit any thing until the entire message has been encoded . In most applications an incremental mode is necessary .
Discussions
The symbol frequencies (i.e.,probabilities) might vary with time . It is therefore desired to use an adaptive symbol frequency model for encoding and decoding.
Golomb Codes
Well-suited for messages containing lots of 0’s and not too many 1’s.
Example:
Fax Documents
First step of Golomb Code:
Convert the input sequence into integers
Example:
– 00100000010000000001000000000010000000000000000000000000001
– 2,6,9,10,27
Second step of Golomb code:
Convert the integers into encoded bitstream
– Select an integer m.– For each integer obtained from the first step n, com
pute q and r, where
n=qm+r
– Binary code for n has two parts:1. q is coded in unary2. r can be coded in fixed length code or variable
length code.– Example:m=5, value of r can be 0,1,2,3,4. Their VLC (denoted by )are: r̂
– Example:
The binary code for n has the following
form:
Therefore, the encoded bistream is given by
References
1. K. Sayood, Introduction to Data Compression, Morgan Kaufmann, 2000.