8/12/2019 Data Compersion
1/11
Data Compersion:With the increased emphasis on full-text data bases, the problem of handling
the quantity of data becomes significant. Since the time required to search a
database is heavily dependent on the amount of data, for efficient operation of
an information system is necessary both to organize the data well and to find
as efficient a representation for the data as is possible. Thus there is growing
interest in data compersion. Why needed?
1. Size of applications is going from large to larger MP3, MPEG, Tiff, etc.
2. Fax has about 4 million dots/page more than 1 minutes over 56Kbps.
If the data is compressed by a factor of 10, the transmission time is
reduced to 6 seconds per page.
3. TV / Motion Pictures uses 30 pictures (frames) / second 200,000 pixels /
frames, color pictures require 3 bytes for each pixel (RGB). Each frame
has 200,000 * 24 = 4.8 Mbits, 2-hour movie requires 216,000 pictures.
total bits for such movie = 216,000 * 4.8 Mbits = 1.0368 x 1012. This is
much higher than the capacity of DVDs
Without c ompression, these appl icat ions w ould not be feasib le.
A codec is called LOSSY, if the data is lost during compression, while it called
LOSSLESS, if the data is not loss during compression.
1.Redundancy reduction (Usually lossless):
Remove redundancy from the message.
2. Reduce information content (Usually loosy):
Reduce the total amount of information in the message.
Leads to sacrifice of quality.
8/12/2019 Data Compersion
2/11
Two classes of text compression methods
Symbol-wise (or statistical) methods
Estimate probabilities of symbols - modeling step
Usually based on either arithmetic or Huffman coding
Dictionary methods
Replace fragments of text with a single code word(typically an index to an entry in the dictionary).
eg: Ziv-Lempel coding, which replaces strings of
characters with a pointer to a previousoccurrence of the string.
No probability estimates needed
Text Compression
model
encoder
model
decodercompressed texttext text
8/12/2019 Data Compersion
3/11
Information TheoryEntropy:Shannon borrowed the definition of entropy from statistical physics
to capture the notion of how much information is contained in the whole
alphabet. For a set of possible messages S, Shannon defined entropy as,
Where p(s) is the probability of message s. The self information i(s) represents
the number of bits of information contained in it, and roughly speaking the
number of bits we should use to send that message.
sispsp
spSHSsSs
.1
log 2
average original symbol length
average compressed symbol length
C
25.2125.0
1logx0.125x2
25.0
1logx0.25x3
0.1250.125,0.25,0.25,,25.0
22
sH
sP
Redundance:is the average codeword legths minus the entropy.
Comp ersion rat io:is the ratio between the average number of bit/symbol in
the original message and the same quantity for the coded message.
8/12/2019 Data Compersion
4/11
Based on the assumption that a file has a great deal of redundancy. Data is
considered just a string of symbols. RLE is good for fax and voice.
22 characters 14 characters
ABBCCDDDDDDDDDEEFGGGGG => ABBCCD#9EEFG#5(22-14)/22 = 36 % reduction
Disadvantage:1. We are unable to distinguish compressed text in the file from
uncompressed text.
2. Any numeric value will be interpreted as the beginning of a
compressed sequence.
1:Run Length Encoding RLE)
8/12/2019 Data Compersion
5/11
1. Intially each symbole is considered as a separate
binary tree.
2. Two tree with the lowest frequencies are chosen andcombined into a single tree whose assigned frequency
is the sum of the two given frequencies. The chosen
tree form the two branches of the new tree.
3. The process is repeated until only a single tree
remains. Then the two branches for every are labeled 0and 1 (0 on the left branch, but the order is not
important).
4. The code for each symbole can be read by following
the branch from the root to the symbol.
There is another algorithm which performances are slightly
better than Run Length Ecoding, the famous Huffman coding.
Huffman code is the frequency distribution of the symboles tobe encoded. A binary tree is then constructed.
2: Huffman coding
8/12/2019 Data Compersion
6/11
Huffman coding - Example
0
a0.05
b0.05
c0.1
d0.2
e0.3
f0.2
g0.1
0.1
0.2
0.3
0.4
0.6
1.00
0
0
0
0
1
1
1
1
1
1
a0.05
b0.05
c0.1
d0.2
e0.3
f0.2
g0.1
0.1
0.2
0.3
0.4
0.6
1.0
Symbol Prob. Codeword
0.05 0000
0.05 0001
0.1 001
0.2 01
0.3 10
0.2 11
a
b
c
d
e
f 0
0.1 111g
8/12/2019 Data Compersion
7/11
Code the sequence (aeebcddegfced) andevaluate entropy and compression ratio.
Sol: 0000 10 10 0001 001 01 01 10 111 110 001 10 01
Aver. orig. symb. length = 3 bitsAver. compr. symb. length = 34/13
Symbol Prob. Codeword
0.05 0000
0.05 0001
0.1 001
0.2 01
0.3 10
0.2 11
a
b
c
d
e
f 0
0.1 111g
Huffman coding - Exercise
H(X) = 2.5464 bits
Huffman coding - Notes
1. In the huffman coding, if, at any time, there is more than one way to
choose a smallest pair of probabilities, any such pair may be chosen.
2. Huffman code is a variable-length code, with the more frequent symbols
being assigned shorter codes.
3. Huffman codes are good for data messages.
8/12/2019 Data Compersion
8/11
LZ77 keep track of last n bytes of data seen and when a phrase is encountered
that has already been seen, they output a pair of values corresponding to the
position of the phrase in the previously-seen buffer of data, and the length of
the phrase. The code consists of a set of triples < a, b, c >, where:a = relative position of the longest match in the dictionary
b = length of longest match
c = next char in buffer beyond longest match
The beginning with 0 identify new characters, not previously seen.
Lempel-Ziv Compression LZ77):
P e t e r _ P i p e r _ p i c (0,0,P)
(0,0,t)
(2,1,r)
(0,0,_)
Output
Code
P e t e r _ P i p e r _ p i c (0,0,e)
P e t e r _ P i p e r _ p i c
P e t e r _ P i p e r _ p i c
P e t e r _ P i p e r _ p i c
1
2
3
4
5
No. of code
triples
k
k
k
k
k
Decodedtext
8/12/2019 Data Compersion
9/11
P e t e r _ P i p e r _ (6,1,i)
(6,3,c)
(0,0,k)
P e t e r _ P i p e r (8,2,r)
P e t e r _ P i p e r
P e t e r _ P i p e r
6
7
8
9
_
_
_
p i c
p i c
p i c
p i c
k
k
k
k
Output
Code
No. of code
triples Decodedtext
8/12/2019 Data Compersion
10/11
Arithmetic coding is based on the concept of in terval subd iv id ing.
In arithmetic coding a source ensemble is represented by an
interval between 0 and 1 on the real number line. Each symbol of the ensemble narrows this interval. It uses the
probabilities of the source messages to successively narrow
the interval used to represent the ensemble.
Arithmetic Coding:
Arithmetic Coding: Description
In the following discussions, we will use M as the size of the
alphabet of the data source,
N[x] as symbol x's probability,
Q[x] as symbol x's cumulative probability (Q[i]=N[0]+N[1]+.)
Assume we know the probabilities of each symbol, we can allocate to each symbol an interval with width proportional to
its probability, and each of the intervals does not overlap with others.
This can be done if we use the cumulative probabilities as the two
ends of each interval. Therefore, the two ends of each symbol x
amount to Q[x-1] and Q[x].
Symbol x is said to own the range [Q[x-1], Q[x]).
8/12/2019 Data Compersion
11/11
Arithmetic Coding: Encoder exampleSymbol, x Probability,
N[x]
[Q[x-1], Q[x])
A 0.4 0.0, 0.4
B 0.3 0.4, 0.7
C 0.2 0.7, 0.9
D 0.1 0.9, 1.0
1
0
B
0.4
0.7 0.67
0.61
C
0.634
0.61
A
0.6286
0.6196
B String: BCAB
Code sent:0.61960.52
0.61
0.67
0.634
0.652
0.664
0.6196
0.6268
Top Related