Download - Data Compersion

8/12/2019 Data Compersion

1/11

Data Compersion:With the increased emphasis on full-text data bases, the problem of handling

the quantity of data becomes significant. Since the time required to search a

database is heavily dependent on the amount of data, for efficient operation of

an information system is necessary both to organize the data well and to find

as efficient a representation for the data as is possible. Thus there is growing

interest in data compersion. Why needed?

1. Size of applications is going from large to larger MP3, MPEG, Tiff, etc.

2. Fax has about 4 million dots/page more than 1 minutes over 56Kbps.

If the data is compressed by a factor of 10, the transmission time is

reduced to 6 seconds per page.

3. TV / Motion Pictures uses 30 pictures (frames) / second 200,000 pixels /

frames, color pictures require 3 bytes for each pixel (RGB). Each frame

has 200,000 * 24 = 4.8 Mbits, 2-hour movie requires 216,000 pictures.

total bits for such movie = 216,000 * 4.8 Mbits = 1.0368 x 1012. This is

much higher than the capacity of DVDs

Without c ompression, these appl icat ions w ould not be feasib le.

A codec is called LOSSY, if the data is lost during compression, while it called

LOSSLESS, if the data is not loss during compression.

1.Redundancy reduction (Usually lossless):

Remove redundancy from the message.

2. Reduce information content (Usually loosy):

Reduce the total amount of information in the message.

Leads to sacrifice of quality.


2/11

Two classes of text compression methods

Symbol-wise (or statistical) methods

Estimate probabilities of symbols - modeling step

Usually based on either arithmetic or Huffman coding

Dictionary methods

Replace fragments of text with a single code word(typically an index to an entry in the dictionary).

eg: Ziv-Lempel coding, which replaces strings of

characters with a pointer to a previousoccurrence of the string.

No probability estimates needed

Text Compression

model

encoder

model

decodercompressed texttext text


3/11

Information TheoryEntropy:Shannon borrowed the definition of entropy from statistical physics

to capture the notion of how much information is contained in the whole

alphabet. For a set of possible messages S, Shannon defined entropy as,

Where p(s) is the probability of message s. The self information i(s) represents

the number of bits of information contained in it, and roughly speaking the

number of bits we should use to send that message.

sispsp

spSHSsSs

.1

log 2

average original symbol length

average compressed symbol length

C

25.2125.0

1logx0.125x2

25.0

1logx0.25x3

0.1250.125,0.25,0.25,,25.0

22

sH

sP

Redundance:is the average codeword legths minus the entropy.

Comp ersion rat io:is the ratio between the average number of bit/symbol in

the original message and the same quantity for the coded message.


4/11

Based on the assumption that a file has a great deal of redundancy. Data is

considered just a string of symbols. RLE is good for fax and voice.

22 characters 14 characters

ABBCCDDDDDDDDDEEFGGGGG => ABBCCD#9EEFG#5(22-14)/22 = 36 % reduction

Disadvantage:1. We are unable to distinguish compressed text in the file from

uncompressed text.

2. Any numeric value will be interpreted as the beginning of a

compressed sequence.

1:Run Length Encoding RLE)


5/11

1. Intially each symbole is considered as a separate

binary tree.

2. Two tree with the lowest frequencies are chosen andcombined into a single tree whose assigned frequency

is the sum of the two given frequencies. The chosen

tree form the two branches of the new tree.

3. The process is repeated until only a single tree

remains. Then the two branches for every are labeled 0and 1 (0 on the left branch, but the order is not

important).

4. The code for each symbole can be read by following

the branch from the root to the symbol.

There is another algorithm which performances are slightly

better than Run Length Ecoding, the famous Huffman coding.

Huffman code is the frequency distribution of the symboles tobe encoded. A binary tree is then constructed.

2: Huffman coding


6/11

Huffman coding - Example

0

a0.05

b0.05

c0.1

d0.2

e0.3

f0.2

g0.1

0.1

0.2

0.3

0.4

0.6

1.00

0

0

0

0

1

1

1

1

1

1

a0.05

b0.05

c0.1

d0.2

e0.3

f0.2

g0.1

0.1

0.2

0.3

0.4

0.6

1.0

Symbol Prob. Codeword

0.05 0000

0.05 0001

0.1 001

0.2 01

0.3 10

0.2 11

a

b

c

d

e

f 0

0.1 111g


7/11

Code the sequence (aeebcddegfced) andevaluate entropy and compression ratio.

Sol: 0000 10 10 0001 001 01 01 10 111 110 001 10 01

Aver. orig. symb. length = 3 bitsAver. compr. symb. length = 34/13

Symbol Prob. Codeword

0.05 0000

0.05 0001

0.1 001

0.2 01

0.3 10

0.2 11

a

b

c

d

e

f 0

0.1 111g

Huffman coding - Exercise

H(X) = 2.5464 bits

Huffman coding - Notes

1. In the huffman coding, if, at any time, there is more than one way to

choose a smallest pair of probabilities, any such pair may be chosen.

2. Huffman code is a variable-length code, with the more frequent symbols

being assigned shorter codes.

3. Huffman codes are good for data messages.


8/11

LZ77 keep track of last n bytes of data seen and when a phrase is encountered

that has already been seen, they output a pair of values corresponding to the

position of the phrase in the previously-seen buffer of data, and the length of

the phrase. The code consists of a set of triples < a, b, c >, where:a = relative position of the longest match in the dictionary

b = length of longest match

c = next char in buffer beyond longest match

The beginning with 0 identify new characters, not previously seen.

Lempel-Ziv Compression LZ77):

P e t e r _ P i p e r _ p i c (0,0,P)

(0,0,t)

(2,1,r)

(0,0,_)

Output

Code

P e t e r _ P i p e r _ p i c (0,0,e)

P e t e r _ P i p e r _ p i c



1

2

3

4

5

No. of code

triples

k

k

k

k

k

Decodedtext


9/11

P e t e r _ P i p e r _ (6,1,i)

(6,3,c)

(0,0,k)

P e t e r _ P i p e r (8,2,r)

P e t e r _ P i p e r

P e t e r _ P i p e r

6

7

8

9

_

_

_

p i c

p i c

p i c

p i c

k

k

k

k

Output

Code

No. of code

triples Decodedtext


10/11

Arithmetic coding is based on the concept of in terval subd iv id ing.

In arithmetic coding a source ensemble is represented by an

interval between 0 and 1 on the real number line. Each symbol of the ensemble narrows this interval. It uses the

probabilities of the source messages to successively narrow

the interval used to represent the ensemble.

Arithmetic Coding:

Arithmetic Coding: Description

In the following discussions, we will use M as the size of the

alphabet of the data source,

N[x] as symbol x's probability,

Q[x] as symbol x's cumulative probability (Q[i]=N[0]+N[1]+.)

Assume we know the probabilities of each symbol, we can allocate to each symbol an interval with width proportional to

its probability, and each of the intervals does not overlap with others.

This can be done if we use the cumulative probabilities as the two

ends of each interval. Therefore, the two ends of each symbol x

amount to Q[x-1] and Q[x].

Symbol x is said to own the range [Q[x-1], Q[x]).


11/11

Arithmetic Coding: Encoder exampleSymbol, x Probability,

N[x]

[Q[x-1], Q[x])

A 0.4 0.0, 0.4

B 0.3 0.4, 0.7

C 0.2 0.7, 0.9

D 0.1 0.9, 1.0

1

0

B

0.4

0.7 0.67

0.61

C

0.634

0.61

A

0.6286

0.6196

B String: BCAB

Code sent:0.61960.52

0.61

0.67

0.634

0.652

0.664

0.6196

0.6268