An introduction to Data Compression
description
Transcript of An introduction to Data Compression
An introduction toData Compression
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2
General informations Requirements
some programming skills (not so much...) knowledge of data structures ... some work!
Office hours ...... please write me an email [email protected]
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 3
What is compression? Intuitively compression is a method “to
press something into a smaller space”. In our domains a better definition is “to
make information shorter”
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 4
Some basic questions What is information? How can we measure the amount of
information? Why compression is useful? How do we compress? How much we can compress?
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 5
What is information? - I Commonly the term information refers to the
knowledge of some fact, circumstance or thought.
For example we can think about reading a newspaper, news are the information. syntax
letters, punctuation marks, white spaces, grammar rules ...
semantics meaning of the words and of the sentences
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 6
What is information? - II In our domain, information is merely
the syntax, i.e. we are interested in the symbols of the alphabet used to express the information.
In order to give a mathematical definition of information we need some principle of Information Theory
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 7
The fundamental concept A key concept in Information Theory is that
the information is conveyed by randomness Which information give us a biased coin, which
outcome is always head? What about another biased coin, which outcome is
head with 90% probability?
We need a way to measure quantitatively the amount of information in some mathematical sense
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 8
The Uncertainty - I Suppose we have a discrete random
variable and is a particular outcome with probability
uncertainty The units are given by the base of the
logarithms base 2 bits base 10 nats
X x( )p x
log( ( )) log(1 ( ))p x p x
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 9
The Uncertanty - II Suppose the random variable output
each outcome has 1 bit of information
0 gives no information at all, while if the outcome is 1 the information is
0,1
(0) (1) 0.5p p
(0) 1, (1) 0p p
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 10
The Entropy More useful is the entropy of a random
variable with values in a space
The entropy is a measure of the average uncertanty of the random variable
X X( ) [uncertanty] ( ) log( ( ))
x
H X E p x p x
X
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 11
The entropy - examples Consider again a r.v. with only two possible
outcomes, 0 and 1In this case
( ) (0.5log 0.5 0.5log 0.5) 1 bitH p
( ) (0.9 log 0.9 0.1log 0.1) 0.469 bitsH p
1 with prob ( )
0 with prob 1p
X H pp
0.5p
0.9p
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 12
Compression and loss lossless
decompressed message (file) is an exact copy of the original. Useful for text compression
lossy some information is lost in the decompressed
message (file). Useful for image and sound compression
lgnore for a while lossy compression
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 13
Definitions - I A source code from a r.v. is a
mapping from to , the set of finite-length string from a D-ary alphabet.
, codeword for , length of
XX *̂D
( )C x( )l x
x( )C x
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 14
Definitions - II non-singular code (... trivial ...)
every element of is mapped in a different string of :
extension of a code
uniquely decodable code its extension is uniquely decodable
X*D ( ) ( )i j i jx x C x C x
*C C1 2 1 2( ... ) ( ) ( )... ( )n nC x x x C x C x C x
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 15
Definitions - III prefix (better prefix-free) or
istantaneous code no codeword is a prefix of any other codeword the advantage is that decoding has no need to look-
ahead
codewords
a 11 b 110
X ... 11? ...
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 16
ExamplesX Code 1 Code 2 Code 3 Code 4
1 01 0 10 02 110 010 00 103 010 10 11 1104 110 01 110 111
singularnot singular, but not uniquely decodableuniquely decodable, but not instantaneousinstantaneous
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 17
Kraft Inequality - I Theorem (Kraft Inequality)
For any instantaneous code over an alphabet of size D, the codeword lengths must satisfy
Conversely, given a set of codeword lengths that satisfy this inequality there exists an istantaneous code with these word lengths
1 2, ,..., ml l l
1
1i
ml
i
D
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 18
Kraft Inequality - II Consider a complete D-ary tree
at level k, there are nodes a node at level has descendants that
are nodes at level k
kDp k k pD
level 0
level 1
level 2
level 3
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 19
Kraft Inequality - III Proof
Consider a D-ary tree (not necessarily complete) representing the codewords, each path down the tree is a sequence of symbols, and each leaf (with its unique path) is a codeword. Let be the longest codeword.A codeword of length , being a leaf, imply that at level there are missing nodes
maxl
maxil l
maxl max il lD
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 20
Kraft Inequality - IVThe total number of possible nodes at level is Summing over all codewords
Dividing by
maxlmaxlD
max max-
1
i
ml l l
i
D D
maxlD
1
1i
ml
i
D
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 21
Kraft Inequality - V Proof
Suppose (without loss of generality) that codewords are ordered by length, i.e. .Consider a D-ary tree and start assigning each codeword to a node, starting from .For a generic codeword with length consider the set K of codewords with length , except i. Suppose there is no available node at level i. That is,
ilk il l
i k il l l
k K
D D
1 2 ... ml l l
1l
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 22
Kraft Inequality - VIbut this means that
Then
that is absurd. Then the obtained tree represents an instantaneous code with desidered codeword lengths
1kl
k K
D
1jl
j K i
D
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 23
Models and coders
The model supplies the probabilities of the symbols (or of the group of symbols, as we will see later)
The coder encodes and decodes starting from these probabilities
model model
encoder decodertext textcompressed text
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 24
Good modeling is crucial What happens if the true probability of
the symbols to be coded are but we use ?
Simply, compressed text will be longer, i.e. the average number of bits/symbol will be greater
It is possible to calculate the difference in bit/symbol from the two mass probability p and q, known as relative entropy
ip iq
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 25
Finite-context models in english text ...... but
A finite-context model of order m uses the previous m symbols to make the prediction
Better modeling but we need to extimate much more probabilities
( ) 0.02ip x u
1( | ) 0.95i ip x u x q
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 26
Finite-state models
Although potentially more powerful (e.g. they can model wheather an odd or even number of as have occurred consecutively), they are not so popular.
Obviously the decoder uses the same model, so they are always in the same states
1 2a 0.5
a 0.99b 0.01b 0.5
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 27
Static models A models is static if we set up a
reasonable probability distribution and use it for all the texts to be coded.
Poor performance in case of different kind of sources (english text, financial data...)
One solution is to have K different models and to send the index of the used model
... but cfr. the book Gadsby by E. V. Wright
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 28
Adaptive models In order to solve the problems of static
modeling, adaptive (or dynamic) models begin with a bland probability distribution, that is refined as more symbols of the text are known
The encoder and the decoder have the same initial distribution, and the same rules to alter it
There could be adaptive models of order m>0
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 29
The zero-frequency problem The situation in which a symbol is predicted
with probability zero should be avoided, as it cannot be coded
One solution: the total number of symbols in the text is increased by 1. This 1/total probability is divided among all unseen symbols
Another solution: to augment by 1 the count of every symbol
Many more solutions... Which is the best? If text is sufficiently long the
compression is similar
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 30
Symbolwise and dictionary models The set of all possible symbols of a source is
called the alphabet Symbolwise models provide an extimated
probability for each symbol in the alphabet Dictionary models instead replace substrings
in a text with codewords that identify each substring in a collection, called dictionary or codebook