Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data...

56
Data compression & Information theory Gjerrit Meinsma Mathematisch cafe 2-10-’03

Transcript of Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data...

Page 1: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Data compression

&

Information theory

Gjerrit Meinsma

Mathematisch cafe 2-10-’03

Page 2: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Claude Elwood Shannon (1916-2001)

Mathematisch cafe 2-10-’03 1

Page 3: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Outline

1. Data compression

2. Universal compression algorithm

3. Mathematical model for ‘disorder’ (information & entropy)

4. Connection entropy and data compression

5. ‘Written languages have high redundancy’

(70% can be thrown away)

Mathematisch cafe 2-10-’03 2

Page 4: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Data compression

For this LaTeX file:

# bytes after compression (gzip)

# bytes before compression= 0.295

Some thoughts:

1. Can this be beaten?

2. Is there a ‘fundamental compression ratio’?

Mathematisch cafe 2-10-’03 3

Page 5: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Two extremes

Theorem 1. De optimal compression ratio for this file < 0.001.

Mathematisch cafe 2-10-’03 4

Page 6: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Two extremes

Theorem 1. De optimal compression ratio for this file < 0.001.

....because my own compression algorithm gjerritzip assigns a

number to each file and this LATEX-file happens to get number 1.

Mathematisch cafe 2-10-’03 4

Page 7: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Two extremes

Theorem 1. De optimal compression ratio for this file < 0.001.

....because my own compression algorithm gjerritzip assigns a

number to each file and this LATEX-file happens to get number 1.

gjerritzip probably doesn’t do too well on other files.

Mathematisch cafe 2-10-’03 4

Page 8: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

We want a universal compression algorithm

one such that for every file

# bytes after compression

# bytes before compression≤ α

witjh α > 0 as small as possible.

Mathematisch cafe 2-10-’03 5

Page 9: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

We want a universal compression algorithm

one such that for every file

# bytes after compression

# bytes before compression≤ α= 1

witjh α > 0 as small as possible.

Unfortunately:

Theorem 2.

There is no lossless compression that strictly reduces every file.

Mathematisch cafe 2-10-’03 5

Page 10: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Proof.

Consider the two one-bit files ‘0’ en ‘1’.....

Mathematisch cafe 2-10-’03 6

Page 11: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Proof.

Consider the two one-bit files ‘0’ en ‘1’.....

More convincing:

There are 2N files of N bits.

There are

2N−1 + 2N−2 + · · · 20 = 2N−1

files of less than N bits.

Funny: patents have been granted to universal compression algorithms,

which we know don’t exist.

Mathematisch cafe 2-10-’03 6

Page 12: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Exploiting structure

Why do gzip, winzip work in practice?

Mathematisch cafe 2-10-’03 7

Page 13: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Exploiting structure

Why do gzip, winzip work in practice?

Because computer files have a structure.

gzip works for many, many files, but not for every file.

Mathematisch cafe 2-10-’03 7

Page 14: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Exploiting structure

Why do gzip, winzip work in practice?

Because computer files have a structure.

gzip works for many, many files, but not for every file.

...it’s time to define ‘structure’ mathematically

...the notion of information & entropy

Mathematisch cafe 2-10-’03 7

Page 15: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Towards a definition of information & entropy

Compare

‘I will eat something today’

with

‘Balkenende has a lover’

The second one is more surprising, supplies more information.

• Information is a measure of surprise

• (Entropy is a measure of disorder)

Mathematisch cafe 2-10-’03 8

Page 16: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

What do we want ‘information’ to be?

Mathematisch cafe 2-10-’03 9

Page 17: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Example 3. A draw a card. Then I tell you:

Mathematisch cafe 2-10-’03 10

Page 18: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Example 3. A draw a card. Then I tell you:

• ‘It is a spade’ [I supply information]

Mathematisch cafe 2-10-’03 10

Page 19: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Example 3. A draw a card. Then I tell you:

• ‘It is a spade’ [I supply information]

• ‘It is an ace’ [I supply information once again]

Mathematisch cafe 2-10-’03 10

Page 20: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Example 3. A draw a card. Then I tell you:

• ‘It is a spade’ [I supply information]

• ‘It is an ace’ [I supply information once again]

It seems reasonable to demand that

I(ace of spade) = I(spade) + I(ace)

That is to say:

I(A ∩ B) = I(A) + I(B) if A and B are indepedent

Mathematisch cafe 2-10-’03 10

Page 21: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Axiom 4 (Version 1).

1. I(A ∩ B) = I(A) + I(B) if A and B independent

2. I(A) ≥ 0

3. I(A) = I(B) if P (A) = P (B)

Because of the last axiom, information is a function of probability:

Axiom 5 (Version 2). For all p, q ∈ (0, 1):

1. I(pq) = I(p) + I(q)

2. I(p) ≥ 0

3. and we add anoter: I(p) is continuous in p.

Mathematisch cafe 2-10-’03 11

Page 22: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Theorem 6.

Then I is unique: I(p) = −k log(p) modulo scaling k > 0.

Other scaling factor k means other unit (irrelevant).

From now on:

I(p) = − log2(p)

Mathematisch cafe 2-10-’03 12

Page 23: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Example 7 (Special cases).

0 1

• I(0) = +∞, makes sense

• I(1) = 0, makes sense

• I(2−k) = k, well...

Mathematisch cafe 2-10-’03 13

Page 24: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Entropy = expectation of information

Definition 8. Entropy H := E(I)

Mathematisch cafe 2-10-’03 14

Page 25: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Entropy = expectation of information

Definition 8. Entropy H := E(I)

Example 9 (Connection entropy and structure).

Consider the file

aaaaabaabbaaaaababaaaaabaaaaaabaabaaab....

and assume

• P (a) = p

• P (b) = 1 − p

Mathematisch cafe 2-10-’03 14

Page 26: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

then entropy per symbol is

H = E(I) = p I(p) + (1 − p) I(1 − p)

p = 0 p = 1

1

Makes sense:

p ≈ 1 : aaaaabaaaaaabaaaa little surprise

p ≈ 0 : bbbbbbbabbbbbbab little surprise

p = 1/2 : abaabbabaabbaaba maximal disorder (symmetry)

Mathematisch cafe 2-10-’03 15

Page 27: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Example 10 (The abcd-file).

Consider file with symbols ‘a’, ‘b’, ‘c’ en ‘d’:

bdaccdaadaabbcaaabbaaabaaacbdaab.....

and suppose that

P (a) = 1/2

P (b) = 1/4

P (c) = 1/8

P (d) = 1/8

Mathematisch cafe 2-10-’03 16

Page 28: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Its entropy (per symbol) is

H =∑

i

pi I(pi)

=1

2I(2−1) +

1

4I(2−2) + 2

1

8I(2−3)

=1

2(1) +

1

4(2) + 2

1

8(3)

=1

2+

1

2+

3

4

= 1.75

Mathematisch cafe 2-10-’03 17

Page 29: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Back to compression

Example 11 (The abcd-file). Consider again

bdaccdaadaabbcaaabbaaabaaacbdaab.....

with again

P (a) = 1/2

P (b) = 1/4

P (c) = 1/8

P (d) = 1/8

This we want to code (binary)

Mathematisch cafe 2-10-’03 18

Page 30: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

This is an obvious coding:

a ↔ 00code word

b ↔ 01

c ↔ 10

d ↔ 11

For example:

abba ↔ 00010100

This coding is decodable

Mathematisch cafe 2-10-’03 19

Page 31: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

As all code words ∈ {00, 01, 10, 11} consist of two bits:

(average) # bits per symbol = 2

Mathematisch cafe 2-10-’03 20

Page 32: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

As all code words ∈ {00, 01, 10, 11} consist of two bits:

(average) # bits per symbol = 2

Can this be done more efficiently?

Mathematisch cafe 2-10-’03 20

Page 33: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Yes:a ↔ 0 with probability 1/2

b ↔ 10 with probability 1/4

c ↔ 110 with probability 1/8

d ↔ 111 with probability 1/8

Symbols that appear more often have shorter code words

Mathematisch cafe 2-10-’03 21

Page 34: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Yes:a ↔ 0 with probability 1/2

b ↔ 10 with probability 1/4

c ↔ 110 with probability 1/8

d ↔ 111 with probability 1/8

Symbols that appear more often have shorter code words

average # bits per symbol =1

2(1) +

1

4(2) + 2

1

8(3)

= 1.75

Mathematisch cafe 2-10-’03 21

Page 35: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Can this be done still more efficiently?

Mathematisch cafe 2-10-’03 22

Page 36: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Can this be done still more efficiently?

Shannon (1948): ‘No, cause entropy of the source is 1.75’

Mathematisch cafe 2-10-’03 22

Page 37: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Theorem 12.

H ≤ infcodings

E(L)

# bits per

symbol

≤ H + 1.

Mathematisch cafe 2-10-’03 23

Page 38: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Example 13 (terrible coding). Suppose our file is

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

with P (a) = 1. This file has entropy (per symbol)

H = 0.

and for the ‘optimal’ coding

a ↔ 0

we have

E(L) = 1 = H + 1

Mathematisch cafe 2-10-’03 24

Page 39: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Example 13 (terrible coding). Suppose our file is

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

with P (a) = 1. This file has entropy (per symbol)

H = 0.

and for the ‘optimal’ coding

a ↔ 0

we have

E(L) = 1 = H + 1

Message: Coding per symbol is not a good idea.

Mathematisch cafe 2-10-’03 24

Page 40: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Coding sequence of symbols

Not allowed was the coding

aaaaaaaaaaaaaaaaaaaaaaaaa ↔ 25 a’s ↔ ....

Pity, but not too bad:

Mathematisch cafe 2-10-’03 25

Page 41: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

We may consider

absdfuhquwyhgvsdfvsefqwsddfhwdqqwsd

as a sequence of ‘letters’ (symbols)

but also as a sequence of N -letters (symbols)

absdfuhquwyhgvsdfvsefqwsddfhwdqqwsd

This requires coding of many ‘symbols’

absdf ↔ 0110101000... ↔ ...

Mathematisch cafe 2-10-’03 26

Page 42: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Entropy per N -letter symbol is

HN = NH

Mathematisch cafe 2-10-’03 27

Page 43: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

So Shannon says:

NH ≤ inf E(L

# bits per symbol (N -letters)

) ≤ NH + 1

that is

H ≤ inf E(L/N) ≤ H + 1/N

Beautiful:

Theorem 14.

infcodings

E(# bits per letter) = H

Mathematisch cafe 2-10-’03 28

Page 44: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Entropy of languages—redundancy

Written languages have high redundancy:

Aoccdrnig to rscheearch at an Elingsh uinervtisy, it deosn’t

mttaer in waht oredr the ltteers in a wrod are, the olny

iprmoetnt tihng is taht the frist and lsat ltteer is at the rghit

pclae.

Languages are highly structured.

It doesn’t seem to have high entropy.

Mathematisch cafe 2-10-’03 29

Page 45: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

How to determine the entropy of ‘English’

Silly approach:

P (a) = P (b) = · · · = P (z) = P ( ) =1

27=⇒ H ≈ 4.8

Mathematisch cafe 2-10-’03 30

Page 46: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

First-order approximation:

P (a) = 0.0575

P (b) = 0.0128... ...

P (e) = 0.0913... ...

P (z) = 0.007

then

H ≈ 4.1

Mathematisch cafe 2-10-’03 31

Page 47: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Higher-order approximation:

P (be | to be or not to) = ??... ...

Then

H =?

Mathematisch cafe 2-10-’03 32

Page 48: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

It is believed that

0.6 < HEnglish < 1.3

that is: about 1 bit per letter is enough!

Computers represent letters in ASCII (8-bits)

so this compression ratio should be possible (in theory):

HEnglish

8<

1.3

8≈

1

6

In other words:

Perfect zippers compress ‘English’ with a factor of 6.

Mathematisch cafe 2-10-’03 33

Page 49: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Proofs

Lemma 15. For every decodable coding

i

2−li ≤ 1

with li the length of code word number i.

Mathematisch cafe 2-10-’03 34

Page 50: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Proof.

N︷ ︸︸ ︷aaaaaaaaa ↔

∑Nk=1 lik

︷ ︸︸ ︷1010001011

baaaaaaaa ↔ 0110110001111... ... ...

zzzzzzzzzz ↔ 011011010001111

Decodable implies

# codewords of length l :=N∑

k=1

lik ≤ 2l

Mathematisch cafe 2-10-’03 35

Page 51: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

SN := (∑

i

2−li)N =∑

i1,i2,...,iN

2−(li1+···+liN)

then

SN =∑

l

2−lAl, Al := # of length l

Then Al ≤ 2l because uniquely decodable. So

SN =∑

l

2−lAl ≤∑

l

2−l2l = Nlmax

Therefore

(∑

i

2−li)N ≤ Nlmax

left-hand side is exponential in N , right-hand side linear.

Hence base of exponent ≤ 1.

Mathematisch cafe 2-10-’03 36

Page 52: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Lemma 16. If li such that

i

2−li ≤ 1

then there is a (direct) decodable coding.

Mathematisch cafe 2-10-’03 37

Page 53: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Proof. Construct a tree with those lengths

(suppose lengths are {1, 2, 3, 3})

PSfrag replacements

0 1 2 3

Mathematisch cafe 2-10-’03 38

Page 54: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

0

10

110

111

1

11

PSfrag replacements

1 12

14

18

This one is decodable.

Mathematisch cafe 2-10-’03 39

Page 55: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Proof that H ≤ E(L).

E(L) − H =∑

i

pili +∑

i

pi log2(pi)

=∑

i

pi log2(2lipi)

=∑

i

pi

ln(2lipi)

ln(2)

=1

ln(2)

i

pi ln(2lipi)

≥1

ln(2)

i

pi(1 − 2−li/pi)

ln(x) ≥ 1 − 1x

=1

ln(2)

i

pi − 2−li ≥ 0.

Mathematisch cafe 2-10-’03 40

Page 56: Data compression Information theory - Homepage …meinsmag/onzin/shannon.pdfOutline 1. Data compression 2. Universal compression algorithm 3. Mathematical model for ‘disorder’

Proof of

inf E(L) ≤ H + 1

is constructive: take

li = d− log2(pi)e

then∑

i

2−li ≤∑

pi = 1

so a decodable coding exists.

We have

E(L) =∑

pili <∑

pi(− log2(pi) + 1) = H + 1.

Mathematisch cafe 2-10-’03 41