Source Coding Efficient Data Representation A.J. Han Vinck.

31
Source Coding Efficient Data Representation A.J. Han Vinck

Transcript of Source Coding Efficient Data Representation A.J. Han Vinck.

Source Coding

Efficient Data Representation

A.J. Han Vinck

DATA COMPRESSION / REDUCTION

• 1) INTRODUCTION-    presentation of messages in Binary format- conversion into binary form

• 2) DATA COMPRESSION• Lossless - Data Compression without errors

• 3) DATA REDUCTION• Lossy – Data reduction using prediction and context

CONVERSION into DIGITAL FORM1  SAMPLING:

discrete exact time samples of the continuous (analoge) signal are produced

[v(t)] time v(t) discrete + amplitude discrete time

 

t t 01 11 10 10 T 2T 3T 4T

digital sample values sample rate R = 1/T

2 QUANTIZING: approximation(lossy) into a set discrete levels3 ENCODING: representation of a level by a binary symbol

HOW FAST SHOULD WE SAMPLE ?

principle:

- an analoge signal can be seen as a sum of sine waves

  +

with some highest sine frequency Fh  

   - unique reconstruction from its (exact) samples if (Nyquist, 1928) 

the sample Rate R = > 2 Fh

• We LIMIT the highest FREQUENCY of the SOURCE without introducing distortion!

EXAMPLES:text: represent every symbol with 8 bit

storage: 8 * (500 pages) * 1000 symbols = 4 Mbit compression possible to 1 Mbit (1:4) 

speech: sampling speed 8000 samples/sec; accuracy 8 bits/sample; needed transmission speed 64 kBit/s      compression possible to 4.8 kBit/s (1:10)

CD music: sampling speed 44.1 k samples/sec; accuracy 16 bits/sample needed storage capacity for one hour stereo: 5 Gbit 1250 books compression possible to 4 bits/sample ( 1:4 )

 digital pictures: 300 x 400 pixels x 3 colors x 8 bit/sample         2.9 Mbit/picture; for 25 images/second we need 75 Mb/s

2 hour pictures need 540 Gbit 130.000 books       compression needed (1:100)

we have to reduce the amount of data !!

using: prediction and context

- statistical properties- models- perceptual properties

LOSSLESS: remove redundancy exact !!  

LOSSY: remove irrelevance with distortion !!

Shannon source coding theorem

• Assume – independent source outputs – Consider runs of outputs of length L

• we expect a certain “type of runs” and

– Give a code word for an expected run with prefix ‘1’– An unexpected run is transmitted as it appears with prefix ‘0’

• Example: throw dice 600 times, what do you expect?• Example: throw coin 100 times, what do you expect?

We start with a binary source

Assume: binary sequence x of length L

P(0) = 1 – P(1) = 1-p; t is the # of 1‘s

For L , and > 0 and as small as desired

Probability ( |t/L –p| > ) 0

i.e. Probability ( |t/L –p| ≤ ) 1- 1 (1) L(p- ) ≤ t ≤ L(p + ) with high probability

Consequence 1:

Let A be the set of typical sequences i.e. obeying (1)

for these sequences: |t/L –p| ≤

then, P(A) 1 ( as close as wanted, i.e. P(A) 1 - )

or: almost all observed sequences are typical and have about t ones

Note: we use the notation when we asume that L

Consequence 2:

Lh(p)| A | 2

( )( )

2 2 2 2 2( )

log | | log log (2 ) log 2 log 2L p

Lh p

L p

L LA L L

t Lp

The cardinality of the set A is

( )2 2

1 1log 2 log 2 ( )Lh pL h p

L L

Shannon 1

Encode

every vector in A with Nint L(h(p)+)+1 bits

every vector in Ac with L bits–Use a prefix to signal whether we have a typical sequence or not

The average codeword length:

K = (1- )[L(h(p)+) +1] + L + 1 L h(p) bits

Shannon 1: converse

source output words XL encoder Yk: k output bits/L input of length L symbols

H(X) =h(p); H(XL ) =Lh(p) k = L[h(p)-] > 0

encoder: assignment of 2L[h(p)-] –1 code words 0

or all zero code word

Pe = prob (all zero code word assigned) =Prob(error)

Shannon 1: converse

source output words XL encoder Yk: k output bits/L input of length L symbols

H(XL, Yk) = H(XL) + H(Yk | XL ) = Lh(p)

= H( Yk ) + H( XL | Yk )

L[h(p)-] + h(Pe) + Pe log2|source|

Lh(p) L[h(p)-] + 1 + L Pe Pe -1/L > 0 !

H(X) =h(p); H(XL ) =Lh(p) k = L[h(p)-] > 0

typicality

Homework:

Calculate for = 0.04 and p = 0.3

h(p), |A|, P(A), (1- )2L(h(p)-) , 2L(h(p)+)

as a function of L

Homework:

Repeat the same arguments for a more general source with entropy H(X)

Sources with independent outputs

• Let U be a source with independent outputs: U1 U2 UL subscript = time

The set of typical sequences can be defined as

Then:

for large L

2 22 2

22

log ( ) var ( log ( ))Pr | ( ) | }

var ( log ( ))0

L LP U iance P Uob H U

L Liance P U

L

2log ( ){ | ( ) | }

LL P U

A U H UL

Sources with independent outputs cont’d

To see how it works, we can write

|U||U|2211

|U|2|U|222121

iL

1i2L

2

PlogPPlogPPlogPL

PlogN

L

PlogN

L

PlogN

L

))U(Plog(

L

)U(Plog

|U||U|2211

|U|2|U|222121

iL

1i2L

2

PlogPPlogPPlogPL

PlogN

L

PlogN

L

PlogN

L

))U(Plog(

L

)U(Plog

where |U| is the size of the alphabet, Pi the probability that symbol i occurs, Ni the fraction of occurances of symbol i

Sources with independent outputs cont’d

• the cardinality of the set A 2LH(U)

• Proof:

( ( ) )2

( ( ) )

( ( ) )

log ( )| ( ) | } ( ) 2

1 ( ) | |2

| | 2

LL L H U

L L H U

A

L H U

P UH U P U

Land

p U A

hence

A

encoding

Encode

every vector in A with L(H(U)+)+1 bits

every vector in Ac with L bits

– Use a prefix to signal whether we have a typical sequence or not

The average codeword length:

K = (1- )[L(H(U)+) +1] + L +1 L H(U) +1 bits

converse

• For converse, see binary source

Sources with memoryLet U be a source with memory

Output: U1 U2 UL subscript = time

states: S {1,2, , |S|}

The entropy or minimum description length

H(U) = H(U1 U2 UL) (use the chain rule)

= H(U1)+ H(U2|U1 ) + + H( UL | UL-1 U2U1 )

H(U1)+ H(U2) + + H( UL) ( use H(X) H(X|Y)

)

How to calculate?

Stationary sources with memory

H( UL | UL-1 U2U1 ) H( UL | UL-1 U2 ) = H( UL-1 | UL-2 U1 )

stationarity

less memory increase the entropy

conclusion: there must be a limit for the innovation for large L

Cont‘d

H(U1 U2 UL)

= H(U1 U2 UL-1)+ H( UL | UL-1 U2U1 )

H(U1 U2 UL-1) +1/L[ H(U1) + H(U2| U1)+ + H(UL|UL-1 U2U1 )]

H(U1 U2 UL-1) + 1/L H(U1 U2 UL)

thus: 1/L H(U1 U2 UL) 1 /(L-1) H(U1 U2 UL-1)

conclusion: the normalized block entropy has a limit

Cont‘d

1/L H(U1 U2 UL) = 1/L[ H(U1)+ H(U2|U1 ) + + H( UL | UL-1 U2U1 )]

H( UL | UL-1 U2U1 )

conclusion: the normalized block entropy is innovation

H(U1 U2 UL+j) H( UL-1 U2U1 ) +(j+1) H( UL | UL-1 U2U1 )

conclusion: the limit ( large j) of the normalized block entropy innovation

THUS: limit normalized block entropy = limit innovation

Sources with memory: exampleNote: H(U) = minimum representation length

H(U) average length of a practical scheme

English Text: First approach consider words as symbols

Scheme: 1. count word frequencies

2. Binary encode words

3. Calculate average representation length

Conclusion: we need 12 bits/word; average wordlength 4.5 letter

H(english) 2.6 bit/letter

Homework: estimate the entropy of your favorite programming language

Example: Zipf‘s law

Procedure:

order the English words according to frequency of occurence

Zipf‘s law: frequency of the word at position n is Fn = A/n

for English A = 0.1 word frequency

word order

1 10 100 1000

1

0.1

0.01

0.001

Result: H(English) = 2.16 bits/letterIn General: Fn = a n-b where a and b are constants

Application: web-access, complexity of languages

Sources with memory: example

Another approach: H(U)= H(U1)+ H(U2|U1 ) + + H( UL | UL-1 U2U1 )

Consider text as stationary, i.e. the statistical properties are fixed

Measure: P(a), P(b), etc H(U1) 4.1

P(a|a), P(b|a), ... H(U2|U1 ) 3.6

P(a|aa), P(b|aa), ... H(U3| U2 U1 ) 3.3

...

Note: More given letters reduce the entropy.

Shannons experiments give as result that 0.6 H(english) 1.3

Example: Markov model

A Markov model has states: S { 1, , S }

State probabilities Pt (S=i)

Transitions t t+1 with probability P (st+1 =j| st =i)

Every transition determines a specific output, see example

i

k

jP (st+1 =j| st =i)

P (st+1 =k| st =i)

0

1

0 1

0

0

11

Markov Sources

Instead H(U)= H(U1 U2 UL), we consider

H(U‘)= H(U1 U2 UL,S1 )

= H(S1 )+H(U1 |S1 ) + H(U2 |U1 S1 ) + + H(UL |UL-1 U1 S1 )

= H(U) + H(S1 |ULUL-1 U1 )

Markov property: output Ut depends on St only

St and Ut determine St+1

H(U‘)= H(S1) + H(U1 |S1 ) + H(U2 |S2 ) + + H(UL |SL )

Markov Sources: further reduction

assume: stationarity, i.e. P(St=i) = P (S=i) for all t > 0 and all i

then: H(U1|S1 ) = H(Ui|Si ) i = 2, 3, , L

H(U1|S1 ) = P(S1=1)H(U1|S1=1) +P(S1=2)H(U1|S1=2) + +P(S1=|S|) H(U1|S1=|S|)

and H(U‘)= H(S1) + LH(U1 |S1 )

H(U) = H(U‘) - H(S1 |U1 U2 UL)

Per symbol: 1/L H(U) H(U1 |S1 ) +1/L[H(S1) - H(S1 |U1 U2 UL)]

0

example

Pt(A)= ½

Pt (B)= ¼

Pt (C)= ¼

½ 0

¼ 1

¼ 2

A

B

C

A

B

C

P(A|B)=P(A|C)= ½

0

1

0

1

The entropy for this example = 1¼

Homework: check!

Homework: construct another example that is stationary and calculate H

Appendix to show (1)

For a binary sequence X of length n with P(Xi =1)=p

)x()p1(pp])x[(E:Use).p1(pn

1

])np()x(E[n

1p)x

n

1(E)y(E

then;pxn

1ylet

i222

i

2

n,1i

2i2

2

n,1i

2i

2

n,1ii

Because the Xi are independent, the variance of the sum = sum of the variances

2 2

2 22 2

2 2

( )Pr(| | ) Pr( ) Pr( )

| | | |

y

y E yy y y

ty p

n