Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Post on 02-Jan-2016

26 views 0 download

Tags:

description

Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School. Goal: A gentle introduction to the basic concepts in information theory Emphasis: understanding and interpretations of these concepts - PowerPoint PPT Presentation

Transcript of Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Information Theory

Ying Nian WuUCLA Department of Statistics

July 9, 2007IPAM Summer School

Goal: A gentle introduction to the basic concepts in information theory

Emphasis: understanding and interpretations of these concepts

Reference: Elements of Information Theory by Cover and Thomas

Topics•Entropy and relative entropy•Asymptotic equipartition property•Data compression•Large deviation•Kolmogorov complexity•Entropy rate of process

EntropyRandomness or uncertainty of a probability distribution

kk paX )Pr( )Pr()( xXxp

8/18/14/12/1Pr

dcba

Example

Kk

Kk

ppppp

aaaa

21

21

Definition

x

kk

k xpxpppXH )(log)(log)(

Entropy

kk paX )Pr( )Pr()( xXxp

Kk

Kk

ppppp

aaaa

21

21

Entropy

8/18/14/12/1Pr

dcba

Example

4

7

8

1log

8

1

8

1log

8

1

4

1log

4

1

2

1log

2

1)( XH

x

kk

k xpxpppXH )(log)(log)(

Definition for both discrete and continuous

)](log[)( XppH E

x

xpxfXf )()()]([ERecall

kk paX )Pr( )Pr()( xXxp

Entropy

x

kk

k xpxpppXH )(log)(log)(

Kk

Kk

ppppp

aaaa

21

21

Example

Entropy

3321log

8/18/14/12/1Pr

p

dcba

4

73

8

13

8

12

4

11

2

1)( XH

)](log[)( XppH E

Interpretation 1: cardinality

Uniform distribution ][~ AX Unif

There are elements inAll these choices are equally likely

A

||log|)]|/1log([)](log[)( AAXpXH EE

Entropy can be interpreted as log of volume or size dimensional cube has vertices can also be interpreted as dimensionality

What if the distribution is not uniform?

n n2

|| A

Asymptotic equipartition property

Any distribution is essentially a uniform distribution in long run repetition

)(~,...,, 21 xpXXX n independently

)()...()(),...,,( 2121 nn XpXpXpXXXp

||/1)( AXp Recall if then a constant]Unif[AX ~

Random?

But in some sense, it is essentially a constant!

Law of large number

)(~,...,, 21 xpXXX n independently

x

n

ii xpxfXfXf

n)()()]([)(

1

1

E

Long run average converges to expectation

Asymptotic equipartition property

)()...()(),...,,( 2121 nn XpXpXpXXXp

)()]([log)(log1

),...,(log1

11 pHXpXp

nXXp

n

n

iin

E

Intuitively, in the long run,)(

1 2),...,( pnHnXXp

)(1 2),...,( pnH

nXXp

||/1)( AXp Recall if

then a constant]Unif[ AX ~

Therefore, as if ]Unif[AXX n ~),...,( 1

)(2|| pnHA ,with

So the dimensionality per observation is )( pH

Asymptotic equipartition property

We can make it more rigorous

)(~,...,, 21 xpXXX n independently

n

iiXf

n 1

1)|)(1

Pr(|

0nn for

x

n

ii xpxfXfXf

n)()()]([)(

1

1

E

Weak law of large number

Typical set

)(1 2),...,( pnH

nXXp

]Unif[AXX n ~),...,( 1)(2|| pnHA ,with

Typical set nnA ,

Kk

Kk

ppppp

aaaa

21

21

n

,nA

: The set of sequencesn

nxxx ),...,,( 21,nA))((

21))(( 2),...,,(2 pHn

npHn xxxp

1)( ,nAp

for sufficiently large

n

))((,

))(( 2||2)1(

pHnn

pHn A

Typical set)(

1 2),...,( pnHnXXp

]Unif[AXX n ~),...,( 1)(2|| pnHA ,with

n

,nA

Interpretation 2: coin flipping

Flip a fair coin {Head, Tail}Flip a fair coin twice independently {HH, HT, TH, TT}……Flip a fair coin times independently equally likely sequences

We may interpret entropy as the number of flips

nn2

Example

TTTHHTHH

dcba

flips

4/14/14/14/1Pr

The above uniform distribution amounts to 2 coin flips

Interpretation 2: coin flipping

)(1 2),...,( pnH

nXXp

]Unif[AXX n ~),...,( 1)(2|| pnHA ,with

),...,( 1 nXX amounts to )( pnH flips

)( pH flips)(~ xpX amounts to

Kk

Kk

ppppp

aaaa

21

21

Interpretation 2: coin flipping

n

,nA

THTTHHTTH

dcba

flips

8/18/14/12/1Pr

H T

H

H

T

T

a

b

c d

Interpretation 2: coin flipping

3321log

8/18/14/12/1Pr

p

dcba

flips4

73

8

13

8

12

4

11

2

1)( XH

THTTHHTTH

dcba

flips

8/18/14/12/1Pr

Interpretation 2: coin flipping

Interpretation 3: coding

Example

00011011

4/14/14/14/1Pr

code

dcbaA

)4/1log(4log2 length

)(1 2),...,( pnH

nXXp

]Unif[AXX n ~),...,( 1)(2|| pnHA ,with

How many bits to code elements in A

Kk

Kk

ppppp

aaaa

21

21

?

)( pnH

Can be made more formal using typical set

bits

Interpretation 3: coding

n

,nA

010011001

8/18/14/12/1Pr

code

dcba

100101100010abacbd

Prefix code

bitsE4

7

8

13

8

13

4

12

2

11)()()]([

x

xpxlXl

H T

H

H

T

T

a

b

c d

100101100010abacbd

010011001

8/18/14/12/1Pr

code

dcba

Sequence of coin flippingA completely random sequenceCannot be further compressed

)(log)( xpxl

)()]([ pHXl E

Optimal code

e.g., two words I, probability

Optimal code

Kk

Kk

Kk

lllll

ppppp

aaaa

21

21

21

Kraft inequality for prefix code

12

k

lk

k

kk plXlL )]([EMinimize

Optimal length kk pl log*

)(* pHL

Wrong model

Kk

Kk

Kk

qqqq

pppp

aaaa

21

21

21

Wrong

True

kk

kkk

k ppplXlL log)]([ *** EOptimal code

k k

kkkk qpplXlL log)]([EWrong code

Redundancyk

k

kk q

ppLL log*

Box: All models are wrong, but some are useful

Kk

Kk

Kk

qqqqq

ppppp

aaaa

21

21

21

Relative entropy

x

p xq

xpxp

Xq

XpqpD

)(

)(log)(]

)(

)([log)|( E

Kullback-Leibler divergence

Kk

Kk

Kk

qqqqq

ppppp

aaaa

21

21

21

Relative entropy

0]))(

)([log(]

)(

)([log]

)(

)([log)|(

Xp

Xq

Xp

Xq

Xq

XpqpD ppp EEE

Jensen inequality

)(||log||log)]([log)|( pHXpUpD p E

Types

Kk

Kk

Kk

nnnn

qqqqq

aaaa

21

21

21

freq

)(~,...,, 21 xqXXX n independently

kn number of times ki aX

n

nf k

k normalized frequency

),...,(),...,( 11 KK qqqfff

Kk

Kk

Kk

nnnn

qqqqq

aaaa

21

21

21

freq

Law of large number

n

nf k

k

Refinement

k

nfk

Kk

nk

K

kk qnfnf

nq

nn

nf

,...,,...,)Pr(

11

i

in xpxxp )(),...,( 1

),...,(),...,( 11 KK qqqfff

Kk

Kk

Kk

nnnn

qqqqq

aaaa

21

21

21

freq

Law of large number

n

nf k

k

Refinement

)|()Pr(log1

qfDfn

)|(2~)Pr( qfnDf

Large deviation

Kolmogorov complexity

Example: a string 011011011011…011

Program: for (i =1 to n/3) write(011) endCan be translated to binary machine code

Kolmogorov complexity = length of shortest machine code that reproduce the string no probability distribution involved

If a long sequence is not compressible, then it has all the statistical properties of a sequence of coin flipping

string = f(coin flippings)

Joint and conditional entropy

121

21

2222212

1112111

21

j

kkjkkk

j

j

j

ppp

ppppa

ppppa

ppppa

bbb

kjjk pbYaX )&Pr(

)&Pr(),( yYxXyxp

kk paX )Pr(

jj pbY )Pr(

)Pr()( xXxpX )Pr()( yYypY

Joint distribution

Marginal distribution

e.g., eye color & hair color

jkjjk ppbYaX /)|Pr(

kkjkj ppaXbY /)|Pr(

)|Pr()|(| yYxXyxp YX

)|Pr()|(| xXyYxyp XY

)|()()|()(),( || yxpypxypxpyxp YXYXYX

Joint and conditional entropy

121

21

2222212

1112111

21

j

kkjkkk

j

j

j

ppp

ppppa

ppppa

ppppa

bbb

Conditional distribution

Chain rule

121

21

2222212

1112111

21

j

kkjkkk

j

j

j

ppp

ppppa

ppppa

ppppa

bbb

Joint and conditional entropy

kj yxkjkj YXpyxpyxpppYXH

,

)],(log[),(log),(log),( E

y

xypxypxXYH )|(log)|()|(

x x y

xypxypxpxXYHxpXYH )|(log)|()()|()()|(

)]|(log[)|(log),()|(,

XYpxypyxpXYHyx

E

)|()()|()(),( || yxpypxypxpyxp YXYXYX

)|()(),( XYHXHYXH

Chain rule

),...,|(),...,(1

111

n

iiin XXXHXXH

  ])()(

),([log))()(|),(();(

YpXp

YXpypxpyxpDYXI E

yx ypxp

yxpyxpYXI

, )()(

),(log),();(

Mutual information

)|()();( YXHXHYXI

),()()();( YXHYHXHYXI

Entropy rate

Stochastic process ,...),...,,( 21 nXXX not independent

Markov chain:

)|Pr(),...,|Pr( 111111 nnnnnnnn xXxXxXxXxX

),...,(1

lim 1 nn XXHn

H Entropy rate:

Stationary process: ),...,|(lim 11 XXXHH nnn

Stationary Markov chain: )|( 1 nn XXHH

(compression)

Shannon, 1948

1. Zero-order approximationXFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD.  2. First-order approximation OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL.  3. Second-order approximation (digram structure as in English). ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE.  4. Third-order approximation (trigram structure as in English). IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE.  5. First-order word approximation. REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.  6. Second-order word approximation. THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.  

Summary

Entropy of a distribution measures randomness or uncertainty log of the number of equally likely choices average number of coin flips average length of prefix code (Kolmogorov: shortest machine code randomness)

Relative entropy from one distribution to the other measure the departure from the first to the second coding redundancy large deviation

Conditional entropy, mutual information, entropy rate