Information and Coding Theory (CO349) - doc.ic.ac.ukherbert/teaching/ICA1S.pdf · Information and...

Information and Coding Theory (CO349)Coding and Decoding

Herbert [email protected]

Department of ComputingImperial College London

Autumn 2017

Information and Codes

What will be covered in this course?

• Data and Information Representation• Data and Information Transmission

We will look at these problems from a quantitative, statisticalpoint of view as well as form an algebraic point of view.

In German, Dutch, Italian, etc: CS is Informatik – InformaticaInformation Science rather than Computer Science.

Practicalities

Two Lecturers

Herbert [email protected] 31

2 weeks until 30 OctoberOpen-book coursework test 26 October (or 30?)

Mahdi [email protected] 31

2 weeks from 2 NovemberOpen-book coursework test 23 November

Exam: Week 11, 11-15 December 2017, 2 hours (3 out of 4).

Different classes, different background, different applications.

Practicalities

Two Lecturers

Practicalities

Two Lecturers

Practicalities

Two Lecturers

Practicalities

Two Lecturers

Overview – Information (Theory)

Part I: Information Theory

• Simple Codes and Coding• Revision: Probabilities• Information Representation• Information Transmission• Information Manifestation

Part II: Coding Theory

• Revision: Linear Algebra• Linear Codes• Algebraic Error-Correcting Codes• Information in Cryptography

Text Books

• Norman L. Biggs: Codes - An Introduction to Information,Communication and Cryptography, Springer 2008.

• Ron Roth: Introduction to Coding Theory, CambridgeUniversity Press, 2006.

• David Applebaum: Probability and Information, CambridgeUniversity Press, 1996/2008.

• Robert McEliece: The Theory of Information and Coding,Cambridge University Press, 2002.

• T.M. Cover, J.A. Thomas: Elements of Information Theory,Wiley-Interscience, 2006.

• Robert B. Ash: Information Theory, Wiley 1965, reprint:Dover 1980.

Text Books

Information and Security

Side-Channel Attacks (Kocher, 1996)The problem appears in attacks against public encryptionalgorithms like RSA. In (optimised) versions of de/encoding(using modular exponentation) properties of the secrete keydetermine the execution time.

How much information about the secret key is revealed?

Differential Privacy (Dwork, 2006)In large (statistical) databases an attacker can try to revealinformation about individuals (de-anonymise), e.g. there areonly 3 under-25 with hair loss registered, and there are twopeople getting hair-loss treatment in SW7. Kim is on thedatabase, 21 and lives at 170 Queen’s Gate.

How much information about individuals is revealed?

Information – A Measure of Surprise

The Entropy of a probability distribution p is:

H(p) = −∑

p(x) log2(p(x)).

Example (Cover&Thomas)

Consider a Cheltenham Horse Race with 8 horses with winningchances (1

2 ,14 ,

164), then the entropy is 2.

Label horses 0,10,110,1110,111100,111101,111110,111111then we need only only 2 bits in average to report winner.

For equally strong horses (18 ,

18) we have an

entropy of 3, and the minimal message length is also 3 bits.

Information – A Measure of Surprise

The Entropy of a probability distribution p is:

H(p) = −∑

p(x) log2(p(x)).

Example (Cover&Thomas)

Consider a Cheltenham Horse Race with 8 horses with winningchances (1

2 ,14 ,

164), then the entropy is 2.

Label horses 0,10,110,1110,111100,111101,111110,111111then we need only only 2 bits in average to report winner.

For equally strong horses (18 ,

18) we have an

entropy of 3, and the minimal message length is also 3 bits.

Surprise vs Prejudice

A father and son are on a fishing trip in the mountains of Wales.On the way back home their car has a serious accident.

The father is immediately killed and declared dead on the siteof the accident. However, the son is severely injured and drivenby ambulance to the next hospital.

When the son is brought into the operating theatre the surgeonexclaims "I can’t do this, he is my son."

Scientific American, 1980s

Everything You Always Wanted toKnow About Logarithms But Were

Afraid to AskDefined (implicitly) via the equation:

x = blogb(x)

Common logarithms: b = 2, b = 10, and b = 2.71828 . . . = e

logb(xy) = logb(x) + logb(y)

logb(xn) = n logb(x), . . . logb(x) =loga(x)

loga(b)

Example

log2(64) = 6 as 26 = 64; log2( 164) = −6 as 1

64 = 2−6 (2−1 = 12 ).

x = blogb(x)

loga(b)

Example

log2(64) = 6 as 26 = 64; log2( 164) = −6 as 1

64 = 2−6 (2−1 = 12 ).

x = blogb(x)

loga(b)

Example

log2(64) = 6 as 26 = 64; log2( 164) = −6 as 1

64 = 2−6 (2−1 = 12 ).

Messages and Strings

DefinitionAn alphabet is a finite set S; its members s ∈ S are referred toas symbols.

DefinitionA message m in the alphabet S is a finite sequence ofelements in S:

m = s1s2 · · · sn with si ∈ S,1 ≤ i ≤ n.

A message is often also referred to as a string or word in S.The number n ∈ N is the length of m.

Notation for length |m| = n; prefix m|k = s1s2 · · · sk (with k ≤ n).

m = s1s2 · · · sn with si ∈ S,1 ≤ i ≤ n.

Notation

Consider also string of length zero, often denoted ε with |ε| = 0.

By S0 we denote the set of words containing only ε; Sn denotesthe set of all strings of length n; and

S∗ = S0 ∪ S1 ∪ S2 ∪ · · ·

Example

With the alphabet S = B = 0,1 we can produce all binarymessages of a certain length n. Concretely, for example:

B3 = 000,001,010,011,100,101,110,111

Notation

S∗ = S0 ∪ S1 ∪ S2 ∪ · · ·

Example

B3 = 000,001,010,011,100,101,110,111

Notation

S∗ = S0 ∪ S1 ∪ S2 ∪ · · ·

Example

B3 = 000,001,010,011,100,101,110,111

Coding

DefinitionLet S and T be two alphabets. A code c is an injective function

c : S → T ∗

For each symbol s ∈ S the string c(s) ∈ T ∗ is called thecodeword for s. The set of all codewords

C = c(s) | s ∈ S

is also referred to as the code.

For |T | = 2, e.g. T = B, we have a binary, for |T | = 3 a ternary,and for |T | = b a b-ary code.

Coding

DefinitionLet S and T be two alphabets. A code c is an injective function

c : S → T ∗

For each symbol s ∈ S the string c(s) ∈ T ∗ is called thecodeword for s. The set of all codewords

C = c(s) | s ∈ S

is also referred to as the code.

For |T | = 2, e.g. T = B, we have a binary, for |T | = 3 a ternary,and for |T | = b a b-ary code.

Example: Morse CodeExample

Development from 1836 by Samuel F. B. Morse (1791–1872).This is a code c : A, . . . ,Z → • , – ,∗ (with variations):

A • –B – • • •C – • – •D – • •E •F • • – •G – – •

H • • • •I • •J • – – –K – • –L • – • •M – –N – •

O – – –P • – – •Q – – • –R • – •S • • •T –U • • –

V • • • –W • – –X – • • –Y – • – –Z – – • •

Most famous (after April 1912) perhaps:

SOS 7→ • • •– – –• • • ≡ • • • – – – • • •

Example: Morse CodeExample

Development from 1836 by Samuel F. B. Morse (1791–1872).This is a code c : A, . . . ,Z → • , – ,∗ (with variations):

A • –B – • • •C – • – •D – • •E •F • • – •G – – •

H • • • •I • •J • – – –K – • –L • – • •M – –N – •

O – – –P • – – •Q – – • –R • – •S • • •T –U • • –

V • • • –W • – –X – • • –Y – • – –Z – – • •

Most famous (after April 1912) perhaps:

SOS 7→ • • •– – –• • • ≡ • • • – – – • • •

Decoding

DefinitionA code c : S → T ∗ is extended to S∗ as follows: Given a strings1s2 · · · sn in S∗ we define

c(s1s2 · · · sn) = c(s1)c(s2) · · · c(sn)

Note: This way we overload the function c (polymorphism).

DefinitionA code c : S → T ∗ is uniquely decodeable (short UD) if theextended code function c : S∗ → T ∗ is injective.

This means that every string in T ∗ corresponds to at most onemessage in S∗.

Decoding

c(s1s2 · · · sn) = c(s1)c(s2) · · · c(sn)

Decoding

c(s1s2 · · · sn) = c(s1)c(s2) · · · c(sn)

Decoding

c(s1s2 · · · sn) = c(s1)c(s2) · · · c(sn)

Example

Let S = x , y , z and take a code c : S → B∗ defined as

x 7→ 0 y 7→ 10 z 7→ 11

Take the original stream yzxxzyxzyy · · · in S∗ and encode it.

Is it always possible to decode the encoded stream in T ∗ = B∗?

Example

Same code as above. Consider encoded stream

1000111000 · · ·

What is the original stream? Is it uniquely determined?

Example

x 7→ 0 y 7→ 10 z 7→ 11

Example

1000111000 · · ·

Example

x 7→ 0 y 7→ 10 z 7→ 11

Example

1000111000 · · ·

The Decoding ProblemGiven a sequence in T ∗ can we (sequentially) decode it?

Example

Same code with C = 0,10,11. Consider 100111000 · · ·Take prefix of length 1, first 1: it is not in the codebook C.Take prefix of length 2, i.e. 10: this matches with y .Consider remainder stream 0111000 · · · .Take prefix of length 1, i.e. 0: this matches with x .Consider remainder stream 111000 · · · .Take prefix of length 1, i.e. 1: is not in the codebook C.Take prefix of length 2, i.e. 11: this matches with z.

Example

Without the ‘pause’ in the Morse code we could not decodeit, e.g. IEEE as • • • • • vs • ••••.

Example

Prefix-FreeDefinitionA code c : S → T ∗ is prefix-free (short PF) if there is no pair ofcodeworts q = c(s) and q′ = c(s′) such that

q′ = qr for some non-empty word r ∈ T ∗.

Example

Let S = w , x , y , z with c : S → B∗ defined by

w 7→ 10 x 7→ 01 y 7→ 11 z 7→ 011

This code is not PF. It is not UD either, because:

wxwy 7→ 10011011 as well as wzz 7→ 10011011

Prefix-FreeDefinitionA code c : S → T ∗ is prefix-free (short PF) if there is no pair ofcodeworts q = c(s) and q′ = c(s′) such that

q′ = qr for some non-empty word r ∈ T ∗.

Example

Let S = w , x , y , z with c : S → B∗ defined by

w 7→ 10 x 7→ 01 y 7→ 11 z 7→ 011

This code is not PF. It is not UD either, because:

wxwy 7→ 10011011 as well as wzz 7→ 10011011

PF implies UD

TheoremIf a code c : S → T ∗ is prefix-free then it is uniquely decodable.

Proof (Indirect).

Take x = x1x2 · · · xm and y = y1y2 · · · yn such that c(x) = c(y)with c(x) = c(x1)c(x2) · · · c(xm) and c(y) = c(y1)c(y2) · · · c(yn).

The prefixes (of length k ) must be the same c(x)|k = c(y)|k forany k . It might be that |c(x1)| 6= |c(y1)|.

Assume w.l.o.g. |c(x1)| ≤ |c(y1)|, and take k = |c(x1)| thenc(x)|k = c(x1) = c(y1)|k and thus c(y1) = c(x1)r for somer ∈ T ∗. This contradicts PF. Repeat this argument to cover allof c(x) = c(y). Formally: Induction on lengths m and n.

PF implies UD

Proof (Indirect).

PF implies UD

Proof (Indirect).

PF implies UD

Proof (Indirect).

Strings as Paths in Trees

If we concentrate on binary codes with T = B then we can seeany possible codeword in C as a (finite) path in a binary tree:

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

0 1 0 1

0 1 0 1 0 1 0 1

In general, we have to consider a b-ary tree for |T | = b.

Strings as Paths in Trees

If we concentrate on binary codes with T = B then we can seeany possible codeword in C as a (finite) path in a binary tree:

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

0 1 0 1

0 1 0 1 0 1 0 1

In general, we have to consider a b-ary tree for |T | = b.

Codes as (Sub)-TreesThe codewords in C can be represented by the end-points ofthe corresponding paths. If a code C is PF then none of itsdecendents can represent a codeword in C.

Example

The code C = 0,10,110,111 is PF, its corresponding tree is

• •

110 111

Codes as (Sub)-TreesThe codewords in C can be represented by the end-points ofthe corresponding paths. If a code C is PF then none of itsdecendents can represent a codeword in C.

Example

The code C = 0,10,110,111 is PF, its corresponding tree is

• •

110 111

Parameters of a Code

DefinitionGiven a code c : S → T ∗, let ni denote the number of symbolss ∈ S which are decoded by strings of length i , i.e.

ni = |s ∈ S | |c(s)| = i| = |s ∈ S | c(s) ∈ T i|

We refer to n1,n2, · · · ,nM as parameters of c, where M is themaximal length of a codeword.

The number of all potential codewords of length i is is given bybi = |T i |, in particular: b = |T | = b1.

Note that nibi gives the fraction of possible codewords of length i

which is actually used by a code (“filling rate”).

ni = |s ∈ S | |c(s)| = i| = |s ∈ S | c(s) ∈ T i|

Kraft-McMillan NumberDefinitionGiven a code c : S → T ∗ with parameters n1,n2, · · · ,nM . TheKraft-McMillan number of c is defined as:

K =M∑

bi =n1

b2 + · · ·+ nM

Example

Consider a binary code, i.e. b = b1 = 2. The parameters of thecode are n1 = 1,n2 = 1,n3 = 2 (like in the last example), thenbi = 2i , i.e. b1 = 2,b2 = 4,b3 = 8 and

Kraft-McMillan NumberDefinitionGiven a code c : S → T ∗ with parameters n1,n2, · · · ,nM . TheKraft-McMillan number of c is defined as:

K =M∑

bi =n1

b2 + · · ·+ nM

Example

Consider a binary code, i.e. b = b1 = 2. The parameters of thecode are n1 = 1,n2 = 1,n3 = 2 (like in the last example), thenbi = 2i , i.e. b1 = 2,b2 = 4,b3 = 8 and

Existence of PF Codes

Theorem (L.G. Kraft, 1949)

Given an alphabet T with |T | = b and (code) parametersn1,n2, · · · ,nM ∈ N. If K ≤ 1 there exists a PF b-ary codec : S → T ∗ with these parameters.

Example

Construct a PF binary code with n2 = 2,n3 = 3,n4 = 1.First check that K = 8

16 + 616 + 1

16 = 1516 ≤ 1.

• Codewords of length 2: take any two, e.g. 00 and 01.• Codewords of length 3: can’t take 00 · · · or 01 · · · but we

can take all of form 1s2s3 with s2, s3 ∈ B; in fact we needonly three of 100, 101, 110, and (maybe) not 111.

• Codeword(s) of length 4: Take either 1110 or 1111.

Example

16 + 616 + 1

16 = 1516 ≤ 1.

Example

16 + 616 + 1

16 = 1516 ≤ 1.

Example

16 + 616 + 1

16 = 1516 ≤ 1.

Example

16 + 616 + 1

16 = 1516 ≤ 1.

Proof of Kraft’s Theorem

Proof.Any partial sum of the one defining K also fulfills

nibi ≤ 1, e.g.

n1b1 ≤ 1, n1

b1 + n2b2 ≤ 1, n1

b1 + n2b2 + n3

b3 ≤ 1, etc. This implies:

n1 ≤ b, n2 ≤ b(b1 − n1), n3 ≤ b(b2 − n1b1 − n2), etc.

or in general:

ni ≤ b(

bi−1 − n1bi−2 − n2bi−3 − . . .− ni−1

• Codeword length 1: Since n1 ≤ b1 we can chose any ofthem; and b − n1 words of length 1 remain as prefixes forwords of length 2,3, . . ..

Proof of Kraft’s Theorem

Proof.Any partial sum of the one defining K also fulfills

nibi ≤ 1, e.g.

n1b1 ≤ 1, n1

b1 + n2b2 ≤ 1, n1

b1 + n2b2 + n3

b3 ≤ 1, etc. This implies:

n1 ≤ b, n2 ≤ b(b1 − n1), n3 ≤ b(b2 − n1b1 − n2), etc.

or in general:

ni ≤ b(

bi−1 − n1bi−2 − n2bi−3 − . . .− ni−1

• Codeword length 1: Since n1 ≤ b1 we can chose any ofthem; and b − n1 words of length 1 remain as prefixes forwords of length 2,3, . . ..

Proof of Kraft’s Theorem (cont.)

Proof (cont).

• Codeword length 2: There are b2 codewords of length 2but b1n1 are unavailable, so b2 − b1n1 are left. But wehave n2 ≤ b(b − n1) so we can make our choices.Thereafter b2 − n1b − n2 prefixes of length 2 are left.

• Codewords of length i : There are alwaysbi−1 − n1bi−2 − n2bi−3 − . . .− ni−1 perfixes of length i − 1left, and from K ≤ 1 we have

ni ≤ b(

bi−1 − n1bi−2 − n2bi−3 − . . .− ni−1

)So we can chose enough codewords ni of length i andenough remain.

Formally, by induction on the maximal length of codewords.

Proof (cont).

ni ≤ b(

bi−1 − n1bi−2 − n2bi−3 − . . .− ni−1

Proof (cont).

ni ≤ b(

bi−1 − n1bi−2 − n2bi−3 − . . .− ni−1

Encoding Strings

Consider a code c : S → T ∗ with parameters n1,n2, . . . ,nM . Weaim to encode strings in of length r . Let qr (i) be the number ofstrings of length r in S∗ encoded in a string of length i in T ∗.

Example

Take c : S → T ∗ with parameters n1 = 2, n2 = 3; what is q2(i)?Given x = x1x2 ∈ S2, then c(x) ∈ T has lengths 2, 3 or 4.

• For |c(x1x2)| = 2: Then |c(x1)| = 1 = |c(x2)|, since n1 = 2we have q2(2) = 2× 2 = 4.

• For |c(x1x2)| = 3: Then |c(x1)| = 1 and |c(x2)| = 2 or viceversa, so we get q2(3) = 2× 3 + 3× 2 = 12.

• For |c(x1x2)| = 4: Then |c(x1)| = 2 = |c(x2)|, thus we knowthat q2(4) = 3× 3 = 9.

Encoding Strings

Example

Encoding Strings

Example

Encoding Strings

Example

Encoding Strings

Example

Generating FunctionsGenerating functions are an important tool for investigatingsequences, e.g. enumerations of combinatorial objects.

DefinitionFor a sequence of numbers q(1),q(2),q(3), . . . the generatingfunction Q(x) is given as the polynomial (formal power series)in the unknown variable x :

Q(x) = q(1)x + q(2)x2 + q(3)x3 + · · ·

For a code c : S → T ∗ we consider for 1 ≤ i ≤ rM:

qr (i) = ||c(s)| = i | |s| = r|

and the generating function:

Qr (x) = qr (1)x + qr (2)x2 + qr (3)x3 + · · ·+ qr (rM)x rM

Q(x) = q(1)x + q(2)x2 + q(3)x3 + · · ·

qr (i) = ||c(s)| = i | |s| = r|

Q(x) = q(1)x + q(2)x2 + q(3)x3 + · · ·

qr (i) = ||c(s)| = i | |s| = r|

Counting Principle (CP)

TheoremGiven a UD code c : S → T ∗ with |c(s)| ≤ M for all s ∈ S withgenerating function Qr (x) then for all r ≥ 1:

Qr (x) = Q1(x)r .

Example

Take c : S → T ∗ with parameters n1 = 2, n2 = 3 as above.What is the relationship between Q1 and Q2?Clearly q1(1) = 2 = n1 and q1(2) = 3 = n2 and thus

Q1(x) = 2x + 3x2 and Q2(x) = 4x2 + 12x3 + 9x4

and thus indeed (Q1(x))2 = Q1(x)Q1(x) = Q2(x).

Counting Principle (CP)

TheoremGiven a UD code c : S → T ∗ with |c(s)| ≤ M for all s ∈ S withgenerating function Qr (x) then for all r ≥ 1:

Qr (x) = Q1(x)r .

Example

Take c : S → T ∗ with parameters n1 = 2, n2 = 3 as above.What is the relationship between Q1 and Q2?Clearly q1(1) = 2 = n1 and q1(2) = 3 = n2 and thus

Q1(x) = 2x + 3x2 and Q2(x) = 4x2 + 12x3 + 9x4

and thus indeed (Q1(x))2 = Q1(x)Q1(x) = Q2(x).

Proof of the Counting PrincipleProof ∗.Consider Q1(x)r = (n1x + n2x2 + n3x3 + · · ·+ nMxM)r withnj = q1(j). What is the coefficient n to xk for 1 ≤ k ≤ rM? Toget any term of the form nxk multiply r terms from Q1(x) to getnxk = ni1x i1 × ni2x i2 × · · · × nir x

ir with i1 + i2 + · · ·+ ir = k andthen n = ni1ni2 · · · nir . The coefficient to xk is∑

i1+i2+···+ir =k

ni1ni2 · · · nir

Consider Qr (x) or qr (k), i.e. the number of words of length kencoded from strings of length r : Each such codeword isformed as c(x1)c(x2) · · · c(xr ), we can chose any xj with|c(xj)| = ij as long as i1 + i2 + · · ·+ ir = k in which case thereni1ni2 · · · nir possible (input) stings. In total we have to consideragain all partitions of k

Proof of the Counting PrincipleProof ∗.Consider Q1(x)r = (n1x + n2x2 + n3x3 + · · ·+ nMxM)r withnj = q1(j). What is the coefficient n to xk for 1 ≤ k ≤ rM? Toget any term of the form nxk multiply r terms from Q1(x) to getnxk = ni1x i1 × ni2x i2 × · · · × nir x

ir with i1 + i2 + · · ·+ ir = k andthen n = ni1ni2 · · · nir . The coefficient to xk is∑

i1+i2+···+ir =k

ni1ni2 · · · nir

Consider Qr (x) or qr (k), i.e. the number of words of length kencoded from strings of length r : Each such codeword isformed as c(x1)c(x2) · · · c(xr ), we can chose any xj with|c(xj)| = ij as long as i1 + i2 + · · ·+ ir = k in which case thereni1ni2 · · · nir possible (input) stings. In total we have to consideragain all partitions of k

Kraft-McMillan Number for UD Codes

Theorem (B. McMillan, 1956)

If there exists a uniquely decodable code c : S → T ∗ withparameters n1,n2, · · · ,nM ∈ N then K ≤ 1.

∃ UD =⇒ K ≤ 1 =⇒ ∃ PF and PF =⇒ UD

Corollary

The following statements are equivalent for codes c : S → T ∗:• There exists a UD code with parameters n1,n2, . . . ,nM .• It holds that K ≤ 1 for parameters n1,n2, . . . ,nM .• There exists a PF code with parameters n1,n2, . . . ,nM .

Kraft-McMillan Number for UD Codes

Theorem (B. McMillan, 1956)

If there exists a uniquely decodable code c : S → T ∗ withparameters n1,n2, · · · ,nM ∈ N then K ≤ 1.

∃ UD =⇒ K ≤ 1 =⇒ ∃ PF and PF =⇒ UD

Corollary

The following statements are equivalent for codes c : S → T ∗:• There exists a UD code with parameters n1,n2, . . . ,nM .• It holds that K ≤ 1 for parameters n1,n2, . . . ,nM .• There exists a PF code with parameters n1,n2, . . . ,nM .

Proof of McMillan’s TheoremProof (Indirect) ∗.Fix r ≥ 1 and take x = 1

b in Qr (x), we get (M = max |c(s)|):

qr (1)

qr (2)

b2 +qr (3)

b3 + · · ·+ qr (rM)

UD implies qr (i) ≤ bi and thus qr (j)bj ≤ 1 for all 1 ≤ j ≤ rM, and

therefore Qr(1

)≤ rM. From K = Q1

)and by CP we have:

)≤ rM thus

r≤ M ∀r .

Suppose K > 1 or K = 1 + k with k > 0, then K r = (1 + k)r =1 + rk + 1

2 r(r − 1)k2 + · · · Furthermore thus K r > 12 r(r − 1)k2

or K r

r > 12(r − 1)k2 which, for large r , contradicts K r

r ≤ M.

qr (1)

qr (2)

b2 +qr (3)

b3 + · · ·+ qr (rM)

therefore Qr(1

)and by CP we have:

)≤ rM thus

r≤ M ∀r .

or K r

r ≤ M.

qr (1)

qr (2)

b2 +qr (3)

b3 + · · ·+ qr (rM)

therefore Qr(1

)and by CP we have:

)≤ rM thus

r≤ M ∀r .

or K r

r ≤ M.

Some Probability Theory

Probability theory is concerned with quantifying or measuringthe chances that certain events can happen.

We consider an event space, i.e. a finite set Ω and a setB ⊆ P(Ω) of measurable sets in Ω which form a Booleanalgebra (based on union, intersection and complement).For finite sets one can use the power-set B = P(Ω).

Probabilities are then assigned via a measure, i.e. a functionPr : B → R or m : B → R or µ : B → R.

Note: For (uncountable) infinite Ω one needs to develop a moregeneral measure theory (e.g. with B a σ-algebra) to overcomeproblems like µ([0,1]) = 1 but µ(x) = 0 for all x ∈ [0,1].

Finite Probability Spaces

Consider a finite measurable spaces (Ω,B) with |Ω| = n,

DefinitionA probability (measure) Pr : B on (Ω,B) has to fulfill• Pr(Ω) = 1.• 0 ≤ Pr(A) ≤ 1 for all A ∈ B.• Pr(A ∪ B) = Pr(A) + Pr(B) for A ∩ B = ∅.

Some further rules (which follow from the axioms above):• Pr(∅) = 0,• Pr(A) = 1− Pr(A),• Pr(A ∪ B) = Pr(A) + Pr(B)− Pr(A ∩ B).• etc.

Random Distributions

For finite probability spaces (Ω,B,Pr) with |Ω| = n, we candefine a probability (measure) via atoms in ω ∈ Ω.

DefinitionA probability distribution is a function p : Ω→ [0,1], with∑

ω∈Ω

p(ω) = 1.

If we enumerate the elements in Ω in some arbitrary way asΩ = ω1, ω2, . . . , ωn then we can also represented p by a (row)vector in Rn.

p = (p(ω1),p(ω2), . . . ,p(ωn))

ω∈Ω

p(ω) = 1.

p = (p(ω1),p(ω2), . . . ,p(ωn))

ω∈Ω

p(ω) = 1.

p = (p(ω1),p(ω2), . . . ,p(ωn))

Random VariablesProbability distributions on a finite Ω define a probability(measure) in the obvious way

Pr(A) =∑ω∈A

p(ω).

DefinitionA random variable is function X : Ω→ R.

We can represent random variables as (column) vectors in Rn.

Example

Consider a dice. The event space describes the top face of thedice, i.e. Ω =

, , , , ,

, define X which

counts the number of eyes, e.g. X( )

= 5 etc.

Pr(A) =∑ω∈A

p(ω).

Example

, , , , ,

, define X which

= 5 etc.

Pr(A) =∑ω∈A

p(ω).

Example

, , , , ,

, define X which

= 5 etc.

Pr(A) =∑ω∈A

p(ω).

Example

, , , , ,

, define X which

= 5 etc.

Moments in ProbabilityExpectation Value. For a random variable X and probabilitydistribution p we define:

E(X ) =∑ω∈Ω

p(ω)X (ω) =∑

piXi = µX

One can show: E(X + Y ) = E(X ) + E(Y ) and E(αX ) = αE(X ).

Example

Consider a (fair) dice and previous random variable X , then

E(X ) = 116

Variance. For random variable X and distribution p we define:

Var(X ) = E((X − E(X ))2 = E(X 2)− (E(X ))2.

The standard deviation is σX =√

Var(X ).Slide 36 of 40

E(X ) =∑ω∈Ω

p(ω)X (ω) =∑

piXi = µX

Example

E(X ) = 116

Var(X ) = E((X − E(X ))2 = E(X 2)− (E(X ))2.

E(X ) =∑ω∈Ω

p(ω)X (ω) =∑

piXi = µX

Example

E(X ) = 116

Var(X ) = E((X − E(X ))2 = E(X 2)− (E(X ))2.

E(X ) =∑ω∈Ω

p(ω)X (ω) =∑

piXi = µX

Example

E(X ) = 116

Var(X ) = E((X − E(X ))2 = E(X 2)− (E(X ))2.

Bayes Theorem and IndependenceGiven two subsets A and B in a probability space (Ω,B,Pr).The conditional probability of A given that B has happened is

PrB(A) = Pr(A | B) =Pr(A ∩ B)

One can show Bayes Theorem which states:

PrA(B) =PrB(A)Pr(B)

Pr(A)=

Pr(A | B)Pr(B)

Pr(A)= Pr(B | A).

Given two subsets A and B in a probability space (Ω,B,Pr), Aand B are (probabilistically) independent if

Pr(B) = Pr(B | A) =Pr(A ∩ B)

Pr(A)or Pr(A ∩ B) = Pr(A)Pr(B).

PrA(B) =PrB(A)Pr(B)

Pr(A)=

Pr(A | B)Pr(B)

Pr(A)= Pr(B | A).

Pr(B) = Pr(B | A) =Pr(A ∩ B)

PrA(B) =PrB(A)Pr(B)

Pr(A)=

Pr(A | B)Pr(B)

Pr(A)= Pr(B | A).

Pr(B) = Pr(B | A) =Pr(A ∩ B)

PrA(B) =PrB(A)Pr(B)

Pr(A)=

Pr(A | B)Pr(B)

Pr(A)= Pr(B | A).

Pr(B) = Pr(B | A) =Pr(A ∩ B)

PrA(B) =PrB(A)Pr(B)

Pr(A)=

Pr(A | B)Pr(B)

Pr(A)= Pr(B | A).

Pr(B) = Pr(B | A) =Pr(A ∩ B)

PrA(B) =PrB(A)Pr(B)

Pr(A)=

Pr(A | B)Pr(B)

Pr(A)= Pr(B | A).

Pr(B) = Pr(B | A) =Pr(A ∩ B)

Products and ProbabilityGiven two probability spaces (Ω1,Pr1) and (Ω2,Pr2), to keepthings simple use Bi = P(Ωi). We can define a probability Pr onthe cartesian product Ω = Ω1 × Ω2 via:

Pr(〈ω1, ω2〉) = Pr1(ω1)Pr1(ω2)

If Pr1 and Pr2 correspond to probability distributions p1 and p2and p to Pr then p = p1 ⊗ p2, i.e. the tensor product.

Caveat: Not all distributions on Ω1 × Ω2 are a product.

Example

Consider Ω1 = 0,1 and Ω2 = z,o and probabilityPr(〈0, z〉) = Pr(〈1,o〉) = 1

2 and Pr(〈0,o〉) = Pr(〈1, z〉) = 0cannot be represented as a product p1 ⊗ p2.However, one can show: p = 1

2(1,0)⊗ (1,0) + 12(0,1)⊗ (0,1).

Pr(〈ω1, ω2〉) = Pr1(ω1)Pr1(ω2)

Example

2(1,0)⊗ (1,0) + 12(0,1)⊗ (0,1).

Pr(〈ω1, ω2〉) = Pr1(ω1)Pr1(ω2)

Example

2(1,0)⊗ (1,0) + 12(0,1)⊗ (0,1).

Pr(〈ω1, ω2〉) = Pr1(ω1)Pr1(ω2)

Example

2(1,0)⊗ (1,0) + 12(0,1)⊗ (0,1).

Pr(〈ω1, ω2〉) = Pr1(ω1)Pr1(ω2)

Example

2(1,0)⊗ (1,0) + 12(0,1)⊗ (0,1).

Tensor/Kronecker Product

Given a n ×m matrix A and a k × l matrix B:

a11 . . . a1m...

. . ....

an1 . . . anm

b11 . . . b1l...

. . ....

bk1 . . . bkl

The tensor or Kronecker product A⊗ B is a nk ×ml matrix:

A⊗ B =

a11B . . . a1mB...

. . ....

an1B . . . anmB

Special cases are square matrices (n = m and k = l) andvectors (row n = k = 1, column m = l = 1).

a11 . . . a1m...

. . ....

an1 . . . anm

b11 . . . b1l...

. . ....

bk1 . . . bkl

A⊗ B =

a11B . . . a1mB...

. . ....

an1B . . . anmB

a11 . . . a1m...

. . ....

an1 . . . anm

b11 . . . b1l...

. . ....

bk1 . . . bkl

A⊗ B =

a11B . . . a1mB...

. . ....

an1B . . . anmB

CorrelationThe covariance of two random variables X and Y is:

Cov(X ,Y ) = E((X − E(X ))E((Y − E(Y )) = E(XY )− E(X )E(Y )

The correlation coefficient is ρ(X ,Y ) = Cov(X ,Y )/(σXσY ).

For independent random variables X and Y – i.e. if we havePr((X = xj) ∩ (Y = yk )) = Pr(X = xj)Pr(Y = yk ):

E(XY ) = E(X )E(Y ) and Cov(X ,Y ) = ρ(X ,Y ) = 0.

Note that ρ(X ,Y ) = 0 does not imply independence.

Example

Pr(X = x) = 13 for x = −1,0,1 and Y = X 2, then ρ(X ,Y ) = 0.

Example

Information and Coding Theory (CO349) - doc.ic.ac.ukherbert/teaching/ICA1S.pdf · Information and...

Documents

Transcript of Information and Coding Theory (CO349) - doc.ic.ac.ukherbert/teaching/ICA1S.pdf · Information and...