Entropy & Information Jilles Vreeken
29 May 2015
Question of the day
What is
information?
(and what do talking drums have to do with it?)
Bits and Pieces What are information a bit entropy mutual information divergence information theory โฆ
Information Theory
Field founded by Claude Shannon in 1948, โA Mathematical Theory of Communicationโ
a branch of statistics that is essentially about
uncertainty in communication
not what you say, but what you could say
The Big Insight
Communication is a series of discrete messages
each message reduces
the uncertainty of the recipient of a) the series and b) that message
by how much
is the amount of information
Uncertainty
Shannon showed that uncertainty can be quantified, linking physical entropy to messages
and defined the entropy of
a discrete random variable ๐ as
๐ป(๐) = โ๏ฟฝ๐(๐ฅ๐)log ๐(๐ฅ๐)๐
Optimal prefix-codes
Shannon showed that uncertainty can be quantified, linking physical entropy to messages
A (key) result of Shannon entropy is that
โ log2๐ ๐ฅ๐
gives the length in bits of the optimal prefix code
for a message ๐ฅ๐
Codes and Lengths
A code ๐ถ maps a set of messages ๐ to a set of code words ๐
๐ฟ๐ถ โ is a code length function for ๐ถ
with ๐ฟ๐ถ ๐ฅ โ ๐ = |๐ถ ๐ฅ โ ๐| the length in bits of the code word y โ ๐ that ๐ถ assigns to symbol ๐ฅ โ ๐.
Efficiency Not all codes are created equal. Let ๐ถ1 and ๐ถ2 be two codes for set of messages ๐ 1. We call ๐ถ1 more efficient than ๐ถ2 if for all ๐ฅ โ ๐, ๐ฟ1 ๐ฅ โค
๐ฟ2(๐ฅ) while for at least one ๐ฅ โ ๐, ๐ฟ1 ๐ฅ < ๐ฟ2 ๐ฅ 2. We call a code ๐ถ for set ๐ complete if there does not exist a
code ๐ถ๐ถ that is more efficient than ๐ถ
A code is complete when it does not waste any bits
The Most Important Slide
We only care about code lengths
The Most Important Slide
Actual code words are of no interest to us whatsoever.
The Most Important Slide
Our goal is measuring complexity,
not to instantiate an actual compressor
My First Code Let us consider a sequence ๐
over a discrete alphabet ๐ = ๐ฅ1, ๐ฅ2, โฆ , ๐ฅ๐ .
As code ๐ถ for ๐ we can instantiate a block code, identifying the value of ๐ ๐ โ ๐ by an index over ๐, which require a constant number of log2 |๐| bits
per message in ๐, i.e., ๐ฟ ๐ฅ๐ = log2 |๐|
We can always instantiate a prefix-free code with code words of lengths ๐ฟ ๐ฅ๐ = log2 |๐|
Codes in a Tree
root
0 00 01
1 10 11
Beyond Uniform What if we know
the distribution ๐(๐ฅ๐ โ ๐) over ๐ and it is not uniform?
We do not want to waste any bits, so using block codes is a bad idea.
We do not want to introduce any undue bias, so
we want an efficient code that is uniquely decodable without having to use arbitrary length stop-words.
We want an optimal prefix-code.
Prefix Codes A code ๐ถ is a prefix code iff there is no code word ๐ถ ๐ฅ that is an extension of another code word ๐ถ(๐ฅโฒ). Or, in other words, ๐ถ defines a binary tree with the leaves as the code words. How do we find the optimal tree?
root
0 00 01
1
Shannon Entropy Let ๐(๐ฅ๐) be the probability of ๐ฅ๐ โ ๐ in ๐, then
๐ป(๐) = โ ๏ฟฝ ๐(๐ฅ๐)log ๐(๐ฅ๐)๐ฅ๐โ๐
is the Shannon entropy of ๐ (wrt ๐)
(see Shannon 1948)
the โweightโ, how often we see ๐ฅ๐
number of bits needed to identify ๐ฅ๐ under ๐
average number of bits needed per message ๐ ๐ โ ๐
Optimal Prefix Code Lengths What if the distribution of ๐ in ๐ is not uniform?
Let ๐(๐ฅ๐) be the probability of ๐ฅ๐ in ๐, then
๐ฟ(๐ฅ๐) = โ log๐(๐ฅ๐)
is the length of the optimal prefix code for message ๐ฅ๐ knowing distribution ๐
(see Shannon 1948)
Kraftโs Inequality For any code C for finite alphabet ๐ = ๐ฅ1, โฆ , ๐ฅ๐ ,
the code word lengths ๐ฟ๐ถ โ must satisfy the inequality
๏ฟฝ 2โ๐ฟ(๐ฅ๐)
๐ฅ๐โ๐
โค 1.
a) when a set of code word lengths satisfies the inequality,
there exists a prefix code with these code word lengths, b) when it holds with strict equality, the code is complete,
it does not waste any part of the coding space, c) when it does not hold, the code is not uniquely decodable
Whatโs a bit?
Binary digit smallest and most fundamental piece of information yes or no invented by Claude Shannon in 1948 name by John Tukey Bits have been in use for a long-long time, though Punch cards (1725, 1804) Morse code (1844) African โtalking drumsโ
Morse code
Natural language
Punishes โbadโ redundancy: often-used words are shorter
Rewards useful redundancy:
cotxent alolws mishaireng/raeding
African Talking Drums have used this for efficient, fast, long-distance communication
mimic vocalized sounds: tonal language very reliable means of communication
Measuring bits
How much information carries a given string? How many bits?
Say we have a binary string of 10000 โmessagesโ
1) 00010001000100010001โฆ000100010001000100010001000100010001 2) 01110100110100100110โฆ101011101011101100010110001011011100 3) 00011000001010100000โฆ001000100001000000100011000000100110 4) 0000000000000000000000000000100000000000000000000โฆ0000000
obviously, all four are 10000 bits long. But, are they worth those 10000 bits?
So, how many bits?
Depends on the encoding!
What is the best encoding? one that takes the entropy of the data into account things that occur often should get short code things that occur seldom should get long code
An encoding matching Shannon Entropy is optimal
Tell us! How many bits? Please? In our simplest example we have
๐(1) = 1/100000 ๐(0) = 99999/100000
|๐๐๐๐1| = โlog (1/100000) = 16.61
|๐๐๐๐0| = โlog (99999/100000) = 0.0000144
So, knowing ๐ our string contains
1 โ 16.61 + 99999 โ 0.0000144 = 18.049 bits
of information
Optimalโฆ.
Shannon lets us calculate optimal code lengths what about actual codes? 0.0000144 bits? Shannon and Fano invented a near-optimal encoding in 1948,
within one bit of the optimal, but not lowest expected
Fano gave students an option: regular exam, or invent a better encoding
David didnโt like exams; invented Huffman-codes (1952) optimal for symbol-by-symbol encoding with fixed probs.
(arithmetic coding is overall optimal, Rissanen 1976)
Optimality
To encode optimally, we need optimal probabilities
What happens if we donโt?
Measuring Divergence
Kullback-Leibler divergence from ๐ to ๐, denoted by ๐ท(๐ โ ๐), measures the number of bits
we โwasteโ when we use ๐ while ๐ is the โtrueโ distribution
๐ท ๐ โ ๐ = ๏ฟฝ๐(๐) log๐ ๐๐ ๐
๐
Multivariate Entropy
So far weโve been thinking about a single sequence of messages
How does entropy work for
multivariate data?
Simple!
Towards Mutual Information
Conditional Entropy is defined as
๐ป ๐ ๐ = ๏ฟฝ๐ ๐ฅ ๐ป(๐|๐ = ๐ฅ)๐ฅโX
โaverage number of bits
needed for message ๐ฅ โ ๐ knowing ๐๐ถ
Symmetric
Mutual Information the amount of information shared between two variables ๐ and ๐
๐ผ ๐,๐ = ๐ป ๐ โ ๐ป ๐ ๐ = ๐ป ๐ โ ๐ป ๐ ๐
= ๏ฟฝ๏ฟฝ๐ ๐ฅ,๐ฆ log๐ ๐ฅ,๐ฆ๐ ๐ฅ ๐ ๐ฆ
๐ฅโ๐๐ฆโ๐
high ๐ผ(๐,๐) implies correlation low ๐ผ(๐,๐) implies independence
Information is symmetric!
Information Gain (small aside)
Entropy and KL are used in decision trees
What is the best split in a tree?
one that results in as homogeneous label distributions in the sub-nodes as possible: minimal entropy
How do we compare over multiple options? ๐ผ๐ผ ๐,๐ = ๐ป ๐ โ ๐ป(๐|๐)
Goal: Finds sets of attributes that interact strongly
Task: mine all sets of attributes
such that the entropy over their values instantiations โค ๐
1.087 bits
Low-Entropy Sets
Theory of Computation
Probability Theory 1
No No 1887
Yes No 156
No Yes 143
Yes Yes 219
(Heikinheimo et al. 2007)
Low-Entropy Sets
Maturity Test Software Engineering
Theory of Computation
No No No 1570
Yes No No 79
No Yes No 99
Yes Yes No 282
No No Yes 28
Yes No Yes 164
No Yes Yes 13
Yes Yes Yes 170
(Heikinheimo et al. 2007)
Low-Entropy Trees Scientific Writing
Maturity Test
Software Engineering
Project
Theory of Computation
Probability Theory 1
(Heikinheimo et al. 2007)
Define entropy of a tree ๐ = ๐ด,๐1, โฆ ,๐๐ as
๐ป๐ ๐ = ๐ป(๐ด โฃ A1, โฆ , Ak) + โ๐ป๐(๐๐)
The tree ๐ for an itemset ๐ด minimizing ๐ป๐ ๐ identifies directional explanations!
๐ป ๐ด โค ๐ป(๐๐|๐๐, ๐๐๐,๐๐ถ,๐๐) + ๐ป(๐๐|๐๐๐,๐๐ถ,๐๐) + ๐ป ๐๐๐ + ๐ป ๐๐ถ ๐๐ + ๐ป ๐๐
Entropy for Continuous-values
So far we only considered discrete-valued data
Lots of data is continuous-valued
(or is it)
What does this mean for entropy?
Differential Entropy
โ ๐ = โ๏ฟฝ๐ ๐ฅ log๐ ๐ฅ ๐๐ฅ๐
(Shannon, 1948)
Differential Entropy
How aboutโฆ the entropy of Uniform(0,1/2) ?
โ๏ฟฝ โ2 log 2 ๐๐ฅ = โ log 212
0
Hm, negative?
Differential Entropy
In discrete data step size โdxโ is trivial.
What is its effect here?
โ ๐ = โ๏ฟฝ๐ ๐ฅ log๐ ๐ฅ ๐๐ฅ
๐
(Shannon, 1948)
Impossibru?
No.
But youโll have to wait
till next week for the answer.
Conclusions
Information is related to the reduction in uncertainty of what you could say
Entropy is a core aspect of information theory lots of nice properties optimal prefix-code lengths, mutual information, etc
Entropy for continuous data isโฆ more tricky differential entropy is a bit problematic
Thank you! Information is related to the reduction in uncertainty
of what you could say Entropy is a core aspect of information theory lots of nice properties optimal prefix-code lengths, mutual information, etc
Entropy for continuous data isโฆ more tricky differential entropy is a bit problematic
Top Related