Download - Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Transcript
Page 1: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Entropy & Information Jilles Vreeken

29 May 2015

Page 2: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Question of the day

What is

information?

(and what do talking drums have to do with it?)

Page 3: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Bits and Pieces What are information a bit entropy mutual information divergence information theory โ€ฆ

Page 4: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Information Theory

Field founded by Claude Shannon in 1948, โ€˜A Mathematical Theory of Communicationโ€™

a branch of statistics that is essentially about

uncertainty in communication

not what you say, but what you could say

Page 5: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

The Big Insight

Communication is a series of discrete messages

each message reduces

the uncertainty of the recipient of a) the series and b) that message

by how much

is the amount of information

Page 6: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Uncertainty

Shannon showed that uncertainty can be quantified, linking physical entropy to messages

and defined the entropy of

a discrete random variable ๐‘‹ as

๐ป(๐‘‹) = โˆ’๏ฟฝ๐‘ƒ(๐‘ฅ๐‘–)log ๐‘ƒ(๐‘ฅ๐‘–)๐‘–

Page 7: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Optimal prefix-codes

Shannon showed that uncertainty can be quantified, linking physical entropy to messages

A (key) result of Shannon entropy is that

โˆ’ log2๐‘ƒ ๐‘ฅ๐‘–

gives the length in bits of the optimal prefix code

for a message ๐‘ฅ๐‘–

Page 8: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Codes and Lengths

A code ๐ถ maps a set of messages ๐‘‹ to a set of code words ๐‘Œ

๐ฟ๐ถ โ‹… is a code length function for ๐ถ

with ๐ฟ๐ถ ๐‘ฅ โˆˆ ๐‘‹ = |๐ถ ๐‘ฅ โˆˆ ๐‘Œ| the length in bits of the code word y โˆˆ ๐‘Œ that ๐ถ assigns to symbol ๐‘ฅ โˆˆ ๐‘‹.

Page 9: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Efficiency Not all codes are created equal. Let ๐ถ1 and ๐ถ2 be two codes for set of messages ๐‘‹ 1. We call ๐ถ1 more efficient than ๐ถ2 if for all ๐‘ฅ โˆˆ ๐‘‹, ๐ฟ1 ๐‘ฅ โ‰ค

๐ฟ2(๐‘ฅ) while for at least one ๐‘ฅ โˆˆ ๐‘‹, ๐ฟ1 ๐‘ฅ < ๐ฟ2 ๐‘ฅ 2. We call a code ๐ถ for set ๐‘‹ complete if there does not exist a

code ๐ถ๐ถ that is more efficient than ๐ถ

A code is complete when it does not waste any bits

Page 10: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

The Most Important Slide

We only care about code lengths

Page 11: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

The Most Important Slide

Actual code words are of no interest to us whatsoever.

Page 12: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

The Most Important Slide

Our goal is measuring complexity,

not to instantiate an actual compressor

Page 13: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

My First Code Let us consider a sequence ๐‘†

over a discrete alphabet ๐‘‹ = ๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘š .

As code ๐ถ for ๐‘† we can instantiate a block code, identifying the value of ๐‘ ๐‘– โˆˆ ๐‘† by an index over ๐‘‹, which require a constant number of log2 |๐‘‹| bits

per message in ๐‘†, i.e., ๐ฟ ๐‘ฅ๐‘– = log2 |๐‘‹|

We can always instantiate a prefix-free code with code words of lengths ๐ฟ ๐‘ฅ๐‘– = log2 |๐‘‹|

Page 14: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Codes in a Tree

root

0 00 01

1 10 11

Page 15: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Beyond Uniform What if we know

the distribution ๐‘ƒ(๐‘ฅ๐‘– โˆˆ ๐‘‹) over ๐‘† and it is not uniform?

We do not want to waste any bits, so using block codes is a bad idea.

We do not want to introduce any undue bias, so

we want an efficient code that is uniquely decodable without having to use arbitrary length stop-words.

We want an optimal prefix-code.

Page 16: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Prefix Codes A code ๐ถ is a prefix code iff there is no code word ๐ถ ๐‘ฅ that is an extension of another code word ๐ถ(๐‘ฅโ€ฒ). Or, in other words, ๐ถ defines a binary tree with the leaves as the code words. How do we find the optimal tree?

root

0 00 01

1

Page 17: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Shannon Entropy Let ๐‘ƒ(๐‘ฅ๐‘–) be the probability of ๐‘ฅ๐‘– โˆˆ ๐‘‹ in ๐‘†, then

๐ป(๐‘†) = โˆ’ ๏ฟฝ ๐‘ƒ(๐‘ฅ๐‘–)log ๐‘ƒ(๐‘ฅ๐‘–)๐‘ฅ๐‘–โˆˆ๐‘‹

is the Shannon entropy of ๐‘† (wrt ๐‘‹)

(see Shannon 1948)

the โ€˜weightโ€™, how often we see ๐‘ฅ๐‘–

number of bits needed to identify ๐‘ฅ๐‘– under ๐‘ƒ

average number of bits needed per message ๐‘ ๐‘– โˆˆ ๐‘†

Page 18: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Optimal Prefix Code Lengths What if the distribution of ๐‘‹ in ๐‘† is not uniform?

Let ๐‘ƒ(๐‘ฅ๐‘–) be the probability of ๐‘ฅ๐‘– in ๐‘†, then

๐ฟ(๐‘ฅ๐‘–) = โˆ’ log๐‘ƒ(๐‘ฅ๐‘–)

is the length of the optimal prefix code for message ๐‘ฅ๐‘– knowing distribution ๐‘ƒ

(see Shannon 1948)

Page 19: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Kraftโ€™s Inequality For any code C for finite alphabet ๐‘‹ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘š ,

the code word lengths ๐ฟ๐ถ โ‹… must satisfy the inequality

๏ฟฝ 2โˆ’๐ฟ(๐‘ฅ๐‘–)

๐‘ฅ๐‘–โˆˆ๐‘‹

โ‰ค 1.

a) when a set of code word lengths satisfies the inequality,

there exists a prefix code with these code word lengths, b) when it holds with strict equality, the code is complete,

it does not waste any part of the coding space, c) when it does not hold, the code is not uniquely decodable

Page 20: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Whatโ€™s a bit?

Binary digit smallest and most fundamental piece of information yes or no invented by Claude Shannon in 1948 name by John Tukey Bits have been in use for a long-long time, though Punch cards (1725, 1804) Morse code (1844) African โ€˜talking drumsโ€™

Page 21: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Morse code

Page 22: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Natural language

Punishes โ€˜badโ€™ redundancy: often-used words are shorter

Rewards useful redundancy:

cotxent alolws mishaireng/raeding

African Talking Drums have used this for efficient, fast, long-distance communication

mimic vocalized sounds: tonal language very reliable means of communication

Page 23: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Measuring bits

How much information carries a given string? How many bits?

Say we have a binary string of 10000 โ€˜messagesโ€™

1) 00010001000100010001โ€ฆ000100010001000100010001000100010001 2) 01110100110100100110โ€ฆ101011101011101100010110001011011100 3) 00011000001010100000โ€ฆ001000100001000000100011000000100110 4) 0000000000000000000000000000100000000000000000000โ€ฆ0000000

obviously, all four are 10000 bits long. But, are they worth those 10000 bits?

Page 24: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

So, how many bits?

Depends on the encoding!

What is the best encoding? one that takes the entropy of the data into account things that occur often should get short code things that occur seldom should get long code

An encoding matching Shannon Entropy is optimal

Page 25: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Tell us! How many bits? Please? In our simplest example we have

๐‘ƒ(1) = 1/100000 ๐‘ƒ(0) = 99999/100000

|๐‘๐‘๐‘๐‘1| = โˆ’log (1/100000) = 16.61

|๐‘๐‘๐‘๐‘0| = โˆ’log (99999/100000) = 0.0000144

So, knowing ๐‘ƒ our string contains

1 โˆ— 16.61 + 99999 โˆ— 0.0000144 = 18.049 bits

of information

Page 26: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Optimalโ€ฆ.

Shannon lets us calculate optimal code lengths what about actual codes? 0.0000144 bits? Shannon and Fano invented a near-optimal encoding in 1948,

within one bit of the optimal, but not lowest expected

Fano gave students an option: regular exam, or invent a better encoding

David didnโ€™t like exams; invented Huffman-codes (1952) optimal for symbol-by-symbol encoding with fixed probs.

(arithmetic coding is overall optimal, Rissanen 1976)

Page 27: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Optimality

To encode optimally, we need optimal probabilities

What happens if we donโ€™t?

Page 28: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Measuring Divergence

Kullback-Leibler divergence from ๐‘„ to ๐‘ƒ, denoted by ๐ท(๐‘ƒ โ€– ๐‘„), measures the number of bits

we โ€˜wasteโ€™ when we use ๐‘„ while ๐‘ƒ is the โ€˜trueโ€™ distribution

๐ท ๐‘ƒ โ€– ๐‘„ = ๏ฟฝ๐‘ƒ(๐‘–) log๐‘ƒ ๐‘–๐‘„ ๐‘–

๐‘–

Page 29: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Multivariate Entropy

So far weโ€™ve been thinking about a single sequence of messages

How does entropy work for

multivariate data?

Simple!

Page 30: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Towards Mutual Information

Conditional Entropy is defined as

๐ป ๐‘‹ ๐‘Œ = ๏ฟฝ๐‘ƒ ๐‘ฅ ๐ป(๐‘Œ|๐‘‹ = ๐‘ฅ)๐‘ฅโˆˆX

โ€˜average number of bits

needed for message ๐‘ฅ โˆˆ ๐‘‹ knowing ๐‘Œ๐ถ

Symmetric

Page 31: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Mutual Information the amount of information shared between two variables ๐‘‹ and ๐‘Œ

๐ผ ๐‘‹,๐‘Œ = ๐ป ๐‘‹ โˆ’ ๐ป ๐‘‹ ๐‘Œ = ๐ป ๐‘Œ โˆ’ ๐ป ๐‘Œ ๐‘‹

= ๏ฟฝ๏ฟฝ๐‘ƒ ๐‘ฅ,๐‘ฆ log๐‘ƒ ๐‘ฅ,๐‘ฆ๐‘ƒ ๐‘ฅ ๐‘ƒ ๐‘ฆ

๐‘ฅโˆˆ๐‘‹๐‘ฆโˆˆ๐‘Œ

high ๐ผ(๐‘‹,๐‘Œ) implies correlation low ๐ผ(๐‘‹,๐‘Œ) implies independence

Information is symmetric!

Page 32: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Information Gain (small aside)

Entropy and KL are used in decision trees

What is the best split in a tree?

one that results in as homogeneous label distributions in the sub-nodes as possible: minimal entropy

How do we compare over multiple options? ๐ผ๐ผ ๐‘‡,๐‘Ž = ๐ป ๐‘‡ โˆ’ ๐ป(๐‘‡|๐‘Ž)

Page 33: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Goal: Finds sets of attributes that interact strongly

Task: mine all sets of attributes

such that the entropy over their values instantiations โ‰ค ๐œŽ

1.087 bits

Low-Entropy Sets

Theory of Computation

Probability Theory 1

No No 1887

Yes No 156

No Yes 143

Yes Yes 219

(Heikinheimo et al. 2007)

Page 34: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Low-Entropy Sets

Maturity Test Software Engineering

Theory of Computation

No No No 1570

Yes No No 79

No Yes No 99

Yes Yes No 282

No No Yes 28

Yes No Yes 164

No Yes Yes 13

Yes Yes Yes 170

(Heikinheimo et al. 2007)

Page 35: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Low-Entropy Trees Scientific Writing

Maturity Test

Software Engineering

Project

Theory of Computation

Probability Theory 1

(Heikinheimo et al. 2007)

Define entropy of a tree ๐‘‡ = ๐ด,๐‘‡1, โ€ฆ ,๐‘‡๐‘˜ as

๐ป๐‘ˆ ๐‘‡ = ๐ป(๐ด โˆฃ A1, โ€ฆ , Ak) + โˆ‘๐ป๐‘ˆ(๐‘‡๐‘—)

The tree ๐‘‡ for an itemset ๐ด minimizing ๐ป๐‘ˆ ๐‘‡ identifies directional explanations!

๐ป ๐ด โ‰ค ๐ป(๐‘†๐‘†|๐‘€๐‘‡, ๐‘†๐‘†๐‘ƒ,๐‘‡๐ถ,๐‘ƒ๐‘‡) + ๐ป(๐‘€๐‘‡|๐‘†๐‘†๐‘ƒ,๐‘‡๐ถ,๐‘ƒ๐‘ƒ) + ๐ป ๐‘†๐‘†๐‘ƒ + ๐ป ๐‘‡๐ถ ๐‘ƒ๐‘‡ + ๐ป ๐‘ƒ๐‘‡

Page 36: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Entropy for Continuous-values

So far we only considered discrete-valued data

Lots of data is continuous-valued

(or is it)

What does this mean for entropy?

Page 37: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Differential Entropy

โ„Ž ๐‘‹ = โˆ’๏ฟฝ๐‘“ ๐‘ฅ log๐‘“ ๐‘ฅ ๐‘๐‘ฅ๐—

(Shannon, 1948)

Page 38: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Differential Entropy

How aboutโ€ฆ the entropy of Uniform(0,1/2) ?

โˆ’๏ฟฝ โˆ’2 log 2 ๐‘๐‘ฅ = โˆ’ log 212

0

Hm, negative?

Page 39: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Differential Entropy

In discrete data step size โ€˜dxโ€™ is trivial.

What is its effect here?

โ„Ž ๐‘‹ = โˆ’๏ฟฝ๐‘“ ๐‘ฅ log๐‘“ ๐‘ฅ ๐‘๐‘ฅ

๐—

(Shannon, 1948)

Page 40: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Impossibru?

No.

But youโ€™ll have to wait

till next week for the answer.

Page 41: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Conclusions

Information is related to the reduction in uncertainty of what you could say

Entropy is a core aspect of information theory lots of nice properties optimal prefix-code lengths, mutual information, etc

Entropy for continuous data isโ€ฆ more tricky differential entropy is a bit problematic

Page 42: Entropy & Information - people.mmci.uni-saarland.dejilles//edu/tada15/slides/04_entropy_and...ย ยท is the Shannon entropy of ๐‘† (wrt ๐‘‹) (see Shannon 1948) the โ€˜weightโ€™, how

Thank you! Information is related to the reduction in uncertainty

of what you could say Entropy is a core aspect of information theory lots of nice properties optimal prefix-code lengths, mutual information, etc

Entropy for continuous data isโ€ฆ more tricky differential entropy is a bit problematic