Post on 28-Dec-2015
1
A(n) (extremely) brief/crude introduction to minimum description length princ
iplejdu
2006-04
2
Outline
• Conceptual/non-technical introduction
• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics
3
Outline
• Conceptual/non-technical introduction
• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics
4
Introduction
• Example: data compression– Description methods
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
5
Introduction
• Example: regression– Model selection and overfitting– Complexity of the model vs. Goodness of fit
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
6
Introduction
• Models vs. Hypotheses
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
7
Introduction
• Crude 2-part version of MDL
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
8
Outline
• Conceptual/non-technical introduction
• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics
9
Probabilities and Codelengths• Let X be a finite or countable set
– A code C(x) for X• 1-to-1 mapping from X to Un>0{0,1}n
• LC(x): number of bits needed to encode x using C
– P: probability distribution defined on X• P(x): the probability of x• A sequence of (usually iid) observations x1, x2,
…, xn: xn
10
Probabilities and Codelengths• Prefix codes: as examples of uniquely
decodable codes– no code word is a prefix of any other
a 0
b 111
c 1011
d 1010
r 110
! 100
Source: http://www.cs.princeton.edu/courses/archive/spring04/cos126/
11
Probabilities and Codelengths• Expected codelength of a code C
– Lower bound:
• Optimal code– if it has minimum expected codelength over all un
iquely decodable codes– How to design one given P?
• Huffman coding
Xx
CCP xLxPxLE )()())((
Xx
xPxPxH )(log)()( 2
12
Probabilities and Codelengths• Huffman coding
Source: http://star.itc.it/caprile/teaching/algebra-superiore-2001/
13
Probabilities and Codelengths• How to design code for {1, 2, …, M}?
– Assuming a uniform distribution: 1/M for each number
– ~logM bits
14
Probabilities and Codelengths• How to design code for all the
positive integers?– For each k
• Describe it with 0s • Followed by a 1• Then encode k using the uniform code for• In total, ~ 2logk + 1 bits
– Can be refined…
15
Probabilities and Codelengths• Let P be a probability distribution over X,
then there exists a code C for X such that:
• Let C be a uniquely decodable code over X, then there exists a probability distribution P such that:
)(log)( xPxLC
)(log)( xPxLC
)(log)( nnC xPxL
16
Probabilities and Codelengths• Codelength revisited
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
17
Outline
• Conceptual/non-technical introduction
• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics
18
Crude MDL
• Preliminary: k-th order Markov chain on X={0,1}– A sequence: X1, X2, …, XN
– Special case: 0-th order: Bernoulli model (biased coin)
• Maximum Likelihood estimator
19
Crude MDL
• Preliminary: k-th order Markov chain on X={0,1}– Special case: first order Markov chain B(1)
• MLE
20
Crude MDL
• Preliminary: k-th order Markov chain on X={0,1}– 2k parameters
• theta[1|000…000] = n[1|000…000]/n[000…000]• theta[1|000…001]• …• theta[1|111…110]• theta[1|111…111]
– Log likelihood function: …– MLE: …
21
Crude MDL
• Question: Given data D=xn, find the Markov chain that best explains D.– We do not want to restrict ourselves to cha
ins of fixed order• How to avoid overfitting?• Obviously, an (n-1)-th order Markov model wo
uld always fit the data the best
22
Crude MDL
• two-part MDL revisited
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
23
Crude MDL
• Description length of data given hypothesis
24
Crude MDL
• Description length of hypothesis– The code should not change with the
sample size n.– Different codes will lead to preferences
of different hypotheses– How to design a code that
• Leads to good inferences with small, practically relevant sample sizes?
25
Crude MDL
• An ``intuitive” and ``reasonable” code for k-th order Markov chain– First describe k using 2logk+1 bits– Then describe the d=2k parameters
• Assume n is given in advance– For each theta in the MLE {theta[1|000…000], …, theta[1|111
…111]}, the best precision we can achieve by counting is 1/(n+1)
– Describe each theta with log(n+1) bits– L(H)=2logk+1+dlog(n+1)– L(H)+L(D|H) = 2logk+1+dlog(n+1) – logP(D|k, theta)– For a given k, only the MLE theta need to be consi
dered
26
Crude MDL
• Good news– We have found a principled manner to
encode data D using H
• Bad news– We have not found clear guidelines to
design codes for H
27
Outline
• Conceptual/non-technical introduction
• Probabilities and Codelengths• Crude MDL• Refined MDL• Other issues
28
Refined MDL
• Universal codes and universal distributions– maximum likelihood code depends on the
data• How to describe the data in an unambiguous
manner?– Design a code such that for every possible
observation, its codelength corresponds to its ML? - impossible
29
Refined MDL
• Worst-case regret
• Optimal universal model
30
Refined MDL
• Normalized maximum likelihood (NML)
• Minimizing -logNML
31
Refined MDL
• Complexity of a model
– The more sequences that can be fit well by an element of M, the larger M’s complexity
– Would it lead to a ``right” balance between complexity and fit?• Hopefully…
32
Refined MDL
• General refined MDL
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
33
Outline
• Conceptual/non-technical introduction
• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics
34
Other topics
• Mixture code• Resolvability• …
35
References
• Barron, A.; Rissanen, J. & Yu, B. (1998), 'The minimum description length principle in coding and modeling', Information Theory, IEEE Transactions on 44(6), 2743--2760.
• Grnwald, P.D.; Myung, I.J. & Pitt, M.A. (2005), Advances in Minimum Description Length: Theory and Applications (Neural Information Processing), The MIT Press.
• Hall, P. & Hannan, E.J. (1988), 'On stochastic complexity and nonparametric density estimation', Biometrika 75(4), 705-714.