Information theory, MDL and human cognition
-
Upload
fatima-byrd -
Category
Documents
-
view
25 -
download
2
description
Transcript of Information theory, MDL and human cognition
![Page 1: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/1.jpg)
Information theory, MDL and human cognition
Nick ChaterDepartment of PsychologyUniversity College London
![Page 2: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/2.jpg)
Overview
Bayes and MDL: An overview Universality and induction Some puzzles about model fitting Cognitive science applications
![Page 3: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/3.jpg)
Bayes and MDL: A simplified story
Shannon’s coding theorem. For distribution Pr(A), optimal code will assign -
log2Pr(A) to code event A
MDL model selection: choose, M, that yields the shortest code for D, i.e.,
minimize: -log2Pr(D, M)
![Page 4: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/4.jpg)
A simple equivalence
Minimize: -log2Pr(D, M)
Maximize: Pr(D, M) = Pr(M|D)Pr(D)
Maximize: Pr(M|D)
Just what Bayes recommends (if choosing a single model)
Equivalence generalizes to parametric M(θ) to ‘full’ Bayes; and in other ways
Chater, 1996, for application to the simplicity and likelihood principles in perceptual organization
![Page 5: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/5.jpg)
Codes or priors? Which comes first? 1. The philosophical issue
Bayesian viewpoint: Probabilities as basic
calculus for degrees of beliefs (probability theory)
decision theory (probabilities meet action) brain as a probabilistic calculation
machine (whether belief propagation, dynamic programming…)
![Page 6: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/6.jpg)
Simplicity/MDL viewpoint: Codes as basic
Rissanen: data is all there is; distributions are a fiction
Code structure is primary; code interpretation is secondary
Probabilities defined over events; but “events” are cognitive constructs
Leeuwenberg & Boselie (1988)
![Page 7: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/7.jpg)
Codes or priors? Which comes first? 2. The practical issue
MDL viewpoint: Take codes as
basic… …when we know
most about representation, e.g., grammars
Bayesian viewpoint: Take probabilities
as basic… …when we know
most about probability, e.g, image statistics:
![Page 8: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/8.jpg)
Bayesian viewpoint (e.g., Geisler et al, 2001)
Good continuation—most lines continue in the same direction in real images
![Page 9: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/9.jpg)
Simplicity/MDL viewpoint (e.g., Goldsmith, 2001)
In, e.g., linguistics, representations are given by theory
And we can roughly assess the complexity of grammars (by length)
Not so clear how directly to set a prior over all grammars
(though can define a generative process in simple cases…)
S NP VPVP V NP
VP V NP PP NP Det Noun
NP NP PP
“Binding contraints”
Gzip as a handy approximation!?!
![Page 10: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/10.jpg)
Simplicity/MDL and Bayes are closely related
Lets now explore the simplicity perspective
![Page 11: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/11.jpg)
Overview
Bayes and MDL: An overview Universality and induction Some puzzles about model fitting Cognitive science applications
![Page 12: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/12.jpg)
The most neutral possible prior…
Suppose we want a prior so neutral that it never rules out a model
Possible, if limit to computable models
Mixture of all (computable) priors, with weights, i, that decline fairly fast:
Then, this multiplicatively dominates all priors
though neutral priors will mean slow learning
m(x) are “universal” priors
i
ii xpxm )()(
![Page 13: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/13.jpg)
The most neutral possible coding language
Universal programming languages (Java, matlab, UTMs, etc)
K(x) = length of shortest program in Java, matlab, UTM, that generates x (K is uncomputable)
Invariance theorem any languages L1, L2, c, x |KL1(x)-KL2(x)| ≤ c
Mathematically justifies talk of K(x), not KJava(x) , KMatlab(x),…
![Page 14: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/14.jpg)
So does this mean that choice of language doesn’t matter?
Not quite! c can be large
And, for any L1, c0, L2, x such that |KL1(x)-KL2(x)| ≥ c0
The problem of the one-instruction code for the entire data set…
But Kolmogorov complexity can be made concrete…
![Page 15: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/15.jpg)
Compact Universal Turing machines
210 bits, λ-calculus
272, combinators
Due to Jon Tromp, 2007
Not much room to hide, here!
![Page 16: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/16.jpg)
Neutral priors and Kolmogorov
complexity A key result:
K(x) = -log2m(x) o(1)
Where m is a universal prior
Analogous to the Shannon’s source coding theorem
And for any computable q,
K(x) ≤ -log2q(x) o(1) For typical x drawn
from q(x)
Any data, x, that is likely for any sensible probability distribution has low K(x)
![Page 17: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/17.jpg)
Prediction by simplicity
Find shortest ‘program/explanation’ for current ‘corpus’ (binary string)
Predict using that program Strictly, use ‘weighted sum’ of explanations,
weighted by brevity
![Page 18: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/18.jpg)
Prediction is possible (Solomonoff, 1978)Summed error has finite bound
sj is summed squared error between prediction and true probability on item j
So prediction converges [faster than 1/nlog(n)], for corpus size n
Computability assumptions only (no stationarity needed)
)(2
2logs
1=jj Ke
![Page 19: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/19.jpg)
Summary so far…
Simplicity/MDL - close and deep connections with Bayes
Defines universal prior (i.e., based on simplicity)
Can be made “concrete” General prediction results A convenient “dual” framework to Bayes,
when codes are easier than probabilitiesLi, M. & Vitanyi, P. (1997) (2nd Edition). Introduction to Kolmogorov complexity theory and its applications. Berlin: Springer.
![Page 20: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/20.jpg)
Overview
Bayes and MDL: An overview Universality and induction Some puzzles about model fitting Cognitive science applications
![Page 21: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/21.jpg)
A problem of model selection? Or: why simplicity won’t go away
Where do priors come from? Well, priors can be given by hyper-priors And hyper-priors by hyper-hyper-priors But it can’t go on forever!
And we need priors over models we’re only just thought of
And, in some contexts, over models we haven’t yet thought of (!)
Code length in our representation language is a fixed basis Nb. Building probabilistic models = augmenting our language
with new coding schemes
![Page 22: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/22.jpg)
The hidden role of simplicity…
Bayesian model selection prefers
y(x)=a2x2+a1x+a0 Not y(x)=a125x125+a124x124+
…+a0
But who says how many parameters a function has got??
![Page 23: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/23.jpg)
A trick…
Convert parameters to constants y(x)=a125x125+a124x124+…+a0
126 parameters y(x)=.003x125 + .02x124+…+3x – 24.3
0 parameters
And hence is favoured by Bayesian (and all other) model selection criteria
All the virtues of theft over honest toil
![Page 24: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/24.jpg)
Zoubin’s problem for ML
ML Gaussian is a delta function on one point
![Page 25: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/25.jpg)
ML: Gaussian is a delta functions on one point
An impressive fit!
![Page 26: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/26.jpg)
A related problem for Bayes?
The mixture of delta functions model (!)
![Page 27: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/27.jpg)
A related problem for Bayes?
The mixture of delta functions model (!)
An even more impressive fit!
![Page 28: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/28.jpg)
Should the “cheating” model get a huge boost from this data
No! Sense of moral
outrage Model must be
fitted post-hoc It would be
different if I’d thought of it before the data arrived (cf empirical Bayes)
Yes! !
But order of data acquisition has no role in Bayes
Confirmation is just the same, whenever I thought of the modelThe models get a spectacular boost; but is even
more spectacularly unlikely…
![Page 29: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/29.jpg)
So we need to take care with priors!
y = x High prior; compact to state
y=.003x125 + .02x124+…+3x-24 Low prior; not compact to state
With a different representation language, could have the opposite bias
But we start from where we are; our actual representations
We can discoverthat things are simpler than we though (i.e., simplicity is not quite so subjective…)
![Page 30: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/30.jpg)
Overview
Bayes and MDL: An overview Universality and induction Some puzzles about model fitting Cognitive science applications
![Page 31: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/31.jpg)
There are quite a few…Domain Principle References
Perceptual organization Find grouping that minimizes cost Koffka, 1935; Leeuwenberg, 1971; Attneave & Frost, 1969;
Early vision Efficient coding & transmission Blakemore, 1990; Barlow, 1974; Srivinisan, Laughlin
Causal reasoning Find minimal belief network Wedelind
Similarity Similarity between representations measured by code length between them
Chater & Vitányi, 2003; Hahn, Chater & Richardson, 2003
Categorization Categorize items to find shortest code (high level perceptual organization)
Pothos & Chater, 2002; Feldman, 2000
Memory storage Shorter codes easier to store Chater, 1999
Memory retrieval Explain interference by cuetrace complexity; a rational foundation for distinctiveness models
Rational foundation for distinctiveness models: SIMPLE (Brown, Neath & Chater, 2005)
Language acquisition Find grammar that best explains child’s input
Chomsky, 1955; J. D. Fodor & Crain; Chater, 2004; Chater & Vitányi, 2005
![Page 32: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/32.jpg)
Here:
Perceptual organization Language Acquisition Similarity and generalization
![Page 33: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/33.jpg)
Long tradition of simplicity in perception (Mach, Koffka, Leeuwenberg); e.g., Gestalt laws
+
++++
+
(x,v)
(x,v)
(x,v)
(x,v)(x,v)
(x,v)(x)
(x)
(x)
(x)(x)
(x) (v)
Grouped 6 + 1 vectors
Ungrouped 6 x 2 vectors
![Page 34: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/34.jpg)
And language acquisition: where it helps resolve an apparent learnability paradox
Undergeneral grammars predict that good sentences are not allowed
just wait til one turns up
Overgeneral grammars predict that bad sentences are actually ok
Need negative evidence---say a bad sentence, and get corrected
![Page 35: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/35.jpg)
The logical problem of language acquisition (e.g., Hornstein & Lightfoot, 1981; Pinker, 1979)
Without negative evidence can never eliminate overgeneral grammars
“Mere” non-occurrence of sentences is not enough… …because almost all acceptable sentences also never
occur Backed-up by formal results (Gold, 1967; though Feldman,
Horning et al)
Argument for innateness?
![Page 36: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/36.jpg)
An “ideal” learning set-up (cf ideal observers)
Linguistic environment
Measures of learning performance
Learning method
Positive evidence only; computability
Statistical
Simplicity
![Page 37: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/37.jpg)
Overgeneralization Theorem (Chater & Vitányi)
Suppose learner has probability j of erroneously
guessing an ungrammatical jth word
Intuitive explanation: overgeneralization underloads probabilities of
grammatical sentences; Small probabilities implies longer code lengths
2log
)(
1 ejj
K
![Page 38: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/38.jpg)
Absence as implicit negative evidence
Overgeneral grammars predict missing sentences And their absence is a clue that the grammar is
wrong
Method can be “scaled-down” to consider learnability of specific linguistic constructions
![Page 39: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/39.jpg)
Similarity and categorization
Cognitive dissimilarity: representational “distortion” required to get from x to y DU(x,y) = K(y|x)
Not symmetrical K(y|x) > K(x|y) when K(y) > K(x)
Deletion is easy…
![Page 40: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/40.jpg)
Shepard’s (1987) Universal Law
Generalization (strictly confusability) is an exponention function of
psychological “distance”
),(),( baU SSBDAebaG
![Page 41: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/41.jpg)
A derivation
Shepard’s generalization measure
for “typical” items
2/1
)|Pr()|Pr(
)|Pr()|Pr(),(
bbaa
abba
SRSR
SRSRbaG
))|Pr(log)|Pr(log)|Pr(log)|Pr((log2/1
),(log
2222
2
bbaaabba SRSRSRSR
baG
)1()|()|(2/1 oSRKSRK abba
![Page 42: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/42.jpg)
Assuming items roughly the same complexity
The universal law
)1()|()|(2/1),(log 2 oSRKSRKbaG abba
)1(),( oSSD baU
),(),( baU SSBDAebaG
![Page 43: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/43.jpg)
The asymmetry of similarity
What thing is this like?
![Page 44: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/44.jpg)
And what is this like?
![Page 45: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/45.jpg)
A heuristic measure of amount of information: Shannon’s guessing game…
1. Pony?2. Cow?3. Dog?…345. Pegasus√
1. Pony?2. Cow?3. Dog?…345. Pegasus√
345!
![Page 46: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/46.jpg)
Asymmetry of codelengths asymmetry of similarity
Horse: guess #345 gets Pegasus. log2Pr(#345) is very small.
Pegasus: guess #2 gets Horse. log2(Pr(#2)) is very small.
So Pegasus is more like Horse, than Horse is like Pegasus
Many other examples of asymmetry, and many measures (search times, memory confusions…), which seem to fit this pattern
![Page 47: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/47.jpg)
Treisman & Souther (1985)
A simple array
![Page 48: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/48.jpg)
A complex array
![Page 49: Information theory, MDL and human cognition](https://reader035.fdocuments.us/reader035/viewer/2022062321/56812b97550346895d8fb87f/html5/thumbnails/49.jpg)
Summary
MDL/Kolmogorov complexity close relation with Bayes
Basis for a “universal” prior Variety of applications to cognitive
science