Revealing inductive biases with Bayesian models Tom Griffiths UC Berkeley with Mike Kalish, Brian...
-
date post
21-Dec-2015 -
Category
Documents
-
view
217 -
download
5
Transcript of Revealing inductive biases with Bayesian models Tom Griffiths UC Berkeley with Mike Kalish, Brian...
Revealing inductive biases with Bayesian models
Tom GriffithsUC Berkeley
with Mike Kalish, Brian Christian, and Steve Lewandowsky
Inductive problems
blicket toma
dax wug
blicket wug
S X Y
X {blicket,dax}
Y {toma, wug}
Learning languages from utterances
Learning functions from (x,y) pairs
Learning categories from instances of their members
Generalization requires induction
Generalization: predicting the properties of an entity from observed properties of others
y
x
What makes a good inductive learner?
• Hypothesis 1: more representational power– more hypotheses, more complexity– spirit of many accounts of learning and development
Some hypothesis spaces
Linear functions
Quadratic functions
8th degree polynomials€
g(x) = p1x + p0
€
g(x) = p2x 2 + p1x + p0
€
g(x) = p j xj
j= 0
8
∑
Minimizing squared error
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Minimizing squared error
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Minimizing squared error
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Minimizing squared error
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Measuring prediction error
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
What makes a good inductive learner?
• Hypothesis 1: more representational power– more hypotheses, more complexity– spirit of many accounts of learning and development
• Hypothesis 2: good inductive biases– constraints on hypotheses that match the environment
Outline
The bias-variance tradeoff
Bayesian inference and inductive biases
Revealing inductive biases
Conclusions
Outline
The bias-variance tradeoff
Bayesian inference and inductive biases
Revealing inductive biases
Conclusions
A simple schema for induction• Data D are n pairs (x,y)
generated from function f
• Hypothesis space of functions, y = g(x)
• Error is E = (y - g(x))2
• Pick function g that minimizes error on D
• Measure prediction error, averaging over x and y
y
x
Bias and variance
• A good learner makes (f(x) - g(x))2 small
• g is chosen on the basis of the data D
• Evaluate learners by the average of (f(x) - g(x))2 over data D generated from f
€
E p(D ) ( f (x) − g(x))2[ ] =
€
( f (x) − E p(D )[g(x)])2 + E p(D ) g(x) − E p(D )[g(x)][ ]2
bias variance
(Geman, Bienenstock, & Doursat, 1992)
Making things more intuitive…
• The next few slides were generated by:– choosing a true function f(x)– generating a number of datasets D from p(x,y) defined
by uniform p(x), p(y|x) = f(x) plus noise
– finding the function g(x) in the hypothesis space that minimized the error on D
• Comparing average of g(x) to f(x) reveals bias
• Spread of g(x) around average is the variance
Linear functions (n = 10)
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Linear functions (n = 10)
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
}bias
pink is g(x) for each dataset
red is average g(x)
black is f(x)
y
x
}variance
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Quadratic functions (n = 10)
pink is g(x) for each dataset
red is average g(x)
black is f(x)
y
x
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
8-th degree polynomials (n = 10)
pink is g(x) for each dataset
red is average g(x)
black is f(x)
y
x
Bias and variance
(for our (quadratic) f(x), with n = 10)
Linear functionshigh bias, medium variance
Quadratic functionslow bias, low variance
8-th order polynomialslow bias, super-high variance
In general…
• Larger hypothesis spaces result in higher variance, but low bias across several f(x)
• The bias-variance tradeoff:– if we want a learner that has low bias on a range of
problems, we pay a price in variance
• This is mainly an issue when n is small– the regime of much of human learning
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Quadratic functions (n = 100)
pink is g(x) for each dataset
red is average g(x)
black is f(x)
y
x
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
8-th degree polynomials (n = 100)
pink is g(x) for each dataset
red is average g(x)
black is f(x)
y
x
The moral
• General-purpose learning mechanisms do not work well with small amounts of data– more representational power isn’t always better
• To make good predictions from small amounts of data, you need a bias that matches the problem– these biases are the key to successful induction, and
characterize the nature of an inductive learner
• So… how can we identify human inductive biases?
Outline
The bias-variance tradeoff
Bayesian inference and inductive biases
Revealing inductive biases
Conclusions
Bayesian inference
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Reverend Thomas Bayes
• Rational procedure for updating beliefs
• Foundation of many learning algorithms
• Lets us make the inductive biases of learners precise
Bayes’ theorem
€
P(h | d) =P(d | h)P(h)
P(d | ′ h )P( ′ h )′ h ∈H
∑
Posteriorprobability
Likelihood Priorprobability
Sum over space of hypothesesh: hypothesis
d: data
Priors and biases
• Priors indicate the kind of world a learner expects to encounter, guiding their conclusions
• In our function learning example…– likelihood gives probability to data that decrease with
sum squared errors (i.e. a Gaussian)– priors are uniform over all functions in hypothesis
spaces of different kinds of polynomials– having more functions corresponds to a belief in a more
complex world…
Outline
The bias-variance tradeoff
Bayesian inference and inductive biases
Revealing inductive biases
Conclusions
Two ways of using Bayesian models
• Specify models that make different assumptions about priors, and compare their fit to human data
(Anderson & Schooler, 1991;
Oaksford & Chater, 1994;
Griffiths & Tenenbaum, 2006)
• Design experiments explicitly intended to reveal the priors of Bayesian learners
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Iterated learning(Kirby, 2001)
What are the consequences of learners learning from other learners?
Objects of iterated learning
• Knowledge communicated across generations through provision of data by learners
• Examples:– religious concepts– social norms– myths and legends– causal theories– language
Analyzing iterated learning
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
PL(h|d): probability of inferring hypothesis h from data d
PP(d|h): probability of generating data d from hypothesis h
PL(h|d)
PP(d|h)
PL(h|d)
PP(d|h)
• Variables x(t+1) independent of history given x(t)
• Converges to a stationary distribution under easily checked conditions (i.e., if it is ergodic)
x x x x x x x x
Transition matrixT = P(x(t+1)|x(t))
Markov chains
Analyzing iterated learning
d0 h1 d1 h2PL(h|d) PP(d|h) PL(h|d)
d2 h3PP(d|h) PL(h|d)
d PP(d|h)PL(h|d)h1 h2d PP(d|h)PL(h|d)
h3
A Markov chain on hypotheses
d0 d1h PL(h|d) PP(d|h)d2h PL(h|d) PP(d|h) h PL(h|d) PP(d|h)
A Markov chain on data
Iterated Bayesian learning
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
PL(h|d)
PP(d|h)
PL(h|d)
PP(d|h)
€
PL (h | d) =PP (d | h)P(h)
PP (d | ′ h )P( ′ h )′ h ∈H
∑
Assume learners sample from their posterior distribution:
Stationary distributions
• Markov chain on h converges to the prior, P(h)
• Markov chain on d converges to the “prior predictive distribution”
€
P(d) = P(d | h)h
∑ P(h)
(Griffiths & Kalish, 2005)
Explaining convergence to the prior
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
PL(h|d)
PP(d|h)
PL(h|d)
PP(d|h)
• Intuitively: data acts once, prior many times
• Formally: iterated learning with Bayesian agents is a Gibbs sampler on P(d,h)
(Griffiths & Kalish, in press)
Revealing inductive biases
• If iterated learning converges to the prior, it might provide a tool for determining the inductive biases of human learners
• We can test this by reproducing iterated learning in the lab, with stimuli for which human biases are well understood
Iterated function learning
• Each learner sees a set of (x,y) pairs
• Makes predictions of y for new x values
• Predictions are data for the next learner
data hypotheses
(Kalish, Griffiths, & Lewandowsky, in press)
Function learning experiments
Stimulus
Response
Slider
Feedback
Examine iterated learning with different initial data
1 2 3 4 5 6 7 8 9
IterationInitialdata
Identifying inductive biases
• Formal analysis suggests that iterated learning provides a way to determine inductive biases
• Experiments with human learners support this idea– when stimuli for which biases are well understood are used,
those biases are revealed by iterated learning
• What do inductive biases look like in other cases?– continuous categories– causal structure– word learning– language learning
Outline
The bias-variance tradeoff
Bayesian inference and inductive biases
Revealing inductive biases
Conclusions
Conclusions
• Solving inductive problems and forming good generalizations requires good inductive biases
• Bayesian inference provides a way to make assumptions about the biases of learners explicit
• Two ways to identify human inductive biases:– compare Bayesian models assuming different priors– design tasks to extract biases from Bayesian learners
• Iterated learning provides a lens for magnifying the inductive biases of learners– small effects for individuals are big effects for groups
Iterated concept learning
• Each learner sees examples from a species
• Identifies species of four amoebae
• Iterated learning is run within-subjects
data hypotheses
(Griffiths, Christian, & Kalish, in press)
Two positive examples
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
data (d)
hypotheses (h)
Bayesian model(Tenenbaum, 1999; Tenenbaum & Griffiths, 2001)
€
P(h | d) =P(d | h)P(h)
P(d | ′ h )P( ′ h )′ h ∈H
∑d: 2 amoebaeh: set of 4 amoebae
€
P(d | h) =1/ h
m
0
⎧ ⎨ ⎩
d ∈ h
otherwise
m: # of amoebae in the set d (= 2)|h|: # of amoebae in the set h (= 4)
€
P(h | d) =P(h)
P( ′ h )h '|d ∈h'
∑Posterior is renormalized prior
What is the prior?
Classes of concepts(Shepard, Hovland, & Jenkins, 1961)
Class 1
Class 2
Class 3
Class 4
Class 5
Class 6
shape
size
color
Experiment design (for each subject)Class 1Class 2Class 3Class 4Class 5Class 6Class 1Class 2Class 3Class 4Class 5Class 6
6 iterated learning chains
6 independent
learning “chains”
Estimating the prior
data (d)hy
poth
eses
(h)
Estimating the prior
Class 1Class 2
Class 3
Class 4
Class 5
Class 6
0.8610.087
0.009
0.002
0.013
0.028
Prior
r = 0.952
Bayesian modelHuman subjects
Two positive examples(n = 20)
Prob
abil
ity
Iteration
Prob
abil
ity
Iteration
Human learners Bayesian model
Two positive examples(n = 20)
Prob
abil
ity
Bayesian model
Human learners
Three positive examples
data (d)
hypotheses (h)
Three positive examples(n = 20)
Prob
abil
ity
Iteration
Prob
abil
ity
Iteration
Human learners Bayesian model
Three positive examples(n = 20)
Bayesian model
Human learners
Serial reproduction(Bartlett, 1932)
• Participants see stimuli, then reproduce them from memory
• Reproductions of one participant are stimuli for the next
• Stimuli were interesting, rather than controlled– e.g., “War of the Ghosts”
Discovering the biases of models
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Generic neural network:
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Discovering the biases of models
EXAM (Delosh, Busemeyer, & McDaniel, 1997):
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Discovering the biases of models
POLE (Kalish, Lewandowsky, & Kruschke, 2004):