How to win big by thinking straight about relatively trivial problems
description
Transcript of How to win big by thinking straight about relatively trivial problems
How to win big by thinking straight aboutrelatively trivial problems
Tony BellUniversity of California at Berkeley
Density Estimation
Make the model like the reality
by minimising the Kullback-Leibler Divergence:
by gradient descent in a parameter of the model :
THIS RESULT IS COMPLETELY GENERAL.
The passive case ( = 0)For a general model distribution written in the ‘Energy-based’ form:
partition function(or zeroth moment...)
energy
the gradient evaluates in the simple ‘Boltzmann-like’ form:
learn on data while awake unlearn on data while asleep
The single-layer case
Learning Rule(Natural Gradient)
The Score Function
Linear Transform
Shaping Density Many problems solved by modeling in the transformed space
for non-loopy hypergraph
is the important quantity
Conditional Density Modeling
To model
use the rules:
This little known fact has hardly ever been exploited.It can be used instead of regression everywhere.
ICA
IVA
ISA
Independent Components, Subspaces and Vectors
DCA(ie: score function hard to get at due to Z)
IVA used for audio-separation in real room:
Score functions derived from sparse factorial and radial densities:
Results on real-room source separation:
Why does IVA work on this problem?
Because the score function, and thus the learning, is only sensitive to the amplitude of the complex vectors, representingcorrelations of amplitudes of frequency components associatedwith a single speaker. Arbitrary dependencies can exist between the phases of this vector. Thus all phase (ie: higher-order statistical structure) is confined within the vector and removedbetween them.
It’s a simple trick, just relaxing the independence assumptionsin a way that fits speech. But we can do much more:
• build conditional models across frequency components• make models for data that is even more structured:
Video is [time x space x colour]Many experiments are [time x sensor x task-condition x trial]
channel 1-16, time 0-8
channel 17-32, time 0-8
channel 1-16, time 0-8
channel 1-16, time 0-1
channel 17-32, time 0-1
channel 1-16, time 0-1
channel 17-32, time 0-1
channel 33-48, time 0-1
The big picture.
Behind this effort is an attempt to explore something called “The Levels Hypothesis”, which is the idea that in biology, in the brain,in nature, there is a kind of density estimation taking place across scales.
To explore this idea, we have a twofold strategy:
1. EMPIRICAL/DATA ANALYSIS: Build algorithms that can probe the EEG across scales, ie: across frequencies
2. THEORETICAL: Formalise mathematically the learning process in such systems.
( = STDP)
LEVEL UNIT DYNAMICS LEARNING
society organism behaviour
ecology society predation, symbiosis
natural selection
sensory-motorlearning
organism cell spikes synaptic plasticity
cell
protein molecular forces gene expression,protein recycling
direct, voltage, Ca, 2nd messenger molecular changeprotein
amino acid
A Multi-Level View of Learning
LEARNING at a LEVEL is CHANGE IN INTERACTIONS between its UNITS,implemented by INTERACTIONS at the LEVEL beneath, and by extensionresulting in CHANGE IN LEARNING at the LEVEL above.
IncreasingTimescale
Separation of timescales allows INTERACTIONS at one LEVEL to be LEARNING at the LEVEL above.
Interactions=fastLearning=slow
Infomax between Layers.(eg: V1 density-estimates Retina)
Infomax between Levels.(eg: synapses density-estimate spikes)
1 2
• square (in ICA formalism)• feedforward• information flows within a level• predicts independent activity• only models outside input
• overcomplete• includes all feedback• information flows between levels• arbitrary dependencies• models input and intrinsic activity
retina
V1
synaptic weights
x
y all neural spikes
all synaptic readout
synapses,dendites
t
y
pdf of all spike timespdf of all synaptic ‘readouts’
If we canmake thispdf uniform
then we have a model constructed from all synaptic and dendritic causality
This SHIFT in looking at the problemalters the question so that if it isanswered, we have an unsupervised theory of ‘whole brain learning’.
Formalisation of the problem:p is the ‘data’ distribution, q is the ‘model’ distributionw is a synaptic weight, and I(y,t) is the spike synapse mutual information
IF
THEN if we were doing classical Infomax, we would use the gradient:
BUT if one’s actions can change the data, THEN an extra term appears:
(1)
(2)
changing one’s model to fit the world
therefore (2) must be easier than (1). This is what we are now researching.
It is easier to live in a world where one can
change the worldto fit the model, as well as