How to win big by thinking straight about relatively trivial problems

How to win big by thinking straight aboutrelatively trivial problems

Tony BellUniversity of California at Berkeley

Density Estimation

Make the model like the reality

by minimising the Kullback-Leibler Divergence:

by gradient descent in a parameter of the model :

THIS RESULT IS COMPLETELY GENERAL.

The passive case ( = 0)For a general model distribution written in the ‘Energy-based’ form:

partition function(or zeroth moment...)

energy

the gradient evaluates in the simple ‘Boltzmann-like’ form:

learn on data while awake unlearn on data while asleep

The single-layer case

Learning Rule(Natural Gradient)

The Score Function

Linear Transform

Shaping Density Many problems solved by modeling in the transformed space

for non-loopy hypergraph

is the important quantity

Conditional Density Modeling

To model

use the rules:

This little known fact has hardly ever been exploited.It can be used instead of regression everywhere.

ICA

IVA

ISA

Independent Components, Subspaces and Vectors

DCA(ie: score function hard to get at due to Z)

IVA used for audio-separation in real room:

Score functions derived from sparse factorial and radial densities:

Results on real-room source separation:

Why does IVA work on this problem?

Because the score function, and thus the learning, is only sensitive to the amplitude of the complex vectors, representingcorrelations of amplitudes of frequency components associatedwith a single speaker. Arbitrary dependencies can exist between the phases of this vector. Thus all phase (ie: higher-order statistical structure) is confined within the vector and removedbetween them.

It’s a simple trick, just relaxing the independence assumptionsin a way that fits speech. But we can do much more:

• build conditional models across frequency components• make models for data that is even more structured:

Video is [time x space x colour]Many experiments are [time x sensor x task-condition x trial]

channel 1-16, time 0-8

The big picture.

Behind this effort is an attempt to explore something called “The Levels Hypothesis”, which is the idea that in biology, in the brain,in nature, there is a kind of density estimation taking place across scales.

To explore this idea, we have a twofold strategy:

1. EMPIRICAL/DATA ANALYSIS: Build algorithms that can probe the EEG across scales, ie: across frequencies

2. THEORETICAL: Formalise mathematically the learning process in such systems.

( = STDP)

LEVEL UNIT DYNAMICS LEARNING

society organism behaviour

ecology society predation, symbiosis

natural selection

sensory-motorlearning

organism cell spikes synaptic plasticity

cell

protein molecular forces gene expression,protein recycling

direct, voltage, Ca, 2nd messenger molecular changeprotein

amino acid

A Multi-Level View of Learning

LEARNING at a LEVEL is CHANGE IN INTERACTIONS between its UNITS,implemented by INTERACTIONS at the LEVEL beneath, and by extensionresulting in CHANGE IN LEARNING at the LEVEL above.

IncreasingTimescale

Separation of timescales allows INTERACTIONS at one LEVEL to be LEARNING at the LEVEL above.

Interactions=fastLearning=slow

Infomax between Layers.(eg: V1 density-estimates Retina)

Infomax between Levels.(eg: synapses density-estimate spikes)

1 2

• square (in ICA formalism)• feedforward• information flows within a level• predicts independent activity• only models outside input

• overcomplete• includes all feedback• information flows between levels• arbitrary dependencies• models input and intrinsic activity

retina

V1

synaptic weights

x

y all neural spikes

all synaptic readout

synapses,dendites

t

y

pdf of all spike timespdf of all synaptic ‘readouts’

If we canmake thispdf uniform

then we have a model constructed from all synaptic and dendritic causality

This SHIFT in looking at the problemalters the question so that if it isanswered, we have an unsupervised theory of ‘whole brain learning’.

Formalisation of the problem:p is the ‘data’ distribution, q is the ‘model’ distributionw is a synaptic weight, and I(y,t) is the spike synapse mutual information

IF

THEN if we were doing classical Infomax, we would use the gradient:

BUT if one’s actions can change the data, THEN an extra term appears:

(1)

(2)

changing one’s model to fit the world

therefore (2) must be easier than (1). This is what we are now researching.

It is easier to live in a world where one can

change the worldto fit the model, as well as

How to win big by thinking straight about relatively trivial problems

Documents

Transcript of How to win big by thinking straight about relatively trivial problems