Entropy and Information For a random variable X with distribution p(x), entropy is given by H[X] = -...

47
Entropy and Information a random variable X with distribution p(x), entropy is given by H[X] = - x p(x) log 2 p(x) ormation” = mutual information: how much knowing the value of one random variable r (the response) reduces uncertainty about another random variable s (the stimulus). ability in response is due both to different stimuli and to noise. much response variability is “useful”, i.e. can represent different messages nds on the noise. Noise can be specific to a given stimulus. eed to know the conditional distribution P(s|r) or P(r|s). a particular stimulus s=s 0 and repeat many times to obtain P(r|s 0 ). Compute variability due to noise: noise entropy rmation is the difference between the total response entropy and mean noise entropy: I(s;r) = H[P(r)] – s P(s) H[P(r|s)] .

Transcript of Entropy and Information For a random variable X with distribution p(x), entropy is given by H[X] = -...

Entropy and Information

For a random variable X with distribution p(x), entropy is given by

H[X] = - x p(x) log2p(x)

“Information” = mutual information:how much knowing the value of one random variable r (the response)reduces uncertainty about another random variable s (the stimulus).

Variability in response is due both to different stimuli and to noise.How much response variability is “useful”, i.e. can represent different messages,depends on the noise. Noise can be specific to a given stimulus.

Need to know the conditional distribution P(s|r) or P(r|s).

Take a particular stimulus s=s0 and repeat many times to obtain P(r|s0).Compute variability due to noise: noise entropy

Information is the difference between the total response entropy and the mean noise entropy: I(s;r) = H[P(r)] – s P(s) H[P(r|s)] .

How can one compute the entropy and information of spike trains?

Information in single cells

Strong et al., 1997; Panzeri et al.

Discretize the spike train into binary words w with letter size , length T.This takes into account correlations betweenspikes on timescales T

Compute pi = p(wi), then the naïve entropy is

Many information calculations are limited by sampling: hard to determine P(w) and P(w|s)

Systematic bias from undersampling.

Correction for finite size effects:

Information in single cells

Strong et al., 1997

Information is the difference between the variability driven by stimuli and that due to noise.

Take a stimulus sequence s and repeat many times.

For each time in the repeated stimulus, get a set of words P(w|s(t)).

Should average over all s with weight P(s); instead, average over time:

Hnoise = < H[P(w|si)] i.

Choose length of repeated sequence long enough to sample the noise entropy adequately.

Finally, do as a function of word length T andextrapolate to infinite T.

Information in single cells

Reinagel and Reid, ‘00

Information in single cells

Obtain information rate of ~80 bits/sec or 1-2 bits/spike.

How much information does a single spike convey about the stimulus?

Key idea: the information that a spike gives about the stimulus is the reduction in entropy between the distribution of spike times not knowing the stimulus,and the distribution of times knowing the stimulus.

The response to an (arbitrary) stimulus sequence s is r(t).

Without knowing that the stimulus was s, the probability of observing a spikein a given bin is proportional to , the mean rate, and the size of the bin.

Consider a bin t small enough that it can only contain a single spike. Then inthe bin at time t,

Information in single cells

Information in single cells

Now compute the entropy difference: ,

Assuming , and using

In terms of information per spike (divide by ):

Note substitution of a time average for an average over the r ensemble.

prior

conditional

Information in single cells

Given

note that:• It doesn’t depend explicitly on the stimulus• The rate r does not have to mean rate of spikes; rate of any event.• Information is limited by spike precision, which blurs r(t),

and the mean spike rate.

Compute as a function of t:

Undersampled for small bins

Information in single cells

An example: temporal coding in the LGN (Reinagel and Reid ‘00)

Information in single cells

Apply the same procedure:collect word distributions for a random, then repeated stimulus.

Information in single cells

Use this to quantify howprecise the code is,and over what timescalescorrelations are important.

Information in single cells

How important is information in multispike patterns?

The information in any given event can be computed as:

Define the synergy, the information gained from the joint symbol:

or equivalently,

Brenner et al., ’00.

Negative synergy is called redundancy.

Information in single cells: multispike patterns

Brenner et al., ’00.

In the identified neuron H1, compute information in a spike pair, separatedby an interval dt:

Information in single cells

Information in patterns in the LGN

Define pattern information as the difference between extrapolated word info and one letter:

Reinagel and Reid ‘00

Using information to evaluate neural models

We can use the information about the stimulus to evaluate ourreduced dimensionality models.

Information in timing of 1 spike:

By definition

Using information to evaluate neural models

Given:

By definition Bayes’ rule

Given:

By definition

So the information in the K-dimensional model is evaluated using the distribution of projections:

Bayes’ rule Dimensionality reduction

Using information to evaluate neural models

Here we used information to evaluate reduced models of the Hodgkin-Huxleyneuron.

1D: STA only

2D: two covariance modes

Twist model

Adaptive coding

Just about every neuron adapts. Why?

•To stop the brain from pooping out•To make better use of a limited dynamic range.•To stop reporting already known facts

All reasonable ideas.

What does that mean for coding?What part of the signal is the brain meantto read?

Adaptation can be mechanism for early sensory systems to make use of statistical Information about the environment.

How can the brain interpret an adaptivecode? From The Basis of Sensation, Adrian (1929)

Adaptation to stimulus statistics: information

Rate, or spike frequency adaptation is a classic form of adaptation.

Let’s go back to the picture of neural computation we discussed before:Can adapt both:

the system’s filtersthe input/output relation (threshold function)

Both are observed, and in both cases, the observed adaptations can bethought of as increasing information transmission through the system.

Information maximization as a principle of adaptive coding:

For optimum information transmission, coding strategy should adjustto the statistics of the inputs.

To compute the best strategy, have to impose constraints (Stemmler&Koch)

e.g. the variance of the output, or the maximum firing rate.

Adaptation of the input/output relation

Fly LMC cells.

Measured contrast innatural scenes.

If we constrain the maximum, the solution for the distribution of outputsymbols is P(r) = constant = . Take the output to be a nonlinear transformation on the input: r = g(s).From

Laughlin ’81.

Adaptation of filters

Change in retinal filters with different light level and contrast

Changes in V1 receptive fields with contrast

Dynamical adaptive coding

But is all adaptation to statistics onan evolutionary scale?

The world is highly fluctuating. Light intensities vary by 1010 over a day.

Expect adaptation to statistics to happendynamically, in real time.

Retina: observe adaptation to variance, or contrast, over 10s of seconds.

Surprisingly slow: contrast gain controleffects after 100s of milliseconds.

Also observed adaptation to spatial scaleon a similar timescale.

Smirnakis et al., ‘97

Dynamical adaptive coding

The H1 neuron of the fly visual system.

Rescales input/output relation with steady state stimulus statistics.

Brenner et al., ‘00

Dynamical adaptive coding

As in the Smirnakis et al. paper, there is rate adaptation in responseto the variance change

Dynamical adaptive coding

This is a form of learning.

Does the timescale reflect the time required to learn the new statistics?

Dynamical adaptive coding

As we have been through before, extract the spike-triggered average

Dynamical adaptive coding

Compute the input/output relations,as we described before:

s = stim . STA;

P(spike|s) = rave P(stim|s) / P(s)

Do it at different times in variancemodulation cycle.

Find ongoing normalisation with respect tostimulus standard deviation

Dynamical adaptive coding

Take a more complex stimulus: randomly modulated white noise.

Not unlike natural stimuli (Ruderman and Bialek ’97)

Find continuous rescalingto variance envelope.

Dynamical information maximisation

This should imply that information transmission is being maximized. We can compute the information directly and observe the timescale.How much information is available about the stimulus fluctuations? Return to two-state switching experiment.

Method:

Present n different white noise sequences, randomly ordered, throughout the variance modulation.

Collect word responses indexed by time with respect to the cycle, P(w(t)).

Now divide according to probe identity, and computeIt(w;s) = H[P(w(t))] – i P(si) H[P(w(t)|si)] , P(si) = 1/n;

Similarly, one can compute information about the variance: It(w;s) = H[P(w(t))] – i P(i) H[P(w(t)|i)] , P(i) = ½;

Convert to information/spike by dividing at each time by mean # of spikes.

Tracking information in time

Adaptation and ambiguity

The stimulus normalization is leading to information recovery within 100ms.

If the stimulus is represented as normalised, how are the spikes tobe interpreted upstream?

The rate conveys variance information, but with slow timescales.

Tracking information in time: the variance

What conveys variance information?

Where is the variance information and how can one decode it?

Notice that the interspike intervalhistograms in the different varianceregimes are quite distinct.

Need to take log to see this clearly.

Could these intervals provide enoughinformation, rapidly, to distinguish thevariance?

Decoding the varianceinformation

Decoding the variance information

Use signal detection theory.

Collect the steady-state distributions:P(|i).

Then for a given observation, compute the likelihood ratio,P(1)/P(|2).

After observing a sequence of intervals, compute the log-likelihood ratio forthe entire sequence

Since we can’t sample the joint distributions, we will assume that theintervals are independent (upper bound).

Now calculate the signal to noise ratio of Dn, <Dn>2/var(Dn), as a function of n.

Decoding the variance information

On average, the number of intervals required for accuratediscrimination is ~5-8.

Adaptive coding: conclusions

Have shown that information is available from the spike train inthree forms:

timing of single spikesthe ratethe local distribution of spike intervals.

The adaptation properties in some systems serve to rapidly maximize information transmission through the system under conditionsof changing stimulus statistics.

We demonstrated this for the variance: in other systems canprobe adaptation to more complex stimulus correlations (Meister)

Mechanisms remain open:intrinsic propertiesconductance level learning (Tony,Stemmler&Koch, Turrigiano)circuit level learning (Tony)

Conclusions

Characterising the neural computation:uncovering the richness of single neurons and systems

Using information to evaluate coding

Adaptation as a method for the brain to make use of stimulus statisticsmore examples?how is it implemented?

The rate dynamics: what’s going on

• Recall: no fixed timescale• Consistent with power-law adaptation

Suggests that rate behaveslikefractional differentiationof the log-variance envelope

Thorson and Biederman-Thorson,Science (1974)

A. Cockroach leg mechanoreceptor, to spinal distortionB. Spider slit sensillum, to 1200 Hz soundC. Stretch receptor of the crayfishD. Limulus eccentric-cell, to increase in light intensity

Fractional differentiation

power-law responseto a step:

scaling “adaptive” responseto a square wave:

Fourier representation (i): each frequency component scaled by

and with phase shifted by a constant phase i

Linear analysis agrees

• Variance envelope ~ exp[sin t/T] for a range of frequencies 1/T

• Stimulate with a set of sine wavesat different frequencies

T = 30s, 60s, 90s

phase shift

Fits pretty well

From sinusoid experiments, find exponent ~ 0.2

Two-state switching Three-state switching

So it’s a fractional differentiator…

• connects with “universal” power-law behaviour of receptors

• uncommon to see it in a “higher computation”

• functional interpretation: whitening stimulus spectrum(van Hateren)

• but what’s the mechanism? --- some ideas but we don’t know

• introduces long history dependence: linear realisation of long memory effects

• also has the property of emphasizing rapid changes and extending dynamic range (Adrian)