A COGNITIVE ARCHITECTURE FOR SENSORY PROCESSING …cip2014.conwiz.dk/files/principe_cip2014.pdf ·...

A COGNITIVE ARCHITECTURE

FOR SENSORY PROCESSING

Jose C. Principe, Ph.D.

Computational NeuroEngineering Laboratory (CNEL)

University of Florida

[email protected]

www.cnel.ufl.edu

mailto:[email protected]

Acknowledgments

My students:

Rakesh Chalasani

Goktug Cinar

This work was partially support by ONR grant N00014-10-

1-0375

Outline

• Brief overview

• Cognitive Sensory Processing

• Generative Hierarchical Models

• Convolution Models

• Conclusions

Principe J. Chalasani R., “Cognitive Architecture for Sensory

Processing”, Proceedings of the IEEE, vol 102, #4, 514-525, 2014

Sensory Processing and Features • What are good features for sensory processing?

• Note: this is crucial “ you cannot make good omelets with rotten eggs”

• We really don’t know! but keep on using almost exclusively the sensory space (audio, video) to find them (SIFT, HOG).

• Does this make sense? Probably not….. Because of the complexity and variability of the sensed signals (differences in illumination, shape, noise, context), while we need invariants in MODEL space!

• Are there alternatives? Of course there are!

Sensory Processing and Features

• We, biological organisms, have solved this problem long

ago (otherwise we would have been extinct!!!).

• Perception is an ACTIVE PROCESS, while our sensory

signal processing is PASSIVE!

• Fuster: “Perception is memory updating”

• Think of the object background segregation (visual or

auditory)

• “We see what we want to see”…. i.e. our brain

disambiguates the sensory signals according to our

expectations

Sensory Processing and Features

• Hermann von Helmholtz (1821-1894) “ the perceptual system

is an inference engine whose function is to infer the probable

causes of the sensory input”

• Cognitive science has provided an impressive increase in

knowledge of how the brain works. I recommend Joaquin

Fuster as “a must read”

• Cortex and the mind, Memory in the cortex, Prefrontal cortex

• The issue for us, signal processing and machine learning

experts, is how to translate these concepts mathematically

i.e. how to create a computational theory of perception.

• Mars- Vision, Bergman - Computational Auditory Scene Analysis

Neuro Anatomy of Visual System

• We share Helmholtz’ view that cortical function evolved to explain sensory inputs. As such we seek to understand the role of processing and stored experience in a machine learning framework for the decoding of sensory input.

• Therefore the goal is to create computational systems that explain the world using rich internal representations that can be made stable and discriminative for fast recall.

RETINA1 Felleman DJ, Van Essen DC. Distributed hierarchical

processing in the primate cerebral cortex. Cereb Cortex. 1991

Jan-Feb;1(1):1-47.

Cognitive Model for Object Recognition in Video

Goal: develop a

bidirectional, dynamical,

adaptive, self -

organizing, distributed

and hierarchical model

for sensory cortex

processing using

approximate Bayesian

inference.


Why Cognitive?

Because it learns, infers autonomously to represent the external world and uses this knowledge to disambiguate future inputs

Bidirectional:

Goals/memory from the top levels can be used as time signals and mixed with incoming sensory data. It uses the architecture as constraints for the optimization.


Dynamic Elements:

Long and short term

memory (as a signal) for in

situ computation.

Naturally handles time with

functional mappings,

encodes uncertainty, learns

on line, and implement

smooth constraints.


Generative models:

Self-organizing because predicts the external inputs, recognition becomes inverse problem.

Efficient way to parameterize the posterior and incorporate priors and uncertainty about causes.

Perceptual inference becomes online inference about latent or hidden causes of inputs

Learning becomes just optimization of parameters .


Hierarchical and

distributed architecture:

Partitions computation,

provides multi-scales for

time and space and creates

an uniform architecture

(same code) that can be

parallelizable.

Allows the use of

architecture as constraint

for the optimization

Cognitive Models for Perception

Previous Work

• Predictive coding as a statistical model

• Olshausen and Field [1996] – Learning sparse codes for natural

images using L1-norm regularization.

• Rao and Ballard [1997] – Dynamic Models with space-time

receptive fields using Kalman filter.

• Lee and Mumford [2003] – Particle filtering for hierarchical

inference with empirical priors using the top-down prediction.

• Friston [2008] – Hierarchical Dynamic Models in generalized

coordinates of motion and empirical priors for continuous-time

signals.

CNEL, University of Florida, University of Florida 13

Cognitive Models for Perception

Previous Work

• Deep learning networks use greedy layer-wise

unsupervised methods to build an hierarchical model from

data. The goal is to learn encoding and decoding

concurrently (different reg.) using feedforward models

• Restricted Boltzmann Machine RBM – weight sharing

• Auto Encoders – denoising

• Sparsification – predictive sparse decompositions

• Convolution networks can also be used for the full image

• Our approach called Deep Predictive Coding Networks

(DPCN) relies on an efficient inference procedure to get a

more accurate latent representation (no encoder)


15

Sensory Processing Functional PrinciplesHierarchical Dynamical Model with Unknown Inputs

• Generalized state space model with additive noise:

yt – Observationsxt – Hidden statesut – Causal states

• Hidden states model the history and• the internal state. • Causes model the “inputs” driving the• system . • Empirical Bayesian priors create a hierarchical model, the

layer on the top tries to predict the causes for the layer below.

tttt

tttt

vBuAxx

nDuCxy

1

• Inferred causes act as observations to the higher linear dynamical system.

• Prior on the causes become empirical and are set by the top-down prediction.

• Causes in a particular layer are updated using the prediction error in the layer below.

• This forms explicit forward and backward connectivity between the layers.

16

Sensory Processing Functional PrinciplesHierarchical Models

17

Sensory Processing Functional PrinciplesHierarchical Dynamical Model with Unknown Inputs

• On a patch of the video image over time:

• 1) Feature extraction (inferring states - xt) by creating an overcomplete state representation of the dynamics in the patch

• 2) Pooling (inferring causes - ut) to extract invariants on the image across patches.

Computational Model – Single LayerFeature extraction (inferring states)

Let y be a p dimensional sequence of a 2D patch from the

same location of a video sequence.

To infer the states x we use a dynamic sparse coding (DSC)

model that maps y onto an overcomplete dictionary of k filters

(k>p)

The energy function is the negative log likelihood

This optimization is not trivial because of the two l1 constraints


- logP(yt, xt |C, A) = E1(yt, xt,C, A) =

= yt -Cxt 2

2+ l xt - Axt-1 1

+g xt 1

Computational Model – Single LayerSparse Coding in Dynamical Networks




Chalasani, R., and Principe, J.C, “Dynamic Sparse Coding with

Smoothing Proximal Gradient Method", Proc. ICASSP 2014, Florence, Italy


• Example: State estimation with known parameters


20 40 60 80 1000

0.5

1

1.5

2

2.5

3

Observation Dimensions

ste

ady s

tate

rM

SE

Kalman Filter

Proposed

Sparse Coding

Synthetic data: Gaussian process

generated with sparseness (500

States, 20 non-zero elements).

Observation matrix C is Gaussian,

A is a permutation matrix

Sparse Coding using FISTA

(fast iterative shrinkage with

thresholds)

Computational Model – Single LayerPooling (inferring causes)

Learn invariant representations by taking advantage of spatial

relationships in local neighborhoods.

A small group of states x representing contiguous patches are

added (sum pooled).

Infer the d dimensional causes by minimizing the energy

functional

This l1 minimization also indirectly establishes nonlinear

relations between causes and states


- logP(xt,ut | B) = E2(ut, xt, B) =

= g.xt,k

(n)

k=1

K

åæ

èç

ö

ø÷

n=1

N

å + b ut 1gk =g0

1+ exp(-[But ]k)

2

é

ëê

ù

ûú

Computational Model – Single LayerInterpretation as free energy

The components of the energy functionals form a generative

model specified as the log likelihood of observations, causes

and parameters as

The latent variables can be efficiently inferred using proximal

gradient methods


- logP(yt, xt,ut,q ) =

= - logP(yt | xt,ut,q )- logP(xt,ut |q )- logP(q ) =

=n=1

N

å1

2yt

(n) -Cxt

(n)

2

2

+ l xt

(n) - Axt-1

(n)

1+g xt 1

+ g t,kxt,k

(n)

k=1

K

åæ

èç

ö

ø÷+

b ut 1- logP(q )

Computational Model – Single LayerLearning the parameters

The model parameters are learned by

dual estimation on the combined cost function by

alternating inference with parameter updating.

The parameters are updated using gradient

descent with an additional temporal smoothness

For fixed x and u the gradients can be computed

as

Matrices C and B are column normalized


ÑAE = sign(xt - Atxt-1)xt

T +V (At - At-1)

ÑBE = (exp(-But ). xt )ut

T +V (Bt - Bt-1)

ÑCE = (yt -Ctxt )xt

T +V (Ct -Ct-1)

q = {A, B,C}

qt =qt-1 + zt

State

Estimation

Parameter

Estimation

Learned Features and InvariancesReceptive Fields (feature detectors)

• Video database (Van Hateren’s)• Input: 17 x 17 Patches from different video sequences.

• States – 400 dim, causes 100 dim; pooling 2x2


Measurement Matrix C –bases

Each small block is a column

Layer 1 causes (matrix B) are

composed from neighborhoods

Learned Features and InvariancesReceptive Fields (feature detectors)


When a receptive field is active (left) the model predicts it will be active later (right)

Scatter plot of the 15 strongest connections of matrix A from current time

to next time (bars are pi/6).

t t+1,t+2,….

Learned Features and InvariancesGabor modeling of the receptive fields


Each element (receptive field of 15x15) in C is fit with a Gabor function parameterized

as center position, spatial orientation, and frequency of the Gabor functions.

Visualizing Invariances

From states to causes (orientation)

• Connection strength between first layer invariance matrix (B) and the observation matrix (C). Each subplot is one column of B when one column of C is active, and strength is color coded.



• Connection strength between first layer invariance matrix (B) and the observation matrix (C). Each subplot is one column of B when one column of C is active, and strength is color coded.

Visualizing InvariancesFrom states to causes (frequencies)

Classification Results Single layer representation (inferred causes)

Dataset FS DSC DSC-I ConvNN

COIL-100 66.87 71.81 74.63 71.49

Animal 76.09 82.43 85.82 ---


12x12 patches, 2x2 pooling and a SVM classifier (4 labeled frames)

Multi-Layered Architecture


Tree structure with tiling

of scene at bottom

Computational model is

uniform within layer and

across

Different spatial scales due to pulling which also

slows the time scale in upper layers

Learning is greedy (one layer at a time)

This creates a Markov chain across layers

Multi-Layered Architecture


Notice that the top layer predictions affect the lower

layer creating effectively constraints due to the

topology.

Inference with Top-Down Connections


Top-down Influence

Predictions from the

top-layer dynamic

model non-linearly

enter the bottom-layer

inference.

Each frame is 32x32 pixel and each sequence

is 100 frames long (30,000 long by concatenation Layer 1 Divide each frame into 20x20 (states 12x12)

Pool 2x2 states into one cause (Invariant unit)

Dimensions: States – 100; Causes – 40

Layer 2 Consider the causes in layer-1 as inputs

Pool 2x2 states into one cause

Dimensions: States – 60; Causes - 3

Recognition of Sequences in Noisy Data

CNEL, University of Florida, University

of Florida34

Layer- 1 Causes

Layer- 2 Causes

Layer – 2 Causes

Recognition of Sequences in Noisy Data

• Top-down influence: ut(L+1)=ut-1(L)

• 2-layered network with dimensions (100, 40)1, (60, 3)2


0

5

10

0

2

4

0

2

4

6Object 1

Object 2

Object 3

0

5

10

0

2

4

6

0

2

4

6Object 1

Object 2

Object 3

0

2

4

6

0

1

2

3

0

2

4

6Object 1

Object 2

Object 3

Bottom-up (No Noise) Bottom-up (With Noise) Top-down (With Noise)

Clean Video Corrupted Video (SNR=-1.2dB)

Scalable Architecture with Convolutional Models

• Advantage of Convolutional model:

• Scalable to large images.

• Invariance to translations.

• Efficient implementation using GPUs.

• Main components:

• Convolutional sparse coding in dynamic networks using

Convolutional FISTA [IJCNN, 2013] to infer states.

• Pooling/Unpooling between the states and the causes.

• Convolutional sparse coding to infer causes.


Chalasani R., Principe J., Ramakrishnan N., “A Fast Proximal Method for

Convolutional Sparse Coding”, Proc. IEEE IJCNN, Austin, Tx, 2013

Scalable Architecture with Convolutional Dynamical

Models (CDNs)


filters

filters

RGB

Pooling

unpooling

SINGLE LAYER MODEL

Convolutional Dynamical ModelsState Space Equations

• Each channel Im,t is modeled as a linear combination

of K matrices convolved with filters Cm,k

• ak,k’ are the lateral connections and here we make

ak,k’=1 for k=k’ because of the application (object

recognition)


I t

m = Cm,k * Xt

k + Nt

m

k=1

K

å mÎ {1,2,..M}

Xt

k(i, j ) = ak,k 'Xt-1

k '

k '=1

K

å (i, j )+Vt

k(i, j )

Convolutional Dynamical ModelsOptimization

• Energy function for state maps (x is a matrix):

• Energy function for cause maps (x is pooled):


Convolutional Dynamical ModelsImplementation

• This convolution dynamical model can be stacked in trees

as before

• The internal connectivity between inputs and states can

be made sparse

• We decrease the model size in the hierarchy by using

max pooling between the state and causes (and

unpooling in the reverse direction)

• Inference is done as before (alternate fixing causes and

states)

• Parameter learning is done as before

• FISTA was extended to convolution networks


Convolutional Dynamical ModelsInference in the Hierarchy

• To simplify inference, the state-space model at each layer predicts the most likely cause at the layer below (Ul-1,t ), given only the previous states and the predicted causes from the layer above


Convolutional Dynamical ModelsLearning in the Hierarchy

• Learning is done layer by layer starting from the

bottom

• To simplify learning, we do not consider any top down

connections for inference

• Filters are normalized to unit norm after learning

• The gradients are


ÑCm,k '

I EI = -2Xt

k ',I *(I t

m - Ck,m * Xt

k,I )k=1

K

å

ÑBm,d '

I EI = -Ut

d ',I * exp{- Bk ',m *Ut

d,I

d=1

D

åæ

èç

ö

ø÷. down(Xt

k ',I )é

ëê

ù

ûú

Object Recognition- Training

• Learning on Van Hateren

natural video database

(128x128).

• Architecture:

• Layer 1: 16 states of 7x7

filters and 32 causes of

6x6 filters.

• Layer 2: 64 states of 7x7

filters and 128 causes.

• Pooling: 2 x 2 between

states and causes.


Layer 2 - Causes

Layer 1 - Causes

Self-Taught Learning


With the parameters learned from Van Hateren, classify images in

Caltech 101.

Extract features from a single bottom up inference (causes from both

layer 1 and 2 that are concatenated into a feature vector)

30 images for training, and for testing.

Object Recognition with Context


COIL-100 dataset:

72 frames per object.

Top-down inference is run

over each sequence

We assume that the test

data is partially available

during training.

So called “transductive”

learning.

Four frames per object for

training a linear SVM.

(0o, 90o, 180o, 270o)

Object Recognition- Results


Methods Accuracy (%)

View-tuned network (VTU)

[Wersing & Korner, 2003]

79.10 %

Convolutional Nets with temporal

coherence [Mobahi et al, 2009]

92.25 %

Stacked ISA with temporal coherence

[Zou et al, 2012]

87.00 %

Our method;

without temporal coherence

79.45 %

Our method;

with temporal coherence

94.41 %

Our method;

with temporal coherence + Top-down

98.34 %

Testing Discriminability in Object Recognition


Honda/UCSD face data set (20 for training, 39 for testing) using Viola Jones

face finding algorithm (on 20x20 patches).Histogram equalization is done. 2

layer model (16,48)1 (64,100)2, 5x5 filters, causes concatenated as features

Testing Discriminability in Object Recognition


You Tube Celebrity face data set (partition the data in 10 sets of 9

video and used 3 of each for training and 6 for training) using Viola

Jones face finding algorithm (30x30 patches). Average results plotted.

Model of same size but filters are 7x7. Histogram equalization is done.

Discriminability with Occlusion

49

Layer -2 Causes

Layer -1 Causes

Layer -1 States

Example Video frames

[VidTIMIT]

Extension to Auditory Processing

• We would like to show the “universality” of this type of modeling

approach by addressing auditory processing.

• Temporal theory information in sound is coded in the temporal

firing patterns of the auditory neurons connecting to the cochlea.

• Place theory perception of sound depends on the location which

vibrates in response to the sound along the basilar membrane.

• Consequently, use of dynamical systems would be a great fit.

Audio streams can also be modeled single source, so they are

easier to explain.

• Simplify the modeling approach using Kalman filters for

efficiency

• Integrate auditory and visual processing through the causes

stored in a content addressable memory (CAM).


Nested Hierarchical Linear Dynamical

System (HLDS)

The linear model consists of one

measurement equation and multiple state

transition equations.

By design the top layer creates point

attractors (Brownian state) to extract

redundancies in the sound time structure.

The nested HLDS is driven bottom-up by

the observations, and top-down by the

states so indirectly it segments the input in

spectral uniform regions.


Cinar G., Príncipe J., “Clustering of Time Series Using a Hierarchical

Linear Dynamical System”, in Proc. ICASSP 2014, Florence, Italy

Point Attractors for Trumpet Notes

• Train with audio samples from Univ. of Iowa Musical

Instrument notes (2 sec sustained notes) in the range E3-

D6 for the nonvibrato Trumpet.

• The algorithm organizes in an unsupervised fashion the

different time structure of notes into point attractors in the

state space of the highest layer (Hopfield network).

CNEL, University of Florida, University

of Florida

52

Recognition Through Clustering

• Model system: 60,10,3 states, 36 msec windows, 1024

FFTs. System is real time.

• To assess classification accuracy we do Monte-Carlo runs

through all 35 notes (randomized order).

• Convergence is declared with 4 consecutive decisions to

the same output space location (clustering).

• The performance is tested for three noise levels.


35 dB SNR 15 dB SNR -5 dB SNR

Classification Accuracy

using Variance Test

96.94% 87.94% 5.43%

Classification Accuracy

using all time instances

79.70% 71.95% 5.17%

Performance in Music Clips

• This setup of isolated notes is not ‘practical’. In a real life

scenario it would be almost impossible to find a music piece

that consists of notes sustained for 2 seconds.

• We started testing the performance of the algorithm in

very short music clips.

• We used the first two verses of Beethoven Symphony

#5.

• We create the clip from real trumpet recordings.

Performance in Music Clips

We play this motif

under different

conditions.

3 different levels of

amplitude

magnitudes.

Crescendo

Decrescendo

Different Tempos

Observe that the

same trajectory is

followed all through

these dynamic

changes.

Yesterday by The Beatles

• We created the Beatles song “Yesterday” using the Trumpet

notes in the database.

• We apply the same simple convergence criterion. Once

unanimous decisions are disrupted, windows are labeled as

“undecided”.

• Once the next convergence is declared, we go back in time

and fix the undecided windows accordingly.

-What the

algorithm “hears”

Yesterday by The Beatles

• Each window is played several times and clustering averaged

(can be parallelized for real time)

• After post-processing, the classification accuracy goes as high

as 93% (notes shorter than the convergence time are

misclassified).

• Clean notes from database are concatenated

58

CONCLUSIONSHigh Level

• We have developed a preliminary version of a computational,

adaptive, self organized distributed and hierarchical model for

episodic memory in sensory processing.

• The Bayesian framework is general and very flexible to explain

the data. However the computational complexity is still huge and

requires specific hardware architectures.

• The sparseness constraint was critical to disambiguate time and

spatial features in video, while adaptive pooling was critical to

link model parameters to object and video invariant features.

• Preliminary results show that our model is capable of high

performance in video processing and it is very robust to

structured noise.

• The same basic principles were extended to auditory processing

(no sparseness), which shows their large appeal.

59

CONCLUSIONS

Specific novel aspects in the approach

• Paradigm shift in sensory processing: Switched from input

space feature design to discriminative model design to explain

the sensory data with sparse and invariant representations.

• The computational framework emphasizes self-organization

and distributed feedback between bottom up and top down

processing. It blends working memory (states) and long term

memory (parameters). It is ready to take advantage of prior

knowledge (cognitive memory).

• Illustrates the importance of dynamic modeling to describe

spatio-temporal data (video). Continuity over time simplifies

processing!

60

CONCLUSIONS

Specific novel aspects in the modeling

• New methodology to perform sparse coding with dynamic

models and showed its importance

• New methodology to learn invariant representations rather

than doing max or average pooling and showed

improvement in classification

• New methodology to implement top-down connections

that can disambiguate the object of interest in the video.

• Validated the role of dynamics to help deal with structured

noise in a hierarchical model

Future Directions

• Audio- Video Fusion

A COGNITIVE ARCHITECTURE FOR SENSORY PROCESSING …cip2014.conwiz.dk/files/principe_cip2014.pdf ·...

Documents

Transcript of A COGNITIVE ARCHITECTURE FOR SENSORY PROCESSING …cip2014.conwiz.dk/files/principe_cip2014.pdf ·...