A COGNITIVE ARCHITECTURE FOR SENSORY PROCESSING …cip2014.conwiz.dk/files/principe_cip2014.pdf ·...
Transcript of A COGNITIVE ARCHITECTURE FOR SENSORY PROCESSING …cip2014.conwiz.dk/files/principe_cip2014.pdf ·...
A COGNITIVE ARCHITECTURE
FOR SENSORY PROCESSING
Jose C. Principe, Ph.D.
Computational NeuroEngineering Laboratory (CNEL)
University of Florida
www.cnel.ufl.edu
Acknowledgments
My students:
Rakesh Chalasani
Goktug Cinar
This work was partially support by ONR grant N00014-10-
1-0375
Outline
• Brief overview
• Cognitive Sensory Processing
• Generative Hierarchical Models
• Convolution Models
• Conclusions
Principe J. Chalasani R., “Cognitive Architecture for Sensory
Processing”, Proceedings of the IEEE, vol 102, #4, 514-525, 2014
Sensory Processing and Features • What are good features for sensory processing?
• Note: this is crucial “ you cannot make good omelets with rotten eggs”
• We really don’t know! but keep on using almost exclusively the sensory space (audio, video) to find them (SIFT, HOG).
• Does this make sense? Probably not….. Because of the complexity and variability of the sensed signals (differences in illumination, shape, noise, context), while we need invariants in MODEL space!
• Are there alternatives? Of course there are!
Sensory Processing and Features
• We, biological organisms, have solved this problem long
ago (otherwise we would have been extinct!!!).
• Perception is an ACTIVE PROCESS, while our sensory
signal processing is PASSIVE!
• Fuster: “Perception is memory updating”
• Think of the object background segregation (visual or
auditory)
• “We see what we want to see”…. i.e. our brain
disambiguates the sensory signals according to our
expectations
Sensory Processing and Features
• Hermann von Helmholtz (1821-1894) “ the perceptual system
is an inference engine whose function is to infer the probable
causes of the sensory input”
• Cognitive science has provided an impressive increase in
knowledge of how the brain works. I recommend Joaquin
Fuster as “a must read”
• Cortex and the mind, Memory in the cortex, Prefrontal cortex
• The issue for us, signal processing and machine learning
experts, is how to translate these concepts mathematically
i.e. how to create a computational theory of perception.
• Mars- Vision, Bergman - Computational Auditory Scene Analysis
Neuro Anatomy of Visual System
• We share Helmholtz’ view that cortical function evolved to explain sensory inputs. As such we seek to understand the role of processing and stored experience in a machine learning framework for the decoding of sensory input.
• Therefore the goal is to create computational systems that explain the world using rich internal representations that can be made stable and discriminative for fast recall.
RETINA1 Felleman DJ, Van Essen DC. Distributed hierarchical
processing in the primate cerebral cortex. Cereb Cortex. 1991
Jan-Feb;1(1):1-47.
Cognitive Model for Object Recognition in Video
Goal: develop a
bidirectional, dynamical,
adaptive, self -
organizing, distributed
and hierarchical model
for sensory cortex
processing using
approximate Bayesian
inference.
Cognitive Model for Object Recognition in Video
Why Cognitive?
Because it learns, infers autonomously to represent the external world and uses this knowledge to disambiguate future inputs
Bidirectional:
Goals/memory from the top levels can be used as time signals and mixed with incoming sensory data. It uses the architecture as constraints for the optimization.
Cognitive Model for Object Recognition in Video
Dynamic Elements:
Long and short term
memory (as a signal) for in
situ computation.
Naturally handles time with
functional mappings,
encodes uncertainty, learns
on line, and implement
smooth constraints.
Cognitive Model for Object Recognition in Video
Generative models:
Self-organizing because predicts the external inputs, recognition becomes inverse problem.
Efficient way to parameterize the posterior and incorporate priors and uncertainty about causes.
Perceptual inference becomes online inference about latent or hidden causes of inputs
Learning becomes just optimization of parameters .
Cognitive Model for Object Recognition in Video
Hierarchical and
distributed architecture:
Partitions computation,
provides multi-scales for
time and space and creates
an uniform architecture
(same code) that can be
parallelizable.
Allows the use of
architecture as constraint
for the optimization
Cognitive Models for Perception
Previous Work
• Predictive coding as a statistical model
• Olshausen and Field [1996] – Learning sparse codes for natural
images using L1-norm regularization.
• Rao and Ballard [1997] – Dynamic Models with space-time
receptive fields using Kalman filter.
• Lee and Mumford [2003] – Particle filtering for hierarchical
inference with empirical priors using the top-down prediction.
• Friston [2008] – Hierarchical Dynamic Models in generalized
coordinates of motion and empirical priors for continuous-time
signals.
CNEL, University of Florida, University of Florida 13
Cognitive Models for Perception
Previous Work
• Deep learning networks use greedy layer-wise
unsupervised methods to build an hierarchical model from
data. The goal is to learn encoding and decoding
concurrently (different reg.) using feedforward models
• Restricted Boltzmann Machine RBM – weight sharing
• Auto Encoders – denoising
• Sparsification – predictive sparse decompositions
• Convolution networks can also be used for the full image
• Our approach called Deep Predictive Coding Networks
(DPCN) relies on an efficient inference procedure to get a
more accurate latent representation (no encoder)
CNEL, University of Florida, University of Florida 14
15
Sensory Processing Functional PrinciplesHierarchical Dynamical Model with Unknown Inputs
• Generalized state space model with additive noise:
yt – Observationsxt – Hidden statesut – Causal states
• Hidden states model the history and• the internal state. • Causes model the “inputs” driving the• system . • Empirical Bayesian priors create a hierarchical model, the
layer on the top tries to predict the causes for the layer below.
tttt
tttt
vBuAxx
nDuCxy
1
• Inferred causes act as observations to the higher linear dynamical system.
• Prior on the causes become empirical and are set by the top-down prediction.
• Causes in a particular layer are updated using the prediction error in the layer below.
• This forms explicit forward and backward connectivity between the layers.
16
Sensory Processing Functional PrinciplesHierarchical Models
17
Sensory Processing Functional PrinciplesHierarchical Dynamical Model with Unknown Inputs
• On a patch of the video image over time:
• 1) Feature extraction (inferring states - xt) by creating an overcomplete state representation of the dynamics in the patch
• 2) Pooling (inferring causes - ut) to extract invariants on the image across patches.
Computational Model – Single LayerFeature extraction (inferring states)
Let y be a p dimensional sequence of a 2D patch from the
same location of a video sequence.
To infer the states x we use a dynamic sparse coding (DSC)
model that maps y onto an overcomplete dictionary of k filters
(k>p)
The energy function is the negative log likelihood
This optimization is not trivial because of the two l1 constraints
CNEL, University of Florida, University of Florida 18
- logP(yt, xt |C, A) = E1(yt, xt,C, A) =
= yt -Cxt 2
2+ l xt - Axt-1 1
+g xt 1
Computational Model – Single LayerSparse Coding in Dynamical Networks
CNEL, University of Florida, University of Florida 19
Computational Model – Single LayerSparse Coding in Dynamical Networks
CNEL, University of Florida, University of Florida 20
Chalasani, R., and Principe, J.C, “Dynamic Sparse Coding with
Smoothing Proximal Gradient Method", Proc. ICASSP 2014, Florence, Italy
Computational Model – Single LayerSparse Coding in Dynamical Networks
• Example: State estimation with known parameters
CNEL, University of Florida, University of Florida 21
20 40 60 80 1000
0.5
1
1.5
2
2.5
3
Observation Dimensions
ste
ady s
tate
rM
SE
Kalman Filter
Proposed
Sparse Coding
Synthetic data: Gaussian process
generated with sparseness (500
States, 20 non-zero elements).
Observation matrix C is Gaussian,
A is a permutation matrix
Sparse Coding using FISTA
(fast iterative shrinkage with
thresholds)
Computational Model – Single LayerPooling (inferring causes)
Learn invariant representations by taking advantage of spatial
relationships in local neighborhoods.
A small group of states x representing contiguous patches are
added (sum pooled).
Infer the d dimensional causes by minimizing the energy
functional
This l1 minimization also indirectly establishes nonlinear
relations between causes and states
CNEL, University of Florida, University of Florida 22
- logP(xt,ut | B) = E2(ut, xt, B) =
= g.xt,k
(n)
k=1
K
åæ
èç
ö
ø÷
n=1
N
å + b ut 1gk =g0
1+ exp(-[But ]k)
2
é
ëê
ù
ûú
Computational Model – Single LayerInterpretation as free energy
The components of the energy functionals form a generative
model specified as the log likelihood of observations, causes
and parameters as
The latent variables can be efficiently inferred using proximal
gradient methods
CNEL, University of Florida, University of Florida 23
- logP(yt, xt,ut,q ) =
= - logP(yt | xt,ut,q )- logP(xt,ut |q )- logP(q ) =
=n=1
N
å1
2yt
(n) -Cxt
(n)
2
2
+ l xt
(n) - Axt-1
(n)
1+g xt 1
+ g t,kxt,k
(n)
k=1
K
åæ
èç
ö
ø÷+
b ut 1- logP(q )
Computational Model – Single LayerLearning the parameters
The model parameters are learned by
dual estimation on the combined cost function by
alternating inference with parameter updating.
The parameters are updated using gradient
descent with an additional temporal smoothness
For fixed x and u the gradients can be computed
as
Matrices C and B are column normalized
CNEL, University of Florida, University of Florida 24
ÑAE = sign(xt - Atxt-1)xt
T +V (At - At-1)
ÑBE = (exp(-But ). xt )ut
T +V (Bt - Bt-1)
ÑCE = (yt -Ctxt )xt
T +V (Ct -Ct-1)
q = {A, B,C}
qt =qt-1 + zt
State
Estimation
Parameter
Estimation
Learned Features and InvariancesReceptive Fields (feature detectors)
• Video database (Van Hateren’s)• Input: 17 x 17 Patches from different video sequences.
• States – 400 dim, causes 100 dim; pooling 2x2
CNEL, University of Florida, University of Florida 25
Measurement Matrix C –bases
Each small block is a column
Layer 1 causes (matrix B) are
composed from neighborhoods
Learned Features and InvariancesReceptive Fields (feature detectors)
CNEL, University of Florida, University of Florida 26
When a receptive field is active (left) the model predicts it will be active later (right)
Scatter plot of the 15 strongest connections of matrix A from current time
to next time (bars are pi/6).
t t+1,t+2,….
Learned Features and InvariancesGabor modeling of the receptive fields
CNEL, University of Florida, University of Florida 27
Each element (receptive field of 15x15) in C is fit with a Gabor function parameterized
as center position, spatial orientation, and frequency of the Gabor functions.
Visualizing Invariances
From states to causes (orientation)
• Connection strength between first layer invariance matrix (B) and the observation matrix (C). Each subplot is one column of B when one column of C is active, and strength is color coded.
CNEL, University of Florida, University of Florida 28
CNEL, University of Florida, University of Florida 29
• Connection strength between first layer invariance matrix (B) and the observation matrix (C). Each subplot is one column of B when one column of C is active, and strength is color coded.
Visualizing InvariancesFrom states to causes (frequencies)
Classification Results Single layer representation (inferred causes)
Dataset FS DSC DSC-I ConvNN
COIL-100 66.87 71.81 74.63 71.49
Animal 76.09 82.43 85.82 ---
CNEL, University of Florida, University of Florida 30
12x12 patches, 2x2 pooling and a SVM classifier (4 labeled frames)
Multi-Layered Architecture
CNEL, University of Florida, University of Florida 31
Tree structure with tiling
of scene at bottom
Computational model is
uniform within layer and
across
Different spatial scales due to pulling which also
slows the time scale in upper layers
Learning is greedy (one layer at a time)
This creates a Markov chain across layers
Multi-Layered Architecture
CNEL, University of Florida, University of Florida 32
Notice that the top layer predictions affect the lower
layer creating effectively constraints due to the
topology.
Inference with Top-Down Connections
CNEL, University of Florida, University of Florida 33
Top-down Influence
Predictions from the
top-layer dynamic
model non-linearly
enter the bottom-layer
inference.
Each frame is 32x32 pixel and each sequence
is 100 frames long (30,000 long by concatenation Layer 1 Divide each frame into 20x20 (states 12x12)
Pool 2x2 states into one cause (Invariant unit)
Dimensions: States – 100; Causes – 40
Layer 2 Consider the causes in layer-1 as inputs
Pool 2x2 states into one cause
Dimensions: States – 60; Causes - 3
Recognition of Sequences in Noisy Data
CNEL, University of Florida, University
of Florida34
Layer- 1 Causes
Layer- 2 Causes
Layer – 2 Causes
Recognition of Sequences in Noisy Data
• Top-down influence: ut(L+1)=ut-1(L)
• 2-layered network with dimensions (100, 40)1, (60, 3)2
CNEL, University of Florida, University of Florida 35
0
5
10
0
2
4
0
2
4
6Object 1
Object 2
Object 3
0
5
10
0
2
4
6
0
2
4
6Object 1
Object 2
Object 3
0
2
4
6
0
1
2
3
0
2
4
6Object 1
Object 2
Object 3
Bottom-up (No Noise) Bottom-up (With Noise) Top-down (With Noise)
Clean Video Corrupted Video (SNR=-1.2dB)
Scalable Architecture with Convolutional Models
• Advantage of Convolutional model:
• Scalable to large images.
• Invariance to translations.
• Efficient implementation using GPUs.
• Main components:
• Convolutional sparse coding in dynamic networks using
Convolutional FISTA [IJCNN, 2013] to infer states.
• Pooling/Unpooling between the states and the causes.
• Convolutional sparse coding to infer causes.
CNEL, University of Florida, University of Florida 36
Chalasani R., Principe J., Ramakrishnan N., “A Fast Proximal Method for
Convolutional Sparse Coding”, Proc. IEEE IJCNN, Austin, Tx, 2013
Scalable Architecture with Convolutional Dynamical
Models (CDNs)
CNEL, University of Florida, University of Florida 37
filters
filters
RGB
Pooling
unpooling
SINGLE LAYER MODEL
Convolutional Dynamical ModelsState Space Equations
• Each channel Im,t is modeled as a linear combination
of K matrices convolved with filters Cm,k
• ak,k’ are the lateral connections and here we make
ak,k’=1 for k=k’ because of the application (object
recognition)
CNEL, University of Florida, University of Florida 38
I t
m = Cm,k * Xt
k + Nt
m
k=1
K
å mÎ {1,2,..M}
Xt
k(i, j ) = ak,k 'Xt-1
k '
k '=1
K
å (i, j )+Vt
k(i, j )
Convolutional Dynamical ModelsOptimization
• Energy function for state maps (x is a matrix):
• Energy function for cause maps (x is pooled):
CNEL, University of Florida, University of Florida 39
Convolutional Dynamical ModelsImplementation
• This convolution dynamical model can be stacked in trees
as before
• The internal connectivity between inputs and states can
be made sparse
• We decrease the model size in the hierarchy by using
max pooling between the state and causes (and
unpooling in the reverse direction)
• Inference is done as before (alternate fixing causes and
states)
• Parameter learning is done as before
• FISTA was extended to convolution networks
CNEL, University of Florida, University of Florida 40
Convolutional Dynamical ModelsInference in the Hierarchy
• To simplify inference, the state-space model at each layer predicts the most likely cause at the layer below (Ul-1,t ), given only the previous states and the predicted causes from the layer above
CNEL, University of Florida, University of Florida 41
Convolutional Dynamical ModelsLearning in the Hierarchy
• Learning is done layer by layer starting from the
bottom
• To simplify learning, we do not consider any top down
connections for inference
• Filters are normalized to unit norm after learning
• The gradients are
CNEL, University of Florida, University of Florida 42
ÑCm,k '
I EI = -2Xt
k ',I *(I t
m - Ck,m * Xt
k,I )k=1
K
å
ÑBm,d '
I EI = -Ut
d ',I * exp{- Bk ',m *Ut
d,I
d=1
D
åæ
èç
ö
ø÷. down(Xt
k ',I )é
ëê
ù
ûú
Object Recognition- Training
• Learning on Van Hateren
natural video database
(128x128).
• Architecture:
• Layer 1: 16 states of 7x7
filters and 32 causes of
6x6 filters.
• Layer 2: 64 states of 7x7
filters and 128 causes.
• Pooling: 2 x 2 between
states and causes.
CNEL, University of Florida, University of Florida 43
Layer 2 - Causes
Layer 1 - Causes
Self-Taught Learning
CNEL, University of Florida, University of Florida 44
With the parameters learned from Van Hateren, classify images in
Caltech 101.
Extract features from a single bottom up inference (causes from both
layer 1 and 2 that are concatenated into a feature vector)
30 images for training, and for testing.
Object Recognition with Context
CNEL, University of Florida, University of Florida 45
COIL-100 dataset:
72 frames per object.
Top-down inference is run
over each sequence
We assume that the test
data is partially available
during training.
So called “transductive”
learning.
Four frames per object for
training a linear SVM.
(0o, 90o, 180o, 270o)
Object Recognition- Results
CNEL, University of Florida, University of Florida 46
Methods Accuracy (%)
View-tuned network (VTU)
[Wersing & Korner, 2003]
79.10 %
Convolutional Nets with temporal
coherence [Mobahi et al, 2009]
92.25 %
Stacked ISA with temporal coherence
[Zou et al, 2012]
87.00 %
Our method;
without temporal coherence
79.45 %
Our method;
with temporal coherence
94.41 %
Our method;
with temporal coherence + Top-down
98.34 %
Testing Discriminability in Object Recognition
CNEL, University of Florida, University of Florida 47
Honda/UCSD face data set (20 for training, 39 for testing) using Viola Jones
face finding algorithm (on 20x20 patches).Histogram equalization is done. 2
layer model (16,48)1 (64,100)2, 5x5 filters, causes concatenated as features
Testing Discriminability in Object Recognition
CNEL, University of Florida, University of Florida 48
You Tube Celebrity face data set (partition the data in 10 sets of 9
video and used 3 of each for training and 6 for training) using Viola
Jones face finding algorithm (30x30 patches). Average results plotted.
Model of same size but filters are 7x7. Histogram equalization is done.
Discriminability with Occlusion
49
Layer -2 Causes
Layer -1 Causes
Layer -1 States
Example Video frames
[VidTIMIT]
Extension to Auditory Processing
• We would like to show the “universality” of this type of modeling
approach by addressing auditory processing.
• Temporal theory information in sound is coded in the temporal
firing patterns of the auditory neurons connecting to the cochlea.
• Place theory perception of sound depends on the location which
vibrates in response to the sound along the basilar membrane.
• Consequently, use of dynamical systems would be a great fit.
Audio streams can also be modeled single source, so they are
easier to explain.
• Simplify the modeling approach using Kalman filters for
efficiency
• Integrate auditory and visual processing through the causes
stored in a content addressable memory (CAM).
CNEL, University of Florida, University of Florida 50
Nested Hierarchical Linear Dynamical
System (HLDS)
The linear model consists of one
measurement equation and multiple state
transition equations.
By design the top layer creates point
attractors (Brownian state) to extract
redundancies in the sound time structure.
The nested HLDS is driven bottom-up by
the observations, and top-down by the
states so indirectly it segments the input in
spectral uniform regions.
CNEL, University of Florida, University of Florida 51
Cinar G., Príncipe J., “Clustering of Time Series Using a Hierarchical
Linear Dynamical System”, in Proc. ICASSP 2014, Florence, Italy
Point Attractors for Trumpet Notes
• Train with audio samples from Univ. of Iowa Musical
Instrument notes (2 sec sustained notes) in the range E3-
D6 for the nonvibrato Trumpet.
• The algorithm organizes in an unsupervised fashion the
different time structure of notes into point attractors in the
state space of the highest layer (Hopfield network).
CNEL, University of Florida, University
of Florida
52
Recognition Through Clustering
• Model system: 60,10,3 states, 36 msec windows, 1024
FFTs. System is real time.
• To assess classification accuracy we do Monte-Carlo runs
through all 35 notes (randomized order).
• Convergence is declared with 4 consecutive decisions to
the same output space location (clustering).
• The performance is tested for three noise levels.
CNEL, University of Florida, University of Florida 53
35 dB SNR 15 dB SNR -5 dB SNR
Classification Accuracy
using Variance Test
96.94% 87.94% 5.43%
Classification Accuracy
using all time instances
79.70% 71.95% 5.17%
Performance in Music Clips
• This setup of isolated notes is not ‘practical’. In a real life
scenario it would be almost impossible to find a music piece
that consists of notes sustained for 2 seconds.
• We started testing the performance of the algorithm in
very short music clips.
• We used the first two verses of Beethoven Symphony
#5.
• We create the clip from real trumpet recordings.
Performance in Music Clips
We play this motif
under different
conditions.
3 different levels of
amplitude
magnitudes.
Crescendo
Decrescendo
Different Tempos
Observe that the
same trajectory is
followed all through
these dynamic
changes.
Yesterday by The Beatles
• We created the Beatles song “Yesterday” using the Trumpet
notes in the database.
• We apply the same simple convergence criterion. Once
unanimous decisions are disrupted, windows are labeled as
“undecided”.
• Once the next convergence is declared, we go back in time
and fix the undecided windows accordingly.
-What the
algorithm “hears”
Yesterday by The Beatles
• Each window is played several times and clustering averaged
(can be parallelized for real time)
• After post-processing, the classification accuracy goes as high
as 93% (notes shorter than the convergence time are
misclassified).
• Clean notes from database are concatenated
58
CONCLUSIONSHigh Level
• We have developed a preliminary version of a computational,
adaptive, self organized distributed and hierarchical model for
episodic memory in sensory processing.
• The Bayesian framework is general and very flexible to explain
the data. However the computational complexity is still huge and
requires specific hardware architectures.
• The sparseness constraint was critical to disambiguate time and
spatial features in video, while adaptive pooling was critical to
link model parameters to object and video invariant features.
• Preliminary results show that our model is capable of high
performance in video processing and it is very robust to
structured noise.
• The same basic principles were extended to auditory processing
(no sparseness), which shows their large appeal.
59
CONCLUSIONS
Specific novel aspects in the approach
• Paradigm shift in sensory processing: Switched from input
space feature design to discriminative model design to explain
the sensory data with sparse and invariant representations.
• The computational framework emphasizes self-organization
and distributed feedback between bottom up and top down
processing. It blends working memory (states) and long term
memory (parameters). It is ready to take advantage of prior
knowledge (cognitive memory).
• Illustrates the importance of dynamic modeling to describe
spatio-temporal data (video). Continuity over time simplifies
processing!
60
CONCLUSIONS
Specific novel aspects in the modeling
• New methodology to perform sparse coding with dynamic
models and showed its importance
• New methodology to learn invariant representations rather
than doing max or average pooling and showed
improvement in classification
• New methodology to implement top-down connections
that can disambiguate the object of interest in the video.
• Validated the role of dynamics to help deal with structured
noise in a hierarchical model
Future Directions
• Audio- Video Fusion