How a Part of the Brain Might or Might Not Work: A New...
Transcript of How a Part of the Brain Might or Might Not Work: A New...
-
How a Part of the Brain Might or Might Not Work: A New
Hierarchical Model of Object Recognition
by
Maximilian Riesenhuber
Diplom-PhysikerUniversität Frankfurt, 1995
Submitted to the Department of Brain and Cognitive Sciencesin partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computational Neuroscience
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
May 2000
c�Massachusetts Institute of Technology 2000. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Department of Brain and Cognitive Sciences
May 2, 2000
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Tomaso Poggio
Uncas and Helen Whitaker ProfessorThesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Earl K. Miller
Co-Chair, Department Graduate Committee
-
How a Part of the Brain Might or Might Not Work: A New Hierarchical Model
of Object Recognition
by
Maximilian Riesenhuber
Submitted to the Department of Brain and Cognitive Scienceson May 2, 2000, in partial fulfillment of the
requirements for the degree ofDoctor of Philosophy in Computational Neuroscience
Abstract
The classical model of visual processing in cortex is a hierarchy of increasingly sophisticated rep-resentations, extending in a natural way the model of simple to complex cells of Hubel and Wiesel.Somewhat surprisingly, little quantitative modeling has been done in the last 15 years to explorethe biological feasibility of this class of models to explain higher level visual processing, such asobject recognition in cluttered scenes. We describe a new hierarchical model, HMAX, that accountswell for this complex visual task, is consistent with several recent physiological experiments in in-ferotemporal cortex and makes testable predictions. Key to achieve invariance and robustness toclutter is a MAX-like response function of some model neurons which selects (an approximation to)the maximum activity over all the afferents, with interesting connections to “scanning” operationsused in recent computer vision algorithms.
We then turn to the question of object recognition in natural (“continuous”) object classes, suchas faces, which recent physiological experiments have suggested are represented by a sparse dis-tributed population code. We performed two psychophysical experiments in which subjects weretrained to perform subordinate level discrimination in a continuous object class — images of com-puter-rendered cars — created using a 3D morphing system. By comparing the recognition perfor-mance of trained and untrained subjects we could estimate the effects of viewpoint-specific train-ing and infer properties of the object class-specific representation learned as a result of training.We then compared the experimental findings to simulations in HMAX, to investigate the computa-tional properties of a population-based object class representation. We find experimental evidence,supported by modeling results, that training builds a viewpoint- and class-specific representationthat supplements a pre-existing representation with lower shape discriminability but greater view-point invariance.
Finally, we show how HMAX can be extended in a straightforward fashion to perform object cate-gorization and to support arbitrary class hierarchies. We demonstrate the capability of our scheme,called “Categorical Basis Functions” (CBF), with the example domain of cat/dog categorization,and apply it to study some recent findings in categorical perception.
Thesis Supervisor: Tomaso PoggioTitle: Uncas and Helen Whitaker Professor
2
-
Acknowledgments
Thanks are due to quite a few people who have contributed in different ways to the gestation of
this thesis. First, of course, are my parents. I am extremely grateful for their untiring support and
encouragement over the years — especially to my father for convincing me to major in physics.
Second is Hans-Ulrich Bauer, who advised my Diplom thesis at the Institute for Theoretical
Physics of the University of Frankfurt. Without Hans-Ulrich and his urging to get my PhD in
computational neuroscience in the US I would never have applied to MIT and would now probably
be a bitter physics PhD working for McKinsey. To him this thesis is dedicated.
At MIT, I first want to thank my advisor, Tommy Poggio, for warning me about the smog at
Caltech, and for being the best advisor I could imagine: Providing a lot of independence while
being very encouraging and supportive. Being exposed to his way of doing science I consider one
of the biggest assets of my PhD training.
Then there are the people in my thesis committee, all of which have provided valuable input
and guidance along the way. Special thanks are due to Peter Dayan for a fine collaboration during
my first year that introduced me to quite a few new ideas. To Earl Miller for being so open to collab-
orations with wacky theorists. To Mike Tarr for a nice News & Views commentary [96] on our paper
[82], and a very stimulating visit which might lead to an even more stimulating collaboration. . .
To Pawan Sinha and Gadi Geiger for introducing me to the wonderful world of psychophysics
and providing invaluable advice when I ran my first experiment. To David Freedman and An-
dreas Tolias for introducing me to the wonderful world of monkey physiology and being great
collaborators. To Christian Shelton for the “mother of all correspondence algorithms” that was so
instrumental in many parts of this thesis and beyond. To Valerie Pires, without whose help the
psychophysics in chapter 4 would have been nothing more than “suggestions for future work”,
and to Mary Pat Fitzgerald, “the mother of CBCL”, whose help in dealing with the subtleties of the
MIT administration was (and continues to be) greatly appreciated.
Then there is our fine department that really has it all “under one roof”. I could not have
imagined a better place to get my PhD.
Last but not least I want to gratefully acknowledge the generous support provided by a Gerald
J. and Marjorie J. Burnett Fellowship (1996–1998) and a Merck/MIT Fellowship in Bioinformatics
(1998–2000) that enabled me to pursue the studies described in this thesis.
3
-
Contents
1 Introduction 8
2 Hierarchical Models of Object Recognition in Cortex 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Are Cortical Models Really Bound by the “Binding Problem”? 26
3.1 Introduction: Visual Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Models of Visual Object Recognition and the Binding Problem . . . . . . . . . . . . . 27
3.3 A Hierarchical Model of Object Recognition in Cortex . . . . . . . . . . . . . . . . . . 29
3.4 Binding without a problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Recognition of multiple objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.2 Recognition in clutter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 The Individual is Nothing, the Class Everything:
Psychophysics and Modeling of Recognition in Object Classes 40
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Modeling: Representing Continuous Object Classes in HMAX . . . . . . . . . . . . . 50
4.3.1 The HMAX Model of Object Recognition in Cortex . . . . . . . . . . . . . . . 51
4
-
4.3.2 View-Dependent Object Recognition in Continuous Object Classes . . . . . . 51
4.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.3 The Model Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5 A note on object class representation and categorical perception 66
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Chorus of Prototypes (COP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 A Novel Scheme: Categorical Basis Functions (CBF) . . . . . . . . . . . . . . . . . . . 68
5.3.1 An Example: Cat/Dog Classification . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.2 Introduction of parallel categorization schemes . . . . . . . . . . . . . . . . . 72
5.4 Interactions between categorization and discrimination: Categorical Perception . . . 72
5.4.1 Categorical Perception in CBF . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.2 Categorization with and without Categorical Perception . . . . . . . . . . . . 75
5.5 COP or CBF? — Suggestion for Experimental Tests . . . . . . . . . . . . . . . . . . . . 78
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6 General Conclusions and Future Work 81
5
-
List of Figures
2-1 Invariance properties of one IT neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2-2 Sketch of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2-3 Illustration of the highly nonlinear shape tuning properties of the MAX mechanism 17
2-4 Response of a sample model neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2-5 Average neuronal responses to scrambled stimuli . . . . . . . . . . . . . . . . . . . . 21
3-1 Cartoon of the Poggio and Edelman model of view-based object recognition . . . . . 30
3-2 Sketch of our hierarchical model of object recognition in cortex . . . . . . . . . . . . . 32
3-3 Recognition of two objects simultaneously . . . . . . . . . . . . . . . . . . . . . . . . 34
3-4 Stimulus/background example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3-5 Model performance: Recognition in clutter . . . . . . . . . . . . . . . . . . . . . . . . 36
4-1 Natural objects, and artificial objects used in previous object recognition studies . . . 42
4-2 The eight prototype cars used in the 8 car system . . . . . . . . . . . . . . . . . . . . . 45
4-3 Training task for Experiments 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4-4 Illustration of match/nonmatch pairs for Experiment 1 . . . . . . . . . . . . . . . . . 47
4-5 Testing task for Experiments 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4-6 Average performance of the trained subjects on the test task of Experiment 1 . . . . . 49
4-7 Average performance of untrained subjects on the test task of Experiment 1 . . . . . 50
4-8 Our model of object recognition in cortex . . . . . . . . . . . . . . . . . . . . . . . . . 52
4-9 Recognition performance of the model on the eight car morph space . . . . . . . . . 53
4-10 Dependence of average (one-sided) rotation invariance, �r, on SSCU tuning width, � 55
4-11 Dependence of invariance range on the number of afferents to each SSCU . . . . . . 55
4-12 Dependence of invariance range on the number of SSCUs . . . . . . . . . . . . . . . . 56
4-13 Effect of addition of noise to the SSCU representation . . . . . . . . . . . . . . . . . . 57
4-14 Cat/dog prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4-15 The “Other Class” effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4-16 Car object class-specific features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6
-
4-17 Performance of the two layer-model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4-18 The 15 prototypes used in the 15 car system . . . . . . . . . . . . . . . . . . . . . . . . 61
4-19 Average performance of trained subjects in Experiment 2 . . . . . . . . . . . . . . . . 62
4-20 Average performance of untrained subjects in Experiment 2 . . . . . . . . . . . . . . 63
5-1 Cartoon of the CBF categorization scheme . . . . . . . . . . . . . . . . . . . . . . . . . 70
5-2 Illustration of the cat/dog stimulus space . . . . . . . . . . . . . . . . . . . . . . . . . 71
5-3 Response of the cat/dog categorization unit . . . . . . . . . . . . . . . . . . . . . . . . 71
5-4 Sketch of the model to explain the influence of experience with categorization tasks
on object discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5-5 Average responses over all morph lines for the two networks . . . . . . . . . . . . . . 76
5-6 Comparison of Euclidean distances of activation patterns . . . . . . . . . . . . . . . . 77
5-7 Output of the categorization unit trained on the cat/dog categorization task . . . . . 80
5-8 Same as Fig. 5-7, but for a representation based on 30 units chosen by k-means . . . 80
7
-
Chapter 1
Introduction
Tell your hairdresser that you are working on vision and he will likely say “Vision? But that’s
easy!” Indeed, the apparent ease with which we perform object recognition even in cluttered scenes
and under difficult viewing conditions belies the amazing complexity of the visual system. This
became apparent with the groundbreaking studies of Hubel and Wiesel [36, 37] in the primary
visual cortices of cats and monkeys in the late 50s and 60s. Subsequently, many other visual areas
were discovered, with recent surveys listing over 30 visual areas linked in an intricate and still
unresolved pattern [22, 35]. This complex connection scheme can be coarsely divided up into two
pathways, the “What” pathway (the ventral stream) running from primary visual cortex, V1, over
V2 and V4 to inferotemporal cortex (IT), and the “Where” pathway (dorsal stream) from V1 to V2,
V3, MT, MST and other parietal areas [103]. In this framework, the “What” pathway is specialized
for object recognition whereas the “Where” pathway is concerned with spatial vision. Looking at
the ventral stream in more detail, Kobatake et al. [41] have reported a progression of the complexity
of cells’ preferred features and receptive field sizes as one progresses along the stream. While
neurons in V1 are tuned to oriented bars and have small receptive fields, cells in IT appear to prefer
more complex visual patterns such as faces (for a review on face cells, see [15]), and respond over
a wide range of positions and scales, pointing to a crucial role of IT cortex in object recognition (as
confirmed by a great number of physiological, lesion and neuropsychological studies [50, 94]).
These findings naturally prompted the question of how cells tuned to views of complex objects
showing invariance to size and position changes in IT could arise from small bar-tuned receptive
fields in V1, and how ultimately this neural substrate could be used to perform object recognition.
In humans, psychophysical experiments had given rise to two main competing theories of ob-
ject recognition: the structural description and the view-based theory (see [98] for a review of the
two theories and their experimental evidence). The former approach, its main protagonist being
the recognition-by-components theory of Biederman [5], holds that object recognition proceeds by
8
-
decomposing an object into a view-independent part-based description while in the view-based
theory object recognition is based on the viewpoints objects had actually appeared in. Experimen-
tal evidence ([8, 48, 49, 97, 100], see also chapter 2) and computational considerations [19] appear
to favor the view-based theory. However, several challenges for view-based models remained; Tarr
and Bülthoff [98] very recently listed the following problems:�
1. tolerance to changes in viewing condition — while the system should allow fine shape dis-
criminations it should not require a new representation for every change in viewing condi-
tion;
2. class generalization and representation of categories — the system should generalize from
familiar exemplars of a class to unfamiliar ones, and also be able to support categorization
schemes.
This thesis presents a simple hierarchical model of object recognition in cortex (chapter 5 shows
how the model can be extended to object categorization in a straightforward way), HMAX [96],
that addresses these challenges in a biologically plausible system. In particular, chapter 2 (a reprint
of [82], copyright Nature America Inc., reprinted by permission) introduces HMAX and compares
the visual properties of view-tuned units in the model, especially with respect to translation, scal-
ing and rotation in depth of the visual scene, to those of neurons in inferotemporal cortex recorded
from in various experiments. Chapter 3 (a reprint of [81], copyright Cell Press, reprinted by per-
mission) shows that HMAX can even perform recognition in cluttered scenes, without having to
resort to special segmentation processes, which is of special interest in connection with the so-called
“Binding Problem”.
While the first two papers focus on “paperclip” objects that have been used extensively in psy-
chophysical [8, 48, 91], physiological [49] and computational studies [68], this object class has sev-
eral disadvantages (such as not being “nice” [107]) that make it unsuitable as a basis to investigate
recognition in natural object classes — the topic of chapter 4 (a reprint of [84]) — where objects
have similar 3D shape, such as faces. Instead, in chapters 4 and 5, stimuli for model and experi-
ment were generated using a novel 3D morphing system developed by Christian Shelton [90] that
allows us to generate morphed objects drawn from a space spanned by a set of prototype objects,
for instances cars (chapter 4), or cats and dogs (chapters 4 and 5). We show that the recognition
results obtainable in natural object classes represented using a population code where the activity
over a group of units codes for the identity of an object (as suggested by recent physiology studies
[112, 115]) are quite comparable to those for individual objects represented by “grandmother” units
(in chapter 2), that is, the performance of HMAX does not appear to be special to a certain object
�They also listed the need for a simple mechanism to measure perceptual similarity, in order “to generalize betweenexamplars or between views” ([98], p. 9), which thus appears to be a corollary of solving the two problems listed above.
9
-
class. In addition, simulations show that a population-based object representation provides several
computational advantages over a “grandmother” representation. We further present experimental
results from a psychophysical study in which we trained subjects using a discrimination paradigm
to build a representation of a novel object class. This representation was then probed by exam-
ining how discrimination performance was affected by viewpoint changes. We find experimental
evidence, supported by the modeling results, that training builds a viewpoint- and class-specific
representation that supplements a pre-existing representation with lower shape discriminability
but greater viewpoint invariance. Chapter 5 (a reprint of [83]) finally shows how HMAX can be
extended in a straightforward way to perform object categorization, and to support arbitrary object
categorization schemes, with interesting opportunities for interactions between discrimination and
categorization as observed in categorical perception.
10
-
Chapter 2
Hierarchical Models of Object
Recognition in Cortex
Abstract
The classical model of visual processing in cortex is a hierarchy of increasingly sophisticated
representations, extending in a natural way the model of simple to complex cells of Hubel and
Wiesel. Somewhat surprisingly, little quantitative modeling has been done in the last 15 years to
explore the biological feasibility of this class of models to explain higher level visual processing,
such as object recognition. We describe a new hierarchical model that accounts well for this
complex visual task, is consistent with several recent physiological experiments in inferotemporal
cortex and makes testable predictions. The model is based on a novel MAX-like operation on the
inputs to certain cortical neurons which may have a general role in cortical function.
2.1 Introduction
The recognition of visual objects is a fundamental cognitive task performed effortlessly by the brain
countless times every day while satisfying two essential requirements: invariance and specificity.
In face recognition, for example, we can recognize a specific face among many, while being rather
tolerant to changes in viewpoint, scale, illumination, and expression. The brain performs this and
similar object recognition and detection tasks fast [101] and well. But how?
Early studies [7] of macaque inferotemporal cortex (IT), the highest purely visual area in the
ventral visual stream thought to have a key role in object recognition [103] reported cells tuned to
views of complex objects such as a face, i.e., the cells discharged strongly to the view of a face but
very little or not at all to other objects. A hallmark of these cells was the robustness of their firing
11
-
to stimulus transformations such as scale and position changes.
This finding presented an interesting question: How could these cells show strongly differing
responses to similar stimuli (as, e.g., two different faces), that activate the retinal photoreceptors in
similar ways, while showing response constancy to scaled and translated versions of the preferred
stimulus that cause very different activation patterns on the retina?
This puzzle was similar to one faced by Hubel and Wiesel on a much smaller scale two decades
earlier when they recorded from simple and complex cells in cat striate cortex [36]: both cell types
responded strongly to oriented bars, but whereas simple cells exhibited small receptive fields with
a strong phase dependence, that is, with distinct excitatory and inhibitory subfields, complex cells
had larger receptive fields and no phase dependence. This led Hubel and Wiesel to propose a
model in which simple cells with their receptive fields in neighboring parts of space feed into
the same complex cell, thereby endowing that complex cell with a phase-invariant response. A
straightforward (but highly idealized) extension of this scheme would lead all the way from simple
cells to “higher order hypercomplex cells” [37].
Starting with the Neocognitron [25] for translation invariant object recognition, several hierar-
chical models of shape processing in the visual system have subsequently been proposed to ex-
plain how transformation-invariant cells tuned to complex objects can arise from simple cell inputs
[64, 111]. Those models, however, were not quantitatively specified or were not compared with
specific experimental data. Alternative models for translation- and scale-invariant object recogni-
tion have been proposed, based on a controlling signal that either appropriately reroutes incoming
signals, as in the “shifter” circuit [2] and its extension [62], or modulates neuronal responses, as
in the “gain-field” models for invariant recognition [78, 88]. While recent experimental studies
[14, 56] have indicated that in macaque area V4 cells can show an attention-controlled shift or mod-
ulation of their receptive field in space, there is still little evidence that this mechanism is used to
perform translation-invariant object recognition and whether a similar mechanism applies to other
transformations (such as scaling) as well.
The basic idea of the hierarchical model sketched by Perrett and Oram [64] was that invariance
to any transformation (not just image-plane transformations as in the case of the Neocognitron
[25]) could be built up by pooling over afferents tuned to various transformed versions of the
same stimulus. Indeed it was shown earlier [68] that viewpoint-invariant object recognition was
possible using such a pooling mechanism. A (Gaussian RBF) learning network was trained with
individual views (rotated around one axis in 3D space) of complex, paperclip-like objects to achieve
3D rotation-invariant recognition of this object. In the network the resulting view-tuned units fed
into a a view-invariant unit; they effectively represented prototypes between which the learning
network interpolated to achieve viewpoint-invariance.
There is now quantitative psychophysical [8, 48, 95] and physiological evidence [6, 42, 49] for
12
-
the hypothesis that units tuned to full or partial views are probably created by a learning process
and also some hints that the view-invariant output is in some cases explicitly represented by (a
small number of) individual neurons [6, 49, 66].
A recent experiment [48, 49] required monkeys to perform an object recognition task using
novel “paperclip” stimuli the monkeys had never seen before. Here, the monkeys were required
to recognize views of “target” paperclips rotated in depth among views of a large number of “dis-
tractor” paperclips of very similar structure, after being trained on a restricted set of views of each
target object. Following very extensive training on a set of paperclip objects, neurons were found
in anterior IT that selectively responded to the object views seen during training.
This design avoided two problems associated with previous physiological studies investigat-
ing the mechanisms underlying view-invariant object recognition: First, by training the monkey
to recognize novel stimuli with which the monkey had not had any visual experience instead of
objects (e.g., faces) with which the monkey was quite familiar, it was possible to estimate the de-
gree of view-invariance derived from just one object view. Moreover, the use of a large number of
distractor objects allowed to define view-invariance with respect to the distractor objects. This is a
key point, since only by being able to compare the response of a neuron to transformed versions of
its preferred stimulus with the neuron’s response to a range of (similar) distractor objects can the
VTU’s (view-tuned unit’s) invariance range be determined — just measuring the tuning curve is
not sufficient.
The study [49] established (Fig. 2-1) that after training with just one object view there are cells
showing some degree of limited invariance to 3D rotation around the training view, consistent
with the view-interpolation model [68]. Moreover, the cells also exhibit significant invariance to
translation and scale changes, even though the object was only previously presented at one scale
and position.
These data put in sharp focus and in quantitative terms the question of the circuitry under-
lying the properties of the view-tuned cells. While the original model [68] described how VTUs
could be used to build view-invariant units, they did not specify how the view-tuned units could
come about. The key problem is thus to explain in terms of biologically plausible mechanisms the
VTUs’ invariance to translation and scaling obtained from just one object view, which arises from
a trade-off between selectivity to a specific object and relative tolerance (i.e., robustness of firing) to
position and scale changes. Here, we describe a model that conforms to the main anatomical and
physiological constraints, reproduces the invariance data described above and makes predictions
for experiments on the view-tuned subpopulation of IT cells. Interestingly, the model is also con-
sistent with recent data from several other experiments regarding recognition in context [54], or the
presence of multiple objects in a cell’s receptive field [89].
13
-
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA *
Spi
ke R
ate
Distractor ID
10 Best Distractors
37 9 20 5 24 3 2 1 0 60
10
20
30
40
60 108 132 156 18084
0
10
20
30
40
Rotation Around Y Axis
(a) (b)
Azimuth and Elevation(x = 2.25 degrees)
1.90 2.80 3.70 4.70 5.60
0
1
2
3
4
5
6
7
(0,0) (x,x) (x,-x) (-x,x) (-x,-x)
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Degrees of Visual Angle
(Tar
get R
espo
nse)
/(M
ean
of B
est D
istr
acto
rs)
(c) (d)
0
1
2
3
4
5
6
7
Figure 2-1: Invariance properties of one neuron (modified from Logothetis et al. [49]). The figure shows theresponse of a single cell found in anterior IT after training the monkey to recognize paperclip-like objects. Thecell responded selectively to one view of a paperclip and showed limited invariance around the training viewto rotation in depth, along with significant invariance to translation and size changes, even though the mon-key had only seen the stimulus at one position and scale during training. (a) shows the response of the cell torotation in depth around the preferred view. (b) shows the cell’s response to the 10 distractor objects (other pa-perclips) that evoked the strongest responses. The lower plots show the cell’s response to changes in stimulussize, (c) (asterisk shows the size of the training view), and position, (d) (using the 1.9� size), resp., relative tothe mean of the 10 best distractors. Defining “invariance” as yielding a higher response to transformed viewsof the preferred stimulus than to distractor objects, neurons exhibit an average rotation invariance of 42� (dur-ing training, stimuli were actually rotated by ���� in depth to provide full 3D information to the monkey;therefore, the invariance obtained from a single view is likely to be smaller), translation and scale invarianceon the order of ��� and �� octave around the training view, resp. (J. Pauls, personal communication).
14
-
����������
��������
��������
������������
view-tuned cells
MAX
weighted sum
simple cells (S1)
complex cells (C1)
"complex composite" cells (C2)
"composite feature" cells (S2)
Figure 2-2: Sketch of the model. The model is an hierarchical extension of the classical paradigm [36] ofbuilding complex cells from simple cells. It consists of a hierarchy of layers with linear (“S” units in the no-tation of Fukushima [25], performing template matching, solid lines) and non-linear operations (“C” poolingunits [25], performing a “MAX” operation, dashed lines). The non-linear MAX operation — which selectsthe maximum of the cell’s inputs and uses it to drive the cell — is key to the model’s properties and is quitedifferent from the basically linear summation of inputs usually assumed for complex cells. These two typesof operations respectively provide pattern specificity and invariance (to translation, by pooling over afferentstuned to different positions, and scale (not shown), by pooling over afferents tuned to different scales).
2.2 Results
The model is based on a simple hierarchical feedforward architecture (Fig. 2-2). Its structure reflects
the assumption that invariance to position and scale on the one hand and feature specificity on
the other hand must be built up through separate mechanisms: to increase feature complexity, a
suitable neuronal transfer function is a weighted sum over afferents coding for simpler features,
i.e., a template match. But is summing over differently weighted afferents also the right way to
increase invariance?
From the computational point of view, the pooling mechanism should produce robust feature
detectors, i.e., measure the presence of specific features without being confused by clutter and con-
text in the receptive field. Consider a complex cell, as found in primary visual cortex, whose pre-
ferred stimulus is a bar of a certain orientation to which the cell responds in a phase-invariant way
[36]. Along the lines of the original complex cell model [36], one could think of the complex cells
as receiving input from an array of simple cells at different locations, pooling over which results in
15
-
the position-invariant response of the complex cell.
Two alternative idealized pooling mechanisms are: linear summation (“SUM”) with equal weights
(to achieve an isotropic response) and a nonlinear maximum operation (“MAX”), where the strongest
afferent determines the response of the postsynaptic unit. In both cases, if only one bar is present
in the receptive field, the response of a model complex cell is position invariant. The response
level would signal how similar the stimulus is to the afferents’ preferred feature. Consider now
the case of a complex stimulus, like e.g., a paperclip, in the visual field. In the linear summation
case, complex cell response would still be invariant (as long as the stimulus stays in the cell’s re-
ceptive field), but the response level now would not allow to infer whether there actually was a
bar of the preferred orientation somewhere in the complex cell’s receptive field, as the output sig-
nal is a sum over all the afferents. That is, feature specificity is lost. In the MAX case, however,
the response would be determined by the most strongly activated afferent and hence would signal
the best match of any part of the stimulus to the afferents’ preferred feature. This ideal example
suggests that the MAX mechanism is capable of providing a more robust response in the case of
recognition in clutter or with multiple stimuli in the receptive field (cf. below). Note that a SUM re-
sponse with saturating nonlinearities on the inputs seems too brittle since it requires a case-by-case
adjustment of the parameters, depending on the activity level of the afferents.
Equally critical is the inability of the SUM mechanism to achieve size invariance: Suppose that
the afferents to a “complex” cell (which now could be a cell in V4 or IT, for instance) show some
degree of size and position invariance. If the “complex” cell were now stimulated with the same
object but at subsequently increasing sizes, an increasing number of afferents would become ex-
cited by the stimulus (unless the afferents showed no overlap in space or scale) and consequently
the excitation of the “complex” cell would increase along with the stimulus size, even though the
afferents show size invariance (this is borne out in simulations using a simplified two-layer model
[79])! For the MAX mechanism, however, cell response would show little variation even as stimulus
size increased since the cell’s response would be determined just by the best-matching afferent.
These considerations (supported by quantitative simulations of the model, described below)
suggest that a sensible way of pooling responses to achieve invariance is via a nonlinear MAX
function, that is, by implicitely scanning (see discussion) over afferents of the same type that differ
in the parameter of the transformation to which the response should be invariant (e.g., feature
size for scale invariance), and then selecting the best-matching of those afferents. Note that these
considerations apply to the case where different afferents to a pooling cell, e.g., those looking at
different parts of space, are likely to be responding to different objects (or different parts of the
same object) in the visual field (as is the case with cells in lower visual areas with their broad shape
tuning). Here, pooling by combining afferents would mix up signals caused by different stimuli.
However, if the afferents are specific enough to only respond to one pattern, as one expects in the
16
-
0
0.2
0.4
0.6
0.8
1
resp
on
se
MAX expt. SUM(a) (b)
Figure 2-3: Illustration of the highly nonlinear shape tuning properties of the MAX mechanism. (a) Experi-mentally observed responses of IT cells obtained using a “simplification procedure” [113] designed to deter-mine “optimal” features (responses normalized so that the response to the preferred stimulus is equal to 1). Inthat experiment, the cell originally responds quite strongly to the image of a “water bottle” (leftmost object).The stimulus is then “simplified” to its monochromatic outline which increases the cell’s firing, and furtherto a paddle-like object, consisting of a bar supporting an ellipse. While this object evokes a strong response,the bar or the ellipse alone produce almost no response at all (figure used by permission). (b) Comparison ofexperiment and model. Green bars show the responses of the experimental neuron from (a). Blue and red barsshow the response of a model neuron tuned to the stem-ellipsoidal base transition of the preferred stimulus.The model neuron is at the top of a simplified version of the model shown in Fig. 2-2, where there are onlytwo types of S1 features at each position in the receptive field, tuned to the left and right side of the transitionregion, resp., which feed into C1 units that pool using a MAX function (blue bars) or a SUM function (redbars). The model neuron is connected to these C1 units so that its response is maximal when the experimentalneuron’s preferred stimulus is in its receptive field.
final stages of the model, then pooling by using a weighted sum, as in the RBF network [68], where
VTUs tuned to different viewpoints were combined to interpolate between the stored views, is
advantageous.
MAX-like mechanisms at some stages of the circuitry appear to be compatible with recent neu-
rophysiological data. For instance, it has been reported [89] that when two stimuli are brought into
the receptive field of an IT neuron, that neuron’s response appears to be dominated by the stimulus
that produces a higher firing rate when presented in isolation to the cell — just as expected if a
MAX-like operation is performed at the level of this neuron or its afferents. Theoretical investiga-
tions into possible pooling mechanisms for V1 complex cells also support a maximum-like pooling
mechanism (K. Sakai & S. Tanaka, Soc. Neurosci. Abs., 23, 453, 1997). Additional indirect support
for a MAX mechanism comes from studies using a “simplification procedure” [113] or “complexity
reduction” [47] to determine the preferred features of IT cells, i.e., the stimulus components that are
responsible for driving the cell. These studies commonly find a highly nonlinear tuning of IT cells
(Fig. 2-3 (a)). Such tuning is compatible with the MAX response function (Fig. 2-3 (b), blue bars).
Note that a linear model (Fig. 2-3 (b), red bars) cannot reproduce this strong response change for
small changes in the input image.
In our model of view-tuned units (Fig. 2-2), the two types of operations, scanning and template
17
-
matching, are combined in a hierarchical fashion to build up complex, invariant feature detectors
from small, localized, simple cell-like receptive fields in the bottom layer which receive input from
the model “retina.” There need not be a strict alternation of these two operations: connections can
skip levels in the hierarchy, as in the direct C1�C2 connections of the model in Fig. 2-2.
The question remains whether the proposed model can indeed achieve response selectivity and
invariance compatible with the results from physiology. To investigate this question, we looked at
the invariance properties of 21 view-tuned units in the model, each tuned to a view of a different,
randomly selected paperclip, as used in the experiment [49].
Figure 2-4 shows the response of one model view-tuned unit to 3D rotation, scaling and trans-
lation around its preferred view (see Methods). The unit responds maximally to the training view,
with the response gradually falling off as the stimulus is transformed away from the training view.
As in the experiment, we can determine the invariance range of the VTU by comparing the response
to the preferred stimulus to the responses to the 60 distractors. The invariance range is then defined
as the range over which the model unit’s response is greater than to any of the distractor objects.
Thus, the model VTU shown in Fig. 2-4 shows rotation invariance of 24�, scale invariance of 2.6
octaves and translation invariance of 4.7� of visual angle. Averaging over all 21 units, we obtain
average rotation invariance over 30.9�, scale invariance over ��� octaves and translation invariance
over 4.6�.
Units show invariance around the training view, of a range in good agreement with the exper-
imentally observed values. Some units (5/21), an example of which is given in Fig. 2-4 (d), show
tuning also for pseudo-mirror views (obtained by rotating the preferred paperclip by 180� in depth,
which produces a pseudo-mirror view of the object due to the paperclips’ minimal self-occlusion),
as observed in some experimental neurons [49].
While the simulation and experimental data presented so far dealt with object recognition set-
tings in which one object was presented in isolation, this is rarely the case in normal object recogni-
tion settings. More commonly, the object to be recognized is situated in front of some background
or appears together with other objects, all of which are to be ignored if the object is to be recognized
successfully. More precisely, in the case of multiple objects in the receptive field, the responses of
the afferents feeding into a VTU tuned to a certain object should be affected as little as possible by
the presence of other “clutter objects.”
The MAX response function posited above for the pooling mechanism to achieve invariance has
the right computational properties to perform recognition in clutter: If the VTU’s preferred object
strongly activates the VTU’s afferents, then it is unlikely that other objects will interfere, as they
tend to activate the afferents less and hence will not usually influence the response due to the MAX
response function. In some cases (such as when there are occlusions of the preferred feature, or one
of the “wrong” afferents has a higher activation) clutter, of course, can affect the value provided
18
-
0 20 40 60 80 100 120 140 1600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
stimulus size
resp
onse
10 20 30 40 50 600
0.5
1
50 60 70 80 90 100 110 120 1300
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
viewing angle
resp
onse
0 90 180 2700
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
viewing angle
resp
onse
−4
−2
0
2
4 −4
−2
0
2
40
0.5
1
y translation (deg)x translation (deg)
resp
onse
(a) (b)
(d)(c)
Figure 2-4: Responses of a sample model neuron to different transformations of its preferred stimulus. Thedifferent panels show the same neuron’s response to (a) varying stimulus sizes (inset shows response to 60distractor objects, selected randomly from the paperclips used in the physiology experiments [49]), (b) rotationin depth and (c) translation. Training size was �� � �� pixels corresponding to 2� of visual angle. (d) showsanother neuron’s response to pseudo-mirror views (cf. text), with the dashed line indicating the neuron’sresponse to the “best” distractor.
19
-
by the MAX mechanism, thereby reducing the quality of the match at the final stage and thus
the strength of the VTU response. It is clear that to achieve the highest robustness to clutter, a
VTU should only receive input from cells that are strongly activated (i.e., that are relevant to the
definition of the object) by its preferred stimulus.
In the version of the model described so far, the penultimate layer contained only 10 cells corre-
sponding to 10 different features, which turned out to be sufficient to achieve invariance properties
as found in the experiment. Each VTU in the top layer was connected to all the afferents and hence
robustness to clutter is expected to be relatively low. Note that in order to connect a VTU to only the
subset of the intermediate feature detectors it receives strong input from, the number of afferents
should be large enough to achieve the desired response specificity.
The straightforward solution is to increase the number of features. Even with a fixed number of
different features in S1, the dictionary of S2 features can be expanded by increasing the number and
type of afferents to individual S2 cells (see Methods). In this “many feature” version of the model,
the invariance ranges for a low number of afferents are already comparable to the experimental
ranges — if each VTU is connected to the 40 (out of 256) C2 cells that are most strongly excited
by its preferred stimulus, model VTUs show an average scale invariance over ��� octaves, rotation
invariance over 36.2� and translation invariance over 4.4�. For the maximum of 256 afferents to
each cell, cells are rotation invariant over an average of 47�, scale invariant over 2.4 octaves and
translation invariant over 4.7�.
Simulations show [81] that this model is capable of performing recognition in context: Using
displays as inputs that contain the neurons preferred clip as well as another, distractor, clip, the
model is able to correctly recognize the preferred clip in 90% of the cases (for 40/256 afferents to
each neuron, the maximum rate is 94% for 18 afferents, dropping to 55% for 256/256 afferents,
compared to 40% in the original version of the model with 10 C2 units), i.e., the addition of the
second clip interfered with the activation caused by the first clip alone so much that in 10% of the
cases the response to the two clip display containing the preferred clip fell below the response to
one of the distractor clips. This reduction of the response to the two-stimulus display compared to
the response to the stronger stimulus alone has also been found in experimental studies [86, 89].
The question of object recognition in the presence of a background object was explored ex-
perimentally in a recent study [54], where a monkey had to discriminate (polygonal) foreground
objects irrespective of the (polygonal) background they appeared with. Recordings of IT neurons
showed that for the stimulus/background condition, neuronal response on average was reduced
to a quarter of the response to the foreground object alone, while the monkey’s behavioral perfor-
mance dropped much less. This is compatible with simulations in the model [81] that show that
even though a unit’s firing rate is strongly affected by the addition of the background pattern, it is
still in most cases well above the firing rate evoked by distractor objects, allowing the foreground
20
-
1 4 16 64 2560
0.2
0.4
0.6
0.8
1
number of tiles
avg.
res
pons
e
(a) (b)
Figure 2-5: Average neuronal responses of neurons in the many feature version of the model to scrambledstimuli. (a) Example of a scrambled stimulus. The images (��� � ��� pixels) were created by subdividingthe preferred stimulus of each neuron into 4, 16, 64, and 256, resp., “tiles” and randomly shuffling the tiles tocreate a scrambled image. (b) Average response of the 21 model neurons (with 40/256 afferents, as above) tothe scrambled stimuli (solid blue curve), in comparison to the average normalized responses of IT neurons toscrambled stimuli (scrambled pictures of trees) reported in a very recent study [108] (dashed green curve).
object to be recognized successfully.
Our model relies on decomposing images into features. Should it then be fooled into confusing
a scrambled image with the unscrambled original? Superficially, one may be tempted to guess that
scrambling an image in pieces larger than the features should indeed fool the model. Simulations
(see Fig. 2-5) show that this is not the case. The reason lies in the large dictionary of filters/features
used that makes it practically impossible to scramble the image in such a way that all features are
preserved, even for a low number of features. Responses of model units drop precipitously as
the image is scrambled into progressively finer pieces, as confirmed very recently in a physiology
experiment [108] of which we became aware after obtaining this prediction from the model.
2.3 Discussion
We briefly outline the computational roots of the hierarchical model we described, how the MAX
operation could be implemented by cortical circuits and remark on the role of features and invari-
ances in the model.
A key operation in several recent computer vision algorithms for the recognition and classifi-
cation of objects [87, 92] is to scan a window across an image, through both position and scale, in
order to analyze at each step a subimage – for instance by providing it to a classifier that decides
whether the subimage represents the object of interest. Such algorithms have been successful in
achieving invariance to image plane transformations such as translation and scale. In addition, this
brute force scanning strategy eliminates the need to segment the object of interest before recogni-
tion: segmentation, even in complex and cluttered images, is routinely achieved as a byproduct of
21
-
recognition. The computational assumption that originally motivated the model described in this
paper was indeed that a MAX-like operation may represent the cortical equivalent of the “window
of analysis” in machine vision to scan through and select input data. Unlike a centrally controlled
sequential scanning operation, a mechanism like the MAX operation that locally and automatically
selects a relevant subset of inputs seems biologically plausible. A basic and pervasive operation
in many computational algorithms — not only in computer vision — is the search and selection
of a subset of data. Thus it is natural to speculate that a MAX-like operation may be replicated
throughout the cortex.
Simulations of a simplified two-layer version the model [79] using soft-maximum approxima-
tions to the MAX operation (see Methods) where the strength of the nonlinearity can be adjusted
by a parameter show that its basic properties are preserved and structurally robust. But how is an
approximation of the MAX operation realized by neurons? It seems that it could be implemented
by several different, biologically plausible circuitries [1, 13, 17, 32, 44]. The most likely hypothesis
is that the MAX operation arises from cortical microcircuits of lateral, possibly recurrent, inhibition
between neurons in a cortical layer. An example is provided by the circuit proposed for the gain-
control and relative motion detection in the visual system of the fly [76], based on feedforward
(or recurrent) shunting presynaptic (or postsynaptic) inhibition by “pool” cells. One of its key
elements, in addition to shunting inhibition (an equivalent operation may be provided by linear
inhibition deactivating NMDA receptors), is a nonlinear transformation of the individual signals
due to synaptic nonlinearities or to active membrane properties. The circuit performs a gain control
operation and — for certain values of the parameters — a MAX-like operation. “Softmax” circuits
have been proposed in several recent studies [34, 45, 61] to account for similar cortical functions.
Together with adaptation mechanisms (underlying very short-term depression [1]), the circuit may
be capable of pseudo-sequential search in addition to selection.
Our novel claim here is that a MAX-like operation is a key mechanism for object recognition
in the cortex. The model described in this paper — including the stage from view-tuned to view-
invariant units [68] — is a purely feedforward hierarchical model. Backprojections – well known
to exist abundantly in cortex and playing a key role in other models of cortical function [59, 75] –
are not needed for its basic performance but are probably essential for the learning stage and for
known top-down effects — including attentional biases [77] — on visual recognition, which can be
naturally grafted into the inhibitory softmax circuits (see [61]) described earlier.
In our model, recognition of a specific object is invariant for a range of scales (and positions) af-
ter training with a single view at one scale, because its representation is based on features invariant
to these transformations. View invariance on the other hand requires training with several views
[68] because individual features sharing the same 2D appearance can transform very differently
under 3D rotation, depending on the 3D structure of the specific object. Simulations show that the
22
-
model’s performance is not specific to the class of paperclip object: recognition results are similar
for e.g., computer-rendered images of cars (and other objects).
From a computational point of view the class of models we have described can be regarded
as a hierarchy of conjunctions and disjunctions. The key aspect of our model is to identify the
disjunction stage with the build-up of invariances and to do it through a MAX-like operation. At
each conjunction stage the complexity of the features increases and at each disjunction stage so does
their invariance. At the last level – of the C2 layer in the paper – it is only the presence and strength
of individual features and not their relative geometry in the image that matters. The dictionary
of features at that stage is overcomplete, so that the activities of the units measuring each feature
strength, independently of their precise location, can still yield a unique signature for each visual
pattern (cf. the SEEMORE system [52]).
The architecture we have described shows that this approach is consistent with available exper-
imental data and maps it into a class of models that is a natural extension of the hierarchical models
first proposed by Hubel and Wiesel.
2.4 Methods
Basic model parameters. Patterns on the model “retina” (of ��� � ��� pixels — which corresponds to a
5� receptive field size (the literature [41] reports an average V4 receptive field size of 4.4�) if we set 32 pixels
� ��) — are first filtered through a layer (S1) of simple cell-like receptive fields (first derivative of gaussians,
zero-sum, square-normalized to 1, oriented at ��� ���� ��� ��� with standard deviations of 1.75 to 7.25 pixels
in steps of 0.5 pixels; S1 filter responses were rectified dot products with the image patch falling into their
receptive field, i.e., the output s�j of an S1 cell with preferred stimulus wj whose receptive field covers an
image patch Ij is s�j � jwj � Ijj). Receptive field (RF) centers densely sample the input retina. Cells in the next
(C1) layer each pool S1 cells (using the MAX response function, i.e., the output c�i of a C1 cell with afferents s�
j
is c�i � maxj s�j ) of the same orientation over eight pixels of the visual field in each dimension and all scales.
This pooling range was chosen for simplicity — invariance properties of cells were robust for different choices
of pooling ranges (cf. below). Different C1 cells were then combined in higher layers, either by combining
C1 cells tuned to different features to give S2 cells responding to co-activations of C1 cells tuned to different
orientations or to yield C2 cells responding to the same feature as the C1 cells but with bigger receptive fields.
In the simple version illustrated here, the S2 layer contains six features (all pairs of orientations of C1 cells
looking at the same part of space) with Gaussian transfer function (� � �, centered at 1, i.e., the response s�k
23
-
of an S2 cell receiving input from C1 cells c�m� c�n with receptive fields in the same location but responding to
different orientations is s�k � exp����c�m � ��
� �c�n � �������), yielding a total of 10 cells in the C2 layer.
Here, C2 units feed into the view-tuned units, but in principle, more layers of S and C units are possible.
In the version of the model we have simulated, object specific learning occurs only at the level of the
synapses on the view-tuned cells at the top. More complete simulations will have to account for the effect of
visual experience on the exact tuning properties of other cells in the hierarchy.
Testing the invariance of model units. View-tuned units in the model were generated by recording
the activity of units in the C2 layer feeding into the VTUs to each one of the 21 paperclip views and then
setting the connecting weights of each VTU, i.e., the center of the Gaussian associated with the unit, resp., to
the corresponding activation. For rotation, viewpoints from 50� to 130� were tested (the training view was
arbitrarily set to 90�) in steps of 4�. For scale, stimulus sizes from 16 to 160 pixels in half octave steps (except
for the last step, which was from 128 to 160 pixels) and for translation, independent translations of���� pixels
along each axis in steps of 16 pixels (i.e., exploring a plane of ����� ��� pixels) were used.
“Many feature” version. To increase the robustness to clutter of model units, the number of features in
S2 was increased: Instead of the previous maximum of two afferents of different orientation looking at the
same patch of space as in the version described above, each S2 cell now received input from four neighboring
C1 units (in a � � � arrangement) of arbitrary orientation, giving a total of �� � ��� different S2 types and
finally 256 C2 cells as potential inputs to each view-tuned cell (in simulations, top level units were sparsely
connected to a subset of C2 layer units to gain robustness to clutter, cf. Results). As S2 cells now combined C1
afferents with receptive fields at different locations, and features a certain distance apart at one scale change
their separation as the scale changes, pooling at the C1 level was now done in several scale bands, each of
roughly a half-octave width in scale space (filter standard deviation ranges were 1.75–2.25, 2.75–3.75, 4.25–
5.25, and 5.75–7.25 pixels, resp.) and the spatial pooling range in each scale band chosen accordingly (over
neighborhoods of � � �, � � �, � , and �� � ��, respectively — note that system performance was robust
with respect to the pooling ranges, simulations with neighborhoods of twice the linear size in each scale
band produced comparable results, with a slight drop in the recognition of overlapping stimuli, as expected),
as a simple way to improve scale-invariance of composite feature detectors in the C2 layer. Also, centers
of C1 cells were chosen so that RFs overlapped by half a RF size in each dimension. A more principled way
24
-
would be to learn the invariant feature detectors, e.g., using the trace rule [23]. The straightforward connection
patterns used here, however, demonstrate that even a simple model shows tuning properties comparable to
the experiment.
Softmax approximation. In a simplified two-layer version of the model [79] we investigated the effects of
approximations to the MAX operations on recognition performance. The model contained only one pooling
stage, C1, where the strength of the pooling nonlinearity could be controlled by a parameter, p. There, the
output c�i of a C1 cell with afferents xj was
c�i �X
j
exp�p � jxj j�Pk exp�p � jxkj�
xj �
which performs a linear summation (scaled by the number of afferents) for p � � and the MAX operation for
p��.
2.5 Acknowledgments
Supported by grants from ONR, Darpa, NSF, ATR, and Honda. M.R. is supported by a Merck/MIT
Fellowship in Bioinformatics. T.P. is supported by the Uncas and Helen Whitaker Chair at the
Whitaker College, MIT. We are grateful to H. Bülthoff, F. Crick, B. Desimone, R. Hahnloser, C. Koch,
N. Logothetis, E. Miller, J. Pauls, D. Perrett, J. Reynolds, T. Sejnowski, S. Seung, and R. Vogels for
very useful comments and for reading earlier versions of this manuscript. We thank J. Pauls for
analyzing the average invariance ranges of his IT neurons and K. Tanaka for the permission to
reproduce Fig. 2-3 (a).
25
-
Chapter 3
Are Cortical Models Really Bound by
the “Binding Problem”?
Abstract
The usual description of visual processing in cortex is an extension of the simple to complex
hierarchy postulated by Hubel and Wiesel — a feedforward sequence of more and more complex
and invariant features. The capability of this class of models to perform higher level visual
processing such as viewpoint-invariant object recognition in cluttered scenes has been questioned
in recent years by several researchers, who in turn proposed an alternative class of models based
on the synchronization of large assemblies of cells, within and across cortical areas. The main
implicit argument for this novel and controversial view was the assumption that hierarchical
models cannot deal with the computational requirements of high level vision and suffer from
the so-called “binding problem”. We review the present situation and discuss theoretical and
experimental evidence showing that the perceived weaknesses of hierarchical models are not true.
In particular, we show that recognition of multiple objects in cluttered scenes, arguably among
the most difficult tasks in vision, can be done in a hierarchical feedforward model.
26
-
3.1 Introduction: Visual Object Recognition
Two problems make object recognition difficult:
1. The segmentation problem: Visual scenes normally contain multiple objects. To recognize in-
dividual objects, features must be isolated from the surrounding clutter and extracted from
the image, and the feature set must be parsed so that the different features are assigned to the
correct object. The latter problem is commonly referred to as the “Binding Problem” [110].
2. The invariance problem: Objects have to be recognized under varying viewpoints, lighting
conditions etc.
Interestingly, the human brain can solve these problems with ease and quickly. Thorpe et al. [101]
report that visual processing in an object detection task in complex visual scenes can be achieved
in under 150 ms, which is on the order of the latency of the signal transmission from the retina
to inferotemporal cortex (IT), the highest area in the ventral visual stream thought to have a key
role in object recognition [103]; see also [72]. This impressive processing speed presents a strong
constraint for any model of object recognition.
3.2 Models of Visual Object Recognition and the Binding Prob-
lem
Hubel and Wiesel [37] were the first to postulate a model of visual object representation and recog-
nition. They recorded from simple and complex cells in the primary visual cortices of cats and mon-
keys and found that while both types preferentially responded to bars of a certain orientation, the
former had small receptive fields with a phase-dependent response while the latter had bigger re-
ceptive fields and showed no phase-dependence. This observation led them to hypothesize that
complex cells receive input from several simple cells. Continuing this model in a straightforward
fashion, they suggested [36] that the visual system is composed of a hierarchy of visual areas, from
simple cells all the way up to “higher order hypercomplex cells.”
Later studies [7] of macaque inferotemporal cortex (IT) described neurons tuned to views of
complex objects such as a face, i.e., the cells discharged strongly to a face seen from a specific
viewpoint but very little or not at all to other objects. A key property of these cells was their
scale and translation invariance, i.e., the robustness of their firing to stimulus transformations such as
changes in size or position in the visual field.
These findings inspired various models of visual object recognition such as Fukushima’s Neocog-
nitron [25] or, later, Perrett and Oram’s [64] outline of a model of shape processing, and Wallis and
27
-
Rolls’ VisNet [111], all of which share the basic idea of the visual system as a feedforward process-
ing hierarchy where invariance ranges and complexity of preferred features grow as one ascends
through the levels.
Models of this type prompted von der Malsburg [109] to formulate the binding problem. His
claim was that visual representations based on spatially invariant feature detectors were ambigu-
ous: “As generalizations are performed independently for each feature, information about neigh-
borhood relations and relative position, size and orientation is lost. This lack of information can
lead to the inability to distinguish between patterns that are composed of the same set of invariant
features. . . ” [110]. Moreover, as a visual scene containing multiple objects is represented by a set of
feature activations, a second problem lies in “singling out appropriate groups from the large back-
ground of possible combinations of active neurons” [110]. These problems would manifest them-
selves in various phenomena such as hallucinations (the feature sets activated by objects actually
present in the visual scene combine to yield the activation pattern characteristic of another object)
and the figure-ground problem (the inability to correctly assign image features to foreground ob-
ject and background), leading von der Malsburg to postulate the necessity of a special mechanism,
the synchronous oscillatory firing of ensembles of neurons, to bind features belonging to one object
together.
One approach to avoid these problems was presented by Olshausen et al. [62]: Instead of trying
to process all objects simultaneously, processing is limited to one object in a certain part of space
at a time, e.g., through “focussing attention” on a region of interest in the visual field, which is
then routed through to higher visual areas, ignoring the remainder of the visual field. The control
signal for the input selection in this model is thought to be provided in form of the output of a
“blob-search” system that identifies possible candidates in the visual scene for closer examination.
While this top-down approach to circumvent the binding problem has intuitive appeal and is com-
patible with physiological studies that report top-down attentional modulation of receptive field
properties, (see the article by Reynolds & Desimone in this issue, or the recent study by Connor
et al. [14]), such a sequential approach seems to be difficult to reconcile with the apparent speed
with which object recognition can proceed even in very complex scenes containing many objects
[72, 101], and is also incompatible with reports of parallel processing of visual scenes, as observed
in pop-out experiments [102]), suggesting that object recognition does not seem to depend only on
explicit top-down selection in all situations.
A more head-on approach to the binding problem was taken in other studies that have called
into question the assumption that representations based on sets of spatially invariant feature detec-
tors are inevitably ambiguous. Starting with Wickelgren [114] in the context of speech recognition,
several studies have proposed how coding an object through a set of intermediate features, made
up of local arrangements of simpler features (e.g., using letter pairs, or higher order combinations,
28
-
instead of individual letters to code words — for instance, the word “tomaso” could be confused
with the word “somato” if both are coded by the sets of letters they are made up of; this ambiguity
is resolved, however, if they are represented through letter pairs) can sufficiently constrain the rep-
resentation to uniquely code complex objects without retaining global positional information (see
Mozer [58] for an elaboration of this idea and an implementation in the context of word recogni-
tion). The capabilities of such a representation based on spatially-invariant receptive fields were
recently analyzed in detail by Mel & Fiser [53] for the example domain of English text.
In the visual domain, Mel [52] recently presented a model to perform invariant recognition of
a high number (100) of objects of different types, using a representation based on a large number
of feature channels. While the model performed surprisingly well for a variety of transformations,
recognition performance depended strongly on color cues, and did not seems as robust to scale
changes as experimental neurons [49]. Perrett & Oram [65] have recently outlined a conceptual
model based on very similar ideas of how a representation based on feature combinations could in
theory avoid the “Binding Problem”, e.g., by coding a face through a set of detectors for combina-
tions of face parts such as eye-nose or eyebrow-hairline. What has been lacking so far, however, is
a computational implementation quantitatively demonstrating that such a model can actually per-
form “real-world” subordinate visual object recognition to the extent observed in behavioral and
physiological experiments [48, 49, 54, 89], where effects such as scale changes, occlusion and over-
lap pose additional problems not found in an idealized text environment. In particular, unlike in
the text domain where the input consists of letter strings and the extraction of features (letter com-
binations) from the input is therefore trivial, the crucial task of invariant feature extraction from
the image is nontrivial for scenes containing complex shapes, especially when multiple objects are
present.
We have developed a hierarchical feedforward model of object recognition in cortex, described
in [82], as a plausibility proof that such a model can account for several properties of IT cells, in
particular the invariance properties of IT cells found by Logothetis et al. [49]. In the following,
we will show that such a simple model can perform invariant recognition of complex objects in
cluttered scenes and is compatible with recent physiological studies. This is a plausibility proof
that complex oscillation-based mechanisms are not necessarily required for these tasks and that the
binding problem seems to be a problem for only some models of object recognition.
3.3 A Hierarchical Model of Object Recognition in Cortex
Studies of receptive field properties along the ventral visual stream in the macaque, from primary
visual cortex, V1, to anterior IT report an overall trend of an increase of average feature complexity
and receptive field size throughout the stream [41]. While simple cells in V1 have small localized
29
-
(a) (b)
View angleΣ
view-tunedunits
view-invariant Σunit
Figure 3-1: (a) Cartoon of the Poggio and Edelman model [68] of view-based object recognition. The grayovals correspond to view-tuned units that feed into a view-invariant unit (open circle). (b) Tuning curves ofthe view-tuned (gray) and the view-invariant units (black).
receptive fields and respond preferentially to simple shapes like bars, cells in anterior IT have been
found to respond to views of complex objects while showing great tolerance to scale and position
changes. Moreover, some IT cells seem to respond to objects in a view-invariant manner [6, 49, 66].
Our model follows this general framework. Previously, Poggio and Edelman [68] presented
a model of how view-invariant cells could arise from view-tuned cells (Fig. 3-1). However, they
did not describe any model of how the view-tuned units (VTUs) could come about. We have re-
cently developed a hierarchical model that closes this gap and shows how VTUs tuned to complex
features can arise from simple cell-like inputs. A detailed description of our model can be found
in [82] (for preliminary accounts refer to [79, 80], and also to [43]). We briefly review here some
of its main properties. The central idea of the model is that invariance to scaling and translation
and robustness to clutter on one hand, and feature complexity on the other hand require different
transfer functions, i.e., mechanisms by which a neuron combines its inputs to arrive at an output
value: While for the latter a weighted sum of different features, which makes the neuron respond
preferentially to a specific activity pattern over its afferents, is a suitable transfer function, increas-
ing invariance requires a different transfer function that pools over different afferents tuned to the
same feature but transformed to different degrees (e.g., at different scales to achieve scale invari-
ance). A suitable pooling function (for a computational justification, see [82]) is a so-called MAX
function, where the output of the neuron is determined by the strongest afferent, thus performing a “scan-
ning” operation over afferents tuned to different positions and scales. This is similar to the original
Hubel and Wiesel model of a complex cell receiving input from simple cells at different locations
to achieve phase-invariance.
In our model of object recognition in cortex (Fig. 3-2), the two types of operations, selection and
template matching, are combined in a hierarchical fashion to build up complex, invariant feature
detectors from small, localized, simple cell-like receptive fields in the bottom layer. In particular,
30
-
patterns on the model “retina” (of ��� � ��� pixels — which corresponds to a 5� receptive field
size ([41] report an average V4 receptive field size of 4.4�) if we set 32 pixels � ��) are first fil-
tered through a layer (S1, adopting Fukushima’s nomenclature [25] of referring to feature-building
cells as “S” cells and pooling cells as “C” cells) of simple cell-like receptive fields (first derivative
of gaussians, zero-sum, square-normalized to 1, oriented at ��� ��� ���� �� with standard devia-
tions of 1.75 to 4.75 pixels in steps of 0.5 pixels. S1 filter responses are absolute values of the image
“filtered” through the units’ receptive fields (more precisely, the rectified dot product of the cells’
receptive field with the corresponding image patch). Receptive field centers densely sample the
input retina. Cells in the next layer (C1) each pool S1 cells of the same orientation over a range of
scales and positions. Filters were grouped in four bands each spanning roughly � octaves, sam-
pling over position was done over patches of linear dimensions of �� �� �� �� pixels, respectively
(starting with the smallest filter band); patches overlapped by half in each direction to obtain more
invariant cells responding to the same features as the S1 cells. Different C1 cells were then com-
bined in higher layers — the figure illustrates two possibilities: either by combining C1 cells tuned
to different features to give S2 cells responding to co-activations of C1 cells tuned to different orien-
tations or to yield C2 cells responding to the same feature as the C1 cells but with bigger receptive
fields (i.e., the hierarchy does not have to be a strict alternation of S and C layers). In the version
described in this paper, there were no direct C1 � C2 connections, and each S2 cell received input
from four neighboring C1 units (in a � � � arrangement) of arbitrary orientation, yielding a total
of �� � �� different S2 cell types. S2 transfer functions were Gaussian (� � �, centered at 1). C2
cells then pooled inputs from all S2 cells of the same type, producing invariant feature detectors
tuned to complex shapes. Top-level view-tuned units had Gaussian response functions and each
VTU received inputs from a subset of C2 cells (see below).
This model had originally been developed to account for the transformation tolerance of view-
tuned units in IT as recorded from by Logothetis et al. [49]. It turns out, however, that the model
also has interesting implications for the binding problem.
3.4 Binding without a problem
To correctly recognize multiple objects in clutter, two problems must be solved: i) features must be
robustly extracted, and ii) based on these features, a decision has to be made about which objects
are present in the visual scene. The MAX operation can perform robust feature extraction (cf. [82]):
A MAX pooling cell that receives inputs from cells tuned to the same feature at, e.g., different lo-
cations, will select the most strongly activated afferent, i.e., its response will be determined by the
afferent with the closest match to its preferred feature in its receptive field. Thus, the MAX mech-
anism effectively isolates the feature of interest from the surrounding clutter. Hence, to achieve
31
-
����������
��������
��������
������������
������������������������
������������������������
������������������������������������������������������������������������
������������������������������������������������������������������������
��������
��������
��������������������������������������������������������
��������������������������������������������������������
view-tuned cells
MAX
weighted sum
simple cells (S1)
complex cells (C1)
"complex composite" cells (C2)
"composite feature" cells (S2)
Figure 3-2: Diagram of our hierarchical model [82] of object recognition in cortex. It consists of layers of linearunits that perform a template match over their afferents (blue arrows), and of non-linear units that perform a“MAX” operation over their inputs, where the output is determined by the strongest afferent (green arrows).While the former operation serves to increase feature complexity, the latter increases invariance by effectivelyscanning over afferents tuned to the same feature but at different positions (to increase translation invariance)or scale (to increase scale invariance, not shown). In the version described in this paper, learning only occuredat the connections from the C2 units to the top-level view-tuned units.
32
-
robustness to clutter, a VTU should only receive input from cells that are strongly activated by the
VTU’s preferred stimulus (i.e., those features that are relevant to the definition of the object) and
thus less affected by clutter (which will tend to activate the afferents less and will therefore be ig-
nored by the MAX response function). Also, in such a scheme, two view-tuned neurons receiving
input from a common afferent feature detector will tend to both have strong connections to this fea-
ture detector. Thus, there will only be little interference even if the common feature detector only
responded to one (the stronger) of the two stimuli in its receptive field due to its MAX response
function. Note that the situation would be hopeless for a response function that pools over all affer-
ents through, for example, a linear sum function: The response would always change when another
object is introduced in the visual field, making it impossible to disentangle the activations caused
by the individual stimuli without an additional mechanism such as, for instance, an attentional
sculpting of the receptive field or some kind of segmentation process.
In the following two sections we will show simulations that support these theoretical consider-
ations, and we will compare them to recent physiological experiments.
3.4.1 Recognition of multiple objects
The ability of the model neurons to perform recognition of multiple, non-overlapping objects was
investigated in the following experiment: 21 model neurons, each tuned to a view of a randomly
selected paperclip object (as used in theoretical [68], psychophysical [8, 48], and physiological [49]
studies on object recognition), were each presented with 21 displays consisting of that neuron’s
preferred clip combined with each of the 21 preferred clips (in the upper left and lower right corner
of the model retina, resp., see Fig. 3-3 (a)) yielding ��� � ��� two-clip displays. Recognition perfor-
mance was evaluated by comparing the neuron’s response to these displays with its responses to 60
other, randomly chosen “distractor” paperclip objects (cf. Fig. 3-3). Following the studies on view-
invariant object recognition [8, 48, 82], an object is said to be recognized if the neuron’s response
to the two-clip displays (containing its preferred stimulus) is greater than to any of the distractor
objects. For 40 afferents to each view-tuned cell (i.e., the 40 C2 units excited most strongly by the
neuron’s preferred stimulus; this choice produced top-level neurons with tuning curves similar to
the experimental neurons [82]), we find that on average in 90% of the cases recognition of the neu-
ron’s preferred clip is still possible, indicating that there is little interference between the activations
caused by the two stimuli in the visual field. The maximum recognition ra