How a Part of the Brain Might or Might Not Work: A New...

How a Part of the Brain Might or Might Not Work: A New

Hierarchical Model of Object Recognition

by

Maximilian Riesenhuber

Diplom-PhysikerUniversität Frankfurt, 1995

Submitted to the Department of Brain and Cognitive Sciencesin partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Computational Neuroscience

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

May 2000

c�Massachusetts Institute of Technology 2000. All rights reserved.

Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Department of Brain and Cognitive Sciences

May 2, 2000

Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Tomaso Poggio

Uncas and Helen Whitaker ProfessorThesis Supervisor

Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Earl K. Miller

Co-Chair, Department Graduate Committee

How a Part of the Brain Might or Might Not Work: A New Hierarchical Model

of Object Recognition

by

Maximilian Riesenhuber

Submitted to the Department of Brain and Cognitive Scienceson May 2, 2000, in partial fulfillment of the

requirements for the degree ofDoctor of Philosophy in Computational Neuroscience

Abstract

The classical model of visual processing in cortex is a hierarchy of increasingly sophisticated rep-resentations, extending in a natural way the model of simple to complex cells of Hubel and Wiesel.Somewhat surprisingly, little quantitative modeling has been done in the last 15 years to explorethe biological feasibility of this class of models to explain higher level visual processing, such asobject recognition in cluttered scenes. We describe a new hierarchical model, HMAX, that accountswell for this complex visual task, is consistent with several recent physiological experiments in in-ferotemporal cortex and makes testable predictions. Key to achieve invariance and robustness toclutter is a MAX-like response function of some model neurons which selects (an approximation to)the maximum activity over all the afferents, with interesting connections to “scanning” operationsused in recent computer vision algorithms.

We then turn to the question of object recognition in natural (“continuous”) object classes, suchas faces, which recent physiological experiments have suggested are represented by a sparse dis-tributed population code. We performed two psychophysical experiments in which subjects weretrained to perform subordinate level discrimination in a continuous object class — images of com-puter-rendered cars — created using a 3D morphing system. By comparing the recognition perfor-mance of trained and untrained subjects we could estimate the effects of viewpoint-specific train-ing and infer properties of the object class-specific representation learned as a result of training.We then compared the experimental findings to simulations in HMAX, to investigate the computa-tional properties of a population-based object class representation. We find experimental evidence,supported by modeling results, that training builds a viewpoint- and class-specific representationthat supplements a pre-existing representation with lower shape discriminability but greater view-point invariance.

Finally, we show how HMAX can be extended in a straightforward fashion to perform object cate-gorization and to support arbitrary class hierarchies. We demonstrate the capability of our scheme,called “Categorical Basis Functions” (CBF), with the example domain of cat/dog categorization,and apply it to study some recent findings in categorical perception.

Thesis Supervisor: Tomaso PoggioTitle: Uncas and Helen Whitaker Professor

2

Acknowledgments

Thanks are due to quite a few people who have contributed in different ways to the gestation of

this thesis. First, of course, are my parents. I am extremely grateful for their untiring support and

encouragement over the years — especially to my father for convincing me to major in physics.

Second is Hans-Ulrich Bauer, who advised my Diplom thesis at the Institute for Theoretical

Physics of the University of Frankfurt. Without Hans-Ulrich and his urging to get my PhD in

computational neuroscience in the US I would never have applied to MIT and would now probably

be a bitter physics PhD working for McKinsey. To him this thesis is dedicated.

At MIT, I first want to thank my advisor, Tommy Poggio, for warning me about the smog at

Caltech, and for being the best advisor I could imagine: Providing a lot of independence while

being very encouraging and supportive. Being exposed to his way of doing science I consider one

of the biggest assets of my PhD training.

Then there are the people in my thesis committee, all of which have provided valuable input

and guidance along the way. Special thanks are due to Peter Dayan for a fine collaboration during

my first year that introduced me to quite a few new ideas. To Earl Miller for being so open to collab-

orations with wacky theorists. To Mike Tarr for a nice News & Views commentary [96] on our paper

[82], and a very stimulating visit which might lead to an even more stimulating collaboration. . .

To Pawan Sinha and Gadi Geiger for introducing me to the wonderful world of psychophysics

and providing invaluable advice when I ran my first experiment. To David Freedman and An-

dreas Tolias for introducing me to the wonderful world of monkey physiology and being great

collaborators. To Christian Shelton for the “mother of all correspondence algorithms” that was so

instrumental in many parts of this thesis and beyond. To Valerie Pires, without whose help the

psychophysics in chapter 4 would have been nothing more than “suggestions for future work”,

and to Mary Pat Fitzgerald, “the mother of CBCL”, whose help in dealing with the subtleties of the

MIT administration was (and continues to be) greatly appreciated.

Then there is our fine department that really has it all “under one roof”. I could not have

imagined a better place to get my PhD.

Last but not least I want to gratefully acknowledge the generous support provided by a Gerald

J. and Marjorie J. Burnett Fellowship (1996–1998) and a Merck/MIT Fellowship in Bioinformatics

(1998–2000) that enabled me to pursue the studies described in this thesis.

3

Contents

1 Introduction 8

2 Hierarchical Models of Object Recognition in Cortex 11

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Are Cortical Models Really Bound by the “Binding Problem”? 26

3.1 Introduction: Visual Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Models of Visual Object Recognition and the Binding Problem . . . . . . . . . . . . . 27

3.3 A Hierarchical Model of Object Recognition in Cortex . . . . . . . . . . . . . . . . . . 29

3.4 Binding without a problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.1 Recognition of multiple objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4.2 Recognition in clutter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 The Individual is Nothing, the Class Everything:

Psychophysics and Modeling of Recognition in Object Classes 40

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 Modeling: Representing Continuous Object Classes in HMAX . . . . . . . . . . . . . 50

4.3.1 The HMAX Model of Object Recognition in Cortex . . . . . . . . . . . . . . . 51

4

4.3.2 View-Dependent Object Recognition in Continuous Object Classes . . . . . . 51

4.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4.3 The Model Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.5 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5 A note on object class representation and categorical perception 66

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2 Chorus of Prototypes (COP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.3 A Novel Scheme: Categorical Basis Functions (CBF) . . . . . . . . . . . . . . . . . . . 68

5.3.1 An Example: Cat/Dog Classification . . . . . . . . . . . . . . . . . . . . . . . 69

5.3.2 Introduction of parallel categorization schemes . . . . . . . . . . . . . . . . . 72

5.4 Interactions between categorization and discrimination: Categorical Perception . . . 72

5.4.1 Categorical Perception in CBF . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.4.2 Categorization with and without Categorical Perception . . . . . . . . . . . . 75

5.5 COP or CBF? — Suggestion for Experimental Tests . . . . . . . . . . . . . . . . . . . . 78

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6 General Conclusions and Future Work 81

5

List of Figures

2-1 Invariance properties of one IT neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2-2 Sketch of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2-3 Illustration of the highly nonlinear shape tuning properties of the MAX mechanism 17

2-4 Response of a sample model neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2-5 Average neuronal responses to scrambled stimuli . . . . . . . . . . . . . . . . . . . . 21

3-1 Cartoon of the Poggio and Edelman model of view-based object recognition . . . . . 30

3-2 Sketch of our hierarchical model of object recognition in cortex . . . . . . . . . . . . . 32

3-3 Recognition of two objects simultaneously . . . . . . . . . . . . . . . . . . . . . . . . 34

3-4 Stimulus/background example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3-5 Model performance: Recognition in clutter . . . . . . . . . . . . . . . . . . . . . . . . 36

4-1 Natural objects, and artificial objects used in previous object recognition studies . . . 42

4-2 The eight prototype cars used in the 8 car system . . . . . . . . . . . . . . . . . . . . . 45

4-3 Training task for Experiments 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4-4 Illustration of match/nonmatch pairs for Experiment 1 . . . . . . . . . . . . . . . . . 47

4-5 Testing task for Experiments 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4-6 Average performance of the trained subjects on the test task of Experiment 1 . . . . . 49

4-7 Average performance of untrained subjects on the test task of Experiment 1 . . . . . 50

4-8 Our model of object recognition in cortex . . . . . . . . . . . . . . . . . . . . . . . . . 52

4-9 Recognition performance of the model on the eight car morph space . . . . . . . . . 53

4-10 Dependence of average (one-sided) rotation invariance, �r, on SSCU tuning width, � 55

4-11 Dependence of invariance range on the number of afferents to each SSCU . . . . . . 55

4-12 Dependence of invariance range on the number of SSCUs . . . . . . . . . . . . . . . . 56

4-13 Effect of addition of noise to the SSCU representation . . . . . . . . . . . . . . . . . . 57

4-14 Cat/dog prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4-15 The “Other Class” effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4-16 Car object class-specific features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6

4-17 Performance of the two layer-model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4-18 The 15 prototypes used in the 15 car system . . . . . . . . . . . . . . . . . . . . . . . . 61

4-19 Average performance of trained subjects in Experiment 2 . . . . . . . . . . . . . . . . 62

4-20 Average performance of untrained subjects in Experiment 2 . . . . . . . . . . . . . . 63

5-1 Cartoon of the CBF categorization scheme . . . . . . . . . . . . . . . . . . . . . . . . . 70

5-2 Illustration of the cat/dog stimulus space . . . . . . . . . . . . . . . . . . . . . . . . . 71

5-3 Response of the cat/dog categorization unit . . . . . . . . . . . . . . . . . . . . . . . . 71

5-4 Sketch of the model to explain the influence of experience with categorization tasks

on object discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5-5 Average responses over all morph lines for the two networks . . . . . . . . . . . . . . 76

5-6 Comparison of Euclidean distances of activation patterns . . . . . . . . . . . . . . . . 77

5-7 Output of the categorization unit trained on the cat/dog categorization task . . . . . 80

5-8 Same as Fig. 5-7, but for a representation based on 30 units chosen by k-means . . . 80

7

Chapter 1

Introduction

Tell your hairdresser that you are working on vision and he will likely say “Vision? But that’s

easy!” Indeed, the apparent ease with which we perform object recognition even in cluttered scenes

and under difficult viewing conditions belies the amazing complexity of the visual system. This

became apparent with the groundbreaking studies of Hubel and Wiesel [36, 37] in the primary

visual cortices of cats and monkeys in the late 50s and 60s. Subsequently, many other visual areas

were discovered, with recent surveys listing over 30 visual areas linked in an intricate and still

unresolved pattern [22, 35]. This complex connection scheme can be coarsely divided up into two

pathways, the “What” pathway (the ventral stream) running from primary visual cortex, V1, over

V2 and V4 to inferotemporal cortex (IT), and the “Where” pathway (dorsal stream) from V1 to V2,

V3, MT, MST and other parietal areas [103]. In this framework, the “What” pathway is specialized

for object recognition whereas the “Where” pathway is concerned with spatial vision. Looking at

the ventral stream in more detail, Kobatake et al. [41] have reported a progression of the complexity

of cells’ preferred features and receptive field sizes as one progresses along the stream. While

neurons in V1 are tuned to oriented bars and have small receptive fields, cells in IT appear to prefer

more complex visual patterns such as faces (for a review on face cells, see [15]), and respond over

a wide range of positions and scales, pointing to a crucial role of IT cortex in object recognition (as

confirmed by a great number of physiological, lesion and neuropsychological studies [50, 94]).

These findings naturally prompted the question of how cells tuned to views of complex objects

showing invariance to size and position changes in IT could arise from small bar-tuned receptive

fields in V1, and how ultimately this neural substrate could be used to perform object recognition.

In humans, psychophysical experiments had given rise to two main competing theories of ob-

ject recognition: the structural description and the view-based theory (see [98] for a review of the

two theories and their experimental evidence). The former approach, its main protagonist being

the recognition-by-components theory of Biederman [5], holds that object recognition proceeds by

8

decomposing an object into a view-independent part-based description while in the view-based

theory object recognition is based on the viewpoints objects had actually appeared in. Experimen-

tal evidence ([8, 48, 49, 97, 100], see also chapter 2) and computational considerations [19] appear

to favor the view-based theory. However, several challenges for view-based models remained; Tarr

and Bülthoff [98] very recently listed the following problems:�

1. tolerance to changes in viewing condition — while the system should allow fine shape dis-

criminations it should not require a new representation for every change in viewing condi-

tion;

2. class generalization and representation of categories — the system should generalize from

familiar exemplars of a class to unfamiliar ones, and also be able to support categorization

schemes.

This thesis presents a simple hierarchical model of object recognition in cortex (chapter 5 shows

how the model can be extended to object categorization in a straightforward way), HMAX [96],

that addresses these challenges in a biologically plausible system. In particular, chapter 2 (a reprint

of [82], copyright Nature America Inc., reprinted by permission) introduces HMAX and compares

the visual properties of view-tuned units in the model, especially with respect to translation, scal-

ing and rotation in depth of the visual scene, to those of neurons in inferotemporal cortex recorded

from in various experiments. Chapter 3 (a reprint of [81], copyright Cell Press, reprinted by per-

mission) shows that HMAX can even perform recognition in cluttered scenes, without having to

resort to special segmentation processes, which is of special interest in connection with the so-called

“Binding Problem”.

While the first two papers focus on “paperclip” objects that have been used extensively in psy-

chophysical [8, 48, 91], physiological [49] and computational studies [68], this object class has sev-

eral disadvantages (such as not being “nice” [107]) that make it unsuitable as a basis to investigate

recognition in natural object classes — the topic of chapter 4 (a reprint of [84]) — where objects

have similar 3D shape, such as faces. Instead, in chapters 4 and 5, stimuli for model and experi-

ment were generated using a novel 3D morphing system developed by Christian Shelton [90] that

allows us to generate morphed objects drawn from a space spanned by a set of prototype objects,

for instances cars (chapter 4), or cats and dogs (chapters 4 and 5). We show that the recognition

results obtainable in natural object classes represented using a population code where the activity

over a group of units codes for the identity of an object (as suggested by recent physiology studies

[112, 115]) are quite comparable to those for individual objects represented by “grandmother” units

(in chapter 2), that is, the performance of HMAX does not appear to be special to a certain object

�They also listed the need for a simple mechanism to measure perceptual similarity, in order “to generalize betweenexamplars or between views” ([98], p. 9), which thus appears to be a corollary of solving the two problems listed above.

9

class. In addition, simulations show that a population-based object representation provides several

computational advantages over a “grandmother” representation. We further present experimental

results from a psychophysical study in which we trained subjects using a discrimination paradigm

to build a representation of a novel object class. This representation was then probed by exam-

ining how discrimination performance was affected by viewpoint changes. We find experimental

evidence, supported by the modeling results, that training builds a viewpoint- and class-specific

representation that supplements a pre-existing representation with lower shape discriminability

but greater viewpoint invariance. Chapter 5 (a reprint of [83]) finally shows how HMAX can be

extended in a straightforward way to perform object categorization, and to support arbitrary object

categorization schemes, with interesting opportunities for interactions between discrimination and

categorization as observed in categorical perception.

10

Chapter 2

Hierarchical Models of Object

Recognition in Cortex

Abstract

The classical model of visual processing in cortex is a hierarchy of increasingly sophisticated

representations, extending in a natural way the model of simple to complex cells of Hubel and

Wiesel. Somewhat surprisingly, little quantitative modeling has been done in the last 15 years to

explore the biological feasibility of this class of models to explain higher level visual processing,

such as object recognition. We describe a new hierarchical model that accounts well for this

complex visual task, is consistent with several recent physiological experiments in inferotemporal

cortex and makes testable predictions. The model is based on a novel MAX-like operation on the

inputs to certain cortical neurons which may have a general role in cortical function.

2.1 Introduction

The recognition of visual objects is a fundamental cognitive task performed effortlessly by the brain

countless times every day while satisfying two essential requirements: invariance and specificity.

In face recognition, for example, we can recognize a specific face among many, while being rather

tolerant to changes in viewpoint, scale, illumination, and expression. The brain performs this and

similar object recognition and detection tasks fast [101] and well. But how?

Early studies [7] of macaque inferotemporal cortex (IT), the highest purely visual area in the

ventral visual stream thought to have a key role in object recognition [103] reported cells tuned to

views of complex objects such as a face, i.e., the cells discharged strongly to the view of a face but

very little or not at all to other objects. A hallmark of these cells was the robustness of their firing

11

to stimulus transformations such as scale and position changes.

This finding presented an interesting question: How could these cells show strongly differing

responses to similar stimuli (as, e.g., two different faces), that activate the retinal photoreceptors in

similar ways, while showing response constancy to scaled and translated versions of the preferred

stimulus that cause very different activation patterns on the retina?

This puzzle was similar to one faced by Hubel and Wiesel on a much smaller scale two decades

earlier when they recorded from simple and complex cells in cat striate cortex [36]: both cell types

responded strongly to oriented bars, but whereas simple cells exhibited small receptive fields with

a strong phase dependence, that is, with distinct excitatory and inhibitory subfields, complex cells

had larger receptive fields and no phase dependence. This led Hubel and Wiesel to propose a

model in which simple cells with their receptive fields in neighboring parts of space feed into

the same complex cell, thereby endowing that complex cell with a phase-invariant response. A

straightforward (but highly idealized) extension of this scheme would lead all the way from simple

cells to “higher order hypercomplex cells” [37].

Starting with the Neocognitron [25] for translation invariant object recognition, several hierar-

chical models of shape processing in the visual system have subsequently been proposed to ex-

plain how transformation-invariant cells tuned to complex objects can arise from simple cell inputs

[64, 111]. Those models, however, were not quantitatively specified or were not compared with

specific experimental data. Alternative models for translation- and scale-invariant object recogni-

tion have been proposed, based on a controlling signal that either appropriately reroutes incoming

signals, as in the “shifter” circuit [2] and its extension [62], or modulates neuronal responses, as

in the “gain-field” models for invariant recognition [78, 88]. While recent experimental studies

[14, 56] have indicated that in macaque area V4 cells can show an attention-controlled shift or mod-

ulation of their receptive field in space, there is still little evidence that this mechanism is used to

perform translation-invariant object recognition and whether a similar mechanism applies to other

transformations (such as scaling) as well.

The basic idea of the hierarchical model sketched by Perrett and Oram [64] was that invariance

to any transformation (not just image-plane transformations as in the case of the Neocognitron

[25]) could be built up by pooling over afferents tuned to various transformed versions of the

same stimulus. Indeed it was shown earlier [68] that viewpoint-invariant object recognition was

possible using such a pooling mechanism. A (Gaussian RBF) learning network was trained with

individual views (rotated around one axis in 3D space) of complex, paperclip-like objects to achieve

3D rotation-invariant recognition of this object. In the network the resulting view-tuned units fed

into a a view-invariant unit; they effectively represented prototypes between which the learning

network interpolated to achieve viewpoint-invariance.

There is now quantitative psychophysical [8, 48, 95] and physiological evidence [6, 42, 49] for

12

the hypothesis that units tuned to full or partial views are probably created by a learning process

and also some hints that the view-invariant output is in some cases explicitly represented by (a

small number of) individual neurons [6, 49, 66].

A recent experiment [48, 49] required monkeys to perform an object recognition task using

novel “paperclip” stimuli the monkeys had never seen before. Here, the monkeys were required

to recognize views of “target” paperclips rotated in depth among views of a large number of “dis-

tractor” paperclips of very similar structure, after being trained on a restricted set of views of each

target object. Following very extensive training on a set of paperclip objects, neurons were found

in anterior IT that selectively responded to the object views seen during training.

This design avoided two problems associated with previous physiological studies investigat-

ing the mechanisms underlying view-invariant object recognition: First, by training the monkey

to recognize novel stimuli with which the monkey had not had any visual experience instead of

objects (e.g., faces) with which the monkey was quite familiar, it was possible to estimate the de-

gree of view-invariance derived from just one object view. Moreover, the use of a large number of

distractor objects allowed to define view-invariance with respect to the distractor objects. This is a

key point, since only by being able to compare the response of a neuron to transformed versions of

its preferred stimulus with the neuron’s response to a range of (similar) distractor objects can the

VTU’s (view-tuned unit’s) invariance range be determined — just measuring the tuning curve is

not sufficient.

The study [49] established (Fig. 2-1) that after training with just one object view there are cells

showing some degree of limited invariance to 3D rotation around the training view, consistent

with the view-interpolation model [68]. Moreover, the cells also exhibit significant invariance to

translation and scale changes, even though the object was only previously presented at one scale

and position.

These data put in sharp focus and in quantitative terms the question of the circuitry under-

lying the properties of the view-tuned cells. While the original model [68] described how VTUs

could be used to build view-invariant units, they did not specify how the view-tuned units could

come about. The key problem is thus to explain in terms of biologically plausible mechanisms the

VTUs’ invariance to translation and scaling obtained from just one object view, which arises from

a trade-off between selectivity to a specific object and relative tolerance (i.e., robustness of firing) to

position and scale changes. Here, we describe a model that conforms to the main anatomical and

physiological constraints, reproduces the invariance data described above and makes predictions

for experiments on the view-tuned subpopulation of IT cells. Interestingly, the model is also con-

sistent with recent data from several other experiments regarding recognition in context [54], or the

presence of multiple objects in a cell’s receptive field [89].

13

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA *

Spi

ke R

ate

Distractor ID

10 Best Distractors

37 9 20 5 24 3 2 1 0 60

10

20

30

40

60 108 132 156 18084

0

10

20

30

40

Rotation Around Y Axis

(a) (b)

Azimuth and Elevation(x = 2.25 degrees)

1.90 2.80 3.70 4.70 5.60

0

1

2

3

4

5

6

7

(0,0) (x,x) (x,-x) (-x,x) (-x,-x)

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Degrees of Visual Angle

(Tar

get R

espo

nse)

/(M

ean

of B

est D

istr

acto

rs)

(c) (d)

0

1

2

3

4

5

6

7

Figure 2-1: Invariance properties of one neuron (modified from Logothetis et al. [49]). The figure shows theresponse of a single cell found in anterior IT after training the monkey to recognize paperclip-like objects. Thecell responded selectively to one view of a paperclip and showed limited invariance around the training viewto rotation in depth, along with significant invariance to translation and size changes, even though the mon-key had only seen the stimulus at one position and scale during training. (a) shows the response of the cell torotation in depth around the preferred view. (b) shows the cell’s response to the 10 distractor objects (other pa-perclips) that evoked the strongest responses. The lower plots show the cell’s response to changes in stimulussize, (c) (asterisk shows the size of the training view), and position, (d) (using the 1.9� size), resp., relative tothe mean of the 10 best distractors. Defining “invariance” as yielding a higher response to transformed viewsof the preferred stimulus than to distractor objects, neurons exhibit an average rotation invariance of 42� (dur-ing training, stimuli were actually rotated by �� in depth to provide full 3D information to the monkey;therefore, the invariance obtained from a single view is likely to be smaller), translation and scale invarianceon the order of �� and �� octave around the training view, resp. (J. Pauls, personal communication).

14

��

��

��

��

view-tuned cells

MAX

weighted sum

simple cells (S1)

complex cells (C1)

"complex composite" cells (C2)

"composite feature" cells (S2)

Figure 2-2: Sketch of the model. The model is an hierarchical extension of the classical paradigm [36] ofbuilding complex cells from simple cells. It consists of a hierarchy of layers with linear (“S” units in the no-tation of Fukushima [25], performing template matching, solid lines) and non-linear operations (“C” poolingunits [25], performing a “MAX” operation, dashed lines). The non-linear MAX operation — which selectsthe maximum of the cell’s inputs and uses it to drive the cell — is key to the model’s properties and is quitedifferent from the basically linear summation of inputs usually assumed for complex cells. These two typesof operations respectively provide pattern specificity and invariance (to translation, by pooling over afferentstuned to different positions, and scale (not shown), by pooling over afferents tuned to different scales).

2.2 Results

The model is based on a simple hierarchical feedforward architecture (Fig. 2-2). Its structure reflects

the assumption that invariance to position and scale on the one hand and feature specificity on

the other hand must be built up through separate mechanisms: to increase feature complexity, a

suitable neuronal transfer function is a weighted sum over afferents coding for simpler features,

i.e., a template match. But is summing over differently weighted afferents also the right way to

increase invariance?

From the computational point of view, the pooling mechanism should produce robust feature

detectors, i.e., measure the presence of specific features without being confused by clutter and con-

text in the receptive field. Consider a complex cell, as found in primary visual cortex, whose pre-

ferred stimulus is a bar of a certain orientation to which the cell responds in a phase-invariant way

[36]. Along the lines of the original complex cell model [36], one could think of the complex cells

as receiving input from an array of simple cells at different locations, pooling over which results in

15

the position-invariant response of the complex cell.

Two alternative idealized pooling mechanisms are: linear summation (“SUM”) with equal weights

(to achieve an isotropic response) and a nonlinear maximum operation (“MAX”), where the strongest

afferent determines the response of the postsynaptic unit. In both cases, if only one bar is present

in the receptive field, the response of a model complex cell is position invariant. The response

level would signal how similar the stimulus is to the afferents’ preferred feature. Consider now

the case of a complex stimulus, like e.g., a paperclip, in the visual field. In the linear summation

case, complex cell response would still be invariant (as long as the stimulus stays in the cell’s re-

ceptive field), but the response level now would not allow to infer whether there actually was a

bar of the preferred orientation somewhere in the complex cell’s receptive field, as the output sig-

nal is a sum over all the afferents. That is, feature specificity is lost. In the MAX case, however,

the response would be determined by the most strongly activated afferent and hence would signal

the best match of any part of the stimulus to the afferents’ preferred feature. This ideal example

suggests that the MAX mechanism is capable of providing a more robust response in the case of

recognition in clutter or with multiple stimuli in the receptive field (cf. below). Note that a SUM re-

sponse with saturating nonlinearities on the inputs seems too brittle since it requires a case-by-case

adjustment of the parameters, depending on the activity level of the afferents.

Equally critical is the inability of the SUM mechanism to achieve size invariance: Suppose that

the afferents to a “complex” cell (which now could be a cell in V4 or IT, for instance) show some

degree of size and position invariance. If the “complex” cell were now stimulated with the same

object but at subsequently increasing sizes, an increasing number of afferents would become ex-

cited by the stimulus (unless the afferents showed no overlap in space or scale) and consequently

the excitation of the “complex” cell would increase along with the stimulus size, even though the

afferents show size invariance (this is borne out in simulations using a simplified two-layer model

[79])! For the MAX mechanism, however, cell response would show little variation even as stimulus

size increased since the cell’s response would be determined just by the best-matching afferent.

These considerations (supported by quantitative simulations of the model, described below)

suggest that a sensible way of pooling responses to achieve invariance is via a nonlinear MAX

function, that is, by implicitely scanning (see discussion) over afferents of the same type that differ

in the parameter of the transformation to which the response should be invariant (e.g., feature

size for scale invariance), and then selecting the best-matching of those afferents. Note that these

considerations apply to the case where different afferents to a pooling cell, e.g., those looking at

different parts of space, are likely to be responding to different objects (or different parts of the

same object) in the visual field (as is the case with cells in lower visual areas with their broad shape

tuning). Here, pooling by combining afferents would mix up signals caused by different stimuli.

However, if the afferents are specific enough to only respond to one pattern, as one expects in the

16

0

0.2

0.4

0.6

0.8

1

resp

on

se

MAX expt. SUM(a) (b)

Figure 2-3: Illustration of the highly nonlinear shape tuning properties of the MAX mechanism. (a) Experi-mentally observed responses of IT cells obtained using a “simplification procedure” [113] designed to deter-mine “optimal” features (responses normalized so that the response to the preferred stimulus is equal to 1). Inthat experiment, the cell originally responds quite strongly to the image of a “water bottle” (leftmost object).The stimulus is then “simplified” to its monochromatic outline which increases the cell’s firing, and furtherto a paddle-like object, consisting of a bar supporting an ellipse. While this object evokes a strong response,the bar or the ellipse alone produce almost no response at all (figure used by permission). (b) Comparison ofexperiment and model. Green bars show the responses of the experimental neuron from (a). Blue and red barsshow the response of a model neuron tuned to the stem-ellipsoidal base transition of the preferred stimulus.The model neuron is at the top of a simplified version of the model shown in Fig. 2-2, where there are onlytwo types of S1 features at each position in the receptive field, tuned to the left and right side of the transitionregion, resp., which feed into C1 units that pool using a MAX function (blue bars) or a SUM function (redbars). The model neuron is connected to these C1 units so that its response is maximal when the experimentalneuron’s preferred stimulus is in its receptive field.

final stages of the model, then pooling by using a weighted sum, as in the RBF network [68], where

VTUs tuned to different viewpoints were combined to interpolate between the stored views, is

advantageous.

MAX-like mechanisms at some stages of the circuitry appear to be compatible with recent neu-

rophysiological data. For instance, it has been reported [89] that when two stimuli are brought into

the receptive field of an IT neuron, that neuron’s response appears to be dominated by the stimulus

that produces a higher firing rate when presented in isolation to the cell — just as expected if a

MAX-like operation is performed at the level of this neuron or its afferents. Theoretical investiga-

tions into possible pooling mechanisms for V1 complex cells also support a maximum-like pooling

mechanism (K. Sakai & S. Tanaka, Soc. Neurosci. Abs., 23, 453, 1997). Additional indirect support

for a MAX mechanism comes from studies using a “simplification procedure” [113] or “complexity

reduction” [47] to determine the preferred features of IT cells, i.e., the stimulus components that are

responsible for driving the cell. These studies commonly find a highly nonlinear tuning of IT cells

(Fig. 2-3 (a)). Such tuning is compatible with the MAX response function (Fig. 2-3 (b), blue bars).

Note that a linear model (Fig. 2-3 (b), red bars) cannot reproduce this strong response change for

small changes in the input image.

In our model of view-tuned units (Fig. 2-2), the two types of operations, scanning and template

17

matching, are combined in a hierarchical fashion to build up complex, invariant feature detectors

from small, localized, simple cell-like receptive fields in the bottom layer which receive input from

the model “retina.” There need not be a strict alternation of these two operations: connections can

skip levels in the hierarchy, as in the direct C1�C2 connections of the model in Fig. 2-2.

The question remains whether the proposed model can indeed achieve response selectivity and

invariance compatible with the results from physiology. To investigate this question, we looked at

the invariance properties of 21 view-tuned units in the model, each tuned to a view of a different,

randomly selected paperclip, as used in the experiment [49].

Figure 2-4 shows the response of one model view-tuned unit to 3D rotation, scaling and trans-

lation around its preferred view (see Methods). The unit responds maximally to the training view,

with the response gradually falling off as the stimulus is transformed away from the training view.

As in the experiment, we can determine the invariance range of the VTU by comparing the response

to the preferred stimulus to the responses to the 60 distractors. The invariance range is then defined

as the range over which the model unit’s response is greater than to any of the distractor objects.

Thus, the model VTU shown in Fig. 2-4 shows rotation invariance of 24�, scale invariance of 2.6

octaves and translation invariance of 4.7� of visual angle. Averaging over all 21 units, we obtain

average rotation invariance over 30.9�, scale invariance over �� octaves and translation invariance

over 4.6�.

Units show invariance around the training view, of a range in good agreement with the exper-

imentally observed values. Some units (5/21), an example of which is given in Fig. 2-4 (d), show

tuning also for pseudo-mirror views (obtained by rotating the preferred paperclip by 180� in depth,

which produces a pseudo-mirror view of the object due to the paperclips’ minimal self-occlusion),

as observed in some experimental neurons [49].

While the simulation and experimental data presented so far dealt with object recognition set-

tings in which one object was presented in isolation, this is rarely the case in normal object recogni-

tion settings. More commonly, the object to be recognized is situated in front of some background

or appears together with other objects, all of which are to be ignored if the object is to be recognized

successfully. More precisely, in the case of multiple objects in the receptive field, the responses of

the afferents feeding into a VTU tuned to a certain object should be affected as little as possible by

the presence of other “clutter objects.”

The MAX response function posited above for the pooling mechanism to achieve invariance has

the right computational properties to perform recognition in clutter: If the VTU’s preferred object

strongly activates the VTU’s afferents, then it is unlikely that other objects will interfere, as they

tend to activate the afferents less and hence will not usually influence the response due to the MAX

response function. In some cases (such as when there are occlusions of the preferred feature, or one

of the “wrong” afferents has a higher activation) clutter, of course, can affect the value provided

18

0 20 40 60 80 100 120 140 1600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

stimulus size

resp

onse

10 20 30 40 50 600

0.5

1

50 60 70 80 90 100 110 120 1300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

viewing angle

resp

onse

0 90 180 2700

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

viewing angle

resp

onse

−4

−2

0

2

4 −4

−2

0

2

40

0.5

1

y translation (deg)x translation (deg)

resp

onse

(a) (b)

(d)(c)

Figure 2-4: Responses of a sample model neuron to different transformations of its preferred stimulus. Thedifferent panels show the same neuron’s response to (a) varying stimulus sizes (inset shows response to 60distractor objects, selected randomly from the paperclips used in the physiology experiments [49]), (b) rotationin depth and (c) translation. Training size was �� pixels corresponding to 2� of visual angle. (d) showsanother neuron’s response to pseudo-mirror views (cf. text), with the dashed line indicating the neuron’sresponse to the “best” distractor.

19

by the MAX mechanism, thereby reducing the quality of the match at the final stage and thus

the strength of the VTU response. It is clear that to achieve the highest robustness to clutter, a

VTU should only receive input from cells that are strongly activated (i.e., that are relevant to the

definition of the object) by its preferred stimulus.

In the version of the model described so far, the penultimate layer contained only 10 cells corre-

sponding to 10 different features, which turned out to be sufficient to achieve invariance properties

as found in the experiment. Each VTU in the top layer was connected to all the afferents and hence

robustness to clutter is expected to be relatively low. Note that in order to connect a VTU to only the

subset of the intermediate feature detectors it receives strong input from, the number of afferents

should be large enough to achieve the desired response specificity.

The straightforward solution is to increase the number of features. Even with a fixed number of

different features in S1, the dictionary of S2 features can be expanded by increasing the number and

type of afferents to individual S2 cells (see Methods). In this “many feature” version of the model,

the invariance ranges for a low number of afferents are already comparable to the experimental

ranges — if each VTU is connected to the 40 (out of 256) C2 cells that are most strongly excited

by its preferred stimulus, model VTUs show an average scale invariance over �� octaves, rotation

invariance over 36.2� and translation invariance over 4.4�. For the maximum of 256 afferents to

each cell, cells are rotation invariant over an average of 47�, scale invariant over 2.4 octaves and

translation invariant over 4.7�.

Simulations show [81] that this model is capable of performing recognition in context: Using

displays as inputs that contain the neurons preferred clip as well as another, distractor, clip, the

model is able to correctly recognize the preferred clip in 90% of the cases (for 40/256 afferents to

each neuron, the maximum rate is 94% for 18 afferents, dropping to 55% for 256/256 afferents,

compared to 40% in the original version of the model with 10 C2 units), i.e., the addition of the

second clip interfered with the activation caused by the first clip alone so much that in 10% of the

cases the response to the two clip display containing the preferred clip fell below the response to

one of the distractor clips. This reduction of the response to the two-stimulus display compared to

the response to the stronger stimulus alone has also been found in experimental studies [86, 89].

The question of object recognition in the presence of a background object was explored ex-

perimentally in a recent study [54], where a monkey had to discriminate (polygonal) foreground

objects irrespective of the (polygonal) background they appeared with. Recordings of IT neurons

showed that for the stimulus/background condition, neuronal response on average was reduced

to a quarter of the response to the foreground object alone, while the monkey’s behavioral perfor-

mance dropped much less. This is compatible with simulations in the model [81] that show that

even though a unit’s firing rate is strongly affected by the addition of the background pattern, it is

still in most cases well above the firing rate evoked by distractor objects, allowing the foreground

20

1 4 16 64 2560

0.2

0.4

0.6

0.8

1

number of tiles

avg.

res

pons

e

(a) (b)

Figure 2-5: Average neuronal responses of neurons in the many feature version of the model to scrambledstimuli. (a) Example of a scrambled stimulus. The images (�� pixels) were created by subdividingthe preferred stimulus of each neuron into 4, 16, 64, and 256, resp., “tiles” and randomly shuffling the tiles tocreate a scrambled image. (b) Average response of the 21 model neurons (with 40/256 afferents, as above) tothe scrambled stimuli (solid blue curve), in comparison to the average normalized responses of IT neurons toscrambled stimuli (scrambled pictures of trees) reported in a very recent study [108] (dashed green curve).

object to be recognized successfully.

Our model relies on decomposing images into features. Should it then be fooled into confusing

a scrambled image with the unscrambled original? Superficially, one may be tempted to guess that

scrambling an image in pieces larger than the features should indeed fool the model. Simulations

(see Fig. 2-5) show that this is not the case. The reason lies in the large dictionary of filters/features

used that makes it practically impossible to scramble the image in such a way that all features are

preserved, even for a low number of features. Responses of model units drop precipitously as

the image is scrambled into progressively finer pieces, as confirmed very recently in a physiology

experiment [108] of which we became aware after obtaining this prediction from the model.

2.3 Discussion

We briefly outline the computational roots of the hierarchical model we described, how the MAX

operation could be implemented by cortical circuits and remark on the role of features and invari-

ances in the model.

A key operation in several recent computer vision algorithms for the recognition and classifi-

cation of objects [87, 92] is to scan a window across an image, through both position and scale, in

order to analyze at each step a subimage – for instance by providing it to a classifier that decides

whether the subimage represents the object of interest. Such algorithms have been successful in

achieving invariance to image plane transformations such as translation and scale. In addition, this

brute force scanning strategy eliminates the need to segment the object of interest before recogni-

tion: segmentation, even in complex and cluttered images, is routinely achieved as a byproduct of

21

recognition. The computational assumption that originally motivated the model described in this

paper was indeed that a MAX-like operation may represent the cortical equivalent of the “window

of analysis” in machine vision to scan through and select input data. Unlike a centrally controlled

sequential scanning operation, a mechanism like the MAX operation that locally and automatically

selects a relevant subset of inputs seems biologically plausible. A basic and pervasive operation

in many computational algorithms — not only in computer vision — is the search and selection

of a subset of data. Thus it is natural to speculate that a MAX-like operation may be replicated

throughout the cortex.

Simulations of a simplified two-layer version the model [79] using soft-maximum approxima-

tions to the MAX operation (see Methods) where the strength of the nonlinearity can be adjusted

by a parameter show that its basic properties are preserved and structurally robust. But how is an

approximation of the MAX operation realized by neurons? It seems that it could be implemented

by several different, biologically plausible circuitries [1, 13, 17, 32, 44]. The most likely hypothesis

is that the MAX operation arises from cortical microcircuits of lateral, possibly recurrent, inhibition

between neurons in a cortical layer. An example is provided by the circuit proposed for the gain-

control and relative motion detection in the visual system of the fly [76], based on feedforward

(or recurrent) shunting presynaptic (or postsynaptic) inhibition by “pool” cells. One of its key

elements, in addition to shunting inhibition (an equivalent operation may be provided by linear

inhibition deactivating NMDA receptors), is a nonlinear transformation of the individual signals

due to synaptic nonlinearities or to active membrane properties. The circuit performs a gain control

operation and — for certain values of the parameters — a MAX-like operation. “Softmax” circuits

have been proposed in several recent studies [34, 45, 61] to account for similar cortical functions.

Together with adaptation mechanisms (underlying very short-term depression [1]), the circuit may

be capable of pseudo-sequential search in addition to selection.

Our novel claim here is that a MAX-like operation is a key mechanism for object recognition

in the cortex. The model described in this paper — including the stage from view-tuned to view-

invariant units [68] — is a purely feedforward hierarchical model. Backprojections – well known

to exist abundantly in cortex and playing a key role in other models of cortical function [59, 75] –

are not needed for its basic performance but are probably essential for the learning stage and for

known top-down effects — including attentional biases [77] — on visual recognition, which can be

naturally grafted into the inhibitory softmax circuits (see [61]) described earlier.

In our model, recognition of a specific object is invariant for a range of scales (and positions) af-

ter training with a single view at one scale, because its representation is based on features invariant

to these transformations. View invariance on the other hand requires training with several views

[68] because individual features sharing the same 2D appearance can transform very differently

under 3D rotation, depending on the 3D structure of the specific object. Simulations show that the

22

model’s performance is not specific to the class of paperclip object: recognition results are similar

for e.g., computer-rendered images of cars (and other objects).

From a computational point of view the class of models we have described can be regarded

as a hierarchy of conjunctions and disjunctions. The key aspect of our model is to identify the

disjunction stage with the build-up of invariances and to do it through a MAX-like operation. At

each conjunction stage the complexity of the features increases and at each disjunction stage so does

their invariance. At the last level – of the C2 layer in the paper – it is only the presence and strength

of individual features and not their relative geometry in the image that matters. The dictionary

of features at that stage is overcomplete, so that the activities of the units measuring each feature

strength, independently of their precise location, can still yield a unique signature for each visual

pattern (cf. the SEEMORE system [52]).

The architecture we have described shows that this approach is consistent with available exper-

imental data and maps it into a class of models that is a natural extension of the hierarchical models

first proposed by Hubel and Wiesel.

2.4 Methods

Basic model parameters. Patterns on the model “retina” (of �� pixels — which corresponds to a

5� receptive field size (the literature [41] reports an average V4 receptive field size of 4.4�) if we set 32 pixels

� ��) — are first filtered through a layer (S1) of simple cell-like receptive fields (first derivative of gaussians,

zero-sum, square-normalized to 1, oriented at �� with standard deviations of 1.75 to 7.25 pixels

in steps of 0.5 pixels; S1 filter responses were rectified dot products with the image patch falling into their

receptive field, i.e., the output s�j of an S1 cell with preferred stimulus wj whose receptive field covers an

image patch Ij is s�j � jwj � Ijj). Receptive field (RF) centers densely sample the input retina. Cells in the next

(C1) layer each pool S1 cells (using the MAX response function, i.e., the output c�i of a C1 cell with afferents s�

j

is c�i � maxj s�j ) of the same orientation over eight pixels of the visual field in each dimension and all scales.

This pooling range was chosen for simplicity — invariance properties of cells were robust for different choices

of pooling ranges (cf. below). Different C1 cells were then combined in higher layers, either by combining

C1 cells tuned to different features to give S2 cells responding to co-activations of C1 cells tuned to different

orientations or to yield C2 cells responding to the same feature as the C1 cells but with bigger receptive fields.

In the simple version illustrated here, the S2 layer contains six features (all pairs of orientations of C1 cells

looking at the same part of space) with Gaussian transfer function (� � �, centered at 1, i.e., the response s�k

23

of an S2 cell receiving input from C1 cells c�m� c�n with receptive fields in the same location but responding to

different orientations is s�k � exp��c�m � ��

� �c�n � ��), yielding a total of 10 cells in the C2 layer.

Here, C2 units feed into the view-tuned units, but in principle, more layers of S and C units are possible.

In the version of the model we have simulated, object specific learning occurs only at the level of the

synapses on the view-tuned cells at the top. More complete simulations will have to account for the effect of

visual experience on the exact tuning properties of other cells in the hierarchy.

Testing the invariance of model units. View-tuned units in the model were generated by recording

the activity of units in the C2 layer feeding into the VTUs to each one of the 21 paperclip views and then

setting the connecting weights of each VTU, i.e., the center of the Gaussian associated with the unit, resp., to

the corresponding activation. For rotation, viewpoints from 50� to 130� were tested (the training view was

arbitrarily set to 90�) in steps of 4�. For scale, stimulus sizes from 16 to 160 pixels in half octave steps (except

for the last step, which was from 128 to 160 pixels) and for translation, independent translations of�� pixels

along each axis in steps of 16 pixels (i.e., exploring a plane of �� pixels) were used.

“Many feature” version. To increase the robustness to clutter of model units, the number of features in

S2 was increased: Instead of the previous maximum of two afferents of different orientation looking at the

same patch of space as in the version described above, each S2 cell now received input from four neighboring

C1 units (in a � � � arrangement) of arbitrary orientation, giving a total of �� different S2 types and

finally 256 C2 cells as potential inputs to each view-tuned cell (in simulations, top level units were sparsely

connected to a subset of C2 layer units to gain robustness to clutter, cf. Results). As S2 cells now combined C1

afferents with receptive fields at different locations, and features a certain distance apart at one scale change

their separation as the scale changes, pooling at the C1 level was now done in several scale bands, each of

roughly a half-octave width in scale space (filter standard deviation ranges were 1.75–2.25, 2.75–3.75, 4.25–

5.25, and 5.75–7.25 pixels, resp.) and the spatial pooling range in each scale band chosen accordingly (over

neighborhoods of � � �, � � �, � , and �� , respectively — note that system performance was robust

with respect to the pooling ranges, simulations with neighborhoods of twice the linear size in each scale

band produced comparable results, with a slight drop in the recognition of overlapping stimuli, as expected),

as a simple way to improve scale-invariance of composite feature detectors in the C2 layer. Also, centers

of C1 cells were chosen so that RFs overlapped by half a RF size in each dimension. A more principled way

24

would be to learn the invariant feature detectors, e.g., using the trace rule [23]. The straightforward connection

patterns used here, however, demonstrate that even a simple model shows tuning properties comparable to

the experiment.

Softmax approximation. In a simplified two-layer version of the model [79] we investigated the effects of

approximations to the MAX operations on recognition performance. The model contained only one pooling

stage, C1, where the strength of the pooling nonlinearity could be controlled by a parameter, p. There, the

output c�i of a C1 cell with afferents xj was

c�i �X

j

exp�p � jxj j�Pk exp�p � jxkj�

xj �

which performs a linear summation (scaled by the number of afferents) for p � � and the MAX operation for

p��.

2.5 Acknowledgments

Supported by grants from ONR, Darpa, NSF, ATR, and Honda. M.R. is supported by a Merck/MIT

Fellowship in Bioinformatics. T.P. is supported by the Uncas and Helen Whitaker Chair at the

Whitaker College, MIT. We are grateful to H. Bülthoff, F. Crick, B. Desimone, R. Hahnloser, C. Koch,

N. Logothetis, E. Miller, J. Pauls, D. Perrett, J. Reynolds, T. Sejnowski, S. Seung, and R. Vogels for

very useful comments and for reading earlier versions of this manuscript. We thank J. Pauls for

analyzing the average invariance ranges of his IT neurons and K. Tanaka for the permission to

reproduce Fig. 2-3 (a).

25

Chapter 3

Are Cortical Models Really Bound by

the “Binding Problem”?

Abstract

The usual description of visual processing in cortex is an extension of the simple to complex

hierarchy postulated by Hubel and Wiesel — a feedforward sequence of more and more complex

and invariant features. The capability of this class of models to perform higher level visual

processing such as viewpoint-invariant object recognition in cluttered scenes has been questioned

in recent years by several researchers, who in turn proposed an alternative class of models based

on the synchronization of large assemblies of cells, within and across cortical areas. The main

implicit argument for this novel and controversial view was the assumption that hierarchical

models cannot deal with the computational requirements of high level vision and suffer from

the so-called “binding problem”. We review the present situation and discuss theoretical and

experimental evidence showing that the perceived weaknesses of hierarchical models are not true.

In particular, we show that recognition of multiple objects in cluttered scenes, arguably among

the most difficult tasks in vision, can be done in a hierarchical feedforward model.

26

3.1 Introduction: Visual Object Recognition

Two problems make object recognition difficult:

1. The segmentation problem: Visual scenes normally contain multiple objects. To recognize in-

dividual objects, features must be isolated from the surrounding clutter and extracted from

the image, and the feature set must be parsed so that the different features are assigned to the

correct object. The latter problem is commonly referred to as the “Binding Problem” [110].

2. The invariance problem: Objects have to be recognized under varying viewpoints, lighting

conditions etc.

Interestingly, the human brain can solve these problems with ease and quickly. Thorpe et al. [101]

report that visual processing in an object detection task in complex visual scenes can be achieved

in under 150 ms, which is on the order of the latency of the signal transmission from the retina

to inferotemporal cortex (IT), the highest area in the ventral visual stream thought to have a key

role in object recognition [103]; see also [72]. This impressive processing speed presents a strong

constraint for any model of object recognition.

3.2 Models of Visual Object Recognition and the Binding Prob-

lem

Hubel and Wiesel [37] were the first to postulate a model of visual object representation and recog-

nition. They recorded from simple and complex cells in the primary visual cortices of cats and mon-

keys and found that while both types preferentially responded to bars of a certain orientation, the

former had small receptive fields with a phase-dependent response while the latter had bigger re-

ceptive fields and showed no phase-dependence. This observation led them to hypothesize that

complex cells receive input from several simple cells. Continuing this model in a straightforward

fashion, they suggested [36] that the visual system is composed of a hierarchy of visual areas, from

simple cells all the way up to “higher order hypercomplex cells.”

Later studies [7] of macaque inferotemporal cortex (IT) described neurons tuned to views of

complex objects such as a face, i.e., the cells discharged strongly to a face seen from a specific

viewpoint but very little or not at all to other objects. A key property of these cells was their

scale and translation invariance, i.e., the robustness of their firing to stimulus transformations such as

changes in size or position in the visual field.

These findings inspired various models of visual object recognition such as Fukushima’s Neocog-

nitron [25] or, later, Perrett and Oram’s [64] outline of a model of shape processing, and Wallis and

27

Rolls’ VisNet [111], all of which share the basic idea of the visual system as a feedforward process-

ing hierarchy where invariance ranges and complexity of preferred features grow as one ascends

through the levels.

Models of this type prompted von der Malsburg [109] to formulate the binding problem. His

claim was that visual representations based on spatially invariant feature detectors were ambigu-

ous: “As generalizations are performed independently for each feature, information about neigh-

borhood relations and relative position, size and orientation is lost. This lack of information can

lead to the inability to distinguish between patterns that are composed of the same set of invariant

features. . . ” [110]. Moreover, as a visual scene containing multiple objects is represented by a set of

feature activations, a second problem lies in “singling out appropriate groups from the large back-

ground of possible combinations of active neurons” [110]. These problems would manifest them-

selves in various phenomena such as hallucinations (the feature sets activated by objects actually

present in the visual scene combine to yield the activation pattern characteristic of another object)

and the figure-ground problem (the inability to correctly assign image features to foreground ob-

ject and background), leading von der Malsburg to postulate the necessity of a special mechanism,

the synchronous oscillatory firing of ensembles of neurons, to bind features belonging to one object

together.

One approach to avoid these problems was presented by Olshausen et al. [62]: Instead of trying

to process all objects simultaneously, processing is limited to one object in a certain part of space

at a time, e.g., through “focussing attention” on a region of interest in the visual field, which is

then routed through to higher visual areas, ignoring the remainder of the visual field. The control

signal for the input selection in this model is thought to be provided in form of the output of a

“blob-search” system that identifies possible candidates in the visual scene for closer examination.

While this top-down approach to circumvent the binding problem has intuitive appeal and is com-

patible with physiological studies that report top-down attentional modulation of receptive field

properties, (see the article by Reynolds & Desimone in this issue, or the recent study by Connor

et al. [14]), such a sequential approach seems to be difficult to reconcile with the apparent speed

with which object recognition can proceed even in very complex scenes containing many objects

[72, 101], and is also incompatible with reports of parallel processing of visual scenes, as observed

in pop-out experiments [102]), suggesting that object recognition does not seem to depend only on

explicit top-down selection in all situations.

A more head-on approach to the binding problem was taken in other studies that have called

into question the assumption that representations based on sets of spatially invariant feature detec-

tors are inevitably ambiguous. Starting with Wickelgren [114] in the context of speech recognition,

several studies have proposed how coding an object through a set of intermediate features, made

up of local arrangements of simpler features (e.g., using letter pairs, or higher order combinations,

28

instead of individual letters to code words — for instance, the word “tomaso” could be confused

with the word “somato” if both are coded by the sets of letters they are made up of; this ambiguity

is resolved, however, if they are represented through letter pairs) can sufficiently constrain the rep-

resentation to uniquely code complex objects without retaining global positional information (see

Mozer [58] for an elaboration of this idea and an implementation in the context of word recogni-

tion). The capabilities of such a representation based on spatially-invariant receptive fields were

recently analyzed in detail by Mel & Fiser [53] for the example domain of English text.

In the visual domain, Mel [52] recently presented a model to perform invariant recognition of

a high number (100) of objects of different types, using a representation based on a large number

of feature channels. While the model performed surprisingly well for a variety of transformations,

recognition performance depended strongly on color cues, and did not seems as robust to scale

changes as experimental neurons [49]. Perrett & Oram [65] have recently outlined a conceptual

model based on very similar ideas of how a representation based on feature combinations could in

theory avoid the “Binding Problem”, e.g., by coding a face through a set of detectors for combina-

tions of face parts such as eye-nose or eyebrow-hairline. What has been lacking so far, however, is

a computational implementation quantitatively demonstrating that such a model can actually per-

form “real-world” subordinate visual object recognition to the extent observed in behavioral and

physiological experiments [48, 49, 54, 89], where effects such as scale changes, occlusion and over-

lap pose additional problems not found in an idealized text environment. In particular, unlike in

the text domain where the input consists of letter strings and the extraction of features (letter com-

binations) from the input is therefore trivial, the crucial task of invariant feature extraction from

the image is nontrivial for scenes containing complex shapes, especially when multiple objects are

present.

We have developed a hierarchical feedforward model of object recognition in cortex, described

in [82], as a plausibility proof that such a model can account for several properties of IT cells, in

particular the invariance properties of IT cells found by Logothetis et al. [49]. In the following,

we will show that such a simple model can perform invariant recognition of complex objects in

cluttered scenes and is compatible with recent physiological studies. This is a plausibility proof

that complex oscillation-based mechanisms are not necessarily required for these tasks and that the

binding problem seems to be a problem for only some models of object recognition.

3.3 A Hierarchical Model of Object Recognition in Cortex

Studies of receptive field properties along the ventral visual stream in the macaque, from primary

visual cortex, V1, to anterior IT report an overall trend of an increase of average feature complexity

and receptive field size throughout the stream [41]. While simple cells in V1 have small localized

29

(a) (b)

View angleΣ

view-tunedunits

view-invariant Σunit

Figure 3-1: (a) Cartoon of the Poggio and Edelman model [68] of view-based object recognition. The grayovals correspond to view-tuned units that feed into a view-invariant unit (open circle). (b) Tuning curves ofthe view-tuned (gray) and the view-invariant units (black).

receptive fields and respond preferentially to simple shapes like bars, cells in anterior IT have been

found to respond to views of complex objects while showing great tolerance to scale and position

changes. Moreover, some IT cells seem to respond to objects in a view-invariant manner [6, 49, 66].

Our model follows this general framework. Previously, Poggio and Edelman [68] presented

a model of how view-invariant cells could arise from view-tuned cells (Fig. 3-1). However, they

did not describe any model of how the view-tuned units (VTUs) could come about. We have re-

cently developed a hierarchical model that closes this gap and shows how VTUs tuned to complex

features can arise from simple cell-like inputs. A detailed description of our model can be found

in [82] (for preliminary accounts refer to [79, 80], and also to [43]). We briefly review here some

of its main properties. The central idea of the model is that invariance to scaling and translation

and robustness to clutter on one hand, and feature complexity on the other hand require different

transfer functions, i.e., mechanisms by which a neuron combines its inputs to arrive at an output

value: While for the latter a weighted sum of different features, which makes the neuron respond

preferentially to a specific activity pattern over its afferents, is a suitable transfer function, increas-

ing invariance requires a different transfer function that pools over different afferents tuned to the

same feature but transformed to different degrees (e.g., at different scales to achieve scale invari-

ance). A suitable pooling function (for a computational justification, see [82]) is a so-called MAX

function, where the output of the neuron is determined by the strongest afferent, thus performing a “scan-

ning” operation over afferents tuned to different positions and scales. This is similar to the original

Hubel and Wiesel model of a complex cell receiving input from simple cells at different locations

to achieve phase-invariance.

In our model of object recognition in cortex (Fig. 3-2), the two types of operations, selection and

template matching, are combined in a hierarchical fashion to build up complex, invariant feature

detectors from small, localized, simple cell-like receptive fields in the bottom layer. In particular,

30

patterns on the model “retina” (of �� pixels — which corresponds to a 5� receptive field

size ([41] report an average V4 receptive field size of 4.4�) if we set 32 pixels � ��) are first fil-

tered through a layer (S1, adopting Fukushima’s nomenclature [25] of referring to feature-building

cells as “S” cells and pooling cells as “C” cells) of simple cell-like receptive fields (first derivative

of gaussians, zero-sum, square-normalized to 1, oriented at �� with standard devia-

tions of 1.75 to 4.75 pixels in steps of 0.5 pixels. S1 filter responses are absolute values of the image

“filtered” through the units’ receptive fields (more precisely, the rectified dot product of the cells’

receptive field with the corresponding image patch). Receptive field centers densely sample the

input retina. Cells in the next layer (C1) each pool S1 cells of the same orientation over a range of

scales and positions. Filters were grouped in four bands each spanning roughly � octaves, sam-

pling over position was done over patches of linear dimensions of �� pixels, respectively

(starting with the smallest filter band); patches overlapped by half in each direction to obtain more

invariant cells responding to the same features as the S1 cells. Different C1 cells were then com-

bined in higher layers — the figure illustrates two possibilities: either by combining C1 cells tuned

to different features to give S2 cells responding to co-activations of C1 cells tuned to different orien-

tations or to yield C2 cells responding to the same feature as the C1 cells but with bigger receptive

fields (i.e., the hierarchy does not have to be a strict alternation of S and C layers). In the version

described in this paper, there were no direct C1 � C2 connections, and each S2 cell received input

from four neighboring C1 units (in a � � � arrangement) of arbitrary orientation, yielding a total

of �� different S2 cell types. S2 transfer functions were Gaussian (� � �, centered at 1). C2

cells then pooled inputs from all S2 cells of the same type, producing invariant feature detectors

tuned to complex shapes. Top-level view-tuned units had Gaussian response functions and each

VTU received inputs from a subset of C2 cells (see below).

This model had originally been developed to account for the transformation tolerance of view-

tuned units in IT as recorded from by Logothetis et al. [49]. It turns out, however, that the model

also has interesting implications for the binding problem.

3.4 Binding without a problem

To correctly recognize multiple objects in clutter, two problems must be solved: i) features must be

robustly extracted, and ii) based on these features, a decision has to be made about which objects

are present in the visual scene. The MAX operation can perform robust feature extraction (cf. [82]):

A MAX pooling cell that receives inputs from cells tuned to the same feature at, e.g., different lo-

cations, will select the most strongly activated afferent, i.e., its response will be determined by the

afferent with the closest match to its preferred feature in its receptive field. Thus, the MAX mech-

anism effectively isolates the feature of interest from the surrounding clutter. Hence, to achieve

31

��

��

��

��

��

��

��

��

��

��

��

��

view-tuned cells

MAX

weighted sum

simple cells (S1)

complex cells (C1)

"complex composite" cells (C2)

"composite feature" cells (S2)

Figure 3-2: Diagram of our hierarchical model [82] of object recognition in cortex. It consists of layers of linearunits that perform a template match over their afferents (blue arrows), and of non-linear units that perform a“MAX” operation over their inputs, where the output is determined by the strongest afferent (green arrows).While the former operation serves to increase feature complexity, the latter increases invariance by effectivelyscanning over afferents tuned to the same feature but at different positions (to increase translation invariance)or scale (to increase scale invariance, not shown). In the version described in this paper, learning only occuredat the connections from the C2 units to the top-level view-tuned units.

32

robustness to clutter, a VTU should only receive input from cells that are strongly activated by the

VTU’s preferred stimulus (i.e., those features that are relevant to the definition of the object) and

thus less affected by clutter (which will tend to activate the afferents less and will therefore be ig-

nored by the MAX response function). Also, in such a scheme, two view-tuned neurons receiving

input from a common afferent feature detector will tend to both have strong connections to this fea-

ture detector. Thus, there will only be little interference even if the common feature detector only

responded to one (the stronger) of the two stimuli in its receptive field due to its MAX response

function. Note that the situation would be hopeless for a response function that pools over all affer-

ents through, for example, a linear sum function: The response would always change when another

object is introduced in the visual field, making it impossible to disentangle the activations caused

by the individual stimuli without an additional mechanism such as, for instance, an attentional

sculpting of the receptive field or some kind of segmentation process.

In the following two sections we will show simulations that support these theoretical consider-

ations, and we will compare them to recent physiological experiments.

3.4.1 Recognition of multiple objects

The ability of the model neurons to perform recognition of multiple, non-overlapping objects was

investigated in the following experiment: 21 model neurons, each tuned to a view of a randomly

selected paperclip object (as used in theoretical [68], psychophysical [8, 48], and physiological [49]

studies on object recognition), were each presented with 21 displays consisting of that neuron’s

preferred clip combined with each of the 21 preferred clips (in the upper left and lower right corner

of the model retina, resp., see Fig. 3-3 (a)) yielding �� two-clip displays. Recognition perfor-

mance was evaluated by comparing the neuron’s response to these displays with its responses to 60

other, randomly chosen “distractor” paperclip objects (cf. Fig. 3-3). Following the studies on view-

invariant object recognition [8, 48, 82], an object is said to be recognized if the neuron’s response

to the two-clip displays (containing its preferred stimulus) is greater than to any of the distractor

objects. For 40 afferents to each view-tuned cell (i.e., the 40 C2 units excited most strongly by the

neuron’s preferred stimulus; this choice produced top-level neurons with tuning curves similar to

the experimental neurons [82]), we find that on average in 90% of the cases recognition of the neu-

ron’s preferred clip is still possible, indicating that there is little interference between the activations

caused by the two stimuli in the visual field. The maximum recognition ra

How a Part of the Brain Might or Might Not Work: A New...

Documents

Transcript of How a Part of the Brain Might or Might Not Work: A New...