A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

17

Transcript of A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

Page 1: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

A Saliency-Based Search Mechanism for Overt and Covert Shifts ofVisual AttentionLaurent Itti and Christof Koch�Computation and Neural Systems Program, California Institute of Technology, Pasadena, CA 91125.Tel: (626) 395-6855; Fax: (626) 796-8876; Email: fitti,[email protected] models of visual search, whether involving overteye movements or covert shifts of attention, are basedon the concept of a saliency map, that is, an explicit two-dimensional map that encodes the saliency or conspicuityof objects in the visual environment. Competition amongneurons in this map gives rise to a single winning loca-tion that corresponds to the next attended target. In-hibiting this location automatically allows the system toattend to the next most salient location. We describe adetailed computer implementation of such a scheme, fo-cusing on the problem of combining information acrossmodalities, here orientation, intensity and color informa-tion, in a purely stimulus-driven manner. The model isapplied to common psychophysical stimuli as well as to avery demanding visual search task. Its successful perfor-mance is used to address the extent to which the primatevisual system carries out visual search via one or moresuch saliency maps and how this can be tested.1 IntroductionMost biological vision systems (including Drosophila; [1]appear to employ a serial computational strategy wheninspecting complex visual scenes. Particular locations inthe scene are selected based on their behavioral relevanceor on local image cues. In primates, the identi�cationof objects and the analysis of their spatial relationshipusually involves either rapid, saccadic eye movements tobring the fovea onto the object, or covert shifts of atten-tion.It may seem ironic that brains employ serial processing,since one usually thinks of them as paradigmatic \mas-sively parallel" computational structures. However, inany physical computational system, processing resourcesare limited, which leads to bottlenecks similar to thosefaced by the von Neumann architecture on conventionaldigital machines. Nowhere is this more evident than inthe primate's visual system, where the amount of infor-mation coming down the optic nerve|estimated to be onthe order of of 108 bits per second|far exceeds what thebrain is capable of fully processing and assimilating intoconscious experience. The strategy nature has devised fordealing with this bottleneck is to select certain portionsof the input to be processed preferentially, shifting the

processing focus from one location to another in a serialfashion.Despite the widely shared belief in the general publicthat \we see everything around us," only a small fractionof the information registered by the visual system at anygiven time reaches levels of processing that directly in u-ence behavior. This is vividly demonstrated by changeblindness [2, 3, 4] in which signi�cant image changes re-main nearly invisible under natural viewing conditions,although observers demonstrate no di�culty in perceiv-ing these changes once directed to them. Overt and covertattention controls access to these privileged levels and en-sures that the selected information is relevant to behav-ioral priorities and objectives. Operationally, informationcan be said to be \attended" if it enters short-term mem-ory and remains there long enough to be voluntarily re-ported. Thus, visual attention is closely linked to visualawareness [5].But how is the selection of one particular spatial lo-cation accomplished? Does it involve primarily bottom-up, sensory-driven cues or does expectation of the targetscharacteristics play a decisive role? A large body of litera-ture has concerned itself with the psychophysics of visualsearch or orienting for targets in sparse arrays or in nat-ural scenes using either covert or overt shifts of attention(for reviews, see [6] or the survey article [7]).Much evidence has accumulated in favor of a two-component framework for the control of where in a visualscene attention is deployed [8, 9, 10, 11, 12, 13, 14, 15, 16]:A bottom-up, fast, primitive mechanism that biases theobserver towards selecting stimuli based on their saliency(most likely encoded in terms of center-surround mech-anisms) and a second slower, top-down mechanism withvariable selection criteria, which directs the \spotlight ofattention" under cognitive, volitional control. Whethervisual consciousness can be reached by either saliency-based or top-down attentional selection or by both re-mains controversial.Preattentive, parallel levels of processing do not repre-sent all parts of a visual scene equally well, but insteadprovide a weighted representation with strong responsesto a few parts of the scene and poor responses to every-thing else. Indeed, in an awake monkey freely viewing anatural visual scene, there are not many locations whichelicit responses in visual cortex comparable to those ob-served with isolated, laboratory stimuli [17]. Whether a1

Page 2: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

given part of the scene elicits a strong or a poor responseis thought to depend very much on \context", that is, onwhat stimuli are present in other parts of the visual �eld.In particular, the recently accumulated evidence for \non-classical" modulation of a cell's response by the presenceof stimuli outside of the cell's receptive �eld provides di-rect support for the idea that di�erent visual locationscompete for activity [18, 19, 20]. Those parts which elicita strong response are thought to draw visual attention tothemselves and to therefore be experienced as \visuallysalient". Directing attention at any of the other parts isthought to require voluntary \e�ort".Both modes of attention can operate at the same timeand visual stimuli have two ways of penetrating to higherlevels of awareness: Being willfully brought into the focusof attention, or winning the competition for saliency.Koch and Ullman [21] introduced the idea of a saliencymap to accomplish preattentive selection (see also theconcept of a \master map" in [11]). This is an explicittwo-dimensional map that encodes the saliency of objectsin the visual environment. Competition among neuronsin this map gives rise to a single winning location thatcorresponds to the most salient object, which constitutesthe next target. If this location is subsequently inhibited,the system automatically shifts to the next most salientlocation, endowing the search process with internal dy-namics (Fig. 1.a).Many computational models of human visual searchhave embraced the idea of a saliency map under di�er-ent guises [11, 22, 23, 24, 25]. The appeal of an ex-plicit saliency map is the relatively straightforward man-ner in which it allows the input from multiple, quasi-independent feature maps to be combined and to giverise to a single output: The next location to be at-tended. Electrophysiological evidence points to the exis-tence of several neuronal maps, in the pulvinar, the supe-rior colliculus and the intraparietal sulcus, which appearto speci�cally encode for the saliency of a visual stimulus[26, 27, 28, 29].However, some researchers reject the idea of a topo-graphic map in the brain whose raison d'etre is the repre-sentation of salient stimuli. In particular, Desimone andDuncan [30] postulate that selective attention is a conse-quence of interactions among feature maps, each of whichencodes in an implicit fashion, the saliency of a stimulusin that particular feature. We know of only a single imple-mentation of this idea in terms of a computer algorithm[31].We here describe a computer implementation of apreattentive selection mechanism based on the architec-ture of the primate visual system. We address the thornyproblem of how information from di�erent modalities|in the case treated here from 42 maps encoding intensity,orientation and color in a center-surround fashion at anumber of spatial scales|can be combined into a sin-gle saliency map. Our algorithm qualitatively reproduceshuman performance on a number of classical search ex-periments.Vision algorithms frequently fail when confronted withrealistic, cluttered images. We therefore studied the per-

formance of our search algorithm using high-resolution(6144�4096 pixels) photographs containing images of mil-itary vehicles in a complex rural background. Our algo-rithm shows, on average, superior performance comparedto human observers searching for the same targets, al-though our system does not yet include any top-downtask-dependent tuning.Finally, we discuss future computational work thatneeds to address the physiological evidence for multiplesaliency maps, possibly operating in di�erent coordinatesystems (e.g. retina versus head coordinates), and theneed to integrate information across saccades.The work presented here is an considerable elaborationupon the model presented in [23] and has not been re-ported previously.2 The ModelOur model is limited to the bottom-up control of atten-tion, i.e. to the control of selective attention by theproperties of the visual stimulus. It does not incorpo-rate any top-down, volitional component. Furthermore,we are here only concerned with the localization of thestimuli to be attended (\Where"), not their identi�cation(\What"). A number of authors [25, 32] have presentedmodels for the neuronal expression of attention along theoccipital-temporal pathway once spatial selection has oc-curred.In the present work, we make the following four as-sumptions: First, visual input is represented, in early vi-sual structures, in the form of iconic (appearance-based)topographic feature maps. Two crucial steps in theconstruction of these representations consist of center-surround computations in every feature at di�erent spa-tial scales, and within-feature spatial competition for ac-tivity. Second, information from these feature maps iscombined into a single map which represents the local\saliency" of any one location with respect to its neigh-borhood. Third, the maximum of this saliency map is, byde�nition, the most salient location at a given time, andit determines the next location of the attentional search-light. And fourth, the saliency map is endowed with inter-nal dynamics allowing the perceptive system to scan thevisual input such that its di�erent parts are visited by thefocus of attention in the order of decreasing saliency.Fig. 1.b shows an overview of our model. Input isprovided in the form of digitized images, from a varietyof sources including a consumer-electronics NTSC videocamera.2.1 Extraction of Early Visual FeaturesLow-level vision features (color channels tuned to red,green, blue and yellow hues, orientation, and bright-ness) are extracted from the original color image at sev-eral spatial scales, using linear �ltering. The di�erentspatial scales are created using Gaussian pyramids [33],which consist of progressively low-pass �ltering and sub-sampling the input image. In our implementation, pyra-mids have a depth of 9 scales, providing horizontal and2

Page 3: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

(a)Central Representation

WTA

Saliency Map

Feature Maps

(b)Feature maps

intensity orientationscolors

Input image

Linear filtering at 8 spatial scales

Center-surround differences and normalization

R, G, B, Y 0, 45, 90, 135o

7 features(6 maps perfeature)

Linear combinations

Saliency map

WTAInhibition of Return

Central Representation

ON, OFF

Conspicuity maps

Figure 1: (a) Original model of saliency-based visual attention, adapted from Koch and Ullman [21]. Early visual featuressuch as color, intensity or orientation are computed, in a massively parallel manner, in a set of pre-attentive feature mapsbased on retinal input (not shown). Activity from all feature maps is combined at each location, giving rise to activity inthe topographic saliency map. The winner-take-all (WTA) network detects the most salient location and directs attentiontowards it, such that only features from this location reach a more central representation for further analysis. (b) Schematicdiagram for the model used in this study. It directly builds on the architecture proposed in (a), but provides a completeimplementation of all processing stages. Visual features are computed using linear �ltering at eight spatial scales, followed bycenter-surround di�erences, which compute local spatial contrast in each feature dimension for a total of 42 maps. An iterativelateral inhibition scheme instantiates competition for salience within each feature map. After competition, feature maps arecombined into a single \conspicuity map" for each feature type. The seven conspicuity maps then are summed into the uniquetopographic saliency map. The saliency map is implemented as a 2-D sheet of Integrate-and-Fire (I&F) neurons. The WTA,also implemented using I&F neurons, detects the most salient location and directs attention towards it. An Inhibition-of-Return mechanism transiently suppresses this location in the saliency map, such that attention is autonomously directed tothe next most salient image location. We here do not consider the computations necessary to identify a particular object atthe attended location.vertical image reduction factors ranging from 1:1 (level 0;the original input image) to 1:256 (level 8) in consecutivepowers of two.Each feature is computed in a center-surround struc-ture akin to visual receptive �elds. Using this biologi-cal paradigm renders the system sensitive to local spatialcontrast in a given feature rather than to amplitude inthat feature map. Center-surround operations are imple-mented in the model as di�erences between a �ne and acoarse scale for a given feature: The center of the recep-tive �eld corresponds to a pixel at level c 2 f2; 3; 4g in thepyramid, and the surround to the corresponding pixel at3

Page 4: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

level s = c+�, with � 2 f3; 4g. We hence compute six fea-ture maps for each type of feature (at scales 2-5, 2-6, 3-6,3-7, 4-7, 4-8). Seven types of features, for which wide evi-dence exists in mammalian visual systems, are computedin this manner from the low-level pyramids: As detailedbelow, one feature type encodes for on/o� image intensitycontrast [34], two encode for red/green and blue/yellowdouble-opponent channels [35, 36], and four encode forlocal orientation contrast [37, 38].The six feature maps for the intensity feature type en-code for the modulus of image luminance contrast, i.e.,the absolute value of the di�erence between intensity atthe center (one of the three c scales) and intensity in thesurround (one of the six s = c + � scales). To isolatechromatic information, each of the red, green and bluechannels in the input image are �rst normalized by theintensity channel; a quantity corresponding to the double-opponency cells in primary visual cortex is then computedby center-surround di�erences across scales. Each of thesix red/green feature maps is created by �rst computing(red-green) at the center, then subtracting (green-red)from the surround, and �nally outputting the absolutevalue. Six blue/yellow feature maps are similarly cre-ated. Local orientation is obtained at all scales throughthe creation of oriented Gabor pyramids from the inten-sity image [39]. Four orientations are used (0, 45, 90and 135�) and orientation feature maps are obtained fromabsolute center-surround di�erences between these chan-nels. These maps encode, as a group, how di�erent theaverage local orientation is between the center and sur-round scales. A more detailed mathematical descriptionof the preattentive feature extraction stage has been pre-sented previously [23].2.2 Combining Information Across Mul-tiple MapsOur modeling hypotheses assume the existence of aunique topographic saliency map. At each spatial lo-cation, activity from the 42 feature maps consequentlyneeds to be combined into a unique scalar measure ofsalience. One major di�culty in such combination re-sides in the fact that the di�erent feature maps arisefrom di�erent visual modalities, which encode for a priorinot comparable stimulus dimensions: For example, howshould a 10� orientation discontinuity compare to a 5%intensity contrast?In addition, because of the large number of maps be-ing combined, the system is faced with a severe signal-to-noise ratio problem: A salient object may only elicita strong peak of activity in one or a few feature maps,tuned to the features of that object, while a larger num-ber of feature maps, for example tuned to the featuresof distracting objects, may show strong peaks at numer-ous locations. For instance, a stimulus display contain-ing one vertical bar among many horizontal bars yieldsan isolated peak of activity in the map tuned to verti-cal orientation at the scale of the bar; the same stimulusdisplay however also elicits strong peaks of activity, inthe intensity channel, at the locations of all bars, sim-

ply because each bar has high intensity contrast with thebackground. When all feature maps are combined intothe saliency map, the isolated orientation pop-out henceis likely to be greatly weakened, at best, or even entirelylost, at worst, among the numerous strong intensity re-sponses.Previously, we have shown that the simplest featurecombination scheme|to normalize each feature map to a�xed dynamic range, and then sum all maps|yields verypoor detection performance for salient targets in complexnatural scenes [40]. One possible way to improve per-formance is to learn linear map combination weights, byproviding the system with examples of targets to be de-tected. While performance improves greatly, this methodpresents the disadvantage of yielding di�erent specializedmodels (that is, sets of synaptic weights), one for eachtype of target studied [40].In the present study, we derive a generic model whichdoes not impose any strong bias for any particular fea-ture dimension. To this end, we implemented a simplewithin-feature spatial competition scheme, directly in-spired by physiological and psychological studies of long-range cortico-cortical connections in early visual areas.These connections, which can span up to 6-8mm in striatecortex, are thought to mediate \non-classical" responsemodulation by stimuli outside the cell's receptive �eld.In striate cortex, these connections are made by axonalarbors of excitatory (pyramidal) neurons in layers III andV [41, 42, 43, 44]. Non-classical interactions are thoughtto result from a complex balance of excitation and inhi-bition between neighboring neurons as shown by electro-physiology [18, 19, 20], optical imaging [45], and humanpsychophysics [46, 47, 48].Although much experimental work is being deployedin the characterization of these interactions, a precisequantitative understanding of such interactions still isin the early stages [48]. Rather than attempting topropose a detailed quantitative account of such interac-tions, our model hence simply reproduces three widelyobserved features of those interactions: First, interac-tions between a center location and its non-classical sur-round appear to be dominated by an inhibitory com-ponent from the surround to the center [49], althoughthis e�ect is dependent on the relative contrast betweencenter and surround [20]. Hence our model focuses onnon-classical surround inhibition. Second, inhibition fromnon-classical surround locations is strongest from neuronswhich are tuned to the same stimulus properties as thecenter [18, 50, 51, 43, 52, 53]. As a consequence, ourmodel implements interactions within each individual fea-ture map rather than between maps. Third, inhibitionappears strongest at a particular distance from the cen-ter [48], and weakens both with shorter and longer dis-tances. These three remarks suggest that the structureof non-classical interactions can be coarsely modeled bya two-dimensional di�erence-of-Gaussians (DoG) connec-tion pattern (Fig. 2).The speci�c implementation of these interactions in ourmodel is as follows: Each feature map is �rst normalizedto a �xed dynamic range (between 0 and 1), in order4

Page 5: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

to eliminate feature-dependent amplitude di�erences dueto di�erent feature extraction mechanisms. Each featuremap is then iteratively convolved by a large 2-D DoG �l-ter, the original image is added to the result, and negativeresults are set to zero after each iteration. The DoG �l-ter, a section of which is shown in Fig. 2, yields stronglocal excitation at each visual location, which is coun-teracted by broad inhibition from neighboring locations.Speci�cally, we have:DoG(x; y) = c2ex2��2ex e�x2+y22�2ex � c2inh2��2inh e�x2+y22�2inh (1)In our implementation, �ex = 2% and �inh = 25% of the(a)σ = 0

σ = 1

σ = 2

σ = 3

σ = 4

σ = 5

σ = 6

σ = 7

σ = 8

2 - 5

4 - 8

(b)Bottom-upfeatureextraction

0

0

image width

DoG filtering

rectification

0

0

to saliency map

+

Figure 2: (a) Gaussian pixel widths for the 9 scales used inthe model. Scale � = 0 corresponds to the original image, andeach subsequent scale is coarser by a factor 2. Two examplesof the six center-surround receptive �eld types are shown, forscale pairs 2-5 and 4-8. (b) Illustration of the spatial compe-tition for salience implemented within each of the 42 featuremaps. Each map receives input from the linear �ltering andcenter-surround stages. At each step of the process, the con-volution of the map by a large Di�erence-of-Gaussians (DoG)kernel is added to the current contents of the map. This addi-tional input coarsely models short-range excitatory processesand long-range inhibitory interactions between neighboring vi-sual locations. The map is half-wave recti�ed, such that nega-tive values are eliminated, hence making the iterative processnon-linear. Ten iterations of the process are carried out be-fore the output of each feature mapped is used in building thesaliency map.

input image width, cex = 0:5 and cinh = 1:5 (Fig. 2).At each iteration of the normalization process, a givenfeature mapM is then subjected to the following trans-formation: M jM+M�DoG � Cinhj�0 (2)where DoG is the 2D Di�erence of Gaussian �lter de-scribed above, j:j�0 discards negative values, and Cinh is aconstant inhibitory term (Cinh = 0:02 in our implementa-tion with the map initially scaled between 0 and 1). Cinhintroduces a small bias towards slowly suppressing areasin which the excitation and inhibition balance almost ex-actly; such regions typically correspond to extended re-gions of uniform textures (depending on the DoG param-eters), which we would not consider salient.Each feature map is subjected to 10 iterations of theprocess described in Eq. 2. The choice of the number ofiterations is somewhat arbitrary: In the limit of an in�-nite number of iterations, any non-empty map will con-verge towards a single peak (except for a few unrealistic,singular con�gurations), hence constituting only a poorrepresentation of the scene. With few iterations however,spatial competition is weak and ine�cient. Two exam-ples of the time evolution of this process are shown inFig. 3, and illustrate that using on the order of 10 it-erations yields adequate distinction between the two ex-ample images shown. As expected, feature maps withinitially numerous peaks of similar amplitude are sup-pressed by the interactions, while maps with one or a fewinitially stronger peaks become enhanced. It is interest-ing to note that this within-feature spatial competitionscheme resembles a \winner-take-all" network with local-ized inhibitory spread, which allows for a sparse distribu-tion of winners across the visual scene (see [54] for a 1-Dreal-time implementation in Analog-VLSI).After normalization, the feature maps for intensity,color, and orientation are summed across scales into threeseparate \conspicuity maps", one for intensity, one forcolor and one for orientation (Fig. 1b). Each conspicuitymap is then subjected to another 10 iterations of Eq. 2.The motivation for the creation of three separate channelsand their individual normalization is the hypothesis thatsimilar features compete strongly for salience, while dif-ferent modalities contribute independently to the saliencymap. Although we are not aware of any supporting ex-perimental evidence for this hypothesis, this additionalstep has the computational advantage of further enforc-ing that only a spatially sparse distribution of strong ac-tivity peaks is present within each visual feature type, be-fore combination of all three types into the scalar saliencymap.2.3 The Saliency MapAfter the within-feature competitive process has takenplace in each conspicuity map, these maps are linearlysummed into the unique saliency map, which resides atscale 4 (reduction factor 1:16 compared to the originalimage). At any given time, the maximum of the saliencymap corresponds to the most salient stimulus to which5

Page 6: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

Iteration 0 Iteration 2 Iteration 4

Iteration 6 Iteration 8 Iteration 10 Iteration 12

Iteration 0 Iteration 2 Iteration 4

Iteration 6 Iteration 8 Iteration 10 Iteration 12

Figure 3: (a) Iterative spatial competition for salience in a single feature map with one strongly activated location surroundedby several weaker ones. After a few iterations, the initial maximum has gained further strength while at the same timesuppressing weaker activation regions. (b) Iterative spatial competition for salience in a single feature map containing numerousstrongly activated locations. All peaks inhibit each other more-or-less equally, resulting in the entire map being suppressed.the focus of attention should be directed next, in order toallow for more detailed inspection by neurons along the occipito-temporal pathway. To �nd the most salient loca-tion, we have to determine the maximum of the saliency6

Page 7: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

Intensity contrast Color contrast Orientation contrast

274 ms.

223 ms.

169 ms.

104 ms.

Attended location

Input image

Saliency map

Figure 4: Example of the working of our model with a 512 � 384 pixels color image. Feature maps are extracted from theinput image at several spatial scales, and are combined into three separate conspicuity maps (Intensity, Color and Orientation;see Fig. 1b) at scale 4 (32 � 24 pixels). The three conspicuity maps that encode for saliency within these three domains arecombined and fed into the single saliency map (also 32�24 pixels). A neural winner-take-all network then successively selects,in order of decreasing saliency, the attended locations. Once a location has been attended to for some brief interval, it istransiently suppressed in the saliency map by the inhibition of return mechanism (dark round areas). Note how the inhibitedlocations recover over time (e.g., the �rst attended location has regained some activity at 274 ms), due to the integrativeproperties of the saliency map. The radius of the focus of attention was 64 pixels.map.This maximum is selected by application of a winner-take-all algorithm. Di�erent mechanisms have been sug-gested for the implementation of neural winner-take-allnetworks [21, 55]; in particular see [56] for a multi-scaleversion of the winner-take-all network. In our model, weused a two dimensional layer of integrate-and-�re neuronswith strong global inhibition in which the inhibitory pop-ulation is reliably activated by any neuron in the layer(a more realistic implementation would consist of pop-ulations of neurons; for simplicity, we model such pop-ulations by a single neuron with very strong synapses).When the �rst of these integrate-and-�re cells �res (win-ner), it will generate a sequence of action potentials, caus-ing the focus of attention (FOA) to shift to the winninglocation. These action potentials will also activate theinhibitory population, which in turn inhibits all cells inthe layer, hence resetting the network to its initial state.In the absence of any further control mechanism, thesystem described so far would direct its focus of atten-tion, in the case of a static scene, constantly to one lo-cation, since the same winner would always be selected.To avoid this undesirable behavior, we follow Koch andUllman [21] and introduce inhibitory feedback from the

winner-take-all (WTA) array to the saliency map. Whena spike occurs in the WTA network, the integrators inthe saliency map transiently receive additional input withthe spatial structure of a di�erence of Gaussians. Theinhibitory center (with a standard deviation of half theradius of the FOA) is at the location of the winner; itand its neighbors become inhibited in the saliency map.As a consequence, attention switches to the next-mostconspicuous location (Fig. 4). Such an \inhibition of re-turn" has been well demonstrated for covert attentionalshifts in humans [57, 58]. There is much less evidence forinhibition-of-return for eye movements in either humansor trained monkeys [59].The function of the excitatory lobes (half width of fourtimes the radius of the FOA) is to favor locality in thedisplacements of the focus of attention: If two locationsare of nearly equal conspicuity, the one closest to the pre-vious focus of attention will be attended next. This im-plementation detail directly follows the idea of \proximitypreference" proposed by Koch and Ullman [21].The time constants, conductances, and �ring thresh-olds of the simulated neurons are chosen so that the FOAjumps from one salient location to the next in approxi-mately 30{70 ms (simulated time; [60]), and so that an7

Page 8: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

attended area is inhibited for approximately 500{900 ms(see Fig. 4). These delays vary for di�erent locationswith the strength of the saliency map input at those loca-tions. The FOA therefore may eventually return to pre-viously attended locations, as it is observed psychophys-ically. These simulated time scales are related to the dy-namical model of integrate-and-�re neurons used in ourmodel (see http://www.klab.caltech.edu/�itti/ forthe implementation source code, which clearly speci�esall parameters of the simulated neurons using SI units).3 ResultsWe tested our model on a wide variety of real images,ranging from natural outdoor scenes to artistic paintings.All images were in color, contained signi�cant amountsof noise, strong local variations in illumination, shad-ows and re ections, large numbers of \objects" often par-tially occluded, and strong textures. Most of these imagescan be interactively examined on the World-Wide-Web,at http://www.klab.caltech.edu/�itti/attention/.Overall, the results indicate that the system scans theimage in an order which makes functional sense in mostbehavioral situations.It should be noted however that it is not straightfor-ward to establish objective criteria for the performanceof the system with such images. Unfortunately, nearly allquantitative psychophysical data on attentional controlare based on synthetic stimuli similar to those discussedin the next section. In addition, although the scan pathsof overt attention (eye movements) have been extensivelystudied [61, 62], it is unclear to what extent the precisetrajectories followed by the attentional spotlight are sim-ilar to the motion of covert attention. Most probably, therequirements and limitations (e.g., spatial and temporalresolutions) of the two systems are related but not identi-cal [56, 63]. Although our model is mostly concerned withshifts of covert attention, and ignores all of the mecha-nistic details of eye movements, we attempt below a com-parison between human and model target search times incomplex natural scenes, using a database of images con-taining military vehicles hidden in a rural environment.3.1 Pop-Out and Conjunctive SearchA �rst comparison of the model with humans can be madeusing the type of displays used in \Visual Search" tasks[11]. A typical experiment consists of a speeded alterna-tive forced-choice task in which the presence of a certainitem in the presented display has to be either con�rmed ordenied. It is known that stimuli which di�er from nearbystimuli in a single feature dimension can be easily foundin visual search, typically in a time which is nearly in-dependent of the number of other items (\distractors")in the visual scene. In contrast, search times for targetswhich di�er from distractors by a combination of features(a so-called \conjunctive task") are typically proportionalto the number of distractors [9].We generated three classes of synthetic images to sim-ulate such experiments: (1) one red target (rectangular

bar) among green distractors (also rectangular bars) withthe same orientation; (2) one red target among red dis-tractors with orthogonal orientation; and (3) one red tar-get among green distractors with the same orientation,and red distractors with orthogonal orientation. In or-der not to artifactually favor any particular orientation,the orientation of the target was chosen randomly for ev-ery image generated. Also, in order not to obtain ceilingperformance in the �rst two tasks, we added strong ori-entation noise to the stimuli (between -17 and +17 de-grees with uniform probability) and strong color specklenoise to the entire image (each pixel in the image had a15% uniform probability to become a maximally brightcolor among red, green, blue, cyan, purple, yellow andwhite). The positioning of the stimuli along a uniformgrid was randomized (by up to � 40% of the spacing be-tween stimuli, in the horizontal and vertical directions),to eliminate any possible in uence of our discrete imagerepresentations (pixels) on the system. Twenty imageswere computed for a total number of bars per image vary-ing between 4 and 36, yielding the evaluation of a totalof 540 images. In each case, the task of our model wasto locate the target, whose coordinates were externallyknown from the image generation process, at which pointthe search was terminated. We are here not concernedwith the actual object recognition problem within the fo-cus of attention. The diameter of the FOA was �xed toslightly more than the longest dimension of the bars.Results are presented in Fig. 5 in terms of the numberof false detections before the target was found. Clear pop-out was obtained for the �rst two tasks (color only andorientation only), independently of the number of distrac-tors in the images. Slightly worse performance is foundwhen the number of distractors is very small, which seemssensible since in these cases the distractors are nearly assalient as the target itself. Evaluation of these types ofimages without introducing any of the distracting noisesdescribed above yielded systematic pop-out (target foundas the �rst attended location) in all images. The con-junctive search task showed that the number of shifts ofthe focus of attention prior to the detection of the targetincreased linearly with the number of distractors. No-tice that the large error bars in our results indicate thatour model usually �nds the target either quickly (in mostcases) or only after scanning a large number of locations.3.2 Search Performance in Complex Nat-ural ScenesWe propose a second test in which target detection is eval-uated using a database of complex natural images, eachcontaining a military vehicle (the \target"). Contrary toour previous study with a simpli�ed version of the model[23], which used low-resolution image databases with rel-atively large targets (typically about 1/10th the widthof the visual scene), this study uses very-high resolutionimages (6144�4096 pixels), in which targets appear verysmall (typically 1/100th the width of the image). In ad-dition, in the present study, search time is compared be-tween the model's predictions and the average measured8

Page 9: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

0

5

10

15

20

5 10 15 20 25 30 35

0

5

10

15

20

5 10 15 20 25 30 35

0

5

10

15

20

5 10 15 20 25 30 35

Color pop-out Orientation pop-out Conjunctive search

Number of distractors Number of distractors Number of distractors

# fa

lse

det

ecti

on

s

Figure 5: Model performance on noisy versions of pop-out and conjunctive tasks of the type pioneered by Treisman andGelade [9]. Dark and light bars in the �gure represent isoluminant red and green colored bars in the original displays withstrong speckle noise added. Dashed lines: chance value, based on the size of the simulated visual �eld and the size of thecandidate recognition area (corresponds to the performance of an ideal observer who scans, on average, half of the distractorsprior to target detection). Solid lines: performance of the model. Error bars: one standard deviation. The typical searchslopes of human observers in feature search and conjunction search, respectively, are successfully reproduced by the model.Each stimulus was drawn inside a 64 � 64 pixels box, and the radius of the focus of attention was �xed to 32 pixels. For a�xed number of stimuli, we tested twenty randomly generated images in each task; the saliency map and winner-take-all wereinitialized to zero (corresponding to a uniformly black visual input) prior to each trial.search times from 62 normal human observers [7].The 44 original photographs were taken during a DIS-STAF (Distributed Interactive Simulation, Search andTarget Acquisition Fidelity) �eld test in Fort HunterLiggett, California, and were provided to us, along withall human data, by the TNO Human Factors Research In-stitute in the Netherlands [7]. The �eld of view for eachimage is 6:9 � 4:6�. Each scene contained one of ninepossible military vehicles, at a distance ranging from 860to 5822 meters from the observer. Each slide was digi-tized at 6144�4096 pixels resolution. Sixty two humanobservers aged between 18 and 45 years and with visualacuity better than 1.25 arcmin�1 participated to the ex-periment (about half were women and half men). Sub-jects were �rst presented with 3 close-up views of each ofthe 9 possible target vehicles, followed by a test run of 10trials. A Latin square design [64] was then used for therandomized presentation of the images. The slides wereprojected such that they subtended 65� 46� visual angleto the observers (corresponding to a linear magni�cationby about a factor 10 compared to the original scenery).During each trial, observers pressed a button as soon as

they had detected the target, and subsequently indicatedat which location on a 10�10 projected grid they hadfound the target. Further details on these experimentscan be found in [7, 65].The model was presented with each image at full resolu-tion. Contrary to the human experiment, no close-ups ortest trials were presented to the model. The most genericform of the model described above was used, without anyspeci�c parameter adjustment for this experiment. Sim-ulations for up to 10,000 ms of simulated time (about200-400 attentional shifts) were done on a Digital Equip-ment Alpha 500 workstation. With these high-resolutionimages, the model comprised about 300 million simulatedneurons. Each image was processed in about 15 minuteswith a peak memory usage of 484 megabytes (for compar-ison, a 640� 480 scene was typically processed in 10 sec-onds, and processing time approximately scaled linearlywith the number of pixels). The focus of attention (FOA)was represented by a disk of radius 340 pixels (Figs. 6and 7). Full coverage of the image by the FOA wouldhence require 123 shifts (with overlap); a random searchwould thus be expected to �nd the target after 61.5 shifts9

Page 10: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

Figure 6: Example of image from the database of 44 scenes depicting a military vehicle in a rural background. The algorithmoperated on 24-bit color versions of these 6144�4096 pixel images (here only reproduced in grey-scale) and took on the orderof 15 min real time on Dec Alpha workstation to carry out the saliency computation. (a) Original image; humans found thelocation of the vehicle in 2.6 sec. on average. (b) The vehicle was determined to be the most salient object in the image, andwas attended �rst by the model. Such a result indicates strong performance of the algorithm in terms of arti�cial vision usingcomplex natural color scenes. After scaling of the model's simulated time such that it scans two to four locations per secondon average, and adding an 1.5 sec. period to account for the human's latency in motor response, the model found the targetin 2.2 sec. 10

Page 11: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

Figure 7: A more di�cult example from the image database studied. (a) A grey-scale rendition of the color image. Humansfound the location of the vehicle in 7.5 sec. on average. (b) The target is not the most salient object, and the model searchesthe scene in order of decreasing saliency. The algorithm came to rest on the location of the target on the 17th shift, after6.1 sec (using same time scaling as in the previous �gure). 11

Page 12: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

on average. The target was considered detected when thefocus of attention intersected a binary mask representingthe outline of the target, which was provided with theimages. Two examples of scenes and model trajectoriesare presented in Figs. 6 and 7; In the �rst image, thetarget was immediately found by the model, while, in thesecond, a serial search was necessary before the targetcould be found.The model immediately found the target (�rst attendedlocation) in seven of the 44 images. It quickly found thetarget (fewer than 20 shifts) in another 23 images. Itfound the target after more than 20 shifts in 11 images,and failed to �nd the target in 3 images. Overall, themodel consequently performed surprisingly well, with anumber of attentional shifts far below the expected 61.5shifts of a random search in all but 6 images. In these6 images, the target was extremely small (and hence notconspicuous at all), and the model cycled through a num-ber of more salient locations.The following analysis was performed to generate theplot presented in Fig. 8: First, a few outlier images werediscarded, when either the model did not �nd the targetwithin 2000 ms of simulated time (about 40-80 shifts; 6images), or when half or more of the humans failed to �ndthe target (3 images), for a total of 8 discarded images.An average of 3 overt shifts per second was assumed forthe model, hence allowing us to scale the model's simu-lated time to real time. An additional 1.5 second was thenadded to the model time to account for human motor re-sponse time. With such calibration, the fastest reactiontimes for both model and humans were approximately 2seconds, and the slowest approximately 15 seconds, forthe 36 images analyzed.The results plotted in Fig. 8 overall show a poor cor-relation between human and model search times. Sur-prisingly however, the model appeared to �nd the targetfaster than humans in 3/4 of the images (points belowthe diagonal), despite the rather conservative scaling fac-tors used to compare model to human time. In orderto make the model's performance equal (on average) tothat of humans, one would have to assume that humansshifted their gaze not faster than twice per second, whichseems unrealistically slow under the circumstances of aspeeded search task on a stationary, non-masked scene.Even if eye movements were that slow, most probablywould humans still shift covert attention at a much fasterrate between two overt �xations.4 DiscussionWe have demonstrated that a relatively simple processingscheme, based on some of the key organizational prin-ciples of pre-attentive early visual cortical architectures(center-surround receptive �elds, non-classical within-feature inhibition, multiple maps) in conjunction with asingle saliency map performs remarkably well at detectingsalient targets in cluttered natural and arti�cial scenes.Key properties of our model, in particular its usageof inhibition-of-return and the explicit coding of saliency

0 4 8 12 160

4

8

12

16

human reaction time (sec)

mod

el r

eact

ion

time

(sec

)

(n=62)

b

aFigure 8: Mean reaction time to detect the target for 62human observers and for our deterministic algorithm. Eightof the 44 original images are not included, in which either themodel or the humans failed to reliably �nd the target. For the36 images studied, and using the same scaling of model time asin the previous two �gures, the model was faster than humansin 75% of the images. In order to bring this performancedown to 50% (equal performance for humans and model), onewould have to assume that no more than two locations can bevisited by the algorithm each second. Arrow (a) indicates the"pop-out" example of Fig. 6, and arrow (b) the more di�cultexample presented in Fig. 7.independent of feature dimensions, as well as its behavioron some classical search tasks, are in good qualitativeagreement with the human psychophysical literature.It can be argued, based on the tentative scaling betweensimulated model time and human time described above(disregarding the fact that our computer implementationrequired on the order of 15 minutes to converge for the6144�4096 pixel images versus search times on the orderof a 2-20 seconds for human observers, and disregardingthe fact that our algorithm did not deal with the prob-lem of detecting the target in the focus of attention), thatthe bottom-up saliency-based algorithm outperforms hu-mans in a demanding but realistic target detection taskinvolving camou aged military vehicles.One paradoxical explanation for this superior perfor-mance might be that top-down in uences play a signi�-cant role in the deployment of attention in natural scenes.Top-down cues in humans might indeed bias the atten-tional shifts, according to the progressively constructedmental representation of the entire scene, in inappropri-ate ways. Our model lacks any high-level knowledge ofthe world and operates in a purely bottom-up manner.This does suggest that for certain (possibly limited)scenarios, such high-level knowledge might interfere withoptimal performance. For instance, human observers arefrequently tempted to follow roads or other structures,or may \consciously" decide to thoroughly examine thesurroundings of salient buildings that have popped-out,while the vehicle might be in the middle of a �eld or in aforest.12

Page 13: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

4.1 Computational ImplicationsThe main di�culty we encountered was that of com-bining information from numerous feature maps into aunique scalar saliency map. Most of the results describedabove do not hold for intuitively simple feature combina-tion schemes, such as straight summation. In particular,straight summation fails to reliably detect pop-outs insearch arrays such as those shown in Fig. 5. The rea-son for this failure is that almost all feature maps containnumerous strong responses (e.g., the intensity maps showstrong activity at all target and distractor elements, be-cause of their high contrast with the black background);the target consequently has a very low signal-to-noise ra-tio when all maps are simply summed. Here, we proposeda novel solution, which �nds direct support in the humanand animal studies of non-classical receptive-�eld inter-actions.The �rst computational implication of our model is thata simple, purely bottom-up mechanism performs surpris-ingly well on real data in the absence of task-dependentfeedback. This is in direct contrast to some of the previ-ous models of visual search, in which top-down bias wasalmost entirely responsible for the relative weighting be-tween the feature types used [22].Further, although we have implemented the early fea-ture extraction mechanisms in a comparatively crudemanner (e.g., by approximating center-surround recep-tive �elds by simple pixel di�erences between a coarseand a �ne scale versions of the image), the model demon-strates a surprising level of robustness, which allows itto perform in a realistic manner on many complex nat-ural images. We have previously studied the robustnessof a pop-out signal in the presence of various amounts ofadded speckle noise (using a far less elaborate and bio-logically implausible approximation of our non-classicalinteractions), and have found that the model is almostentirely insensitive to noise as long as such noise is notdirectly masking the main feature of the target in spa-tial frequency or chromatic frequency space [23]. We be-lieve that such robustness is another consequence of thewithin-feature iterative scheme which we use to allow forthe fusion of information from several dissimilar sources.That our model yields robust performance on naturalscenes is not too surprising when considering the evidencefrom a number of state-of-the-art object recognition al-gorithms [66, 67, 68, 69]. Many of these demonstratesuperior performance when compared to classical imageprocessing schemes, although these new algorithms arebased on very simple feature detection �lters, similar tothe ones found in biological systems.4.2 Neurobiological ImplicationsWhile our model reproduces certain aspects of humansearch performance in a qualitative fashion, a more quan-titative comparison is premature for several reasons.Firstly, we have yet to incorporate a number of knownfeatures. For instance, we did not include any measureof saliency based on temporal stimulus onset or disap-pearance, or on motion [70]. We also have not yet inte-

grated any retinal non-uniform sampling of the input im-ages, although this is likely to strongly alter the saliencyof peripherally-viewed targets. Nor have we addressedthe well-known asymmetries in search tasks [71]. Whentargets and non-targets in a visual search tasks are ex-changed, visual search performance often changes too(e.g., it is easier to search for a curved line among straightdistractors than for a straight line among curved distrac-tors). Spatial \grouping" acting among stimuli is alsoknown to dramatically a�ect search time performance[72] and has not been dealt with here. In principle, thiscan be addressed by incorporating excitatory, cooperativecenter-surround interactions among neurons both withinand across feature maps. And, as discussed above, ourmodel is completely oblivious to any high-level featuresin natural scenes, including social cues.More importantly, a number of electrophysiological�ndings muddy the simple architecture our model oper-ates under (Fig. 1b). Single-unit recordings in the vi-sual system of the macaque indicate the existence of anumber of distinct maps of the visual environment thatappear to encode the saliency and/or the behavioral sig-ni�cance of targets. These include neurons in the supe-rior colliculus, the inferior and lateral subdivisions of thepulvinar, the frontal-eye �elds and areas within the intra-parietal sulcus [73, 26, 74, 28, 29]. What remains unclearis whether these di�erent maps emphasize saliency fordi�erent behaviors or for di�erent visuo-motor responsepatterns (for instance, for attentional shifts, eye or handmovements). If saliency is indeed encoded across multiplemaps, this raises the question of how competition can actacross these maps to ensure that only a single location ischosen as the next target of an attentional or eye shift.Following Koch and Ullman's [21] original proposalthat visual search is guided by the output of a selectionmechanism operating on a saliency map, it now seemsplausible that such a process does characterize processingin the entire visual system. Inhibition-of-return (IOR) isa critical component of such search strategy, which es-sentially acts as memory. If its duration is reduced, thealgorithm fails to �nd less salient objects because it end-lessly cycles through the same number of more salientobjects. For instance, if the time scale of IOR was re-duced from 900 ms to 50 ms, the model would detect themost salient object, inhibit its location, then shift to thesecond most salient location, but it would subsequentlycome back to the most salient object, whose inhibitionwould have ceased during the attentional shift from �rstto second object. Under such conditions, the algorithmwould never focus on anything else than the two mostsalient locations in the image. Our �nding that IOR playsa critical role in purely bottom-up search may not nec-essarily disagree with recently suggested evidence thathumans appear to use little or no memory during search[75]; while these authors do not refute the existence ofIOR, a precise understanding of how bottom-up and top-down aspects of attention interact in human visual searchremains to be elucidated.Whether or not this implies that saliency is expressedexplicitly in one or more visual �eld maps remains an13

Page 14: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

open question. If saliency is encoded (relatively) inde-pendently of stimulus dimensions, we might be able toachieve a dissociation between stimulus attributes andstimulus saliency. For instance, appropriate visual masksmight prevent the attributes of a visual stimulus to beread out without a�ecting its saliency. Or we mightbe able to directly in uence such maps, for instance us-ing reversible pharmacological techniques in animals ortranscranial magnetic stimulations in human volunteers(TMS)?Alternatively, it is possible that stimulus saliency isnot expressed independently of feature dimensions butis encoded implicitly within each speci�c feature map asproposed by Desimone and Duncan [30]. This raises thequestion of how interactions among all of these maps givesrise to the observed behavior of the system for naturalscenes. Such an alternative has not yet been analyzed indepth by computational work (see, however, [31]).Mounting psychophysical, electrophysiological, clinicaland functional imaging evidence [76, 77, 78, 74, 79, 29]strongly implies that the neuronal structures underlyingthe selection and the expression of shifts in spatial at-tention and occulomotor processing are tightly linked.These areas include the deeper parts of the superior col-liculus; parts of the pulvinar; the frontal eye �elds inthe macaque and its homologue in humans, the precen-tral gyrus; and areas in the intraparietal sulcus in themacaque and around the intraparietal and postcentralsulci and adjacent gyri in humans.The close relationship between areas active duringcovert and during overt shifts of attention raises the issueof how information in these maps is integrated across sac-cades, in particular given the usage of both retinal andocculo-motor coordinate systems in the di�erent neuronalmaps (see, for instance, [80]). This is an obvious questionthat will be explored by us in future computational work.Finally, we can now wonder about the relationship be-tween the saliency mechanism, the top-down volitionalattentional selection process, and awareness. We have re-cently proposed a quantitative account of the action ofspatial attention on various psychophysical thresholds forpattern discrimination, in terms of a strengthening of co-operative and competitive interactions among early visual�lters [81]. How can such a scheme be combined with thecurrent selection process based on purely bottom-up sen-sory data? Several possibilities come to mind. First, bothprocesses might operate independently and both mediateaccess to visual awareness. Computationally, this can beimplemented in a straightforward manner. Second how-ever, top-down attention might also directly interact withthe single saliency map, for instance by in uencing itsconstitutive elements via appropriate synaptic input. Ifthe inhibition-of-return could be selectively inactivated atlocations selected under volitional control, for example byshunting [82], then the winner-take-all and the attentionalfocus would remain at that location, ignoring for a whilesurrounding salient objects. Although such feedback tothe saliency map seems plausible and is functionally use-ful, it certainly does not constitute all of the top-downattentional modulation of spatial vision [81]. Finally, in-

dependent saliency maps could operate for the di�erentfeature maps and both saliency and volitional forms ofattention could access them independently. Current ex-perimental evidence does not allow us to unambiguouslychoose among these possibilities.Acknowledgements: We thank Dr. Toet from theTNO Human Factors Research Institute, The Nether-lands, for providing us with the database of military im-ages and human search times on these images. This re-search was supported by NSF-ERC, NIMH and ONR.References[1] Heisenberg, M. & Wolf, R. Studies of Brain Func-tion, Vol. 12: Vision in Drosophila (Springer-Verlag,Berlin, Germany, 1984).[2] Rensink, J. H., Vanderven, J., Vanpamelen, G., Fed-der, F. & Majoor, E. The modi�ed renphosystem- a high biological nutrient removal system. WaterScience and Technology 35, 137{46 (1997).[3] O'Regan, J. K., Rensink, R. A. & Clark, J. J.Change-blindness as a result of 'mudsplashes'. Na-ture 398, 34 (1999).[4] Simons, D. J. & Levin, D. T. Failure to detectchanges to attended objects. Investigative Ophthal-mology & Visual Science 38, 3273 (1997).[5] Crick, F. & Koch, C. Constraints on cortical andthalamic projections: the no-strong-loops hypothe-sis. Nature 391, 245{50 (1998).[6] Niebur, E. & Koch, C. Computational architecturesfor attention. In The attentive brain (ed. Parasur-aman, R.), 163{186 (MIT Press, Cambridge, MA,1998).[7] Toet, A., Bijl, P., Kooi, F. L. & Valeton, J. M. Ahigh-resolution image dataset for testing search anddetection models (TNO-TM-98-A020) (TNO HumanFactors Research Institute, Soesterberg, The Nether-lands, 1998).[8] James, W. The Principles of Psychology (HarvardUP, Cambridge, MA, 1890/1981).[9] Treisman, A. M. & Gelade, G. A feature-integrationtheory of attention. Cognit Psychol 12, 97{136(1980).[10] Bergen, J. R. & Julesz, B. Parallel versus serial pro-cessing in rapid pattern discrimination. Nature 303,696{8 (1983).[11] Treisman, A. Features and objects: the fourteenthbartlett memorial lecture. Q J Exp Psychol [A] 40,201{37 (1988).[12] Nakayama, K. & Mackeben, M. Sustained and tran-sient components of focal visual attention. VisionRes 29, 1631{47 (1989).14

Page 15: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

[13] Braun, J. & Sagi, D. Vision outside the focus ofattention. Percept Psychophys 48, 45{58 (1990).[14] Hikosaka, O., Miyauchi, S. & Shimojo, S. Orientinga spatial attention{its re exive, compensatory, andvoluntary mechanisms. Brain Res Cogn Brain Res5, 1{9 (1996).[15] Braun, J. & Julesz, B. Withdrawing attention atlittle or no cost: detection and discrimination tasks.Percept Psychophys 60, 1{23 (1998).[16] Braun, J. Vision and attention: the role of training[letter; comment]. Nature 393, 424{5 (1998). Com-ment on: Nature 1997 Jun 19;387(6635):805-7.[17] Gallant, J. L., Connor, C. E. & Essen, D. C. V. Neu-ral activity in areas v1, v2 and v4 during free view-ing of natural scenes compared to controlled viewing.Neuroreport 9, 85{90 (1998).[18] Sillito, A. M., Grieve, K. L., Jones, H. E., Cudeiro,J. & Davis, J. Visual cortical mechanisms detectingfocal orientation discontinuities. Nature 378, 492{6(1995).[19] Sillito, A. M. & Jones, H. E. Context-dependentinteractions and visual processing in v1. J PhysiolParis 90, 205{9 (1996).[20] Levitt, J. B. & Lund, J. S. Contrast dependence ofcontextual e�ects in primate visual cortex. Nature387, 73{6 (1997).[21] Koch, C. & Ullman, S. Shifts in selective visualattention: towards the underlying neural circuitry.Hum Neurobiol 4, 219{27 (1985).[22] Wolfe, J. M. Visual search in continuous, naturalisticstimuli. Vision Res 34, 1187{95 (1994).[23] Itti, L., Koch, C. & Niebur, E. A model of saliency-based visual-attention for rapid scene analysis. IEEETransactions on Pattern Analysis and Machine In-telligence 20, 1254{9 (1998).[24] Niebur, E. & Koch, C. Control of selective visualattention: Modeling the `where' pathway. In Neu-ral Information Processing Systems (NIPS*8) (eds.Touretzky, D., Mozer, M. & Hasselmo, M.), 802{808(MIT Press, Cambridge, MA, 1996).[25] Olshausen, B. A., Anderson, C. H. & Van Essen,D. C. A neurobiological model of visual attention andinvariant pattern recognition based on dynamic rout-ing of information. J Neurosci 13, 4700{19 (1993).[26] Robinson, D. L. & Petersen, S. E. The pulvinar andvisual salience. Trends Neurosci 15, 127{32 (1992).[27] Rockland, K. S., Andresen, J., Cowie, R. J. & Robin-son, D. L. Single axon analysis of pulvinocorticalconnections to several visual areas in the macaque.J Comp Neurol 406, 221{50 (1999).

[28] Gottlieb, J. P., Kusunoki, M. & Goldberg, M. E. Therepresentation of visual salience in monkey parietalcortex. Nature 391, 481{4 (1998).[29] Colby, C. L. & Goldberg, M. E. Space and attentionin parietal cortex. Annu Rev Neurosci 22, 319{49(1999).[30] Desimone, R. & Duncan, J. Neural mechanisms ofselective visual attention. Annu Rev Neurosci 18,193{222 (1995).[31] Hamker, F. H. The role of feedback connections intask-driven visual search. In Connectionist Modelsin Cognitive Neuroscience, Proc. of the 5th NeuralComputation and Psychology Workshop (NCPW'98)(eds. von Heinke, D., Humphreys, G. W. & Olson,A.) (Springer Verlag, London, 1999).[32] Beymer, D. & Poggio, T. Image representations forvisual learning. Science 272, 1905{9 (1996).[33] Burt, P. & Adelson, E. The laplacian pyramid as acompact image code. IEEE Trans. on Communica-tions 31, 532{540 (1983).[34] Leventhal, A. The Neural Basis of Visual Function(Vision and Visual Dysfunction Vol. 4) (CRC Press,Boca Raton, FL, 1991).[35] Luschow, A. & Nothdurft, H. C. Pop-out of ori-entation but no pop-out of motion at isoluminance.Vision Research 33, 91{104 (1993).[36] Engel, S., Zhang, X. & Wandell, B. Colour tuning inhuman visual cortex measured with functional mag-netic resonance imaging. Nature 388, 68{71 (1997).[37] DeValois, R. L., Albrecht, D. G. & Thorell, L. G.Spatial-frequency selectivity of cells in macaque vi-sual cortex. Vision Research 22, 545{559 (1982).[38] Tootell, R. B., Hamilton, S. L., Silverman, M. S.& Switkes, E. Functional anatomy of macaque stri-ate cortex. i. ocular dominance, binocular interac-tions, and baseline conditions. J Neurosci 8, 1500{30(1988).[39] Greenspan, H., Belongie, S., Goodman, R., Perona,P., Rakshit, S. & Anderson, C. H. Overcompletesteerable pyramid �lters and rotation invariance. InProc. IEEE Computer Vision and Pattern Recogni-tion (CVPR), Seattle, WA, 222{228 (Jun 1994).[40] Itti, L. & Koch, C. A comparison of feature combina-tion strategies for saliency-based visual attention sys-tems. In SPIE Human Vision and Electronic Imag-ing IV (HVEI'99), San Jose, CA, (in press) (1999).[41] Rockland, K. S. & Lund, J. S. Intrinsic laminar lat-tice connections in primate visual cortex. J CompNeurol 216, 303{18 (1983).[42] Gilbert, C. D., Das, A., Ito, M., Kapadia, M. & Wes-theimer, G. Spatial integration and cortical dynam-ics. Proc Natl Acad Sci U S A 93, 615{22 (1996).15

Page 16: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

[43] Gilbert, C. D. & Wiesel, T. N. Columnar speci�cityof intrinsic horizontal and corticocortical connectionsin cat visual cortex. J Neurosci 9, 2432{42 (1989).[44] Gilbert, C. D. & Wiesel, T. N. Clustered intrinsicconnections in cat visual cortex. J Neurosci 3, 1116{33 (1983).[45] Weliky, M., Kandler, K., Fitzpatrick, D. & Katz,L. C. Patterns of excitation and inhibition evokedby horizontal connections in visual cortex share acommon relationship to orientation columns. Neuron15, 541{52 (1995).[46] Polat, U. & Sagi, D. The architecture of perceptualspatial interactions. Vision Res 34, 73{8 (1994).[47] Polat, U. & Sagi, D. Spatial interactions in humanvision: from near to far via experience-dependentcascades of connections. Proc Natl Acad Sci U S A91, 1206{9 (1994).[48] Zenger, B. & Sagi, D. Isolating excitatory and in-hibitory nonlinear spatial interactions involved incontrast detection. Vision Res 36, 2497{513 (1996).[49] Cannon, M. W. & Fullenkamp, S. C. Spatial interac-tions in apparent contrast: inhibitory e�ects amonggrating patterns of di�erent spatial frequencies, spa-tial positions and orientations. Vision Res 31, 1985{98 (1991).[50] Knierim, J. J. & van Essen, D. C. Neuronal re-sponses to static texture patterns in area v1 of thealert macaque monkey. J Neurophysiol 67, 961{80(1992).[51] Ts'o, D. Y., Gilbert, C. D. & Wiesel, T. N. Relation-ships between horizontal interactions and functionalarchitecture in cat striate cortex as revealed by cross-correlation analysis. J Neurosci 6, 1160{70 (1986).[52] Malach, R., Amir, Y., Harel, M. & Grinvald, A.Relationship between intrinsic connections and func-tional architecture revealed by optical imaging andin vivo targeted biocytin injections in primate stri-ate cortex. Proc Natl Acad Sci U S A 90, 10469{73(1993).[53] Malach, R. Cortical columns as devices for maximiz-ing neuronal diversity. Trends Neurosci 17, 101{4(1994).[54] Horiuchi, T., Morris, T., Koch, C. & DeWeerth,S. Analog vlsi circuits for attention-based, visualtracking. In Neural Information Processing Systems(NIPS*9) (eds. Mozer, M., Jordan, M. & Petsche,T.), 706{712 (MIT Press, Cambridge, MA, 1997).[55] Yuille, A. L. & Grzywacz, N. M. A mathematical-analysis of the motion coherence theory. Interna-tional Journal of Computer Vision 3, 155{75 (1989).

[56] Tsotsos, J. K., Culhane, S. M., Wai, W. Y. K.,Lai, Y. H., Davis, N. & Nu o, F. Modeling visual-attention via selective tuning. Arti�cial Intelligence78, 507{45 (1995).[57] Posner, M. I., Cohen, Y. & Rafal, R. D. Neuralsystems control of spatial orienting. Philos Trans RSoc Lond B Biol Sci 298, 187{98 (1982).[58] Kwak, H. W. & Egeth, H. Consequences of allocat-ing attention to locations and to other attributes.Percept Psychophys 51, 455{64 (1992).[59] Motter, B. C. & Belky, E. J. The guidance of eyemovements during active visual search. Vision Res38, 1805{15 (1998).[60] Saarinen, J. & Julesz, B. The speed of attentionalshifts in the visual �eld. Proc Natl Acad Sci U S A88, 1812{4 (1991).[61] Yarbus, A. L. Eye Movements and Vision (PlenumPress, New York, 1967).[62] Noton, D. & Stark, L. Scanpaths in eye move-ments during pattern perception. Science 171, 308{11 (1971).[63] Rao, R. P. N. & Ballard, D. H. An active vision ar-chitecture based on iconic representations. Arti�cialIntelligence 78, 461{505 (1995).[64] Wagenaar, W. A. Note on the construction ofdigram-balanced latin squares. Psychol Bull 72,384{86 (1969).[65] Bijl, P., Kooi, F. K. & van Dorresteijn, M. Vi-sual search performance for realistic target imageryfrom the DISSTAF �eld trials (TNO Human FactorsResearch Institute, Soesterberg, The Netherlands,1997).[66] Poggio, T. Image representations for visual learn-ing. Lecture Notes in Computer Science 1206, 143(1997).[67] Niyogi, P., Girosi, F. & Poggio, T. Incorporatingprior information in machine learning by creatingvirtual examples. Proceedings of the IEEE 86, 2196{209 (1998).[68] Malik, J. & Perona, P. Preattentive texture discrim-ination with early vision mechanisms. J Opt Soc Am[A] 7, 923{32 (1990).[69] Simoncelli, E. P., Freeman, W. T., Adelson, E. H.& Heeger, D. J. Shiftable multiscale transforms.Ieee Transactions on Information Theory 38, 587{607 (1992).[70] Hillstrom, A. P. & Yantis, S. Visual-motion and at-tentional capture. Perception & Psychophysics 55,399{411 (1994).16

Page 17: A Saliency-Based Search Mechanism for Overt and Covert Shifts of ...

[71] Treisman, A. & Gormican, S. Feature analysis inearly vision: evidence from search asymmetries. Psy-chol Rev 95, 15{48 (1988).[72] Driver, J., Mcleod, P. & Dienes, Z. Motion coher-ence and conjunction search - implications for guidedsearch theory. Perception & Psychophysics 51, 79{85 (1992).[73] Laberge, D. & Buchsbaum, M. S. Positron emissiontomographic measurements of pulvinar activity dur-ing an attention task. Journal of Neuroscience 10,613{9 (1990).[74] Kustov, A. A. & Robinson, D. L. Shared neural con-trol of attentional shifts and eye movements. Nature384, 74{7 (1996).[75] Horowitz, T. S. & Wolfe, J. M. Visual search has nomemory. Nature 394, 575{7 (1998).[76] Shepherd, M., Findlay, J. M. & Hockey, R. J. Therelationship between eye movements and spatial at-tention. Q J Exp Psychol 38, 475{491 (1986).[77] Andersen, R. A., Bracewell, R. M., Barash, S.,Gnadt, J. W. & Fogassi, L. Eye position e�ects onvisual, memory, and saccade-related activity in ar-eas lip and 7a of macaque. J Neurosci 10, 1176{96(1990).[78] Sheliga, B. M., Riggio, L. & Rizzolatti, G. Orientingof attention and eye movements. Exp Brain Res 98,507{22 (1994).[79] Corbetta, M. Frontoparietal cortical networks fordirecting attention and the eye to visual locations:identical, independent, or overlapping neural sys-tems? Proc Natl Acad Sci U S A 95, 831{8 (1998).[80] Andersen, R. A. Multimodal integration for the rep-resentation of space in the posterior parietal cortex.Philos Trans R Soc Lond B Biol Sci 352, 1421{8(1997).[81] Lee, D. K., Itti, L., Koch, C. & Braun, J. Attentionactivates winner-take-all competition among visual�lters. Nat Neurosci 2, 375{81 (1999).[82] Koch, C. Biophysics of Computation: Informa-tion Processing in Single Neurons (Oxford Univer-sity Press, Oxford, England, 1998).

17