Uni ed Theory of Transformation Plasticity and the Effect ...
A Uni ed Theory of Attentional Control
Transcript of A Uni ed Theory of Attentional Control
A Unified Theory of Attentional Control
Matthew H. Wilder, Michael C. Mozer and Christopher D. Wickens
mozer,[email protected]
Department of Computer Science and Institute of Cognitive Science
University of Colorado, Boulder, CO 80309-0430
University of Illinois, University of Colorado, and Alionscience: MA & D Operations
October 23, 2009
Abstract
Although diverse, theories of visual attention generally share the notion that attention is
controlled by some combination of three distinct strategies: (1) exogenous cueing from locally-
contrasting primitive visual features, such as abrupt onsets or color singletons (e.g., Itti et al.,
1998); (2) endogenous gain modulation of exogenous activations, used to guide attention to
task relevant features (e.g. Navalpakkam and Itti, 2005; Wolfe, 1994, 2007); and (3) endogenous
prediction of likely locations of interest, based on task and scene gist (e.g., Torralba, Oliva,
Castelhano, and Henderson, 2006). We propose a unifying conceptualization in which attention
is controlled along two dimensions: the degree of task focus and the spatial scale of operation.
Previously proposed strategies—and their combinations—can be viewed as instances of this
mechanism. Thus, this theory serves not as a replacement for existing models, but as a means
of bringing them into a coherent framework. We implement this theory and demonstrate its
applicability to a wide range of attentional phenomena. The model successfully yields the trends
found in visual search tasks with synthetic images and makes predictions that correspond well
with human eye movement data for tasks involving real-world images. In addition, the theory
yields an unusual perspective on attention that places a fundamental emphasis on the role of
experience and task-related knowledge.
1
1 Introduction
The human visual system can be configured to perform a remarkable variety of arbitrary tasks.
For example, in a pile of coins, we can find the coin of a particular denomination, color, or shape,
determine whether there are more heads than tails, locate a coin that is foreign, or find a combi-
nation of coins that yields a certain total. The flexibility of the visual system to task demands is
achieved by control of visual attention.
Three distinct control strategies have been discussed in the literature. Earliest in chronological
order, exogenous control was the focus of both experimental research (e.g., Averbach and Coriell,
1961; Posner and Cohen, 1984; Treisman, 1982) and theoretical perspectives (e.g., Itti and Koch,
2000; Julesz, 1984; Koch and Ullman, 1985; Neisser, 1967). Exogenous control refers to the guidance
of attention to distinctive, locally contrasting visual features such as color, luminance, texture, and
abrupt onsets. Theories of exogenous control assume a saliency map, a spatiotopic map in which
activation in a location indicates saliency or likely relevance of that location. Activity in the saliency
map is typically computed in the following way. Primitive features are first extracted from the visual
field along dimensions such as intensity, color, and orientation. For each dimension, broadly tuned,
highly overlapping detectors are assumed that yield a coarse coding of the dimension. For example,
on the color dimension, one might posit spatial feature maps tuned to red, yellow, blue, and green.
Next, local contrast is computed for each feature—both in space and time—yielding an activity
map that specifies distinctive locations containing that feature. These feature-contrast maps are
combined together to yield a saliency map. Saliency thus corresponds to all locations that stand
out from their spatial or temporal neighborhood in terms of their primitive visual features.
Attention need not be deployed in a purely exogenous manner but can be influenced by task
demands (e.g., Bacon and Egeth, 1994; Folk et al., 1992; Wolfe et al., 1989). The ability of in-
dividuals to attend based on features such as color or orientation has led to theories proposing
feature-based endogenous control (e.g., Mozer and Baldwin, 2008; Mozer, 1991; Navalpakkam and
Itti, 2005; Wolfe, 1994, 2007). In these theories, the contribution of feature-contrast maps to the
saliency map is weighted by endogenous gains on the feature-contrast maps, as depicted in Figure 1.
The result is that an image such as that in Figure 2a, which contains singletons in both color and
orientation, might yield a saliency activation map like that in Figure 2b if task contingencies involve
2
visual
saliencymap
verticalhorizontal green red
primitive-feature endogenous
exogenous activations
attentionalselection
attentionalstate
image gainscontrast maps
+
Figure 1: A depiction of feature-based endogenous control
(a) (b) (c)
Figure 2: (a) a display containing two singletons, one in color and one in orientation; (b) a saliency map ifcolor is a task-relevant feature; (c) a saliency map if orientation is a task-relevant feature
color or like that in Figure 2c if the task contingencies involve orientation.
Experimental studies support a third attentional control strategy in which attention is guided
to visual field regions likely to be of interest based on the current task and global properties of
the scene (e.g., Biederman, 1972; Neider and Zelinsky, 2006; Siagian and Itti, 2007; Torralba et al.,
2006; Wolfe, 1998a). This type of scene-based endogenous control seems intuitive: if you are looking
for your keys in the kitchen, they are likely to be on a counter. If you are waiting for a ride on a
street, the car is likely to appear on the road not on a building. Even without a detailed analysis
of the visual scene, its gist can be inferred—e.g., whether one is in a kitchen or on the street,
and the perspective from which the scene is viewed—and this gist can guide attention. Oliva and
Torralba (2001) show that scene identity can be inferred without any object information; a finding
that further supports the claim that scene-gist computation is an attentional process.
1.1 A Unifying Framework
Instead of conceiving of the three control strategies as distinct and unrelated mechanisms, a key
contribution of this work is to characterize the strategies as components of a broader control space.
As depicted in Figure 3, the control space spans two dimensions: task specificity and spatial scale.
Task specificity refers to the degree to which control exploits the current tasks and goals. High
task-specificity refers to situations in which the attentional system is strongly constrained by the
nature or properties of the task. Low task-specificity occurs in situations where the individual has
3
feature-basedendogenous
exogenous
scene-basedendogenous
Spatial ScaleTa
sk S
peci
ficity
local global
low
high
Figure 3: A two dimensional control space that characterizes exogenous, feature-based endogenous, andscene-based endogenous control of visual attention.
no particular goal or when attention operates in a task-independent manner. In this paper we
focus on visual search, and consequently, tasks can be defined in the currency of objects—i.e. a
search for a specific object or a class of objects. Within the context of visual search, high task
specificity refers to search for a particular object (e.g., a red bike), and low task specificity refers
to exploration in the absence of a particular goal.
In the control space, the spatial scale dimension is analogous to the granularity of image process-
ing. At the local scale, small regions in the image are processed separately and saliency predictions
for one region are roughly independent of the predictions for different region. Conversely, process-
ing at the global spatial scale is more holistic—general scene properties extracted from the entire
image are used to make saliency predictions across the whole field of view.
The two dimensions of the control space are helpful in explicating the similarities and differences
between the three control strategies we’ve described. Exogenous control appears in the lower left
corner of the space in Figure 3 because it operates independently of current goals and uses the local
spatial scale. Feature-based and scene-based endogenous control are placed in the upper region
of the space because both operate with a high degree of task specificity. However, they reside at
different points along the spatial scale continuum. Feature-based endogenous control classifies a
small region as salient if specific features are present in that location. Scene-based endogenous
control uses the whole image to determine the gist of the scene. When searching for a specific
object or group of objects, the gist is then used to differentiate regions of the scene by relevance.
What benefit is there to laying out the three strategies in this control space? The control space
4
offers us the means to reconceptualize attentional control by illuminating the relationships among
differing control strategies—highlighting the dimensions along which one strategy is similar to and
different from another. In doing so, the control space suggests additional strategies that might
be employed by attention. We return to this issue later, when we recharacterize other attentional
phenomena in terms of the control space.
1.2 Combining Control Strategies
We have argued for the existence of at least three distinct attentional control strategies which fit
into a control space (Figure 3) that varies with task-specificity and spatial scale of processing.
Given the possibility of distinct control strategies, one might ask how the strategies are employed.
A simple hypothesis is that control processes operate at a single point in the control space, and that
point is selected so as to optimize performance in a specific environment. If the selected point was in
between two of the primary strategies depicted in Figure 3, it might appear that multiple strategies
were operating simultaneously. Alternatively, multiple strategies might be employed in parallel,
and attention would be governed by some combination of or arbitration among the strategies.
Recently, several hybrid models have been proposed that incorporate multiple strategies oper-
ating in parallel. Torralba, Oliva, Castelhano, and Henderson (2006) present a model—which we
refer to hereafter as TOCH—that integrates object-specific scene-based endogenous control with
standard exogenous control. Navalpakkam and Itti (2005) combine exogenous with feature-based
endogenous control in a model we refer to as NI. In the upgrade of Guided Search to the current
4.0 version, Wolfe (2007) added bottom-up exogenous activation and scene-gist processing to the
existing feature-based endogenous control. Siagian and Itti (2007) propose an architecture—which
we call SI —that computes gist and exogenous saliency in parallel using the same biologically plau-
sible visual features. The exogenous component drives the saliency map but can be aided by the
gist information.
Most of these hybrid models combine task specific processing at the global spatial scale with
some form of processing at the local spatial scale. One advantage of the control space, however, is
that it suggests an even wider range of strategies to employ in guiding attention. In this paper, we
present a framework that encompasses any combination of strategies in the control space. We view
this framework as a generalization of earlier hybrid models such as TOCH, NI, Guided Search, and
5
SI.
In order to implement combinations of control strategies, we must be able to specify a control
strategy at any point in the space. We have a good understanding of how to implement the pri-
mary control strategies, which lie near corner of the space (depicted in Figure 3). An immediate
challenge, however, is to characterize the interior of the control space—what it means for a control
strategy to operate at an intermediate spatial scale and with an intermediate degree of task speci-
ficty. Parameterizing a model over a continuum of spatial scales seems straightforward, but how
would varying degrees of task specificity be implemented? To answer this question, it is useful to
digress and consider the role of experience in attention. By illuminating the relationship between
attentional strategies and experience, we formulate a novel hypothesis which strongly intertwines
attention and experience. This hypothesis offers a natural way to represent varying degrees of task
specificity.
1.3 Experience-based Attention
Consider the influence of experience on the three primary control strategies. Exogenous control
is typically envisioned as a hard-wired bottom-up process that is independent of an individual’s
experience. Experience plays a larger role in feature-based endogenous control: task instructions
or knowledge can modulate bottom-up processes to amplify task-relevant features. Nonetheless,
representations in the attentional system are still considered fixed and experience independent.
Scene-based endogenous control places a greater emphasis on experience. Performance in a partic-
ular environment—e.g., city streets, kitchens, etc.—depends on associations between coarse image
properties and scene types learned through experience.
These descriptions suggest that the influence of experience on attentional control varies depend-
ing on the specific strategy employed. In contrast to this classical view, we suggest here that all
attentional control strategies can be formally characterized as experience dependent.
We motivate this perspective with the observation that experience appears to have an effect
on attentional processes that are typically thought of as exogenous. For example, in figure-ground
assignment, Vecera et al. (2002) find that participants tend to select regions in the lower portion of a
synthetic black-white image as figure and regions in the upper portion of the image as background.
This result can be seen as a natural consequence of experience-based attention because individuals
6
have more experience attending to objects in the lower half of their visual fields. Peterson and Skow-
Grant (2004) provide another argument for the role of experience in figure-ground assignment by
showing directly that the choice of figure is biased by shape familiarity. For example, a sillouette
of a face is preferred as the figure over the anti-sillouette. Because figure-ground segregation is
generally considered an early, preattentive process, this work implicates experience as a factor that
affects attentional allocation.
In the domain of real-world images, Cerf et al. (2008b) offer stronger evidence for experience-
driven attention by comparing a standard exogenous model of attention with a new model that
combines the exogenous model with a face detection model. The authors find that the addition of a
face detection component significantly improves the correspondence between model predictions and
human eye-movements. Surprisingly, an improvement is found even for images that do not contain
faces. One interpretation of this result is that because faces are encountered frequently and are
usually relevant, the attentional system is tuned to report face-like regions as salient. In subsequent
work, Cerf et al. (2008a) show that comparisons to human eye-movements are even better when
more object models contribute to the final saliency map. These results imply that experience is an
important factor in attention even in free-viewing situations when no task is being performed.
Another argument for the fundamental role of experience in attention comes from adaptation
effects. Senders (1964) shows that fixation frequency to components in an instrument panel corre-
sponds to the bandwidth (events per second) of that component. Over days of training, participants
learned the statistics of the instrument panel and allocated their attention according to those statis-
tics. More recently, Geng and Behrmann (2005) find that locations with high spatial probability
play a more important role in attention. Essentially this work suggests that the attentional system
encodes information about the frequency of events in various locations. Chun and Jiang (1998)
show that individuals adapt to predictive spatial configurations in a contextual cueing paradigm.
On some trials in this type of experiment, displays are presented that have a repeated distractor
configuration that predicts the target location. Other trials have a random, nonpredictive distractor
configuration. Participants show faster response times for the repeated predictive displays versus
the nonpredictive displays, a result which implies that attention learns these configurations through
experience. In an interesting study combining chess and contextual cueing, Brockmole et al. (2008)
show that experience with chess significantly affects performance on the search task. The authors
7
present participants with chess board displays, half of which contain piece configurations that are
predictive of the target location. Both novice and expert participants exhibit the standard contex-
tual cueing improvement of response time, but chess experts improve more quickly and ultimately
attain faster response times than the novices. Although the fact that attention can be adaptive
does not imply that attention always adapts, the results summarized in this section suggest the
parsimonious view that attentional processes constantly adapt to the ongoing stream of experience,
and therefore, that attention should be viewed as fundamentally knowledge or experience based.
1.4 Reconceptualizing the Control Space
Adopting the perspective that all varieties of attentional control are fundamentally experience de-
pendent, we return to consider our control space (Figure 3), and in particular, the task-specificity
dimension. A control strategy that has a high degree of task specificity necessarily requires expe-
rience with the particular task. However, control strategies that have low task dependence (e.g.,
exogenous control) have traditionally been viewed in terms of hardwired bottom-up and experience
independent mechanisms. Suppose instead that we considered exogenous control as having a de-
pendence on a broad range of past experience, i.e., as depending on every task. A pure exogenous
control strategy would thus be cast as “identify as salient locations that are interesting based on
the combination of all past experience on a wide variety of learned tasks.”
With our focus on visual search, each specific task corresponds to a target object, e.g., car keys,
wallet, car, traffic light. The task specificity dimension can be recast in terms of the size of the
subset of objects that guide attention—with decreasing specificity as the set size increases. A high
degree of task specificity might correspond to a single target object, e.g., a car or a person. An
intermediate degree of task specificity might be thought of as search for a more general class of
object, e.g., the set of all objects that are wheeled vehicles or living things. And low task specificity
would simply include all objects with which one has had experience.
Defining exogenous attention in this manner is an extension of the concept presented in Cerf
et al. (2008a,b). This generalization implies that pure bottom-up attention does not exist as a
separate mechanism but is rather a special case of what has traditionally been viewed as top-
down attentional control. In line with this claim, Folk et al. (1992) have found that involuntary
attention capture is contingent on the attentional control settings. In a cueing paradigm, the
8
authors manipulate the property of the cue, abrupt onset or color discontinuity, so that in some
cases it’s consistent with the target property and in other cases it’s inconsistent with the target
property. They find that the cue only captures attention when the property of the cue matches the
defining property of the search task—thus supporting the notion that bottom-up capture is in fact
dependent on the task settings.
With our reconceptualization of task specificity, we can describe any control strategy via a set
of modules that specialize in particular targets. The modules must also be characterized in terms
of the other dimension of the control space—the spatial scale at which they operate. This set of
modules can be cast in terms of a dual to the control space, which we will call the module grid,
depicted in Figure 4. Like the control space, the module grid has two dimensions, but the dimensions
focus on implementation. The rows of module grid enumerate specific targets of attention, and the
columns specify an analog of spatial scale, which we call range of influence and explain shortly.
Each cell in the grid is a particular mechanism that produces a saliency map given a visual image.
Control strategies, which occupy a single point in control space (Figure 3), can occupy regions
in module grid (Figure 4). Exogenous attention corresponds to a short range of influence and a
combination across all targets. Feature-based endogenous attention also operates with a short range
of influence, but only a small collection of targets are employed—specifically those that contain the
defining features. Scene-based endogenous attentional guidance utilizes a long range of influence
and incorporates a small set of targets.
In reformulating the second dimension of the control space (Figure 3), a dimension analogous
to spatial scale is defined in the module grid—depicted along the columns of Figure 4. We describe
this dimension in terms of the spatial range of influence that a visual feature in the image has
on the saliency map. A short range of influence implies that saliency map activation at a given
location is determined only by nearby visual features, whereas a long range of influence implies that
activation anywhere in the saliency map can be influenced by features anywhere in the visual field.
Emphasizing the role of experience, the actual range of influence a visual feature will have depends
on statistics of the task environment and past experience: even if the potential connectivity is
present for long-range influences, the realization of these influences will depend on the particular
task environment, determined through experience.
Through experience, each module becomes specialized for a particular target. We’ve shown that
9
Range of Influence...
short long
person
car
house
tree
sign
light
Exogenous
Feature-basedEndogenous
Scene-basedEndogenous
Targ
et O
bjec
ts
Figure 4: A depiction of specific attentional mechanisms that are learned through experience. The twodimensions of this module grid relate to the nature of the experience and learning. This module grid canbe viewed as a transformation of the control space that emphasizes specific mechanisms and the role ofexperience. Each controls strategy—which corresponds to a point in control space—can be understood as acollection of grid squares in the module grid.
each primary strategy can be implemented as a subset of modules in one column of the module
grid. Similarly, the module grid can represent any combination of strategies in the control space if
multiple modules operate in parallel and their saliency maps are merged. We call this framework
TASC, an acronym on TAsk-specific and Spatial-scale Control. Rather than viewing TASC as an
alternative theory to existing models such as TOCH, Guided Search, and SI, we view TASC as a
generalization of these earlier theories, and each of these earlier theories can be seen as a specific
instantiation of TASC.
1.5 An Illustration of Saliency Over the Module Grid
To give a concrete intuition about the operation of TASC, we present an example illustrating the
model’s behavior. The input to TASC is an image—natural or synthetic—and the output from
TASC is a saliency map. TASC maintains a representation for each object target and range of
influence. Each representation yields a saliency map as shown in Figure 5 which corresponds
directly to the module grid in Figure 4. This example shows a street scene and the saliency maps
for three different objects: people, trees and sidewalks. In its full implementation, TASC would
maintain representations for a large collection of objects. At the short range of influence, saliency
maps generally make fine-grained predictions. These maps are similar to those obtained by feature-
10
pers
ontre
esi
dew
alk
longshort
Figure 5: A street scene and saliency maps produced by TASC in response to this input for three differenttarget objects and four ranges of influence.
based endogenous models like Guided Search 2.0. In contrast, the saliency maps from modules with
a long range of influence show coarse regions where the object is likely to appear.
As mentioned above, an advanced attentional strategy involves combining multiple modules
from the module grid. Figure 6a presents an example of a combined saliency map that could result
from TASC. This map is a combination across 11 objects and all ranges of influence. Figure 6b
shows separate saliency maps for each of the 4 ranges of influence where a combination is performed
across all 11 object modules. The map at the short range of influence corresponds to the exogenous
control strategy.
1.6 A Probabilistic Interpretation of TASC
Recent theories of visual attention have given a probabilistic interpretation to saliency (e.g. Bruce
and Tsotsos (2009); Gao et al. (2008); Kanan et al. (2009); Mozer et al. (2006); Mozer and Baldwin
(2008); Navalpakkam and Itti (2005); Torralba et al. (2006); Zhang et al. (2008)). In these theories,
11
(a) (b) (c)
longshort(d)
Figure 6: A saliency map (b) for the image in (a) using the naive control strategy in TASC—combine allmaps in the module grid. (c) the saliency map overlaid on the original image. (d) separate saliency mapsfor the 4 ranges of influence where all object modules are combined. These results come from a simulationwith 11 different target object representations.
saliency at some location x in the visual field is related to the probability that a target appears at
that location, P (Tx|Fx), where Tx is a binary random variable indicating the presence or absence of
a target at x, and Fx denotes the set of visual features on which the saliency assessment is based.
Each module of TASC further conditions on the object that is the goal of search, o, and the range
of influence, r: P (Tx|Fx, O = o,R = r). As we describe in the next section on implementation,
TASC uses a linear neural net to approximate this conditional probability. Any control strategy
involves combining the predictions of one or more modules. The combination rule can also be
expressed in probabilistic terms. To do so, we define the notion of a goal, G, that determines the
relative importance of individual objects and ranges of influence. That is, we define a conditional
distribution, P (O,R|G), that specifies for a given goal how likely a particular module is to be
important. One can integrate out over the uncertainty in this distribution:
P (Tx|G) =∑o
∑r
P (Tx|Fx, O = o,R = r)P (O = o,R = r|G).
We assume that P (O,R|G) has been learned through experience; however, we do not focus on this
form of attentional learning in the present paper. Given a specific goal, such as searching for a
person, P (O,R|G) will be peaked on a single target object and around ranges of influence that are
likely to be relevant. Given a more general goal, such as searching for a mode of transportation, the
12
probability mass in P (O,R|G) will be more widely distributed over objects. And given the most
general goal of all—to explore the visual environment—P (O,R|G) might either be based on priors
(i.e., by integrating out over the more specific goals), or might simply be uniform (i.e., all objects
and ranges are equally relevant). For simplicity we assume the latter in simulations of exogenous
control.
2 TASC Implementation
As with all models of attention, TASC assumes that computation of the saliency map can be
performed in parallel across the visual field with relatively simple operations. If computing saliency
was as computationally complex as recognizing objects, there wouldn’t be much use for the saliency
map because the purpose of the saliency map is to provide rapid heuristic guidance to the visual
system. In implementing TASC, each module from the module grid is configured to perform a
mapping from image to saliency map. The modules all have the same architecture, though they are
parameterized to implement varying ranges of influence, and the association from image to saliency
map is established via an object-specific training procedure which we describe shortly.
2.1 Overview of Processing Stages
Rather than creating yet another model of attention, our aim in developing TASC was to generalize
existing models such that existing models can be viewed as instantiations of TASC. Specifically,
the module processing mechanisms were chosen to be consistent with key implementation details
of several models including NI, TOCH, and Guided Search. Figure 7 provides a schematic depic-
tion of the TASC architecture. There are 4 main processing stages: feature extraction, contrast
enhancement, dimensionality reduction, and associative mapping. A subset of these stages is found
in most attention models. Table 1 lays out the basic processing stages used in NI, Guided Search,
and TOCH.
Image processing in TASC begins with a feature extraction stage in which raw pixel values are
converted into basic information about the presence of features including color and orientation. All
visual attention models that operate on raw image data use a similar form of feature extraction,
though some include other features such as luminance.
13
dimensionalityreduction
associativemapping
feature extraction &contrast enhancement
Figure 7: A schematic depiction of the TASC processing stages.
Table 1: Processing stages of three existing models of attentional control
Stage NI, Guided Search TOCHfeature extraction color, orientation, luminance at
multiple scalescolor, orientation at multiplespatial scales
contrast enhancement via center-surround differencing nodimensionality reduction no sub-sampling and PCAtask association linear mostly linear with a Gaussian
squashing function
14
Feature extraction is followed by a contrast enhancement stage that mimics neural processing
mechanism along the visual stream. This stage strengthens the response to regions that differ
significantly from their surround. TASC weights feature values by the ratio of center activity
to surround activity. NI implements contrast enhancement through center-surround differencing
across spatial frequencies and non-linear normalization. Contrast enhancement is not included in
TOCH.
The third processing stage is dimensionality reduction. Dimensionality reduction is used in the
scene gist computations in TOCH to limit the redundancy of the feature data. This reduction is
not necessary in NI and Guided Search because saliency predictions are based only on local feature
data which have a manageable dimensionality. Similar to TOCH, TASC uses sub-sampling and
principal components analysis to perform dimensionality reduction. The amount of reduction varies
with the module’s range of influence. At the short range of influence, significantly more feature
information is utilized across the whole image than at the long range of influence. This approach
matches the continuum seen between NI and TOCH.
The final processing stage involves constructing an object-specific saliency map from the di-
mensionality reduced image data. NI, Guided Search and TOCH are all attention models that can
be tuned to a specific search goal. In NI, linear weights for the individual feature channels are
learned from a set of test images related to the search task. Guided Search uses a set of linear
gains to modulate the weight of each feature channel. TOCH associates global features with likely
object locations through a weighted sum of linear regressors where the weights depend on the global
features in a non-linear way. Each module in TASC builds a linear mapping between feature values
and target locations for a specific object.
2.2 Patch Processing
For a given target object, we implement 4 TASC modules that have varying ranges of influence from
short to long. A specific range of influence is implemented by dividing the image into overlapping
patches of a specific size. Each image patch contributes only to the corresponding patch in the
saliency map. For modules with a short range of influence, small patches are used so that saliency
predictions are based only on local features—i.e. those within the small patch. At the longest
range of influence, the entire image is treated as a single patch which enables image features to
15
(a)
(b)
Figure 8: A depiction of patch processing for the 4 TASC modules that vary from short range of influence(left) to long range of influence (right). Each patch can only make predictions about its correspondingregion in the saliency map. Thus at the short range of influence, only local predictions are made. At thelong range of influence, however, image properties in any region can contribute to the saliency values in anyother region. In (b) see the saliency maps across the 4 ranges of influence for the image in (a) where thetask involves searching for trees. For all but the longest range of influence, patches are tiled such that theyoverlap by 50 percent.
have an effect on all saliency locations no matter how distant. The short range of influence modules
correspond well to the NI model whereas the long range of influence modules are more similar to
the global gist processing component of TOCH. Figure 8 shows a depiction of the patch sizes across
the 4 spatial scales. In this implementation we use square input images and square patch sizes
though this is not a requirement of the TASC framework. The model has 4 different patch sizes for
the 4 ranges of influence, φ1 = 1.56, φ2 = 6.25, φ3 = 25, and φ4 = 100 ordered from short to long
range of influence, where φ represents the percent of the image the patch spans. For each range of
influence, the image is tiled by patches that overlap by 50 percent.
Each TASC module processes the image in a patchwise fashion, with each patch processed
independently of each other. Patches are not recombined until the saliency map is formed. The
processing mechanisms are the same for all patches regardless of the range of influence, though
some processing parameters differ. Below we describe each processing mechanism and then present
the parameters that differ between ranges of influence.
16
2.3 Feature Extraction
TASC begins with an image such as in Figure 9a. From this image the model records feature
activations along 8 different channels: 4 color opponency and 4 orientation as shown in Figure 10a.
The value of each color channel is the difference between the RGB value for that color and its
opposing color (red opposes green and blue opposes yellow). All negative color feature values are
set to 0. If R,G,B are the pixel values and Y = (R + G)/2, then the red, green, blue and yellow
feature values are given by fr = max(0, R − G), fg = max(0, G − R), fb = max(0, B − Y ), and
fy = max(0, Y − B). Technically the feature values should be denoted R(x, y) and fr(x, y) for
each pixel location, (x, y), but we drop the location specification when it’s not necessary. For
the orientation channels, the intensity image is convolved with Gabor filters tuned to 4 different
orientations: 0, 45, 90, 135. All feature extraction parameters are the same across the 4 ranges of
influence.
The TASC feature extraction stage is very similar to feature extraction performed in the NI
and TOCH models. All models use the same process for extracting orientation values. NI uses
color opponency values similar to those used in TASC. In Itti and Koch (2000), which forms the
basis for the early stages of NI, two color channels are used: red-green and blue-yellow (negative
values allowed). TASC uses these same values but divides them into 4 channels to avoid negative
feature values.
The primary difference between feature extraction in TASC and the NI and TOCH models is
that TASC does not extract features at multiple spatial frequencies. Feature extraction utilizing
multiple spatial frequencies, obtained for example through the steerable pyramid (Simoncelli and
Freeman, 1995), is typically employed to improve the model’s ability to handle image datasets with
a diverse range of object and scene scales. To reduce the computational load, we chose to perform
feature extraction at a single spatial frequency in TASC and found the results satisfactory for our
simulations. Nevertheless, we envision a complete implementation of TASC that includes feature
extraction at multiple spatial frequencies and expect that this addition would improve the overall
performance.
17
(a) (b)
Figure 9: An example image, (a), and the saliency maps, (b), for the four ranges of influence from short tolong(left to right). Here the task is to search for the trees in the image.
(a) (b)
Figure 10: Activities in TASC for the first two stages of processing: feature extraction (a) and contrastenhancement (b). The original image is shown in Figure 9a. The feature channels on top are red, green,blue and yellow. The four channels on the bottom correspond to orientations of 0, 45, 90, and 135 degreesfrom left to right.
2.4 Contrast Enhancement
Contrast enhancement is implemented in TASC by weighting each feature value by the ratio of
center activity to surround activity (the result of which is shown in Figure 10b). The model
computes center and surround activity by convolving the feature data for each channel by two
gaussian kernels. Both kernels have a peak of 1 but they differ in their standard deviation, with σc
corresponding the center kernel and σs to the surround kernel. For each pixel and feature channel,
the model obtains a center value, cent, which measures the amount of activity close to that pixel
location and a surround value, surr, that measures the activity over the wider region surrounding
the location. The contrast value, c(x, y), is computed for each of the 8 feature channels as follows
using the red feature channel as an example:
cr(x, y) =centr(x, y)surrr(x, y)
fr(x, y),
18
where,
centr(x, y) =∑i,j∈I
fr(i, j)e−(x−i)2
2σ2c e
−(y−j)2
2σ2c and surrr(x, y) =
∑i,j∈I
fr(i, j)e−(x−i)2
2σ2s e
−(y−j)2
2σ2s ,
with (i, j) iterating over all pixel locations in the image I.
Because feature values are always non-negative, the surround kernel bounds the center kernel,
i.e. surr ≥ cent, and thus the ratio of center to surround will always be between 0 and 1. If most
of the activity for a specific feature is in the center, the ratio will be close to 1 and the feature
activity at that location will remain strong. If there is significant activity for the specific feature
in the surrounding regions, this ratio will be close to 0 and the feature value at that pixel will be
reduced in strength.
For the feature extraction and contrast enhancement stages, the model computes the f and c
values from the whole image rather than by patch. For feature extraction this avoids boundary
issues at the edges of the patch when convolving the image with the Gabor filter. For the contrast
enhancement stage this allows the surrounding areas of a location to have an influence on the
contrast value even if they reside in a different patch. The standard deviations for the center and
surround kernels are σc = 0.05N and σs = 0.5N , where the image dimension is N x N . These
parameters lead to a contrast mechanism that compares a local region to a large part of the image,
i.e. contrasting occurs at a global scale. The surround region could be shrunk to yield more local
contrasting. We chose these values primarily because the content of synthetic images used in visual
search tasks is usually spread across the whole image. These parameters were left fixed for all
other image datasets. However, a smaller surround would perhaps be more appropriate for real
world images. The same center and surround parameters are also used across all ranges of influence
in this implementation, though, these parameters could differ from one range of influence to the
next. For example, at the long rang of influence, a larger surround value may be preferred. It is
also possible that several contrast values could be obtained by varying the center and surround
percentages. This would in essence provide another form of multi-scale processing.
19
2.5 Dimensionality Reduction
After the contrast enhancement stage, TASC maintains 8 feature values per pixel. For a patch with
many pixels, the dimensionality of this data is too high for most desired computations. Following
the methods used in TOCH for gist processing, the dimensionality is reduced significantly through
sub-sampling and principal components analysis while key information needed for classifying scenes
is still preserved.
2.5.1 Sub-sampling
As in TOCH, TASC uses a simple sub-sampling scheme where each pixel in the sub-sampled image
represents the mean value of all pixels in a corresponding unique, non-overlapping rectangular
region in the original data. Sub-sampling is performed only within feature channels so for each
sub-sampled region in the image the model still maintains 8 feature values. If the original patch
dimension is n x n and patches are sub-sampled the by a factor ρ, the new patch dimension becomes
n/ρ x n/ρ. Each pixel in the sub-sampled data corresponds to the average of all pixels in a ρ x ρ
block in the original data.
The extent of sub-sampling varies across ranges of influence in TASC. In essense more sub-
sampling is needed for the longer ranges of influence because the patches are larger and contain
more feature data. This approach fits well with both NI and TOCH. Specifically, NI operates
with a short range of influence and does not perform any sub-sampling while TOCH operates at a
long range of influence and consequently uses significant sub-sampling of feature data. The model
uses the following values of ρ for the four ranges of influence from short to long: ρ1 = 4, ρ2 = 8,
ρ3 = 8, and ρ4 = 16. These specific values were chosen to yield a reasonable computational load
in the following principal components analysis stage. However, changing these values does not
significantly affect the model’s performance.
2.5.2 Principal Components Analysis
After sub-sampling, the model uses principal components analysis (PCA) to further reduce the
dimensionality of data. This process extracts the critical feature properties of a patch and discards
extraneous feature information. PCA analysis is done separately for each of the 8 feature channels
20
and 4 ranges of influence. For each feature channel, the top ` principal components are preserved.
The PCA projections are derived from a set of patches extracted from our image database—using
several randomly distributed patches per image. The database contains roughly 2,500 images
including many varied real-world scenes and some synthetic visual search images. Note that for
a given feature channel and range of influence, the same PCA projection is used for all patches,
thus yielding a location independent reduction. Additionally, the same PCA projections are used
across all simulations described in the results section (including images with artificial displays and
real-world scenes).
We chose to perform PCA on each feature channel to preserve an equal amount of information
across all feature dimensions. The alternative would involve lumping all feature data together and
performing one PCA projection. In this case, PCA may find that a particular feature dimension
does not have significant discrimination power and thus would not maintain any information about
this feature channel in the top principal components. This could be problematic for visual search
tasks where the discarded feature plays a defining role. This problem could be avoided by carefully
controlling the database used for deriving the PCA projections, however, we found it preferable to
instead treat the individual feature channels independently.
If the original patch size is n x n, the dimensionality of each feature channel after subsampling
is (n/ρ)2. After PCA, there are only ` data points per channel. The full patch data, x, are obtained
by concatenating the data from each feature channel into a vector with dimensionality 8`. The
number of principal components used varies across the ranges of influence; from short range to
long range, `1 = 1, `2 = 3, `3 = 9, and `4 = 27. Because there are far more patches per image at
the shorter ranges of influence, the total number of data points per image decreases as the range
of influence becomes longer. At the shortest range of influence, there are 15x15=225 patches per
image (this number is odd because patches overlap). At progressively longer ranges of influence,
there are 7x7=49, 3x3=9, and 1x1=1 patches per image. From short to long range of influence, the
total data points used per image are 225x8x1=1800, 49x8x3=1176, 9x8x9=648, and 1x8x27=216
(patches x features x `).
Figure 11 provides an informative view into how the PCA stage functions by depicting which
regions each principal component favors for each feature channel. Each box in this figure displays
the values of a specific principal component vector for a specific channel. Lighter values correspond
21
red green blue yellow 0 45 90 135
Figure 11: Reconstructed principal component weights for the 8 feature channels. The top row correspondsto the first principal component. See text for more detailed explanation.
to stronger weights for the features at that location in the patch. Principal components decrease
in importance from top to bottom. The first principal component typically represents the DC level
of activity. Subsequent principal components provide more fine-grained spatial selectivity. The
principal component vectors reconstructed in this figure correspond to the second longest range
of influence for which 9 principal components are used. The PCA projections used for the other
ranges of influence exhibited a very similar pattern.
Figure 12 shows the representation for the image in Figure 9a after sub-sampling and PCA
at the 4 ranges of influence. Each row in Figures 12a-c shows 8 composite images, one for each
feature channel, formed by assembling the nth principal component value for each patch. The
top row corresponds to the first principal component and lower rows display subsequent principal
components with decreasing significance. Figure 12a is the representation for the shortest range
of influence. At the longest range of influence, Figure 12d, there is only one patch per image but
more principal components (now shown along the columns).
2.6 Task Associations and Learning
To implement modules that specialize for a variety of target objects, each module has a unique
collection of patch specific neural networks that link image features to object locations. Figure 9b
shows the output from an association network tuned to trees for the 4 ranges of influence. These
22
(a)
(b)
(c)
(d)
Figure 12: PCA activities in TASC at the four ranges of influence (shortest (a) to longest (d)) correspondingto the image in Figure 9a. A separate representation is maintained for each feature channel (here the orderingis: red, green, blue, yellow, 0, 45, 90, 135). PCA is computed at the patch level. Each block in these figurescorresponds to a PCA value for a specific feature channel and patch. At the shortest range of influence thereare 15x15 patches. At the longest range of influence, there is only one patch per image. In (a) – (c) the mostimportant component is in the top row and less important components are in lower rows. In (d) the principalcomponents decrease in importance from left to right. Lighter values correspond to greater activation.
23
are the final saliency maps for each module have been smoothed with a Gaussian filter.
For each patch in our image, there is set of input units representing the patch data, xp, after
dimensionality reduction. Each set of input units is fully connected to a set of output units via a
linear mapping. To limit the capacity of this network, we restrict the rank of the linear mapping.
This is equivalent to adding a hidden layer with a limited number of units. For each range of
influence, the number of hidden units is restricted to a different r < 8`. If the input to hidden
weight matrix for a patch is Vp and the hidden to output matrix is Wp, the linear mapping from
input to output, VpWp, has rank r. The model uses r1 = 1, r2 = 3, r3 = 9, and r4 = 27 for the 4
ranges of influence from short to long. This leads patches at the shorter range of influence to have
weaker descriptive ability. However, because there are far more patches at the shorter range of
influence, the module as a whole maintains adequate descriptive power. The output values for each
patch, op, are obtained by passing the output activations, xpVpWp, through the logistic sigmoid
function to yield meaningful saliency values between 0 and 1. For computational purposes, we
constrain the dimension of op to be smaller than the original patch dimensions because saliency
predictions do not need to be as fined grained as the original pixel information. The reduction is
such that the final saliency map has 32x32 pixels.
To obtain the final saliency map, the outputs from the individual patches must be combined.
Because the patches overlap, multiple output units can correspond to the same region in the
complete saliency map. Over most of the image, 4 overlapping patches contribute to a single
location, though, along the edges only 2 patches contribute and in the corners only one patch is
present. The final saliency value for a particular location, s(x, y), is computed as an average of all
patch output values at that location:
s(x, y) =1
|Ω(x, y)|∑
p∈Ω(x,y)
op(gp(x, y)),
where Ω(x, y) is the set of patches that contribute to point (x, y) in the saliency map, |Ω(x, y)| is
the cardinality of that set, and gp(x, y) is a function that maps the coordinates in the saliency map
to patch coordinates for a particular patch p.
To train a network for a specific task, we use a supervised learning approach based on the
LabelMe image database (Russel et al., 2008), which contains images labeled by pixel according
24
Figure 13: (above) Two images from the LabelMe database, (below) pixel-by-pixel labeling of three objects—cars (red), people (green), and buildings (blue).
to the object present at that corresponding location. Figure 13 shows two labeled images used for
training with cars, people, and buildings labeled (note that objects are sometimes not labeled).
Typically, our training sets for each object consist of 500-1000 images. During training, each image
is presented to the model, the model computes a dimensionality reduced feature representation
for each patch, and this representation serves as input to the patch specific neural network. Each
network is trained with a target output containing values of 1 at all locations in the patch where a
given target object appears and 0 at all other locations. Training is performed in batches using the
back propagation algorithm to update weights. Training continues until a local optimum in error
is obtained.
2.7 Combining Multiple Modules
So far we have discussed how each module in the module grid computes a saliency map for a specific
image. Each individual module could be used to guide attention. However, to implement more
complex strategies, such as hybrid models like TOCH, it is necessary to combine the saliency maps
from a collection of modules. It is this combination of modules that gives the TASC framework
its ability to model attentional guidance that varies in task specificity and utilizes multiple spatial
scales.
Several different approaches are possible for combining the saliency maps from multiple modules.
As we suggested in the introduction, a probabilistic approach could assign a prior probability to
the relevance of each module for a specific goal. The combination could then integrate out over
25
the uncertainty in the relevance of the modules, essentially computing a weighted summation of
module outputs, where the weights correspond to the module’s prior probability. One problem
with this probabilistic rule is that it can yield the same result when one module is highly active
at a location as it it would if many modules were modestly active at that location. If the module
with high activity is relevant to the goal, we’d intuitively expect the saliency at that location to
be high regardless of how the other modules respond. This observation suggests an alternative
approach to combining modules in which a subset of relevant modules is selected and the final
saliency value at a specific location corresponds to the maximum value at that location across the
subset of modules. This approach can be thought of as performing a disjunction across object
modules whereas the previous combination rule is more of a conjunction of modules. These two
approaches will produce similar results for highly constrained tasks because in the probabilistic
combination most of the mass of the distribution over modules will be concentrated on one module
and in the max combination the set of relevant modules will be small. When tasks become less
constrained, however, the predictions of these models become more different. We implemented both
combination rules in TASC and found that the max rule produced saliency maps that were more
consistent with experimental results. This second strategy was used for all simulations presented
in the results section.
Combining visual information using a max rule is not new in the literature. Zhaoping and Snow-
den (2006) propose a selection process that computes the maximum over feature maps. Riesenhuber
and Poggio (1999) argue for a max-like combination of different feature detectors in their object
recognition model. Evidence for max rule combinations can also be found at the neural imple-
mentation level of analysis. Specifically, Gawne and Martin (2002) and Lampl et al. (2004) find
examples of max operations occuring in monkey and cat visual corticies. The general finding is
that activities of objects in multiple locations (within one receptive field) lead to an activation level
in the visual cortex that is most similar to the max of the individual object activations when they
are presented separately. An intuitive argument can also be made for applying the max rule in
real-world situations. For example, if attention is tuned to search for wheeled vehicles and the car
module gives a strong response at a location, then that location should be salient even if the bike,
truck, bus, train, cart, and scooter modules all have a weak response at that location.
Mathematically, this combination is defined as follows. Let so,r(x, y) be the saliency map value
26
for target object o and range of influence r at location (x, y). The values in the combined saliency
map are given by
sc(x, y) = maxo,r∈M
so,r(x, y),
where M is the set of modules relevant to the current goal. In each of the simulations presented
in the next section, M is selected to fit the constraints of the search task. In most cases, M
contains modules associated with one specific target object, though at times the model combines
across multiple target objects. One can imagine that one range of influence might be more relevant
than another for a specific viewing goal, however, it is unclear how to select a subset of ranges for
that goal. Consequently, the model employs the most general strategy in all simulations, using all
ranges of influence for the relevant target objects. For free viewing simulations, i.e. pure exogenous
control, M contains every module in the module grid.
2.8 Key Components of the Implementation
At this point the model may seem rather complex in the number of stages to process an image
and obtain a saliency map. However, most of these components are present in existing attentional
models and don’t significantly contribute to the characteristics of TASC. Nevertheless, there are
several implementation decisions in TASC that are particularly important in defining the behavior
of the model. The feature extraction and contrast enhancement stages are mostly standard though
we prefer our specific implementation of contrast enhancement. For the PCA stage, we find it
important to reduce the dimensionality of the feature channels independently rather than operating
on all the data at once. This enables a high degree of reduction while ensuring that all feature
channels maintain an equally informative representation. Another key implementation decision is
the choice to restrict the association network to be linear. Linearity limits the complexity of the
mappings that can be learned and causes the model to perform a quick and dirty sort of object
recognition (we return to this point in the discussion). The final implementation decision that is
important in TASC is the use of the max of disjunction rule when combining modules. Another
point about this implementation worth mentioning, though not a specific design decision, is that
the same model is used for all simulations and the only difference from one simulation to the next
is the subset of modules used to obtain the final saliency map.
27
3 Simulation Results
Having explained our implementation of TASC, we now present simulations that demonstrate that
TASC accounts for a diverse set of findings in the attentional literature. We begin by testing
the model on classic visual search tasks. Next, we explore the contextual cueing paradigm and
observe how the TASC framework allows for the type of global contextual priming observed in
these experiments. We discuss how TASC provides an integrative framework that accomodates
other attentional phenomena including probabilistic spatial cueing and figure-ground assignment.
We conclude by testing TASC on real-world images and comparing the saliency predictions to eye-
movement data from human observers. In all these simulations, TASC operates on raw pixel images.
Historically, attentional models that simulate psychological findings used as input an abstract data
structure representing presegmented features. Here, we present a single model that uses the same
input representation for both artificial and real-world images.
3.1 Visual Search
Visual search tasks require that an individual detect a target element in a display surrounded by
distractor elements. Visual search is one of the most extensively studied tasks in the psychological
literature (see Wolfe (1998b) for a meta-analysis of a wide range of visual search experiments and
Wolfe and Horowitz (2004) for a review of the attributes that guide attention). The seminal work
of Treisman and Gelade (1980) revealed what appeared to be a fundamental dichotomy between
feature and conjunction search. In feature search, the target differs from all distractors along
a single feature dimension; for example, the target is red and the distractors are green (see the
display at the left of Figure 14a). In conjunction search, the target is defined by a conjunction
of features, and shares one feature in common with each distractor; for example, the target is
a red vertical bar among distractors that are green verticals or red horizontals (see Figure 14b).
Feature search is efficient in that search time is not dependent on the number of distractors in
the display; conjunction search is inefficient, due to the increased confusability between targets
and distractors. Any account of visual search needs to start by explaining the distinction between
feature and conjunction search. This explanation is a necessary but not sufficient condition for the
plausibility of a theory. The dichotomy between feature and conjunction search has given way to
28
Sing
le F
eatu
reC
onju
nctio
nO
ddba
ll
Short LongRange of Influence
(a)
(b)
(c)
Figure 14: Saliency maps at 4 ranges of influence for three different visual search paradigms: (a) featuresearch—a single red target among green distractors, (b) conjunction search—a single red vertical amonggreen verticals and red horizontals, and (c) oddball search—a singleton exists in each image but the specificfeatures of the target and the defining feature dimension are unknown before the trial.
many subtleties and quirks that populate the literature. Our goal in this work is not to explain
visual search in all its detail, but to demonstrate that TASC is a plausible candidate to tackle the
visual search literature.
In addition to the feature and conjunction search paradigms, we also explore oddball detection
or pop-out search. In an oddball-detection task, the target differs from distractors along one feature
dimension, but the feature dimension and the target feature value is not specified in advance of a
trial. For example, the participant might be required to find a red target among green elements,
a green element among red, a vertical among horizontals, or a horizontal among verticals (see the
display in Figure 14c). The target feature changes from trial to trial and thus part of the task
involves determining the relevant discrimination on the current trial.
To model visual search, we construct displays that contain elements varying along two feature
dimensions, color and orientation. To assess search performance, we require a measure of response
latency to detect a target as a function of the number of distractors in the display. (The efficiency
of search is reflected in the slope of this curve.) How are TASC’s saliency maps translated into
response latencies? We adopt an assumption from Guided Search 2.0 (Wolfe, 1994) regarding how
the saliency map is used in search. Guided Search supposes that display locations are examined
in order from most salient to least, and that response latency increases linearly with the number
of display locations examined. To compute the saliency rank of a display element, a Gaussian
29
filter is first applied to smooth the saliency map. The maximum saliency value in the immediate
region of the display element is then determined, and elements are ranked by saliency. One does
not need to assume serial search to obtain response latencies monotonic in saliency ranking. For
example, Guided Search 4.0 (Wolfe, 2007) proposes that saliency ranking determines the order in
which display locations are fed into a parallel but capacity limited asynchronous diffusion process
that performs object recognition. Under this assumption, response latency is affected not only by
saliency ranking but also by drift rates. Nevertheless, in the absense of familiarity differences among
the display elements, the capacity-limited parallel diffusion process roughly produces latencies
proportional to saliency ranking. One could embellish TASC to use the response accumulation
mechanism of Guided Search 4.0, but doing so would add little to the key results we present here.
Most visual search experiments include both target present and target absent trials. Target
absent displays provide insight into how search is terminated. Latencies are slower in target absent
trials but usually have a similar dependence on the number of distractors as in target present trials
for the same task. Search termination is not implemented in TASC for target absent trials and
the target rank measure is undefined for images that do not contain a target. Consequently, all
subsequent visual search simulations contain only target present trials.
3.1.1 Feature Search
To simulate feature search, we train a set of modules for each primitive feature as target. This
set corresponds to a row of the module grid, i.e., modules that operate across various ranges of
influence. For a primitive feature such as red, training is based on a set of labeled images that
contain vertical and horizontal line segments of all colors, and the red elements are indicated as
targets. The same set of images is used for training other primitive feature modules with the
appropriate elements indicated as targets. These other modules are used in the conjunction and
oddball simulations. The trained module is then presented with a set of 300 novel images, each
of which has exactly one target. The set of images is divided into an equal number of displays
with 4, 8, and 12 distractors. Figure 14a shows a sample display for red-target feature search with
8 distractors and the activation of saliency maps across four ranges of influence. The location
containing the target is highly salient at each range of influence, and no other locations are salient,
suggesting efficient search. As described in the model implementation section, the final saliency
30
(a)4 8 12
1
2
3
4
5
Mea
n Sa
lienc
y R
ank
Number of Distractors
Single Feature
(b)4 8 12
1
2
3
4
5
Mea
n Sa
lienc
y R
ank
Number of Distractors
Conjunction
(c)4 8 12
1
2
3
4
5
Mea
n Sa
lienc
y R
ank
Number of Distractors
Oddball
Figure 15: The mean target ranking for three visual search simulations in TASC grouped by the numberof distractors in the display. Here we take target ranking to be proportional to response time. The task in(a) involves searching for a target defined on one feature dimension (i.e. find the red object amongst greendistractors). As expected, the target rank is independent of the number of distractors. (b) results for a classicconjunction task (i.e. search for red vertical bars amongst green vertical and red horizontal bars). Here theranking increases with the size of the distractor set suggesting that search is inefficient. (c) results for anoddball detection task where the target is a singleton along one dimension, but which dimension is unknownbefore the trial. As expected, response times for this task are independent of the number of distractors. Ingeneral, predicted response latencies in oddball detection tasks are slower than in single feature search.
map used to evaluate performance for each image is a combination of the four ranges of influence
shown in Figure 14a. It is clear that the target rank in the combined saliency map for this image
is 1. Figure 15a shows the mean target ranking across all 300 images grouped by display size.
Regardless of the number of distractors in the display, TASC obtains a saliency ranking of 1 for the
target element, suggesting a fast response time that is independent of display size, consistent with
behavioral studies showing efficient feature search. Feature search is efficient in TASC because the
contrast enhancement stage strengthens the target feature value and the neural net only activates
regions that contain the target feature. The target feature in this simulation was red, however, the
same result was obtained for green, blue, vertical, and horizontal targets.
Although TASC is trained to perform feature search, the model is agnostic as to whether this
training corresponds to experience within an individual’s lifetime, or experience on an evolutionary
time scale. The main claim of the model is that feature search is not special; it relies on the same
processing mechanisms as any other search task.
31
3.1.2 Conjunction Search
In principle, TASC could implement conjunction search in two different ways. First, saliency of a
conjunction target could be defined as the combination of the saliency maps of the two primitive
feature modules that make up the conjunction, i.e., saliency of red as target and saliency of vertical
as target. The combination of red and vertical maps would be no different than the combination
of automobile and bicycle maps when the target was a wheeled vehicle. Alternatively, a set of
specialized modules could be trained on the conjunction target—e.g., a module for red vertical
targets—treating the conjunction just like any more complex object like a car or person. The
semantics of the two alternatives are somewhat different: the combination of primitive feature
modules detects red or vertical, whereas the specialized conjunction module should in principle
detect red and vertical.
The saliency maps in Figure 14b were obtained from a simulation using the first approach—i.e.
representing conjunction as the combination of two primitive features. As the figure illustrates, the
saliency maps for a conjunction of features have significant spurious activation. Locations in which
red items are present are salient, as are locations in which verticals are present. The exact pattern
of activity depends on the configuration of elements, because the contrast and dimensionality
reduction stages of the model depend on the local configurations. In contrast to the situation
with feature search, the target does not stand out in the saliency map. Figure 15b shows that
the saliency rank of the target increases steeply with the number of distractors in the display,
suggesting that TASC has inadequate computational power to reliably detect a conjunction target.
Note that the saliency activation map is not random. If it were random, then the mean rank would
on expectation be half the number of items in the display—2.5, 4.5, and 6.5 for displays with 4, 8,
and 12 distractors, respectively. The observed rankings are better, indicating that some information
about the conjunction is available to the model. This occurs simply because on expectation, the
saliency of a red vertical element should be higher than an element that is just red or just vertical.
A key reason why TASC is inefficient on conjunction search is that saliency maps for different
targets (here, red and vertical) are combined with a max operator rather than addition. The max
operator yields a roughly 1:1 saliency ratio for targets versus distractors, whereas summing activa-
tions from the two saliency maps yields a 2:1 saliency ratio on expectation. As we explained earlier,
32
the max function was motivated from the fact that the max function essentially implemenents a
disjunction, and disjunctions are needed to vary the task specificity as we have defined it. For
example, if the target is a wheeled vehicle, then a location should be salient if it contains either a
car or a bus or a motorcycle. As we earlier defined the control space, a fundamental dimension of
control is this degree of task specficity, and we conjecture that being able to control task specificity
is far more useful for an agent than having the capability to perform conjunction search. The
architecture is thus optimized for its typical use—searching for objects with varying degrees of
specificity—not the artificial laboratory task of conjunction search.
Thus, we did not simply design inefficient conjunction search into TASC in order to be consis-
tent with behavioral data. Although the model’s design could be altered to improve conjunction
search, the alterations would have other negative consequences to the model’s abilities. Inefficient
conjunction search is a reflection of design trade offs in the cognitive architecture.
Simulations using the second approach—building one module specific to the conjunction—
proved to be more difficult. Evidence suggests that even with lots of training, individuals cannot
learn efficient conjunction search. This poses a challenge to the theory behind TASC which is
premised on the notion of expert modules for complex objects. Why can a car saliency module
be efficient but not a green vertical saliency module? The answer is likely rooted in the difference
between conjunction search targets in artificial displays and objects in real-world scenes. Local-
ization of a conjunction target requires precise alignment of color and shape features. In contrast,
features in real-world objects are generally more redundant and their exact alignment is usually not
needed for identification. We argue that the quick and dirty process embodied by TASC for object
recognition will likely lose this precise alignment of which color feature goes with which shape fea-
ture in a dense neighborhood of many colored shapes. In fact, experimental evidence suggests that
the density of conjunction search influences search efficiency: when display elements are spaced far
apart, search is more efficient. This finding is consistent with the claim that information about
which feature goes with which object is lost when many elements are packed into a small area.
This behavior also arises within the TASC implementation. When few principal components
are used to represent a patch, most of the spatial information of each feature is lost. If the patch
contains several elements, the model will confuse the feature properties of the different elements
and be unable to properly recognize the presence of a conjunction target. For example, consider
33
the shortest range of influence which per patch uses 1 principal component for each feature channel
(the first component is usually the DC level which records the overall activity in the patch). If the
patch contains a red-vertical and a green-horizontal, there will be an equal amount of activity in
the red, green, horizontal, and vertical channels for this patch. But this is the same activity that
would be obtained if the objects were red-horizontal and green-vertical. Thus the model cannot
distinguish between the two scenarios and conjunction search will be inefficient. Similar behavior
arises for the longer ranges of influence where patches cover a large region. Even though more
principal components are used for these patches, the spatial information for each feature is still
limited because a larger region has to be represented.
In simulations where TASC was trained on a conjunction target, i.e. red and vertical, search
was more efficient than expected by the previous analysis. The boosted performance was likely due
to a mismatch between the patch sizes and the content of the displays. Patch sizes were chosen a
priori to obtain a set with varying sizes. Similarly, the size and layout of the visual search displays
in Figure 14 were picked randomly. Because of these choices, the patches at the smallest range of
influence typically contain only one element and therefore there is no confusion between elements.
We expect that the efficiency observed in this simulation is a consequence of an incorrect choice of
patch sizes. In fact, we observed that at the longer ranges of influence, where patch sizes are larger,
search was inefficient. Additionally, the shorter ranges of influence became less efficient when the
displays were made denser.
3.1.3 Oddball-detection Search
In an oddball-detection or pop-out search task, the target is defined only by its contrast with the
distractors. On one trial the target may be the red item among green, and on the next trial the
target may be the vertical among horizontals. The task precludes knowing the target’s featural
identity in advance of a trial. Consequently, oddball detection is facilitated not by endogenous
control, but by detecting local contrast for any feature on any dimension. The notion of exogneous
attention in TASC is ideal for oddball detection. In TASC, we define exogenous attention as the
combination (disjunction) of many object-specialized modules. To simulate the oddball-detection
task, TASC was trained on five primitive features: red, green, blue, horizontal, and vertical. The
model was then shown a set of novel images, each containing an array of 4, 8, or 12 homogeneous
34
distractors, and a single target having a different value on one feature dimension (color or form).
The critical dimension and feature values vary from trial to trial. On each trial, an overall saliency
map is obtained by combining across the five target objects and four ranges of influence.
Figure 14c presents an example of one trial of the oddball-detection task, and the resulting
saliency maps. The three shortest ranges of influence yield greater saliency for the target than other
locations, as does the combination across the four maps. However, the target-distractor saliency
ratio is lower for oddball detection than for feature search. Figure 15c shows the saliency ranking
for the oddball-detection simulation. Most importantly, the graph does not show a systematic
search slope: the target does pop out regardless of display size. However, as one would expect
by examination of the sample saliency maps in Figure 14, the mean saliency ranking of the target
is somewhat higher in oddball detection than in feature search, predicting that oddball detection
should be slower than feature search. Finding direct evidence of this prediction in the literature is
a challenge, but there is indirect evidence (e.g., Lamy et al., 2006), and it wouldn’t seem surprising
considering that feature search is more narrowly delineated.
TASC is efficient in oddball-detection tasks primarily because the contrast enhancement stage
amplifies the defining target feature at the target location. Similarly all non-defining feature values
are suppressed because the surrounding is similar. If the display contains one horizontal bar among
many vertical bars, the contrast weight for the horizontal channel will be high because it is unique,
but the contrast weight for the vertical channel will be low because vertical elements are common.
Contrasting among the color channels will be low for the present color and irrelevant for all other
colors. Because oddball-detection is defined in TASC as the combination of all primitive feature
modules, all element in the display will have some activity. This occurs because each element will
be active in two modules (the color module for the element’s color and the orientation module
associated with the element’s orientation). However, the activity for the oddball target will be
greatest because the activity in these modules will be roughly proportional to the feature values
after contrast enhancement. Thus the target will usually have the greatest saliency and search will
be independent of the number of distractors.
35
(a) (b)1 2 3 4 5 6
−20
0
20
40
60
80
100
Epoch
Con
text
ual C
uein
g (m
s)
Figure 16: (a) A sample contextual cueing display. (b) A reproduction of the main contextual cueingresult from Chun and Jiang (1998) showing the difference in response times for nonpredictive and predictivedisplays. Responses are significantly faster for predictive displays.
3.2 Contextual Cueing
The previous visual search examples all involve guidance that is primarily driven by the features
of the target. However, experimental evidence has suggested that other factors also contribute to
attentional guidance. In their seminal work on contextual cueing, Chun and Jiang (1998) found
that repeated distractor configurations in a difficult visual search task led to faster response times.
Participants were shown displays like the one in Figure 16a containing a single target—a rotated
letter T of any color—among distractors—L’s at various orientations. The participants’ task was
to report the orientation of the T (pointing to the left or the right). Unbeknownst to participants,
some display configurations (i.e. the spatial layout of the distractors) were repeated over the course
of the experiment. In these predictive displays, the target and distractors appeared in the same
locations, although their orientations and colors could change from trial to trial. After several blocks
of trials, participants responded roughly 60 msec faster to predictive displays than to random or
nonpredictive displays. Figure 16b displays the contextual cueing effect reported in Chun and Jiang
(1998). This figure plots the difference in response time between the nonpredictive and predictive
cases, where a positive value indicates that predictive trials are faster than nonpredictive trials. A
significant difference is present after the first epoch which consists of 5 blocks each with 24 trials—
i.e. 5 presentations of each predictive display. Interestingly, participants in contextual cueing
experiments learn the configurations implicitly and are unable to explicitly differentiate between
predictive and nonpredictive configurations at the end of the experiment.
Contextual cueing indicates that attention in visual search can be guided by more than just the
local feature properties of the target. A complete model of attention must therefore be capable of
36
exploiting image properties that extend beyond the target. Here we demonstrate that the TASC
model leverages predictive spatial configurations to boost saliency values in relevant locations. As
in the previous simulations, a row of modules is trained for a specific target, which in this case is a
sideways T. This training stage corresponds to the practice trials in a contextual cueing experiment.
The images in the training phase are standard contextual cueing displays except that none of the
configurations are predictive. As in Chun and Jiang (1998), the model is then presented with
30 blocks of 24 images, 12 with predictive displays that repeat from block to block and 12 with
nonpredictive displays that always change. The displays were generated using the same methods
as in Chun and Jiang (1998). Because the contextual cueing effect depends on the sequential
presentation of trials, the association component of each TASC module was modified to perform
online learning while displays are presented in order. This differs from the batch learning used for
other simulations. Two computations occur at each trial: a forward pass in the neural network
yields a saliency map for the current display (used for evaluation) and a backward pass updates
the weights to reflect the current trial. In this way the association components of each module
are constantly changing such that performance on future trials is affected by recent experience.
Sequential predictions are inherently noisier because they originate from an average over fewer
trials than predictions that are based on averages performed across the entire experiment. To
account for this, results were obtained from the average of 5 simulations. As in the previous visual
search experiments, the target ranking in the final combined saliency map is used as a measure of
response latency. Figure 17a shows the difference in ranking between nonpredictive and predictive
trials and exhibits a pattern very similar to Chun and Jiang’s behavioral result in Figure 16b. In this
figure, the 30 blocks of trials are grouped into 6 epochs each with 5 blocks. Most importantly, both
figures show no difference between the two cases at the onset of the experiment and a noticeable
difference beginning in the second epoch. Figure 17b depicts the block by block learning pattern
for predictive and nonpredictive displays. Learning occurs quickly for the predictive displays and
performance is greatly facilitated compared to the minimal improvement for nonpredictive displays.
Figure 17a can be obtained from Figure 17b by averaging across groups of 5 blocks and computing
the difference between cases.
The results presented in Figure 17 are obtained from the final saliency map that represents a
combination across all ranges of influence. But which ranges of influence are most responsible for
37
(a)1 2 3 4 5 6
0
1
2
3
Mea
n R
ank
Diff
eren
ce
Epoch (b)5 10 15 20 25 30
1
2
3
4
5
6
Block
Mea
n Sa
lienc
y R
ank
PredictiveNonpredictive
Figure 17: (a) Contextual cueing simulation results showing the difference in rank between the nonpredictiveand predictive displays. Predictive ranks are significantly lower after the first epoch. (b) Block by blockresults showing the mean rank for predictive and nonpredictive displays.
Short Long
1
2
3
4
5
Range of Influence
Mea
n R
ank
PredictiveNonpredictive
Figure 18: Mean ranks for the 4 ranges of influence in TASC from the contextual cueing simulation. Thefirst 5 blocks are removed from the averages. The two longest ranges of influence demonstrate significantfacilitation for displays with predictive contexts.
38
producing the contextual cueing effect? Intuition implicates longer ranges of influence because short
ranges cannot detect configurations of distractors that extend beyond a local region. Support for
this intuition can be found by examining the saliency maps for each range of influence separately.
Figure 18 presents the average predictive and nonpredictive rank across the whole simulation for
the 4 ranges of influence. The first 5 blocks are removed from each average because significant
learning occurs during those trials. It is clear that the difference in rank increases as the range
of influence grows and that the target rank for predictive displays decreases with longer ranges of
influences. The two longest ranges of influence appear to be responsible for most of the response
time facilitation.
In the context of the TASC control space, contextual cueing, with its dependence on the long
ranges of influence, corresponds to an attentional strategy that primarily exploits the region of
the control space with high task specificity and a global spatial scale. A natural comparison can
be drawn between contextual cueing and scene-based endogenous control; in both cases, attention
is guided by coarse properties of the image. Similarly, a parallel can be made between distrac-
tor configurations and scene gists. However, TASC predicts that contextual cueing is not just a
consequence of the global spatial scale. As suggested by Figure 18, the second longest range of
influence—which covers quadrants of the image—is also capable of producing the effect. This dif-
ference between contextual cueing and scene-based endogenous control suggests another strategy
that can be placed in the TASC control space. Figure 19 adds a point for contextual cueing to
the control space that has relatively high task specificity and an intermediate spatial scale near
the global end of the spectrum. Contextual cueing is placed lower than scene-based endogenous
control on the task specificity axis because the same configuration could provide facilitation for a
variety of target objects. Thus the relationship between a specific target and the contextual cueing
configuration is weaker than the relationship between an object and the scene gist in a real-world
task. In essence, contextual cueing could still have an effect when the task is more loosely de-
fined. The prediction in TASC that contextual cueing draws from an intermediate spatial scale
is supported by a subsequent contextual cueing experiment that observed the contextual cueing
effect even when predictive configurations were constrained to one quadrant of the image (Olson
and Chun, 2002). Additionally, no response time facilitation was found in this experiment when
configurations distant to the target were repeated and local configurations were novel.
39
feature-basedendogenous
exogenous
scene-basedendogenous
Spatial ScaleTa
sk S
peci
ficity
local global
low
high
contextual cueing
lower region
Figure 19: The control space depiction with two points added for contextual cueing and the lower regioneffect. The experimental results from these two paradigms suggest that individuals employ control strategieslocated in different regions of the control space than the primary strategies.
Recent research on contextual cueing with real-world displays has led to alternative conclusions
about how global and local contexts affect response times (e.g., Brockmole et al., 2006). Contrary
to the result of Olson and Chun (2002), this research concludes that observers use global contex-
tual cues for attentional guidance far more than local contextual cues. Using simulated real-world
scenes, the authors manipulated objects in the local region of the target and the objects surrounding
that local region which form the global context. As expected, the greatest response time facilita-
tion occured when the local and global context were both predictive. However, when predictive
local contexts were compared to predictive global contexts, local contexts provided significantly less
facilitation than global contexts. The conflict between these results and those in Olson and Chun
seems to stem from the inherent difference between artificial displays and real-world scenes. The
TASC framework offers a potential resolution to this conflict by allowing for the possibility that
attentional emphasis is placed on different regions of the control space for these two experiments.
The global spatial scale could be of high importance for real-world scenes whereas intermediate
spatial scales might be preferred for less familiar images that lack an understandable global struc-
ture. This viewpoint fits with the concept that modules are used according to their relevance to
the current goals and environment. Experience may build an attentional bias for the global spatial
scale when the viewing environment consists of real-world scenes. This conclusion requires more
rigorous support, however, the notion that contextual guidance is tuned through experience to real-
world environments is partially bolstered by the findings in Brockmole et al. (2006) that predictive
40
configurations in real-world scenes are learned more quickly and ultimately lead to a greater re-
sponse time facilitation than predictive configurations in artificial search images. Experience with
the contextual properties of real-world scenes evidently results in a greater impact on attentional
guidance. We expect that TASC would produce results consistent with those in Brockmole et al.
(2006) if ranges of influence were selectively chosen based on the visual environment and the goals
of the task.
3.3 Spatial Probability Cues
To assure that distractor configurations are responsible for the effect found in contextual cueing
experiments, the arrangement of displays is controlled such that predictive target locations are
equally as probable as nonpredictive target locations. Motivation for this constraint comes from
findings in Geng and Behrmann (2005) that suggest that spatial probability is another factor that
influences attentional guidance. The authors report response latencies that are faster for targets
in locations that frequently contain a target compared to locations were targets rarely appear.
The implementation of TASC allows the model to naturally exploit spatial probability cues of this
sort when forming saliency predictions. Specifically, the neural networks of each module indirectly
encode the prior probability of a target appearing at each location within their scope (i.e. if a
specific location is frequently active, it will generally yield higher activations than infrequently
active locations). If the same target features appear in separate images at two different locations—
one with high probability and one with low probability—the networks will yield higher activation
for the target at the highly probable location and consequently the target-distractor saliency ratio
will be greater for that target and the response time faster.
Geng and Behrmann also find that response times associated with targets in low probability
locations are slower when a distractor appears in a highly probable location as opposed to when
the highly probable locations are empty. TASC would also produce this effect because the module
networks have weak discrimination abilities. If a network regularly gives a response to a high
probability region, then a distractor in that location would likely produce more activity than if it
were in a low probability location, thus hindering the ranking of the actual target.
The spatial probability cues in Geng and Behrmann are are learned over a relatively short
time frame (within one group of trials). In contrast, spatial probability cues in Senders (1964) are
41
learned over the span of days and weeks. TASC is capable of learning at both of these time scales.
If modified to process trials sequentially, as in the contextual cueing simulation, the association
networks would show a short term preference for locations that more recently contained targets.
Additionally, if networks were exposed to regular patterns of information over a longer period of
time, the model would acquire a bias for frequently relevant locations. This behavior is a direct
consequence of the claim embedded in TASC that attention is fundamentally experience-based and
we elaborate on this point in the discussion.
3.4 Figure-ground Assignment
Expanding upon the long term learning of spatial probability cues, one might expect a spatial bias
for locations found to be important based on a lifetime of experience. Vecera et al. (2002) find
evidence for such a bias in what they refer to as the lower-region effect in figure-ground assignment.
The authors observed that for displays such as the blocky images in Figure 20, viewers tended to
perceive the lower region as the foreground figure regardless of the arrangement of colors. This
result can be interpreted as an attentional bias for the lower region—i.e. people attend the lower
region first and thus consider it the foreground figure. An intuitive explanation for this effect can
be drawn from the observation that in real-world situations, objects of interest are more often found
in the lower region of our field of view (things are on the ground more often than they are in the
sky).
The TASC framework offers an informative interpretation of the lower-region effect. Specifically,
this effect can be viewed as exogenous control that operates at a global spatial scale and has low
task specificity—i.e., it is based on the overall scene properties and requires no specific target of
search. This strategy is markedly different from scene-based endogenous control in that it operates
with low task specificity. We have indicated this new point in the control space in Figure 19. Lower
region, occupying the fourth corner of the control space, is a natural complement to the three
primary control strategies described previously. To demonstrate that TASC obtains the lower-
region effect, an instantiation with 11 object module rows was trained on a corpus of real-world
images using the following objects: car, person, tree, window, bike, light, building, sidewalk, road,
head, and sign. The model was then exposed to displays (shown in Figure 20) similar to those in
Vecera et al. (2002) with noise added to increase the amount of feature information in the lower
42
Figure 20: Sample displays from Vecera et al. (2002) and the associated outputs from TASC operating atthe longest range of influence and with low task specificity, combining across 11 object models trained onreal-world images. A bias is shown for the lower region in both displays.
and upper regions. The saliency maps in Figure 20 are a combination across all object modules
using only the longest range of influence. Because the displays bear no resemblance to real-world
images, the activity patterns in the saliency maps are difficult to interpret. However, TASC does
show a clear bias for attending the lower region of the image regardless of the color layout—i.e.
there is more saliency activity in the bottom half of both maps. These saliency maps are obtained
from a limited instantiation of TASC with only 11 objects and thus one can justifiably be cautious
in making strong claims about this result. Nevertheless, the result is an emergent consequence of
training TASC solely on real-world scenes.
Throughout this paper, the term exogenous control has been used to refer to bottom-up atten-
tional processing at a local spatial scale. This usage is consistent with most of the literature on
attention. In the lower-region effect, however, we recognize a different form of exogenous behavior
that operates at a global spatial scale. From the perspective of the TASC framework, exogenous
attention—defined to be any attention that is driven primarily by factors outside the individual—
should correspond to the entire bottom region in the control space—i.e. low task specificity across
all spatial scales. This new conceptualization of exogenous attention may prove useful in evalu-
ating attentional behavior in unstructured viewing situations. We embrace this new perspective;
however, throughout the rest of the paper, we continue to use the term exogenous control with its
traditional meaning.
3.5 Real-world Images
The simulations in the previous sections demonstrate that TASC yields results consistent with the
findings in a variety of attentional experiments with artificial images. This consistency provides
a strong argument for the theory behind TASC. However, any complete model of visual attention
43
must also be tested on real-world images; after all, the ultimate goal of this inquiry is to understand
how people allocate attention in real-life situations.
In much of the attention literature, results for real-world images lack quantitative analysis
offering instead only weak qualitative assessments. Torralba et al. (2006) provide a more rigorous
analysis in which saliency predictions from the TOCH model are compared directly to human eye-
movements recorded for the same set of images. The experiment in Torralba et al. (2006) involved
three different search tasks over a set of images that contained indoor and outdoor scenes. For
each target object—person, mug, and painting—an ordered sequence of fixation locations for each
image was collected from 8 participants. Saliency maps for each image were also obtained from
a version of TOCH tuned to the relevant target object. The model’s performance for each image
was measured by computing the percent of participant fixations within the most active regions of
the saliency map. These highly active regions contain all locations with a saliency value among
the top 20 percent of values and thus cover 20 percent of the image’s area. The saliency maps
of TOCH were compared to saliency maps representing cross-participant consistency and saliency
maps obtained from a simple bottom-up model similar to model of Itti and Koch (2000). For
each image and participant, the cross-participant saliency map was constructed using a mixture of
Gaussians, one for every fixation location of every participant except the current participant. Cross-
participant consistency was then measured by computing the percent of the current participant’s
fixations located within the most active regions in this constructed saliency map. Cross-participant
consistency proved to be high and thus provided a useful upper limit for model comparison.
To assess TASC’s performance on real-world images, we tested the model on the same set of
images used by Torralba et al. (2006). Other state of the art models (e.g. Kanan et al., 2009)
have reported similar or better performance than TOCH on this dataset, but for simplicity, we
assess TASC’s performance in the context of the results presented in Torralba et al. (2006). A
TASC instantiation with 3 object module rows, was trained on the image corpus with the set of
test images removed. Saliency maps were generated from the object module row relevant to the
specific task and performance was measured using the method described above. Figure 21 presents
the performance of TASC alongside results reproduced from Torralba et al. (2006). The results are
divided by target object and by whether the target was present or absent. Each bar graph is further
divided along the abscissa by fixation number. Thus the light blue bar corresponding to fixation 1
44
in the upper left graph expresses that for the person search task with target present images, on
average about 65% of the participants’ first fixation in an image fell within the top 20 percent
of the TASC saliency map for that image. Across all search cases and fixation numbers, TASC’s
performance is on average between the TOCH model and the simple bottom-up model (BU). It is
important to note that the TASC instantiation used here was not optimized for these images and
search tasks. Instead, the model parameters were carried over from the visual search simulations
with artificial images. Because characteristics of real-world images are quite different than those
of artificial search images, it seems likely that image processing parameters selected for synthetic
images would be sub-optimal for real-world images—e.g. the sizes used for the center and surround
regions in the contrast enhancement stage. Additionally, performance is limited in this simulation
by the naive assumption that for a given task, all ranges of influence are equally relevant. This
latter assumption is likely responsible for TASC’s relatively poor performance in the mug search.
This relationship can be understood by noting that the performance of the BU model is closest to
TOCH in mug search. Because TOCH consists of a bottom-up model filtered by global contextual
guidance, the similarity between TOCH and BU in this task suggests that mug locations are only
weakly predicted by global scene properties. Hence, the bottom-up component of TOCH, which
operates on local feature information, is responsible for most of the model’s predictive power. In
the naive version of TASC, however, all ranges of influence are combined equally thus causing a
dilution of the shortest range of influence by the longer ranges. In fact, the saliency maps obtained
from the shortest range of influence alone perform similarly to TOCH and BU in the mug search
task. Despite these limiting factors, the present result demonstrates that TASC’s performance for
real-world images is similar to other major models.
The results for the first three fixations in Figure 21 are obtained using the same saliency map.
But is there any relationship between successive fixations that might motivate the use of a different
saliency map for each fixation? Analysis of the eye-movement data in Torralba et al. (2006) reveals
that participants tend to saccade to locations in close proximity to the previous fixation location. Of
course some fixations select a completely new region, but these are less common. This observation
conflicts with the standard inhibition of return explanation used to generate fixation sequences in
other attentional models including Itti and Koch (2000). The predictive quality of the last fixation
can be assessed by building a saliency map with a Gaussian centered around the last fixation
45
1 2 340
60
80
100Person Present
1 2 340
60
80
100Person Absent
HumanTASCTOCHBU
1 2 340
60
80
100Mug Present
Perc
ent o
f Fix
atio
ns in
Thr
esho
lded
Reg
ion
(20%
)
1 2 340
60
80
100Mug Absent
1 2 340
60
80
100Painting Present
Fixation Number1 2 3
40
60
80
100Painting Absent
Fixation Number
Figure 21: TASC performance on the eye-movement dataset used in Torralba et al. (2006). TASC is placedalongside TOCH, a bottom-up model (BU), and cross-participant consistency. The measure is percent ofhuman eye-movements contained in the top 20% of the saliency map. The data is divided into the first threefixations.
46
1 2 330
40
50
60
70
80
90
100
Fixation NumberPe
rcen
t of F
ixat
ions
TASC with last fixationTASCJust last fixationTOCH
Figure 22: A model predicting the current fixation location based on the last fixation location performscomparably to major attention models. TASC shows an improvement when previous fixation information isincluded suggesting that both models contain valuable information. The results here represent an averageover the 3 tasks and target present/absent cases in the previous figure.
location. Performance for the current fixation can be measured by thresholding this last fixation
map in the same manner described above. The magenta bars in Figure 22 display the percent of
fixations that are in close proximity to the previous fixation—i.e. the percent of fixations that fall
within a circle covering 20 percent of the image and centered around the last fixation location.
For the first fixation, the saliency map is constructed using one centrally located Gaussian. The
results presented here represent an average across the three tasks and the present/absent cases.
Interestingly, the previous fixation alone proves to be as good a predictor of the current fixation
as other attentional models including TASC and TOCH. The figure also shows that performance
is improved when the TASC saliency map is averaged with the last fixation map, suggesting that
both elements provide distinctly useful information for predicting attentional shifts.
The relationship between successive fixation locations is consistent with any model that encodes
the effort required to change the locus of attention, where shorter saccades are preferred because
they require less effort. The SEEV model proposed by Wickens (2007) explicitly incorporates this
effort component. SEEV is a utility based framework that combines task independent influences
with knowledge driven components. Specifically, SEEV models the probability of attending a
location as a linear combination of four factors: Salience, Effort, Expectancy, and Value. The
salience component corresponds to bottom-up attention as defined in models such as Itti and Koch
(2000). Effort represents the cost of attending a new location—i.e. the physical cost of moving
the eyes, head, or whole body—and has a negative influence on saliency. Expectancy is a top-
47
down component that represents the likelihood that a specific region contains information. The
value component models the importance of attending a specific region based on the gains (losses)
associated with fixating (not fixating) that region. As in TASC, experience plays a fundamental
role in tuning the expectancy and value components of SEEV. A utility based theory such as SEEV
is compatible with the TASC framework and the combination of the two would result in a more
powerful model that could capture results such as the relationship between subsequent fixations.
4 Discussion
We have introduced a unified framework for addressing a broad range of phenomena pertaining
to attentional control. This framework, instantiated in the TASC model, describes human visual
search in both naturalistic scenes and in artificial displays consisting of simple elements varying
in color and shape. Our goal in this work was to develop TASC to the point that it serves as an
existence proof of an attention theory that is based fundamentally on the role of experience, a theory
that makes contact with diverse phenomena and can serve an an integrative framework. TASC
borrows from many other theories of attention, and in so doing, it serves as a synthesis of existing
theories and as a means of characterizing the relationships among the theories. TASC follows a clear
trend in the literature toward models that are comprehensive in scope and incorporate multiple
attentional control strategies (e.g., Kanan et al., 2009; Navalpakkam and Itti, 2005; Torralba et al.,
2006; Zhang et al., 2008). TASC is perhaps distinguished by the fact that the distinct control
strategies in TASC are implemented by the same underlying mechanisms. Through simulations,
we showed that exogenous control, feature-based endogenous control, and scene-based endogenous
control could all emerge from the same underlying processing machinery.
Embedded in the TASC framework is a key proposition: attentional control is fundamentally
experience based. TASC consists of a collection of modules, each of which is specialized to the
detection of a particular entity, and is assumed to arise from experience with the world, and to
continually adapt to the ongoing stream of experience. These entities can be what are traditionally
referred to as features, such as the color red, or simple shapes, such as a square, or complex,
hierarchical, articulated objects, such as a person. We’ve loosely referred to these entities as
objects, although this usage stretches the canonical notion of objects in vision. The important
48
point is that these objects are functionally defined—they are the goals of visual search, goals that
have been rewarded and learned in the past.
According to TASC, attentional control varies in the degree of task (object) specificity via the
combination of object modules. By this conception, what is typically considered top-down control
occurs when one or a small number of object modules contribute to saliency; and what is typically
considered bottom-up control occurs when all object modules contribute to saliency. Instead of
considering exogenous control as a search with no target in mind, TASC conceptualizes it as a
search for all possible objects. Consequently, TASC rejects the notion of pure bottom-up saliency,
instead favoring the hypothesis that control always operates in a top-down manner, tuned through
experience.
Failure of attentional control to exclude task irrelevant signals can be attributed to limitations
on the degree of task specificity that can be achieved. On grounds of survival, it may be adaptive for
an organism not to become so focused on the task at hand that all extraneous warning signals from
the environment are suppressed. As a result, we suppose that critical object modules contribute to
saliency even when not pertinent to current goals. For example, the presence of a human face may
be important to many different goals, including basic survival in a community. Thus, attentional
capture can be considered not as a failure of attentional control to stay on task, but as the inability
of attentional control to narrow to a single task.
4.1 The role of experience
Many recent theories of attention have begun to focus on the role of experience in attentional
control. TASC is motivated in part by TOCH, whose contextual component, presented in Torralba
(2001, 2003), encodes and exploits correlations between scene gist and the locations of specific
objects. Rao et al. (2002) also implement a model for visual search in which saliency is based on
how well local visual features match a specific object template. Similarly, the SAIM model of Heinke
and Humphreys (1997, 2003) incorporates object templates to determine a focus of attention via
constraint-satisfaction dynamics.
These previous theories assume that experience leads to the learning of specific object models.
Recent Bayesian theories instead assume that experience results in learning of statistical properties
of the environment, which might steer attention to locations containing surprising visual informa-
49
tion, or might optimize attention to situations that are likely to occur in the future. While the
various Bayesian theories agree in claiming that attention is modulated by statistics of the envi-
ronment, they differ in terms of the time period over which statistics are collected. Some theories
rely on life long history (e.g., Kanan et al., 2009; Torralba, 2003; Zhang et al., 2008), some theories
rely on the recent history of experience, and suggest trial-to-trial modulations of attention (e.g.,
Itti and Baldi, 2009; Mozer and Baldwin, 2008; Mozer et al., 2006; Yu and Dayan, 2005), and
finally, some theories assume that experience doesn’t extend beyond the statistics of the current
image (e.g. Bruce and Tsotsos, 2009; Torralba et al., 2006, bottom-up component). Like all of these
theories, TASC can be cast in a probabilistic framework, as described in an earlier section of the
paper. TASC has the virtue of spanning a range of time scales of adaptation. Some simulations
we presented rely on experience on the time scale of years (e.g., attention in real-world scenes,
lower region bias in figure-ground assignment); other simulations rely on experience on the time
scale of minutes (e.g., contextual cueing). Regardless of how TASC plays out compared to other
probabilistic theories, we view TASC as the culmination of a shift in the field, from the assumption
of hardwired, fixed mechanisms of attention to the view that every aspect of attentional control
relies on adaptation to the ongoing stream of experience.
4.2 Adaptation of attentional control
Up to this point we’ve used TASC to explain adaptation in a way focussed on how object modules
arise from experience—i.e. how the neural nets change the associative weights to encode experience.
However, TASC suggests a complementary sort of adaptation that we haven’t discussed which
occurs when the set of modules relevant to a particular goal change. The TASC architecture
assumes that for a specific goal, saliency predictions are obtained by taking a disjunction across
a subset of modules that correspond to the objects and ranges of influence related to the goal.
The predictions obtained by TASC are thus dependent on the particular subset of modules. In the
simulations presented in this work, the subset of modules relevant to a particular search experiment
were selected in advance and remained fixed throughout the simulation. If instead this subset
were to change throughout the course of an experiment, attentional behavior would also change.
Modification of the set of relevant modules could serve as a form of adaptation of attentional control.
A natural extension to the approach in the current implementation, which gives all selected modules
50
equal weight, is to combine across all modules giving each one a weight based on its relevance to
the goal. Under this approach, adaptation would occur if the module weights associated with each
goal were constantly revised to reflect experience.
4.3 The relation of attention and object recognition
Under the conceptualization of saliency proposed by TASC, the goal of attention is to identify
locations in the image containing objects of interest. So what distinguishes TASC from full-blown
object recognition models? The primary difference is that TASC is computationally limited by
the dimensionality reduction stage and the restrictions placed on the neural nets. Another point
that sets TASC apart from most object recognition models is that processing occurs in parallel
across the entire visual field without any recurrent loops. These differences are important because
computational simplicity and efficiency should go hand-in-hand with speed in the brain, and a quick
response is essential if saliency is to provide useful guidance for more detailed, computationally
intensive object recognition and scene interpretation processes.
In object recognition models, it is common to exploit as much of the feature data as possible,
even in the face of significant additional computational requirements. In essence, TASC employs a
contrasting strategy: discard as much feature data as possible while still preserving at least some
of the predictive qualities. Extraneous feature data is removed in the sub-sampling and principal
components analysis (PCA) stages of processing. PCA is often used by object recognition models to
organize the feature data in a more compact form without losing its expressive power. The difference
between PCA in TASC and other object recognition models is the number of principal components
preserved. Whereas object recognition models typically keep enough principal components such
that little is lost from the original data, TASC uses only a few principal components, thus extracting
just the primary trends of the data. The choice between different levels of dimensionality reduction
ultimately results in a trade off between efficiency and precision of predictions. The design of
TASC is based on the assumption that attention sacrifices some precision in order to provide faster
guidance.
The linearity of the associative neural nets in TASC also limits the model’s computational
complexity. The output of each network is obtained via a rank-limited linear mapping of the di-
mensionality reduced feature data, implemented using a hidden layer with few units. The mapping
51
of the input units onto the hidden units can be viewed as an additional dimensionality reduction
step. The difference between this reduction and the PCA reduction is that the mapping is cus-
tomized for each object—i.e. PCA selects features that are generally informative across all images
whereas this mapping selects features that are important for a specific object. The features encoded
by the hidden units are then combined linearly to generate output activations. The linearity of the
system is broken when the output values are scaled by the sigmoid function. However, because the
sigmoid function is monotonic, the relative ranking of output units is preserved.
Through this one-pass feedforward processing stream, the TASC modules estimate the proba-
bility that a specific object is present at each location given the surrounding visual features. These
estimates are computed quickly at the expense of high levels of precision, thus providing a coarse
form of object recognition or detection. This key property of the TASC theory implies that the
attentional system, in its essence, always performs a crude sort of object recognition, even when
the viewing goals do not involve specific objects. Objects are coarsely recognized from local feature
properties, global scene properties, and intermediate regional feature properties. Predictions from
these rudimentary object detection components are brought together to yield saliency predictions
customized for specific viewing situations. The details of these object components, the combina-
tion rules across them, and even the object definitions themselves are all constantly tuned through
experience to yield an adaptive attentional system capable of providing useful guidance across a
wide range of search goals and visual environments.
5 Acknowledgments
This research was supported by NSF BCS 0339103, NSF CSE-SMA 0509521, and NSF BCS-
0720375.
References
E. Averbach and A. Coriell. Short-term memory in vision. Bell Systems Technical Journal, 40:309–328, 1961.
W. Bacon and H. Egeth. Overriding stimulus-driven attentional capture. Perception and Psy-chophysics, 55:485–496, 1994.
I. Biederman. Perceiving real-world scenes. Science, 177:77–80, 1972.
52
J. Brockmole, M. Castelhano, and J. Henderson. Contextual cueing in naturalistic scenes: Globaland local contexts. Journal of Experimental Psychology: Learning, Memory, and Cognition, 32:699–706, 2006.
J. Brockmole, D. Hambrick, D. Windisch, and J. Henderson. The role of meaning in contextualcueing: Evidence from chess expertise. Quarterly Journal of Experimental Psychology, 2008.
N. Bruce and J. Tsotsos. Saliency, attention, and visual search: An information theoretic approach.Journal of Vision, 9(3):5, 1–24, 2009.
M. Cerf, E.P. Frady, and C. Koch. Using semantic content as cues for better scanpath prediction.Proceedings of the symposium on Eye tracking research & applications, pages 143–146, 2008a.
Moran Cerf, Jonathan Harel, Wolfgang Einhaeuser, and Christof Koch. Predicting human gazeusing low-level saliency combined with face detection. In J.C. Platt, D. Koller, Y. Singer, andS. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cam-bridge, MA, 2008b.
M. Chun and Y. Jiang. Contextual cueing: Implicit learning and memory of visual context guidesspatial attention. Cognitive Psychology, 36:28–71, 1998.
C. Folk, R. Remington, and J. Johnston. Involuntary covert orienting is contingent on attentionalcontrol settings. Journal of Experimental Psychology: Human Perception and Performance, 18:1030–1044, 1992.
D. Gao, V. Mahadevan, and N. Vasconcelos. On the plausibility of the discriminant center-surroundhypothesis for visual saliency. Journal of Vision, 8(7):13, 1–18, 2008.
T. Gawne and J. Martin. Responses of primate visual cortical v4 neurons to simultaneously pre-sented stimuli. Journal of Neurophysiology, 88:1128–1135, 2002.
J. Geng and M. Behrmann. Spatial probability as an attentional cue in visual search. Perception& Psychophysics, 67(7):1252–1268, 2005.
D. Heinke and G. Humphreys. Saim: A model of visual attention and neglect. In InternationalConference on Artificial Neural Networks, pages 913–918, 1997.
D. Heinke and G. Humphreys. Attention, spatial representation and visual neglect: Simulatingemergent attention and spatial memory in the selective attention for identification model (SAIM).Psychological Review, 110:29–87, 2003.
L. Itti and P. Baldi. Bayesian surprise attracts human attention. Vision Research, 49(10):1295–1306, 2009.
L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visualattention. Vision Research, 40(10-12):1489–1506, May 2000.
L. Itti, C. Koch, and E. Neiber. A model of saliency-based visual attention for rapid scene analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, 1998.
B. Julesz. Toward an automatic theory of preattentive vision. In G. M. Edelman, W. E. Gall, andW. M. Cownan, editors, Dynamic aspects of neocortical function, pages 595–612. NeurosciencesResearch Foundation, New York, 1984.
53
C. Kanan, M. Tong, L. Zhang, and G. Cottrell. Sun: Top-down saliency using natural statistics.Visual Cognition, 17(6-7):979–1003, 2009.
C. Koch and S. Ullman. Shifts in selective visual attention: Towards the underlying neuronalcircuitry. Human Neurobiology, 4:219–227, 1985.
I. Lampl, D. Ferster, T. Poggio, and M. Riesenhuber. Intracellular measurements of spatial inte-gration and the MAX operation in complex cells of the cat primary visual cortex. Journal ofNeurophysiology, 92:2704–2713, 2004.
D. Lamy, T. Carmel, H. Egeth, and A. Leber. Effects of search mode and intertrial priming onsingleton search. Perception and Psychophysics, 68:919–932, 2006.
M. Mozer. The perception of multiple objects: A connectionist approach. MIT Press, Cambridge,MA, 1991.
M. Mozer and D. Baldwin. Experience-guided search: A theory of attentional control. In J. Platt,D. Koller, and Y. Singer, editors, Advances in Neural Information Processing Systems 20. MITPress, Cambridge, MA, 2008.
M. Mozer, M. Shettel, and S. Vecera. Control of visual attention: A rational account. In Y. Weiss,B. Schoelkopf, and J. Plattr, editors, Neural Information Processing Systems 18, pages 923–930.MIT Press, Cambridge, MA, 2006.
V. Navalpakkam and L. Itti. Modeling the influence of task on attention. Vision Research, 45:205–231, 2005.
M. Neider and G. Zelinsky. Scene context guides eye movements during visual search. VisionResearch, 46:614–621, 2006.
U. Neisser. Cognitive psychology. Appleton-Century-Crofts, New York, 1967.
A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatialenvelope. International Journal of Computer Vision, 42(3):145–175, 2001.
I. Olson and M. Chun. Perceptual constraints on implicit learning of spatial context. VisualCognition, 9:273–302, 2002.
M. Peterson and E. Skow-Grant. Memory and learning in figure-ground perception. In B. Ross andD. Irwin, editors, Cognitive Vision: Psychology of Learning and Motivation, volume 42, pages1–34. 2004.
M. Posner and Y. Cohen. Components of visual orienting. In H. Bouma and D. G. Bouwhuis,editors, Attention and Performance X, pages 531–556. Erlbaum, Hillsdale, NJ, 1984.
Rajesh P Rao, Gregory J Zelinsky, Mary M Hayhoe, and Dana H Ballard. Eye movements in iconicvisual search. Vision Research, 42(11):1447–1463, 2002.
M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuro-science, 2:1019–1025, 1999.
B. Russel, A. Torralba, and K. Murphy. Labelme: a database and web-based tool for imageannotation. International Journal of Computer Vision, 77(1-3):157–173, May 2008.
54
J. Senders. The human operator as a montior and controlller of mulitdegree of freedom systems.IEEE Transactions on Human Factors in Electronics, 5:2–6, 1964.
C. Siagian and L. Itti. Rapid biologically-inspired scene classification using features shared withvisual attention. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(2):300–312, 2007.
E. Simoncelli and W. Freeman. The steerable pyramid: a flexible architecture for multi-scalederivative computation. In 2nd Annual Intl. Conf. on Image Processing, 1995.
A. Torralba. Contextual modulation of target saliency. Adv. in Neural Information ProcessingSystems, 14:1303–1310, 2001.
A. Torralba. Modeling global scene factors in attention. Journal of Optical Society of America, 20(7):1407–1418, 2003.
A. Torralba, A. Oliva, M. Castelhano, and J. Henderson. Contextual guidance of eye movementsand attention in real-world scenes: The role of global features on object search. PsychologicalReview, 113(4):766–786, October 2006.
A. Treisman. Perceptual grouping and attention in visual search for features and objects. Journalof Experimental Psychology: Human Perception and Performance, 8:194–214, 1982.
A. Treisman and G. Gelade. A feature-integration theory of attention. Cognitive Psychology, 12:97–136, 1980.
S. Vecera, E. Vogel, and G. Woodman. Lower region: A new cue for figure-ground assignment.Journal of Experimental Psychology: General, 131:194–205, 2002.
C. Wickens. Attention to attention and its applications: A concluding view. In A. Kramer,D. Wiegmann, and A. Kirlik, editors, Attention: From Theory to Practice. Oxford UniversityPress, New York, 2007.
J. Wolfe. Guided search 2.0: A revised model of visual search. Psychonomic Bulletin and Review,1:202–238, 1994.
J. Wolfe. Visual memory: What do you know about what you saw? Current Biology, 8:R303–R304,1998a.
J. Wolfe. What can 1,000,000 trials tell us about visual search? Psychological Science, 9(1):33–39,1998b.
J. Wolfe. Guided search 4.0: Current progress with a model of visual search. In W. Gray, editor,Integrated Models of Cognitive Systems, pages 99–119. Oxford, New York, 2007.
J. Wolfe and T. Horowitz. What attributes guide the deployment of visual attention and how dothey do it? Nature Reviews Neuroscience, 5:1–7, 2004.
J. Wolfe, K. Cave, and S. Franzel. Guided search: An alternative to the feature integration modelfor visual search. Journal of Experimental Psychology: Human Perception and Performance, 15:419–433, 1989.
A. Yu and P. Dayan. Inference, attention, and decision in a bayesian neural architecture. InAdvances in Neural Information Processing Systems, volume 17, pages 1577–1584, 2005.
55
L. Zhang, M. Tong, T. Marks, H. Shan, and G. Cottrell. Sun: A bayesian framework for saliencyusing natural statistics. Journal of Vision, 8(7):32, 1–20, 2008.
L. Zhaoping and R. Snowden. A theory of a saliency map in primary visual cortex (V1) tested bypsychophysics of colour-orientation interference in texture segmentation. Visual Cognition, 14:911–933, 2006.
56