One-Bit Radar Imaging Via Adaptive Binary Iterative Hard ...
Iterative and Adaptive Sampling with Spatial Attention for Black … · novel iterative and...
Transcript of Iterative and Adaptive Sampling with Spatial Attention for Black … · novel iterative and...
Iterative and Adaptive Sampling with Spatial Attention
for Black-Box Model Explanations
Bhavan Vasu∗
Kitware Inc.
Clifton Park, NY, USA 12065
Chengjiang Long∗
Kitware Inc.
Clifton Park, NY, USA 12065
Abstract
Deep neural networks have achieved great success in
many real-world applications, yet it remains unclear and
difficult to explain their decision-making process to an end-
user. In this paper, we address the explainable AI prob-
lem for deep neural networks with our proposed frame-
work, named IASSA, which generates an importance map
indicating how salient each pixel is for the models predic-
tion with an iterative and adaptive sampling module. We
employ an affinity matrix calculated on multi-level deep
learning features to explore long-range pixel-to-pixel cor-
relation, which can shift the saliency values guided by
our long-range and parameter-free spatial attention mod-
ule. Extensive experiments on the MS-COCO dataset show
that the proposed approach matches or exceeds the per-
formance of state-of-the-art black-box explanation meth-
ods. Our source code is available at https://github.
com/vbhavank/IASSA-Saliency .
1. Introduction
It is still unclear how a specific deep neural network
works, how certain it is about the decision making, etc, al-
though deep networks have achieved remarkable success in
multiple applications such as object recognition [42, 51, 9,
18, 16, 19, 17, 10, 38], object detection [21, 5, 30], image
labeling [15, 8], media forensics [33, 20, 14], medical diag-
nosis [43, 44], and autonomous driving [23, 4, 22]. How-
ever, due to the importance of explanation in understand-
ing and building trust in cognitive psychology and philoso-
phy [12, 13, 28, 45, 31], it is very critical to make the deep
neural networks more explainable and trustable, especially
to ensure that the decision-making mechanism is transpar-
ent and easily interpretable. Therefore, the problem of Ex-
plainable AI, i.e., providing explanations for an intelligent
models decision, especially in explaining classification de-
∗Equal contributions. This work was supervised by Chengjiang Long.
Figure 1. Visual comparison between the proposed method IASSA
and two state-of-the-art black-box explanation algorithms, i.e.,
LIME [26] and RISE [27], for the importance of producing ex-
planations.
cisions made by deep neural networks on natural images, at-
tracts much attention in artificial intelligence research [34].
Rather than explainable solutions [35, 50, 40, 29, 37] to
certain white-box models via calculating importance based
on the information like the network’s weights and gradients.
We advocate a more general explainable approach to pro-
duce a saliency map for an arbitrary network that is treated
as a black-box model, without requiring details about the
architecture and implementation. Such a saliency map can
show how important each image pixel is for the networks
prediction.
Recently, multiple explainable approaches have been
proposed for black-box models. LIME [26, 1] proposes
to draw random samples around the instance for an ex-
planation by fitting an approximate linear decision model.
However, such a superpixel based saliency method may not
group correct regions. RISE [27] explores the black-box
model by sub-sampling the input image via random masks
and generates the final importance map by a linear combi-
nation of the random binary masks. Although this is seem-
ingly simple yet surprisingly powerful approach for black-
box models, the results are still far from perfect, especially
2960
in complex scenes.
In this paper, inspired by RISE [27], we propose a
novel iterative and adaptive sampling with spatial attention
(IASSA) for explaining black-box models. We do not ac-
cess weights or gradients of the underlying model. We only
sample the input image randomly using a sliding window
during the initialization stage. And then an iterative and
adaptive sampling module is designed to generate sampling
masks for the next iteration, based on the adjusted attention
map which is obtained with the saliency map at the current
iteration and the long-range and parameter-free spatial at-
tention. Such an iterative procedure continues until conver-
gence i.e. no substantial change in the final saliency map.
The visual comparison of explanations obtained with LIME
and RISE is shown in Figure 1.
Regarding the long-range and parameter-free spatial at-
tention module, we apply a pre-trained model trained on
the large-scale ImageNet dataset to extract features for the
input image. Note that we combine multi-level contextual
features to better represent the image. Then we calculate an
affinity matrix and apply a softmax function to get spatial
attention. Since the affinity matrix covers pixel-to-pixel cor-
relations irrespective of them being local neighbors or not,
our attention covers long-range inter-dependencies. Also,
no parameters are required to be learned in this procedure.
Such a long-range and parameter-free spatial attention can
guide the saliency values in the obtained saliency map to-
wards highly correlated pixels. This can be very helpful to
identify sampling regions for adaptive sampling in the next
iteration.
Another contribution of our work is to evaluate for
”goodness” of an explanation. Besides previously used met-
rics like deletion, insertion and “Pointing Game” [27], we
also choose to use F-1 and IoU . We evaluate the final
saliency maps at both the image and pixel-level to highlight
the success of our approach in maximizing information con-
tained in each pixel. We argue that such a comprehensive
evaluation should be more trustable when compared with
the human-annotated importance. In our case, we assume
ground truth masks as representative of human interpreta-
tion of an object, as they are human-annotated.
To sum up, the technical contributions are of three-
folds: (1) we propose an iterative and adaptive sampling
for generating accurate explanations, based on the adjusted
saliency map generated by combining the saliency map ob-
tained from the previous iteration and the long-range and
parameter-free spatial attention map; (2) our long-range and
parameter-free attention module that incorporates “object-
ness” and guides our adaptive sampler with the help of
multi-level feature fusion; and (3) we further introduce an
evaluation scheme that tries to estimate goodness of an ex-
planation in a way that it is reliable and accurate.
We conduct extensive experiments on the popular and
vast MS-COCO dataset [11] and compare it with the state-
of-the-art methods in the field. The experimental results
demonstrate the efficacy of our proposed method.
2. Related work
The related work can be divided into two categories, i.e.,
white-box approaches and black-box approaches for pro-
ducing explanations.
White-box approaches rely on information such as the
model parameters or gradients, as well as the intermedi-
ate feature maps. Zeiler et. al. [47] visualize the inter-
mediate representation learned by CNNs using deconvolu-
tional networks. Explanations are achieved in other meth-
ods [25, 36, 46] by synthesizing an input image that highly
activates a neuron. Class activation maps (CAM) [52]
achieve class-specific importance at each location in an im-
age by computing a weighted sum of the activation values
at each location across all channels using a Global Average
Pooling layer (GAP). Such a method prevents us from us-
ing this approach to explain models lacking a native GAP
layer without additional re-training. Later, CAM was ex-
tended to Grad-CAM [35] by weighting the feature activa-
tion values at every location with the average gradient of
the class score (w.r.t. the feature activation values) for ev-
ery feature map channel. In addition, Zhang et. al. [50]
introduced a probabilistic winner-takes-all strategy to com-
pute the relative importance of neurons towards model pre-
dictions. Fong et. al. [7] and Cao et. al. [2] learn a per-
turbation mask that maximally affects the models output
by back-propagating the error signals through the model.
However, all of the above methods assume that the inter-
nal parameters of the underlying model are accessible as a
white-box. They achieve interpretability by incorporating
changes to a white-box based model and are constrained to
specific network architectures, limiting reproducibility on a
new dataset.
Black-box approaches treat the learning models as
purely black-box, without requiring access to any details of
the architecture and the implementation. LIME [32] tries
to fit an approximate linear decision model (LIME) in the
vicinity of a particular input. For a sufficiently complex
model, a linear approximation may not result in a faith-
ful representation of the non-linear model. Even though
LIME model produces good quality results on the MS-
COCO dataset, due to its reliance on super-pixels, it is not
the best at grouping object boundaries with activation. As
an improvement over LIME, RISE model [27] was pro-
posed to generate an importance map indicating how salient
each pixel is for the black-box model’s prediction. Such
a method estimates importance empirically by probing the
model with randomly masked versions of the input image
and obtaining the corresponding output probability. Note
that sampling methods to generate explanations have been
2961
Figure 2. The framework of our unsupervised saliency map extraction method for an explanation of a black-box model. Given an input
image, we perform a rough pass over the image to start the iterative process with large window size. The masked images are passed to
the black box classifier that predicts logit scores for each sample, the predicted logit scores are used to weight image regions to produce a
saliency map. Then an adjusted saliency map is generated by combining with the long-range and parameter-free spatial attention module
to guide the iterative and adaptive sampling module to generate the new sampling masks, which leads to a new saliency map. Such iterative
procedure continues until convergence. Note that the spatial attention module is built based on multi-level deep learning features via an
affinity matrix.
explored in the past [27, 48, 6]. Even though they produce
explanations for a wide variety of black-box model applica-
tions, their resolution is always limited by factors like sam-
pling sensitivity and strength of classifier.
In this paper, unlike the existing methods, we explore a
novel method to provide precise explanations for any appli-
cation that uses a deep neural network for feature extrac-
tion, irrespective of the multi-level features. We leverage a
long-range and parameter-free spatial attention to adjust the
saliency map. We propose an iterative and adaptive sam-
pling module with long-range and parameter-free attention
to determine important regions in an image. The proposed
system can also be adapted to perform co-saliency [6] by
weighting the final saliency map using a standard feature
comparison metric like Euclidean or Cosine distance. This
makes our approach robust to the form of explanation de-
sired and produces better quality saliency maps across dif-
ferent applications with little or no overhead in training.
3. Methodology
The proposed framework is illustrated in Figure 2. Given
an input image, we perform a rough pass to initialize our ap-
proach. The sampled image regions are passed to the black
box classifier that predicts logit scores for each sample, the
predicted logit scores are used to weight image regions to
produce an aggregated response map. The response map is
combined with the attention map obtained from the long-
range and parameter-free spatial attention module (LRPF-
SA) to get an adjusted saliency map. The attention module
also guides the iterative and adaptive sampling to sample
relevant regions in the next iteration. Such an iterative pro-
cedure continues until convergence. Note that the spatial
attention module is built on multi-level deep learning fea-
tures via an affinity matrix. In the following subsections,
we further explain our approaches in detail.
3.1. Iterative and Adaptive Sampling Module
We propose a novel iterative and adapting sampler that
is guided by our LRPF-SA to automatically pick sam-
pling regions of interest with an appropriate sampling factor
rather than weighting all image regions equally. Sampling
around important regions ensures faster convergence and
finer saliency maps. The iterative quality of our approach
also allows the users to control the quality of saliency
maps, which is inversely proportional to the amount of time
needed to generate them. We believe this is crucial in real-
time applications where the same explanation generator sys-
tem needs to be scaled according to user requirements with
minimal changes.
Given an image I , a black-box model f produces a score
vector of length c, where c is the number of classes the
black-box model was trained for. We sample the input im-
age I , using masks M : Λ → {0, 1} be a sliding window
of size W and stride S. Considering the masked version
2962
(I ⊙M) of I, where ⊙ represents element-wise multiplica-
tion, we compute the confidence scores for all the masked
images f(I ⊙ M). We define the importance of a pixel
λ ∈ Λ as the expected score over all possible masks M
conditioned on the event that pixel λ is observed. In other
words, when the scalar score f(I⊙M) is high for a chosen
mask m ∈ M , we can infer that the pixels preserved by m
are important. We define the importance of the pixel λ as
the expected score over all possible masks conditional on
the event that λ is observed, i.e..
S(I, f, λ) =∑
m
f(I ⊙M)P [M = m,M(λ) = 1], (1)
where
P [M = m,M(λ) = 1] =
{
0, if m(λ) = 0P [M = m], if m(λ) = 1
(2)
With Equation 1 and 2, we arrive at
S(I, f, λ) =1
P [M(λ) = 1]
∑
m
f(I⊙M).m(λ).P [M = m]
(3)
Considering that P [M(λ) = 1] = E[M(λ)], we rewrite
Equation 3 in matrix notation as
S(I, f, λ) =1
E[M ]
∑
m
f(I ⊙M).m.P [M = m] (4)
Using Monte Carlo sampling, at the iteration 0, the final
saliency map is computed as a weighted average of a col-
lection of masks Mk = {M1, . . . ,MN} by the following
approximation:
S(I, f, λ) ≈1
E[M ] ·N
N∑
i=1
f(I ⊙Mi).Mi(λ). (5)
When the black-box model f is associated with a class
c, then we can obtain a saliency map corresponding to c ac-
cording to Equation 4. Although most applications require
only the top-1 saliency map, our approach can be used to
obtain class specific salient structures.
The initial saliency map S0 is generated based on a slid-
ing window M0. After the initialization, we take the long-
range and parameter-free attention module A to adjust the
saliency map from Sk to S′k
at the k-th iteration by the fol-
lowing rules
S′k= βSk + (β − 1)A× Sk, (6)
where β is a regularizer to control the amount of influence
the attention network has towards generating the final ex-
planation. The intuition behind using both saliency and at-
tention maps is that, while the saliency map Sk is associated
Figure 3. An illustration of our LRPF-SA module that produces
attention maps used to guide the iterative and adaptive sampling
module.
with the output of a back-box model, we provide a new in-
sight with our proposed LRPF-SA (see next subsection) to
apply some spatial constraints with respect to the extracted
features. Therefore, by combining both forms of explana-
tions we hope to converge on an aggregated saliency map
that gives a complete picture of the image regions that in-
terest the system while conforming to object boundaries.
Then we use S′k
to guide the adaptive sampling for the
next iteration by
Mk+1 = HAR(S′k), (7)
where HAR(·) denotes the highest activated region obtained
by applying a threshold, resulting in a binary map that high-
lights all pixels containing the object of interest, i.e.,
HAR(S′k) = S′
k> Tthresh (8)
With the adaptive sampling masks Mk+1, we are able
to apply Equation 5 to obtain the saliency map Sk+1 at the
(k+1)-th iteration. And then S′k+1
is obtained using Equa-
tion 6 to get the adaptive sampling masks Mk+2 for gener-
ating the saliency map Sk+2 at the (k+ 2)-th iteration. It is
worth noting that the window size and stride can be gradu-
ally depreciated with respect to the number of iterations to
increase the resolution of saliency maps until there is very
little or no change in the quality of maps. The number of
iterations can also be fixed based on application require-
ments, when the user is willing to sacrifice the quality of
saliency maps for run-time.
3.2. LongRange and ParameterFree Spatial Attention
Obtaining an attention map from a deep learning model
is a well-researched topic [39, 49]. Recent developments
in minimizing attention generation overhead was proposed
in [41]. Inspired by [41], we propose a novel long-range and
parameter-free spatial attention (LRPF-SA) module. We
2963
make use of a deep network for feature extraction that en-
compasses activations from different levels of the network.
We believe by using activations from different levels of the
network we provide a true explanation about how the image
is perceived by the complete network, giving rise to hier-
archical salient concepts in the attention map. The saliency
maps are then used to choose from the hierarchical concepts
that match with image boundaries, thus giving rise to accu-
rate and reliable saliency maps.
In this paper, we use the pre-trained network learned on
the ImageNet dataset. Note that in the case of a new do-
main, the network can be adapted into the target domain
using methods proposed in [3]. Let Φ(I) be a pre-trained
deep network used to extract multi-level features that are
combined by upsampling and fusion. Finally, we use a soft-
max operation over the resulting affinity matrix to obtain an
attention map as showing in Figure 3.
Note that the Affinity matrix contains dependencies of
every pixel with all other pixels. Let Φ1(I), Φ2(I), Φ3(I)and Φ4(I) be the the features extracted from four different
levels of the feature extractor. Since we use a Φ1(I) of H×W × C1 dimensions, where H and W are the height and
width of the obtained feature maps and C1 is the number
of channels. The feature maps Φ2(I), Φ3(I) and Φ4(I) are
upsampled to H × W , with channel numbers C2, C3, and
C4. Upsampling the feature maps let us directly compute
an aggregated response using the following Equation
Φ(I) = Φ1(I)⊕ (Φ2(I)↑ ⊕ Φ3(I)↑ ⊕ Φ4(I)↑, (9)
where the subscript ↑ denotes an upsampling operation and
⊕ is an concatenation operation, and the long-range and
parameter-free spatial attention can be obtained by
A = softmax(Φ′(I)⊙ (Φ′(I))T ), (10)
where Φ′ is reshaped from Φ of dimension H ×W × C to
HW × C, and C =4∑
i=1
Ci is the number of channels in
Φ. Figure 3 shows an illustration of our LRPF-SA module
that produces attention maps used to guide the iterative and
adaptive sampling module. By using an attention mecha-
nism we hope to gain information related to the ”object-
ness”, hidden among pixels in an image.
3.3. Iterative Saliency Convergence
We propose to find the best possible saliency map that
captures the decision-making process of the underlying al-
gorithm in an iterative manner. Generating high-quality ex-
planations is a very time-consuming process and limits its
usage in real-time applications that require generating pre-
cise maps on large datasets. By gradually converging on the
optimal saliency map, we hope to let the user decide the rate
of convergence that fits their time budget, opening up possi-
bilities to use explanations in a wide variety of applications.
4. Experiments
One would wonder if we should consider an explana-
tion “good” if it represents the importance according to the
black-box classifier or if it conforms to object boundaries,
encouraging human trust in the explanation system. To ver-
ify the effectiveness of our proposed approach IASSA, we
conduct experiments on the MS-COCO dataset [11] and
evaluate explanations for their ability to best represent im-
age regions that the underlying model relies on and also
for their segmentation performance. By leveraging atten-
tion with model dependant saliency, the proposed approach
achieves better performance when evaluated for insertion,
deletion, intersection over union (IoU), F1-score, and a
pointing game score [27]. We believe we can leverage the
proposed explanation generation method to fine-tune mod-
els, especially deep learning classifiers in a closed loop us-
ing Attention Branch Networks [24].
Note that in this paper, the input images are resized to
224 × 224 to facilitate mask reuse and ease in feature ex-
traction. The IASSA module is initialized with a window
size of W of 45 and a stride S of 8 with step size 1.5 and
0.2 respectively. We use a β of 0.5 and a Tthresh of 0.3
to generate a new saliency map at the k-th iteration. We
present results for 25 iterations.
4.1. Evaluation Metrics
Evaluating the quality of saliency maps can be subjec-
tive to the kind of explanation. We evaluate the quality of
saliency maps using five different metrics: deletion, inser-
tion, IoU, F1-score, along with a pointing game score [27].
In deletion, given a saliency map and input image I we
gradually remove pixels based on their importance in the
saliency map, meanwhile monitoring the Area Under the
Curve (AUC). A sharp drop in activation as a function of
the fraction of pixels removed can be used to quantify the
quality of saliency maps. Analogously, in insertion, we re-
veal pixels gradually in the blurred image. The pixels can
be removed or added in several ways like setting the pixels
of interest to zero, image mean, gray value or blurring pix-
els. For deletion, we set pixels of interest to a constant grey
value. But the same evaluation protocol cannot be used for
insertion as the model would be biased towards grouping of
pixels introduced on an empty canvas.
To prevent the introduction of bias towards pixels group-
ing shapes, for insertion we unblur regions of the image,
under consideration. The IoU and F1-score are calculated
by applying a threshold Tthresh of 0.3 on the aggregated
saliency maps using Equation 8 and 9 obtained at the end of
i-th iterations. We also use a pointing game that considers
2964
(a) Input Image (b) LIME (c) RISE (d) IASSA (e) Input Image (f) LIME (g) RISE (h) IASSAFigure 4. Visual comparison between quality of explanations in the form of saliency maps obtained using black-box explanations LIME,
RISE and our proposed IASSA on the MS-COCO dataset.
an explanation as a positive hit when the highest activated
pixel lies on the object of interest. We average all perfor-
mance metrics at both image and pixel-level by normalizing
the performance by the number of pixels activated. The nor-
malization for per-pixel performance lets us fairly evaluate
explanations that might cover a region much larger than the
object of interest but also include the object.
4.2. Effectiveness of Iterative Adaptive SamplingModule with LRPFSA
We consider explanation generation as an optimiza-
tion problem, assuming there exists an optimal explana-
tion that encapsulates both model dependence and human
interpretable cues in an image. Converging on this opti-
mal explanation is conditioned upon parameters such as the
number of iterations k, regularizer β, and threshold Tthresh
(where β and Tthresh decide the convergence rate). We fix
the value for β and Tthresh, and evaluate the impact of k.
A qualitative analysis of the proposed explanation sys-
tem’s ability to converge on an optimal explanation can be
visualized in Figure 5. The obtained explanations contain
well-defined image boundaries at iteration 10 and slowly
converges to its peak performance at iteration 15. Figure 5
shows the improvement in the quality of explanations with
the increasing number of iterations. Figures 6 and 7 show
the quantitative performance both at an image and pixel-
level with increasing number of iterations. As we can ob-
serve, at the image-level, the proposed IASSA seems to
reach its peak performance at iteration 15 and deteriorate
post-peak due to oversampling. Whereas, when evaluated at
the pixel level, the proposed method IASSA’s performance
increases across all metrics but deletion, suggesting the re-
duction in the influence of model-dependent saliency.
4.3. Comparison with Stateoftheart approaches
Figure 4 shows results comparing explanations obtained
using the proposed method with LIME and RISE. The
saliency maps obtained by our IASSA highlight regions of
interest more accurately when compared to other state-of-
the-art approaches. For example, the success of our ap-
proach can be qualitatively visualized in the test image for
class “snowboard” in Figure 4 (row 4, column 5), while
there exists some ambiguity if the person in the input im-
age contributes to classification while using either LIME or
RISE. The IASSA clearly points at the snowboard to clas-
sify the image.
We also summarize the results quantitatively in Table 1.
From Table 1, we can observe that our proposed method
IASSA outperforms the two black-box model explanation
approaches with the added flexibility of generating expla-
2965
(a) Input Image (b) k = 5 (c) k = 10 (d) k = 15 (e) k = 20 (f) k = 25
Figure 5. Visualization of our IASSA’s saliency maps with increasing iteration number k on the MS-COCO dataset.
0.2
0.3
0.4
0.5
0.6
0.7
0.8
5 10 15 20 25
Deletion
Number of iterations (k)
Deletion
(a) Deletion
0
0.2
0.4
0.6
0.8
1
1.2
5 10 15 20 25
Insertion
Number of iterations (k)
Insertion
(b) Insertion
0.226
0.228
0.23
0.232
0.234
0.236
0.238
0.24
5 10 15 20 25
F-1 score
Number of iterations (k)
F-1 score
(c) F-1
0.14
0.142
0.144
0.146
0.148
0.15
0.152
0.154
0.156
0.158
0.16
5 10 15 20 25
IoU(%)
Number of iterations (k)
IoU
(d) IoU
0.42
0.425
0.43
0.435
0.44
0.445
0.45
0.455
0.46
5 10 15 20 25
Pointing Game
Number of iterations (k)
Pointing Game
(e) Pointing Game
Figure 6. Performance of our IASSA at the image level with increasing number of iterations on the MS-COCO dataset.
nations in an iterative manner enabling its application in
speed-critical explanation systems. When averaged at an
image level, LIME is severely affected, especially in point-
ing game due to instances when the pixels with the high-
est activation were not aligned with the ground truth mask.
The proposed model not only outperforms other explanation
mechanisms when evaluated for “goodness” for the under-
lying model but also maintains human trust in explanation.
Even though RISE obtains deletion metrics close to the
proposed system, IASSA gives the best of both worlds by
explaining the model underneath and encapsulating object-
ness information at the same time. While our IASSA per-
forms close to the best when evaluated at the image level,
the true merit of our approach can only be appreciated at
the pixel level. In an ideal explanation, we would expect all
the contributing regions to carry the highest activation pos-
sible. Black box explanation approaches are prone to error
in interpretation of an explanation due to extraneous image
regions that affect human trust in explanation. Normalizing
saliency maps with the number of pixels carrying the top
30% of the activations resolve this issue, resulting in a fair
evaluation. The iterative aspect of our IASSA makes it a
2966
0.00005
0.00005
0.00005
0.00005
0.00005
0.00006
0.00006
0.00006
5 10 15 20 25
Deletion
Number of iterations (k)
Deletion
(a) Deletion
0.00018
0.0002
0.00022
0.00024
0.00026
0.00028
0.0003
0.00032
0.00034
0.00036
5 10 15 20 25
Insertion
Number of iterations (k)
Insertion
(b) Insertion
0.00005
0.00006
0.00007
0.00008
0.00009
0.00010
0.00011
0.00012
5 10 15 20 25
F1-score
Number of iterations (k)
F1-score
(c) F-1
0.00003
0.00004
0.00004
0.00005
0.00005
0.00006
0.00006
0.00007
0.00007
5 10 15 20 25
IoU
Number of iterations (k)
IoU
(d) IoU
0.00010
0.00011
0.00012
0.00013
0.00014
0.00015
0.00016
0.00017
0.00018
5 10 15 20 25
Pointing Game
Number of iterations (k)
Pointing Game
(e) Pointing Game
Figure 7. Performance of our IASSA at the pixel level with increasing number of iterations on the MS-COCO dataset.
Table 1. Comparative evaluation in terms of deletion (lower is better) and insertion (higher is better), F-1 (higher is better), IoU (higher is
better), and Pointing Game (higher is better) scores at both image and pixel levels on the MS-COCO dataset.
Method Deletion ↓ Insertion ↑ F-1 ↑ IoU ↑ Pointing Game ↑
Image-level
LIME 0.900967 0.99 0.15390 0.09745 0.16461
RISE 0.1847 1.0 0.13837 0.13653 0.25
IASSA 0.18803 1.0 0.23658 0.15153 0.4216
Pixel-level
LIME 10.8526e-05 10.96158e-05 1.71177e-05 1.08447e-05 0.43671e-05
RISE 5.5423e-05 28.8669e-05 4.26672e-05 2.69240e-05 8.95937e-05
IASSA 5.50534e-05 35.33639e-05 10.5960e-05 6.9282e-05 17.79331e-05
Figure 8. Explanations affected by sampling artifact that results in what can be an accurate explanation for class handbag and oven, but
causes ambiguity due artifacts in the form of lines.
perfect match for applications that require the system to be
scaled with minimal overhead.
4.4. Discussion
Fine-tuning hyper-parameters such as W and S, β and
Tthresh plays a crucial role in determining performance.
Hyperparameters help users control the quality of expla-
nations and the algorithms convergence rate. Even though
setting hyperparameters requires some knowledge about the
underlying algorithm, we limit the range of values between
a standard range of (0.0, 1.0) as opposed to it being arbi-
trary. The proposed system can result in explanations con-
taining sampling artifacts due to a mismatch between win-
dow size W of stride S. To prevent this, we plan to look
into other sampling methods that are both faster and can get
a consensus on a larger image region at a time. Some exam-
ples of sampling artifacts are shown in Figure 8. Ultimately,
the proposed system takes an average of approximately 800
milliseconds per iteration to compute explanation on an im-
age of size 224 × 224 using a ResNet-50 architecture in
batches of 256. Since a majority of the run-time is spent in
loading the deep learning feature extractor, we advice using
large batch sizes to minimize model load time.
5. Conclusion
In this paper, we propose a novel iterative and adaptive
sampling with a parameter-free long-range spatial attention
for generating explanations for black-box models. The pro-
posed approach assists in bridging the gap between model
dependant explanation and human trustable explanation by
laying the path for future research into methodologies to
define “goodness” of an explanation. We prove the above
claim by evaluating our approach using a plethora of met-
rics like deletion, insertion, IoU, F-1 score, and pointing
game, at both the image and pixel levels. We believe the
explanations obtained using the proposed approach could
not only be used for the human to reason model decision
but also contains generalized class specific information that
could be fed back into the model to form a closed loop.
2967
References
[1] S. A. Bargal, A. Zunino, V. Petsiuk, J. Zhang, K. Saenko,
V. Murino, and S. Sclaroff. Guided zoom: Questioning
network evidence for fine-grained classification. ArXiv,
abs/1812.02626, 2018. 1
[2] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang,
L. Wang, C. Huang, W. Xu, et al. Look and think twice: Cap-
turing top-down visual attention with feedback convolutional
neural networks. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV), pages 2956–2964,
2015. 2
[3] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep
clustering for unsupervised learning of visual features. In
European Conference on Computer Vision (ECCV), 2018. 5
[4] J. Choi, D. Chun, H. Kim, and H.-J. Lee. Gaussian yolov3:
An accurate and fast object detector using localization un-
certainty for autonomous driving. In The IEEE International
Conference on Computer Vision (ICCV), October 2019. 1
[5] B. Ding, C. Long, L. Zhang, and C. Xiao. Argan: At-
tentive recurrent generative adversarial network for shadow
detection and removal. In Proceedings of the IEEE In-
ternational Conference on Computer Vision (ICCV), pages
10213–10222, 2019. 1
[6] B. Dong, R. Collins, and A. Hoogs. Explainability for
content-based image retrieval. In The IEEE Conference
on Computer Vision and Pattern Recognition Workshop
(CVPRW), June 2019. 3
[7] R. C. Fong and A. Vedaldi. Interpretable explanations of
black boxes by meaningful perturbation. In Proceedings
of the IEEE International Conference on Computer Vision
(ICCV), pages 3429–3437, 2017. 2
[8] T. Hu, C. Long, L. Zhang, and C. Xiao. Vital: A visual inter-
pretation on text with adversarial learning for image labeling.
arXiv preprint arXiv:1907.11811, 2019. 1
[9] G. Hua, C. Long, M. Yang, and Y. Gao. Collaborative active
learning of a kernel machine ensemble for recognition. In
IEEE International Conference on Computer Vision (ICCV),
pages 1209–1216. IEEE, 2013. 1
[10] G. Hua, C. Long, M. Yang, and Y. Gao. Collaborative active
visual recognition from crowds: A distributed ensemble ap-
proach. IEEE Transactions on Pattern Analysis and Machine
Intelligence (T-PAMI), 40(3):582–594, 2018. 1
[11] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com-
mon objects in context. In European conference on computer
vision (ECCV), pages 740–755. Springer, 2014. 2, 5
[12] T. Lombrozo. The structure and function of explanations.
Trends in Cognitive Sciences, 10(10):464–470, 2006. 1
[13] T. Lombrozo. The instrumental value of explanations. Phi-
losophy Compass, 6(8):539–551, 2011. 1
[14] C. Long, A. Basharat, , and A. Hoogs. A coarse-to-fine deep
convolutional neural network framework for frame duplica-
tion detection and localization in forged videos. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW), pages 1–10. IEEE, 2019.
1
[15] C. Long, R. Collins, E. Swears, and A. Hoogs. Deep neural
networks in fully connected crf for image labeling with so-
cial network metadata. In 2019 IEEE Winter Conference on
Applications of Computer Vision (WACV), pages 1607–1615.
IEEE, 2019. 1
[16] C. Long and G. Hua. Multi-class multi-annotator active
learning with robust gaussian process for visual recogni-
tion. In Proceedings of the IEEE International Conference
on Computer Vision (ICCV), pages 2839–2847, 2015. 1
[17] C. Long and G. Hua. Correlational gaussian processes for
cross-domain visual recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), pages 118–126, 2017. 1
[18] C. Long, G. Hua, and A. Kapoor. Active visual recogni-
tion with expertise estimation in crowdsourcing. In IEEE
International Conference on Computer Vision (ICCV), pages
3000–3007. IEEE, 2013. 1
[19] C. Long, G. Hua, and A. Kapoor. A joint gaussian process
model for active visual recognition with expertise estimation
in crowdsourcing. International Journal of Computer Vision
(IJCV), 116(2):136–160, 2016. 1
[20] C. Long, E. Smith, A. Basharat, and A. Hoogs. A c3d-
based convolutional neural network for frame dropping de-
tection in a single video shot. In IEEE International Confer-
ence on Computer Vision and Pattern Recognition Workshop
(CVPRW) on Media Forensics. IEEE, 2017. 1
[21] C. Long, X. Wang, G. Hua, M. Yang, and Y. Lin. Accurate
object detection with location relaxation and regionlets re-
localization. In The 12th Asian Conf. on Computer Vision
(ACCV), pages 3000–3016. IEEE, 2014. 1
[22] X. Ma, Z. Wang, H. Li, P. Zhang, W. Ouyang, and X. Fan.
Accurate monocular 3d object detection via color-embedded
3d reconstruction for autonomous driving. In The IEEE In-
ternational Conference on Computer Vision (ICCV), October
2019. 1
[23] A. I. Maqueda, A. Loquercio, G. Gallego, N. Garca, and
D. Scaramuzza. Event-based vision meets deep learning on
steering prediction for self-driving cars. In The IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
June 2018. 1
[24] M. Mitsuhara, H. Fukui, Y. Sakashita, T. Ogata, T. Hi-
rakawa, T. Yamashita, and H. Fujiyoshi. Embedding human
knowledge in deep neural network via attention map. arXiv
preprint arXiv:1905.03540, 2019. 5
[25] A. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and
J. Clune. Synthesizing the preferred inputs for neurons in
neural networks via deep generator networks. In Advances
in Neural Information Processing Systems (NeurIPS), pages
3387–3395, 2016. 2
[26] T. L. Pedersen and M. Benesty. lime: Local interpretable
model-agnostic explanations. R Package version 0.4, 1,
2018. 1
[27] V. Petsiuk, A. Das, and K. Saenko. Rise: Randomized in-
put sampling for explanation of black-box models. In Pro-
ceedings of the British Machine Vision Conference (BMVC),
2018. 1, 2, 3, 5
[28] B. A. Plummer, M. I. Vasileva, V. Petsiuk, K. Saenko, and
D. Forsyth. Why do these match? explaining the behavior of
2968
image similarity models. arXiv preprint arXiv:1905.10797,
2019. 1
[29] Z. Qi, S. Khorram, and F. Li. Visualizing deep networks
by optimizing with integrated gradients. arXiv preprint
arXiv:1905.00954, 2019. 1
[30] F. U. Rahman, B. Vasu, and A. Savakis. Resilience and self-
healing of deep convolutional object detectors. In 2018 25th
IEEE International Conference on Image Processing (ICIP),
pages 3443–3447. IEEE, 2018. 1
[31] M. T. Ribeiro, S. Singh, and C. Guestrin. ”why should i trust
you?”: Explaining the predictions of any classifier. In HLT-
NAACL Demos, 2016. 1
[32] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust
you?: Explaining the predictions of any classifier. In Pro-
ceedings of the 22nd ACM SIGKDD international confer-
ence on knowledge discovery and data mining (SIGKDD),
pages 1135–1144. ACM, 2016. 2
[33] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies,
and M. Niessner. Faceforensics++: Learning to detect ma-
nipulated facial images. In The IEEE International Confer-
ence on Computer Vision (ICCV), October 2019. 1
[34] W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, and K.-
R. Muller. Explainable ai: Interpreting, explaining and visu-
alizing deep learning. Explainable AI: Interpreting, Explain-
ing and Visualizing Deep Learning, 2019. 1
[35] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,
D. Parikh, and D. Batra. Grad-cam: Visual explanations
from deep networks via gradient-based localization. In Pro-
ceedings of the IEEE International Conference on Computer
Vision (CVPR), pages 618–626, 2017. 1, 2
[36] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside
convolutional networks: Visualising image classification
models and saliency maps. arXiv preprint arXiv:1312.6034,
2013. 2
[37] B. Vasu, F. U. Rahman, and A. Savakis. Aerial-cam: Salient
structures and textures in network class activation maps of
aerial imagery. In 2018 IEEE 13th Image, Video, and Mul-
tidimensional Signal Processing Workshop (IVMSP), pages
1–5. IEEE, 2018. 1
[38] B. Vasu and A. Savakis. Visualizing the resilience of deep
convolutional network interpretations. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion Workshops (CVPRW), pages 107–110, 2019. 1
[39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all
you need. In Advances in Neural Information Processing
Systems (NeurIPS), pages 5998–6008, 2017. 4
[40] J. Wagner, J. M. Kohler, T. Gindele, L. Hetzel, J. T. Wiede-
mer, and S. Behnke. Interpretable and fine-grained visual
explanations for convolutional neural networks. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 9097–9107, 2019. 1
[41] H. Wang, Y. Fan, Z. Wang, L. Jiao, and B. Schiele.
Parameter-free spatial attention network for person re-
identification. arXiv preprint arXiv:1811.12150, 2018. 4
[42] J. Wei, C. Long, H. Zou, and C. Xiao. Shadow inpainting
and removal using generative adversarial networks with slice
convolutions. Computer Graphics Forum (CGF), 38(7):381–
392, 2019. 1
[43] B. Wu, X. Sun, L. Hu, and Y. Wang. Learning with unsure
data for medical image diagnosis. In The IEEE International
Conference on Computer Vision (ICCV), October 2019. 1
[44] X. Xing, Q. Li, H. Wei, M. Zhang, Y. Zhan, X. S. Zhou,
Z. Xue, and F. Shi. Dynamic spectral graph convolution net-
works with assistant task training for early mci diagnosis. In
Medical Image Computing and Computer Assisted Interven-
tion (MICCAI), pages 639–646, 2019. 1
[45] C.-K. Yeh, C.-Y. Hsieh, A. S. Suggala, D. W. Inouye, and
P. D. Ravikumar. On the (in)fidelity and sensitivity for ex-
planations. In Advances in Neural Information Processing
Systems (NeurIPS), 2019. 1
[46] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson.
Understanding neural networks through deep visualization.
arXiv preprint arXiv:1506.06579, 2015. 2
[47] M. D. Zeiler and R. Fergus. Visualizing and understanding
convolutional networks. In European Conference on Com-
puter Vision (ECCV), pages 818–833. Springer, 2014. 2
[48] M. D. Zeiler and R. Fergus. Visualizing and understanding
convolutional networks. In European Conference on Com-
puter Vision (ECCV), pages 818–833. Springer, 2014. 3
[49] H. Zhang, I. J. Goodfellow, D. N. Metaxas, and
A. Odena. Self-attention generative adversarial networks.
arXiv:1805.08318, 2018. 4
[50] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and
S. Sclaroff. Top-down neural attention by excitation back-
prop. International Journal of Computer Vision (IJCV),
126(10):1084–1102, 2018. 1, 2
[51] L. Zhang, C. Long, X. Zhang, and C. Xiao. Ris-gan: Ex-
plore residual and illumination with generative adversarial
networks for shadow removal. In AAAI Conference on Arti-
ficial Intelligence (AAAI), 2020. 1
[52] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-
ralba. Learning deep features for discriminative localization.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 2921–2929, 2016. 2
2969