Recurrent Iterative Gating Networks for Semantic Segmentation · VGG [34] do not employ any...

Recurrent Iterative Gating Networks for Semantic Segmentation

Rezaul Karim1, Md Amirul Islam2, Neil D. B. Bruce2,31University of Manitoba, 2Ryerson University, 3Vector Institute

[email protected], [email protected], [email protected]

Abstract

In this paper, we present an approach for Recurrent Iter-ative Gating called RIGNet. The core elements of RIGNetinvolve recurrent connections that control the flow of infor-mation in neural networks in a top-down manner, and differ-ent variants on the core structure are considered. The iter-ative nature of this mechanism allows for gating to spreadin both spatial extent and feature space. This is revealedto be a powerful mechanism with broad compatibility withcommon existing networks. Analysis shows how gating in-teracts with different network characteristics, and we alsoshow that more shallow networks with gating may be madeto perform better than much deeper networks that do notinclude RIGNet modules.

1. IntroductionThe problem of semantic segmentation and other pixel-

wise labeling problems [26, 7, 41, 16, 1, 28] has beenstudied in detail, and is important in image understandingand for various applications including autonomous driving.Moreover, many successful solutions have been proposedbased on deep neural networks [19, 34, 37, 13, 14]. Whileprogress continues to be made, this progress has also beenaccompanied by increasingly deep and complex models.While increasing the depth of the network may allow forstrong inferences to be made, this also risks poor spatialgranularity in labeling when pooling occurs and in somesense also acts as a complex template matching process.Given a deep model, the demand of making dense predic-tions for all pixels at once makes the problem more chal-lenging insofar as any relational reasoning must be han-dled in a single feedforward pass. There is good reasonto believe that more parsimonious solutions might be pre-sented from the careful guidance of how the image is in-terpreted from early layers to deep layers that encode morecomplex features. Evidence of this comes from both exam-ples of existing neural networks that consider such princi-ples [11, 31, 26, 17, 25, 16, 18, 6, 5, 40], and also the verysignificant role that recurrence, gain control and gating play

Figure 1. A recurrent iterative gating based model. A con-ceptual illustration of how higher layers of the network influencelower layers by gating information that flows forward. When ap-plied iteratively (left to right), this results in belief propagation forfeatures in ascending layers, that propagates over iterations bothspatially and in feature space.

in biological vision systems as a mechanisms for task, con-text or input dependent adaptation.

In particular, gating is one mechanism capable of con-trolling information flow with the possibility of filtering outambiguous interpretations of a scene. The precise mecha-nism by which this can be accomplished may take many dif-ferent forms. There are at least two principal mechanismsof importance in considering the role of recurrent gating:

1. To ensure that activation among earlier layers is con-sistent with the higher-level interpretation reached bylater layers. For example, if a deep layer signals withhigh confidence that a face is present, features thatcarry more importance for faces, or carry more dis-criminative value relevant to faces should be passedforward. More importantly, activation that arises frompatterns at the location of the face that are not relatedto faces should be suppressed.

2. In the case that local features in a scene are diagnosticof semantic category, there should be a mechanism forthis local diagnosticity to propagate outwards both infeatures carried forward and to envelop a larger spatialextent (see Fig. 1). That is, belief in a semantic con-cept may propagate spatially by virtue of the recurrentgating mechanism.

To this end, we propose an iterative gating mechanismwherein deeper features signal to earlier features what in-

arX

iv:1

811.

0804

3v1

[cs

.CV

] 2

0 N

ov 2

018

formation should be passed forward. Moreover, some net-works are marked by spatial pooling, the spatial extent ofactivation observed very deep in the network has the ca-pacity to propagate to a larger spatial region among earlierlayers. When this is done iteratively, there is no limit to thespatial extent over which strongly diagnostic features can bematched to less discriminative and more ambiguous regionsof the scene allowing for a consensus on the most likelylabeling to be determined through gating and iteration.

It is also the case that one might expect that more effi-cient handling of the traffic that flows through the networkmay allow for more parsimonious network architectures tohandle challenging problems.

In this paper we present one such mechanism, a Recur-rent Iterative Gating Network RIGNet. The results pre-sented in the paper consider how gating mechanisms mayinteract with different network properties including depth ofnetwork, spatial pooling, dilated filters and other character-istics. Through this presentation we show that a wide rangeof different network architectures reveal better performancethrough the use of RIG. Moreover, we show that simplerversions of networks (e.g. ResNet-50) can be as capable asmuch deeper networks (e.g. ResNet-101) when modified toform a RIG-Net and show that this result is true for severaldifferent networks.

This has important implications for semantic segmenta-tion, but also any dense image labeling problem and also inhow the nature of recurrent processing in general is viewed.In short, this presents a solution for improving the perfor-mance of any network by adding a simple canonical recur-rent gating mechanism.

2. Related WorkFeed-forward neural networks have shown tremendous

success in various visual recognition tasks. e.g. image clas-sification [34, 13, 14], semantic segmentation [26, 7, 41, 16,1, 28]. Recent approaches mostly focus on generating a dis-criminative feature representation by increasing the depth ofthe network. There has also been prior research in machinelearning [31, 3, 38, 33, 40, 5, 6] that incorporates feedbackin the learning process. In this paper, we carefully examinethe essence of feedback based learning, including the roleof iterative recurrence and provide an overview of some re-cent work that falls into different categories in the role offeedback.

Traditional feed-forward networks, e.g., AlexNet [19],VGG [34] do not employ any recurrence or feedback mech-anism inside the network and carry a large number of pa-rameters. Recent successful methods e.g. ResNet [13],GoogleNet [36], and Highway Networks [35] have struc-ture connected to earlier layers in the feed-forward networkin the form of residual connections. While the deeper net-works with skip connections counter the problem of degra-

dation due to depth, networks with residual connections ad-dress this with the identity shortcut. Moreover, several re-cent methods have shown the superiority of feedback basedapproaches [40, 5, 2, 21, 24, 22, 31] for particular tasks ofinterest. While most approaches focus on directly correct-ing the initial prediction iteratively, we instead propagate in-formation backward through the network to learn a compactrepresentation that is also strong in its predictions and playsan implicit role in correcting the final prediction made.

Inference in human and other biological vision sys-tems includes both short range and long range recurrentconnections[12] where feed-forward paths work in concertwith recurrent feedback to provide pre-attentive and atten-tive modes of vision [20]. Some approaches [3, 30, 30, 15]consider recurrent neural networks (e.g. LSTM [35]) for thetask of semantic segmentation or scene labeling.

Related to our proposed approach is the idea of learn-ing in an iterative manner in a feed-forward network withfeedback modulation. Recently proposed Feedback Net-works [40] present feedback based learning for the purposeof hierarchical taxonomy learning in image classification.IEF [21] proposed a feedback based pipeline that correctsthe initial prediction iteratively based on the feedback fromeach timestep. RCNN [31] introduced a recurrent convolu-tional neural network for scene labeling where the under-lying CNN takes an input image and the label predictionsfrom the previous iteration.

Architecturally, our work is closest to the iterative feed-back modulation strategy of recurrent feedback neural net-works which are unrolled for several iterations with suc-cessive feature refinement. Conceptually, our work is quitedifferent than prior works in the way that feedback is prop-agated through deep convolutional neural networks for fea-ture refinement, and we provide detailed analysis and sup-port of this approach that includes intuitive and empiricalgrounding. In summary, our proposed approach can bethought of as a novel formulation of feedback based recur-rent design on deep convolutional neural networks that canemulate attentive vision to facilitate top down attention (inspace and feature-space) and improve spatial and seman-tic context for inference. This implies an approach thathas a high degree of generality and desirable characteris-tics which we relate to characteristics of different networkarchitectures in the sections that follow.

3. Recurrent Iterative Gating NetworkIn this section, we present the recurrent iterative gating

mechanism with different recurrent unrolling mechanisms.We also highlight some theory towards the logical expla-nation for the recurrence gating module and several coreadvantages of different unrolling mechanisms. Finally, wepropose our top-down feedback based Recurrent IterativeGating Network, the RIGNet for semantic segmentation.

f i

r

f i

ϑ f 1

ϑ f 2

ϑf 3

ϑf 4

ϑ f o

ϑ

f(a)

(c)

frecurrent(b) f o

r

Figure 2. Illustration of (a) traditional iterative feed-forward net-work (b) iterative recurrent network where the hidden state servesas a feedback gate and (c) our recurrent iterative gating network forsemantic segmentation. In contrast to previous works, our frame-work involves recurrent iterative gating that control the flow ofinformation passed forward in a top-down manner.

3.1. Overview of RIGNet Formulation

The process involved in iterative feedback based ap-proaches has few key elements: (1) Output from somelayers/blocks in the network is fed-back to earlier layersthrough a gating module where the simplest gate may be anidentity transform or skip connection (2) The feedback iscombined with a representation at an earlier layer throughconcatenation/multiplication/addition to generate the inputfor next iteration and (3) The final output is generated atthe end of the last iteration. The input image undergoesshared convolutional stages repeatedly to predict labels ateach step. The premise behind this is that iterating the feed-forward modules (most cases RNN) a few times producesa different final output at the last iteration. However, thereis no backward interaction involving the feed-forward mod-ules. In the proposed RIGNet architecture, we take the out-put of each block of layers as feedback, modulating the sig-nal that forms the input of that block. The outcome of ourrecurrent feedback based approach is seen to be beneficialin few respects: (a) modulate the initial input with the feed-back signal to emulate attentive vision (b) a compact rep-resentation that can compete performance-wise with deeperarchitectures through unrolling for more iteration (c) thisallows a common architecture suitable for a wide range ofdevices with different computational throughput by varyingthe number of iterations (d) this introduces a hierarchicalstructure that leads to an implicit coarse-to-fine represen-tations to improve spatial and semantic context of the in-ference. Overall, this allows for refinement of feature spe-cific activations, and confidence for categories separated byspace to propagate over the image.

An illustration of the RIGNet architecture is shown inFig. 2. Unlike existing work, we integrate iterative feedbackmodules inside the feed-forward network which can be seenas rerouting the captured information based on informationthat flows in a backward direction. At a high-level, RIGNetmimics the cyclical structure of the human brain which iscreated by integrating iteration in the feedback modules ofa feed-forward network.

3.2. Iterative Gating Mechanism

In this section, we share the details of our proposed re-current iterative gating/feedback mechanism which is basedon stacking feedback modules in each stage of the feed-forward network. Recent works on semantic segmentationthat have shown success typically share common strate-gies involving dilation [8, 39, 14], encoder-decoder struc-ture [28, 16, 1], and coarse-to-fine refinement [16, 32, 6, 25]to balance semantic context and fine details by recoveringper-pixel categorization. Most of these approaches are in-creasingly precise in their performance but introduce addi-tional model complexity. Our main objective is to demon-strate a compact representation of complex deep networkswhich can achieve similar performance and is agnostic tothe network it is paired with. To accomplish this, wepropose a network with several iterative feedback mod-ules (shown as a feedback gate in Fig. 2) similar to re-current neural networks. More specifically, we apply re-current top-down feedback in a block-wise manner ratherthan layer-wise [23, 24] or over the full network [31]. Weargue that formulating top-down feedback layer-wise simi-lar to [23, 24] suffers from few major drawbacks: (1) addsa huge overhead in number of parameters (2) limiting fortransfer learning (3) the output of a single deeper layer maynot contain sufficiently rich semantic information for feed-back. Moreover, the recurrence over the whole network [31]is also unable to leverage semantic and spatial contextualinformation through feedback to refine the previous layers.In this context, the behaviour of filters remains fixed andonly varies based on the current label hypotheses rather thanany internal feature representations. In contrast, in RIGNetfeedback is taken from the output of a block of convolutionlayers resulting in a larger effective field of view and moreabstraction. Also, this mechanism allows the lower-levellayers to be influenced by the weight/activation of higher-level features resulting in refined weight/activation in theearlier layers that refines information passed forward andimportantly also allows for spatial propagation through in-ternal feature representations throughout the network. Themechanism is further defined by a parameter (ui) denoted asthe unroll iteration of the network that determines the num-ber of times the feedback loop will be instantiated to predictthe final output. The unrolling parameter (ui) can be viewedthe same as the unrolling steps in RNNs. Fig. 3 shows theunrolling effect on the network presented in Fig. 2 (c).

The final output is generated with multiple iterationsover the network where the number of iterations is deter-mined by the unroll iterations (ui). In the very first itera-tion, all feedback gates remain disconnected from the net-work (i.e. there is no gating without prior knowledge of theimage among subsequent layers). In subsequent iterations,the output of the previous iteration is subject to modula-tion by the feedback gate at every layer in the most extreme

f o

ϑf 4

ϑf i

ϑ f 1

ϑ f 2

ϑ f 3

ϑ

f i

ϑ f 1

ϑ f 2

ϑf 3

ϑf 4

ϑ f o

ϑ

= 1ui

= 2ui

Figure 3. Our proposed network when unrolled in different time-steps (ui = 1, 2..). The difference with the traditional approach isthat the loop connections feed into layers that share parameters with previous layers in addition to gating being controlled top-down.Consider iteration ui = 1 where the feed-forward network receives the input image and predicts the initial output which is gated with thecorresponding iterative gating module in iteration ui = 2.

case all the way from the first layer to the deepest layers.This operation repeats for all the stages of the network toobtain a prediction for the current iteration. The reverse in-formation flow towards the input all the way from outputlayers allows the earlier layers to be implicitly subject toadjusted parameters at the time of inference to provide guid-ance to remove ambiguity that may arise anywhere withinthe network. Similarly, the loopy structure inside the feed-back modules allows a shallower network to be influencedby a rich feature representation. The iterative nature allowsfor stage-wise refinement and spatial propagation of refine-ment. This implies that much simpler networks can performmore powerful inference given that their behaviour is notfixed and can be modulated by recurrent feedback.

3.3. Unroll Mechanism

For a recurrent approach to be practical, there is a needfor a specific recurrent structure (and implied unrollingmechanism) and also a constraint that the unrolling itera-tions should be finite. It is important to note that often onlya finite number of iterations is of value given that the fi-nal output will tend to converge on a certain set of labelsand show little change beyond a fixed number of iterations.Note also that when we set the unroll iteration ui = 1,RIGNet becomes a purely end-to-end feed-forward neuralnetwork. We explore two different unrolling mechanismsto control the flow of information in a top-down manner.

3.3.1 Sequential Unroll

Sequential unrolling involves the recurrence within blocktied to a particular iteration, with the order of recurrencefollowing subsequent blocks all involved in a single feed-forward pass. Activations are forwarded to the next recur-rent block when iterations in all previous network blockswithin a single recurrent iteration are finished. Fig. 4 de-picts the sequential unrolling mechanism when the networkis unrolled for three iterations (ui = 3). Conceptually, thesequential unroll mechanism increases the effective depthof the unrolled network by a multiplicative factor. For ex-ample, given a network with lt layers in the RIGNet formu-lation, lf of lt layers remain similar whereas we introduce

lr layers within the blockwise recurrent feedback stage. So,the feed-forward network has effective depth similar to theoriginal depth of the network ( le = lf + lr) howeverthe RIGNet with unroll iteration ui has effective depth ofle = lf + lr ∗ ui.

f 3

ϑ f 4

ϑf 1

ϑ f 2

ϑ f 4

ϑ

4

4f 4

ϑ

f 5

ϑf 5

ϑ

5

5f 5

ϑ

f 6

ϑf 6

ϑ 6

6 f 6

ϑ

= 1ui = 2ui = 3ui

Figure 4. Illustration of RIGNet with sequential unroll for threeiteration in last three feed-forward blocks. Note that f i

ϑ refers to afeed-forward block and F i denotes the recurrent feedback gate.

3.3.2 Parallel Unroll

In the parallel unrolling mechanism, the network initiallygathers the final representation (activations) of the first it-eration and then recurrence proceeds by way of feedbackfrom the first block to the last (deep → shallow). Theformulation of obtaining the final representation in subse-quent iterations is similar to the first iteration. Our pro-posed RIGNet in Fig. 3 is a feed-forward network with aparallel unrolling mechanism. Fig. 5 illustrates the paral-lel unrolling mechanism where a RIGNet with feedback inlast three blocks is unrolled for three iterations. The paral-lel unrolling mechanism has less effective depth comparedto the sequential one since it increases the effective depthmultiplicatively only for the last feedback block. To con-tinue with the example in Sec. 3.3.1, assume the last feed-back block has lrk layers among lr whereas the remainingblocks with feedback have lrj = lr− lrk layers. In this casethe effective depth of the RIGNet with parallel unrolling isle = lf +lrj +lrk ∗ui. This depth analysis between sequen-tial and parallel unrolling mechanisms reveals the impactof increasing the effective depth on performance improve-ment. This analysis also sheds light on the hypothesis thatan effective depth increase helps to improve the overall per-

f 4

ϑ

f 4

ϑ

f 4

ϑ

4

5

6

5

f 3

ϑ f 5

ϑf 1

ϑ f 2

ϑ

4

f 6

ϑ

f 5

ϑf 6

ϑ

f 5

ϑf 6

ϑ

= 1ui

= 2ui

= 3ui

6

Figure 5. Illustration of RIGNet with parallel unroll in the lastthree feed-forward blocks for three iterations.

formance as opposed to the role of long range semantic con-text (across layers and spatially) through feedback bringingimprovement. Additionally, parallel recurrence brings an-other advantage of generating a early prediction (coarse rep-resentation) at the end of each iteration which can be usedto facilitate taxonomy learning where some coarse grainedpredictions can be made at initial iterations (like vehicles)and then fine predictions in final iterations (like cars/bus).This can be done by backpropagation of loss defined bycoarse predictions in initial iteration similar to the conceptexplored in [40] for classification task. Since we are princi-pally interested in the final prediction, we only backpropa-gate loss once at the end of final iteration, but it should benoted that the generality of the RIGNet structure presents awide range of directions that may be explored further.

3.3.3 Parallel Unroll with Multi Range Feedback

In addition to our basic feedback model (RIGNet), wealso extend the concept of short range feedback to longrange feedback. Our multi range feedback is a best fit forthe parallel recurrence mechanism as it requires additionalfeedback from not only subsequent blocks but also deeperblocks. The concept of multi range feedback is more anal-ogous to biological vision systems and has been consid-ered to a limited degree in the context on image classifi-cation [27].

3.4. Iterative Gating Module Formulation

In this section we present the formulation for both ourbasic feedback gate and multi range feedback gate.

3.4.1 Basic Feedback Gate

Each feedback gate module takes the output of the forwardblock (next stage feature map) f i+1

ϑ as input and learns topass information relevant to gating backwards through thefollowing sequence of operations: First we apply an averagepooling (resultant feature map f i

′+1ϑ ) followed by a 3 × 3

convolution to obtain a feature map f i′+1

b which is capable

of carrying context with a larger field of view. We then ap-ply a sigmoid operation followed by bilinear upsampling toproduce a feature map whose spatial resolution is the sameas the input to the feedback gate. The resultant feature mapprovides the input to the feedback gate. The ith stage feed-back gate F i+1 combines the inputs through an element-wise product resulting in a modulated feature map f ir.We can summarize these operations as follows:

f i′+1

b = S(C3×3(Ap(f i+1ϑ ); Θ), f i

′+1b = ξ(f i

′+1b )

f ir = f i′+1

b ⊗ f i−1ϑ (1)

where Ap represents an average pooling operation and (Θ)denotes the parameters of the convolution C. ξ refers toupsampling operation.

3.4.2 Multi Range Feedback Gate

In a multi range feedback gate, the long range feedback istaken from the output of the last block. For example, we de-note the output from the last block as f ic at iteration i. Thelong range feedback f ic is first passed through convolutionand an interpolation layer resulting in a feature map of f i

′

c

with a number of channels and spatial dimensions matchingthe short range feedback f i+1

ϑ (see Eq.1). We then concate-nate f ic and f i+1

ϑ to obtain a feature map f i′

c . The rest ofthe operations of the feedback gate are similar to the basicfeedback gate with f i

′

c instead of f i+1ϑ .

4. ExperimentsWe perform a series of experiments to evaluate the per-

formance of RIGNet. We begin by providing implemen-tation details of RIGNet and baseline approaches. Ini-tially, we perform a controlled study to analyze and in-vestigate the contribution of each component in RIGNet.Then we experiment on the object-centric PASCAL VOC2012 dataset [10]. In addition to semantic segmentation, wealso explore how the recurrent iterative gating mechanismcan improve scene parsing performance on the COCO-Stuffdataset [4].

4.1. Implementation Details and Baseline Networks

Our experimental pipeline and the pre-trained models arebased on the open source toolbox PyTorch [29]. We be-gin with experiments using simpler networks and graduallymove to more complex networks to show performance im-provements related to different network architectures withthe recurrent iterative gating mechanism. First, we reportevaluation performance for the vanilla ResNet baselines de-noted as ResNet50-FCN, ResNet101-FCN and our corre-sponding proposed ResNet50-RIGNet, ResNet101-RIGNetnetworks. So given an input image I ∈ Rh×w×d, the net-works produce a feature map of size

⌊h32 ,

w32

⌋. Then we

extend our experiments with a more sophisticated networkarchitecture to examine the effectiveness of RIGNet. Wechoose DeepLabV2 [9] as our new baseline network dueto it’s superior performance on pixel-wise labeling tasks.DeepLabV2 uses the dilated structure to balance the seman-tic context and fine details, resulting in a feature map of size⌊h8 ,

w8

⌋given an input image I ∈ Rh×w×d.

For ease of representation, we use the following nota-tions to report numbers throughout the experiment section.Sequential unroll (Su), Parallel unroll (Pu), Parallel un-roll with multi-range feedback (Pf

u), unroll iteration (ui).Fb � j..i� refers to feedback used in blocks from i to j.

4.2. Analyzing RIGNet

In this section, we perform ablation studies to investigatethe role of the recurrent iterative gating mechanism. Wehighlight a few major facts to validate the design choices:1) the role of applying iterative gating modules in differentstages 2) the length of iteration in a gating module. 3) thedesign choice associated with the feedback gate 4) the influ-ence of network-wide vs block-wise feedback mechanism.

4.2.1 Unroll Mechanism and Feedback Blocks

To evaluate the value of applying iterative gating modulesin different convolutional stages, we perform a control studywhere we train models by adding iterative gating modulesstep by step to different layers to evaluate their effect on per-formance. More specifically, we train a feed-forward net-work by adding feedback modules at the last stage only andcompute mIoU for the final predictions. We repeat this op-eration several times until we reach the first convolutionalstage to examine the importance of integrating recurrentiterative gating mechanisms at the final layer, many deeplayers, or all layers. We first report mIoU for the vanillaResNet baselines in Table 1. Table 2 shows the depth-wiseperformance of the ResNet50-RIGNet architecture on thePASCAL VOC 2012 validation set. From this analysis, it isclear that inclusion of iterative gating modules improves theoverall performance gradually.

Methods Parameters Mean IoU

ResNet50-FCN 32-s 59.4ResNet101-FCN 32-s 65.3

Table 1. Quantitative results in terms of mIoU on PASCAL VOC2012 validation set for ResNet-FCN based baselines.

MethodsFeedback Blocks

�6� �6..5� �6..4� �6..3� �6..2� �6..1�Su 61.6 65.1 65.4 65.2 65.3 65.3Pu 62.3 64.8 65.2 64.9 64.8 65.1Pfu 61.6 66.3 66.7 66.8 67.1 67.2

Table 2. Comparison of mIoU for different variants of our pro-posed ResNet50-RIGNet architecture.

4.2.2 Unroll Iteration

To justify the significance of an iterative solution, we ex-amine the extent of the recurrent gating module in terms ofiterations. Table 3 shows the experimental results of the it-erative gating module per the discussion. We keep the bestperforming combination for the three different scenarios.

Iter. RIGNet (Su)�6..4� RIGNet (Pu)�6..4� RIGNet (Pfu)�6..1�

2 65.4 65.2 67.24 68.3 68.3 68.56 68.8 68.9 69.5

Table 3. Comparison of mIoU for varying number of unroll itera-tions with ResNet50-RIGNet variants on PASCAL VOC 2012.

For this analysis, we compare evaluation performancefor unroll iter 2, unroll iter 4, and unroll iter 6. We observethat overall performance progressively improves with eachsuccessive stage of iteration. We empirically found this ob-servation to be valid across datasets and different networkarchitectures. Interestingly, ResNet50-RIGNet with unrolliteration 3 outperforms the ResNet101-FCN which furthervalidates the impact of increasing the number of iterationsin recurrent gating.

4.2.3 Feedback Gate Design Choices

Additionally, concerning the design choice of feedbackgate, we try a variety of alternative design choices and re-port the number in Table 4. When we use (additive + ReLU)interaction in recurrent gating modules, ResNet50-RIGNetachieves 64.2% mIoU on the PASCAL VOC 2012 valida-tion set. In comparison, our proposed ResNet50-RIGNetwith an (multiplicative + sigmoid) interaction in the gat-ing modules achieves 65.2% mIoU. Multiplicative feedbackrouting is demonstrably valid from a performance point ofview, but also intuitive in that it provides a stronger capac-ity to resolve categorical ambiguity present among earlierlayers in the extreme case completely inhibiting activationin an earlier layer.

Methods Add + ReLU Mul + Sigmoid Mul + Tanh

ResNet50-RIGNet 64.2 65.2 65.2ResNet101-RIGNet 67.0 68.9 68.0

Table 4. Impact of the choice of activation function in iterativegating on PASCAL VOC 2012 validation set.

Moreover, we also investigate the impact of differentpooling operations (w/o pooling, max pool, and avg pool)in the design choice of recurrent iterative gating. Table 5presents the results of alternative design choice in termsof different pooling operations. For ResNet50-RIGNet allthe three different design choices achieve similar mIoU onPASCAL VOC 2012 val set. Interestingly, the ResNet101-RIGNet with an average pooling in the recurrent gatingachieves better performance compared to alternatives.

Methods w/o Pooling Max Pool Average Pool

ResNet50-RIGNet 65.2 65.2 65.2ResNet101-RIGNet 68.3 68.3 68.9

Table 5. Impact of choice of pooling operation in the iterative gat-ing with Pu�6..4�, ui = 2.

4.2.4 Feedback: Network-wide vs Block-wide

In Table 6, we present the results comparing different feed-back routing mechanisms shown in Fig. 2. Note that exist-ing works incorporate network-wide recurrence [31] witha shallower base network and the results comply with ourgeneral intuition of the superiority of the gated block-widefeedback mechanism.

Methods Recurrence Mechanism mIoU

ResNet50-FCN feed-forward 59.4RCNN Network-wide, similar to [31] 59.6RCNN Network-wide gated feedback 59.9

ResNet50-RIGNet routing similar to [22] 65.0ResNet50-RIGNet Pu �6..4�, ui = 2 65.2ResNet50-RIGNet Pf

u �6..1�, ui = 2 67.2Table 6. Comparison of network-wide vs block-wide recurrencebased methods on PASAL VOC val. Note that we implement theideas in [31] with ResNet50. Also, we adapt [22] by implementingtheir routing mechanism attached with our basic feedback gate.

4.3. Experiments on PASCAL VOC 2012

We evaluate the performance of our proposed recurrentiterative gating network on the PASCAL VOC 2012 dataset,one of the most commonly used semantic segmentationbenchmarks. Following prior works [26, 9, 17], we usethe augmented training set comprised of 10,581, 1449, and1456 images in training, validation, and testing respectively.The models are trained on the augmented training set andtested on the validation set. Table 7 shows results for thecomparison between our proposed approach and the ResNetbaselines on the validation set.

Method Parameters mIoU (%)

ResNet50 - 59.4ResNet50-RIGNet Pu�6..4�, ui = 6 68.9

ResNet101 - 65.3ResNet101-RIGNet Su�6..4�, ui = 6 71.2ResNet101-RIGNet Pu�6..4�, ui = 6 71.4ResNet101-RIGNet Pf

u � 6..1�, ui = 6 71.6

ResNet101 (8s) - 71.3ResNet101-RIGNet(8s) Pu� 6..4�, ui = 4 74.9

DeepLabV2 - 74.9DeepLabV2-RIGNet Pu� 6..4�, ui = 2 75.9

Table 7. PASCAL VOC 2012 validation set results for severalbaselines and RIGNet with different unroll mechanisms.

Note that we only report the best result for ResNet50-RIGNet in Table 7 since the depth-wise results are alreadyreported in Table 2. It is evident that, ResNet50-RIGNet andResNet101-RIGNet outperform the baselines significantlyin terms of mIOU achieving 68.9% and 71.6% respectively.It is worth mentioning that our proposed ResNet50-RIGNetyields mIoU better than ResNet101-FCN without any post-processing techniques providing a convincing case for thevalue of our iterative gating mechanism. We also exper-iment with ResNet101-FCN (stride 8) as a base networkand achieve superior performance (71.3% vs. 74.9% mIoU)compared to the baseline. As shown in Table 7, DeepLabv2-RIGNet outperforms the baseline significantly which furthervalidates the importance of iteratively refining initial out-comes through recurrent gating modules.

Figure 6 depicts a visual comparison of our approachwith respect to the baselines. We can see that ResNet50-RIGNet is capable of producing predictions superior toResNet101-FCN, and the RIG mechanism has a powerfulimpact on network performance for all cases.

Image ResNet50-FCN ResNet50-RIGNet ResNet101-FCN ResNet101-RIGNet

Figure 6. Qualitative results corresponding to the PASCAL VOC2012 validation set for 2 iteration.

4.4. Experiments on COCO-Stuff

To further confirm the value and generality of pro-posed recurrent iterative gating mechanism on scene pars-ing, we evaluate on the large-scale COCO-Stuff dataset.This dataset contains images of high-complexity includ-ing things and stuff. The COCO-Stuff dataset extends theCOCO annotation by adding dense pixel-wise stuff anno-tations and provides dense semantic labels for the wholescene, which has 9,000 training images and 1,000 test im-ages. It includes total annotation of 182 classes consistingof 91 thing classes and 91 stuff classes.

We use the same architectures and training proceduresmentioned for evaluating performance on the COCO-Stuffdataset. Table 8 shows the quantitative comparison ofour approach with respect to vanilla ResNet-FCN andDeepLabv2 based baselines. Our proposed ResNet50-RIGNet achieves better mIoU (28.8% vs. 24.3%) thanthe baseline. We further perform experiments on the

image gt ResNet50-FCN RIGNet Pu � 6� RIGNet Pu � 6..5� RIGNet Pu � 6..4� RIGNet Pfu � 6..1� ResNet101-FCN

Figure 7. Some samples of output quality after stage-wise addition of recurrent iterative gating modules. For each row, we show the inputimage, ground-truth, the vanilla ResNet-50 prediction, the predicted segmentation map of ResNet in a top→down manner when iterativegating is included, and the output of vanilla ResNet101-FCN.

COCO-Stuff 10k dataset with more sophisticated models(DeepLabV2-Res101). As shown in Table 8, RIGNet con-sistently outperforms the baselines by a significant margin.

Methods Parameters pAcc mAcc mIoU

ResNet50 - 57.2 35.2 24.3ResNet50-RIGNet Pu� 6..4�, ui = 2 59.8 38.2 26.6ResNet50-RIGNet Pu� 6..4�, ui = 6 61.4 40.0 28.8

ResNet101 - 58.7 38.2 26.4ResNet101-RIGNet Pu� 6..4�, ui = 2 60.8 39.4 28.0ResNet101-RIGNet Pu� 6..4�, ui = 6 62.6 41.1 29.5ResNet101-RIGNet Pf

u � 6..1�, ui = 6 62.3 41.6 29.9

DeepLabV2 - 65.4 44.6 34.1DeepLabV2-RIGNet Pu� 6..1�, ui = 2 66.1 46.6 35.0

Table 8. Comparison of several ResNet based networks w/o andw/ RIG on COCO-Stuff 10K validation set.

The superior performance achieved by RIGNet revealsthat integrating recurrent iterative gating modules in thefeed-forward network are very effective in capturing morecontextual information for labeling complex scenes.

5. DiscussionTowards examining the practical grounds for recurrent it-

erative gating networks with top-down feedback, we aimedto verify a few specific hypotheses with our experiments.Firstly, the recurrent iterative gating mechanism can al-low more parsimonious networks to outperform deeper ar-chitectures with careful selection of the gating structure.This is revealed to be the case for the RIGNet architec-ture, with clear evidence of ResNet50-RIGNet outperform-ing ResNet101-FCN with three iterations.

Secondly, our proposed RIGNet is more precise and se-mantically meaningful compared to the baselines with re-spect to qualitative results presented in Fig. 6. Fig. 7 illus-trates the impact of integrating a recurrent gating module.We can see that the recurrent iterative gating scheme pro-gressively improves the details of a predicted segmentationmap by recovering the missing spatial details which can beseen as coarse-to-fine refinement.

Moreover, the improvement in performance as a func-tion of recurrent gating depth (RIG blocks) reveals that theRIGNet formulation of the feed-forward convolutional net-work improves the representational power of the model byincorporating semantic and relational context in a top-downmanner. While there may exist alternatives for feedbackgate design, we present an intuitive way of designing re-current feedback gates. Given the demonstrable capabilityof the proposed mechanism, we expect that this work willinspire significant interest in exploring feedback based ap-proaches within future work including an emphasis on top-down processing, gating, and iteration.

6. Conclusion

We have presented a network mechanism that involvesRecurrent Iterative Gating for semantic segmentation,showing it is a valuable contribution in its compatibilitywith virtually any feedforward neural network providing ageneral mechanism for boosting performance, or allowingfor a simpler network. Our experimental results on twochallenging datasets demonstrate that the proposed modelperforms significantly better than the baselines. We alsofound that feedback based approaches are capable of devel-oping a meaningful coarse-to-fine representation similar tocoarse-to-fine refinement based approaches without bring-ing complexity into the network architecture. As one spe-cific example, ResNet-50 as a RIGNet is shown to outper-form ResNet-101.

Results presented reveal the capability for correctingerrors made in a single feedforward pass through RecurrentIterative Gating as an exciting and important direction forfuture work, which even for deep networks, allows forstronger representational capacity.

Acknowledgments: The authors acknowledge financialsupport from the NSERC Canada Discovery Grants pro-gram, MGS funding, and the support of the NVIDIA Cor-poration GPU Grant Program.

References[1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A

deep convolutional encoder-decoder architecture for scenesegmentation. TPAMI, 2017. 1, 2, 3

[2] V. Belagiannis and A. Zisserman. Recurrent human poseestimation. In FG, 2017. 2

[3] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene la-beling with lstm recurrent neural networks. In CVPR, 2015.2

[4] H. Caesar, J. Uijlings, and V. Ferrari. Coco-stuff: Thing andstuff classes in context. arXiv:1612.03716, 2016. 5

[5] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Hu-man pose estimation with iterative error feedback. In CVPR,2016. 1, 2

[6] A. Casanova, G. Cucurull, M. Drozdzal, A. Romero,and Y. Bengio. On the iterative refinement of denselyconnected representation levels for semantic segmentation.arXiv:1804.11332, 2018. 1, 2, 3

[7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Semantic image segmentation with deep con-volutional nets and fully connected CRFs. In ICLR, 2015. 1,2

[8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully con-nected crfs. arXiv:1606.00915, 2016. 3

[9] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully con-nected crfs. TPAMI, 2018. 6, 7

[10] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams,J. Winn, and A. Zisserman. The pascal visual object classeschallenge: A retrospective. IJCV, 2015. 5

[11] A. Gazzaley and A. C. Nobre. Top-down modulation: bridg-ing selective attention and working memory. Trends in cog-nitive sciences, 2012. 1

[12] C. D. Gilbert and W. Li. Top-down influences on visual pro-cessing. Nature Reviews Neuroscience, 2013. 2

[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In CVPR, 2016. 1, 2

[14] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger.Densely connected convolutional networks. In CVPR, 2017.1, 2, 3

[15] Q. Huang, W. Wang, K. Zhou, S. You, and U. Neumann.Scene labeling using gated recurrent units with explicit longrange conditioning. arXiv:1611.07485, 2016. 2

[16] M. A. Islam, S. Naha, M. Rochan, N. Bruce, and Y. Wang.Label refinement network for coarse-to-fine semantic seg-mentation. arXiv:1703.00551, 2017. 1, 2, 3

[17] M. A. Islam, M. Rochan, N. D. B. Bruce, and Y. Wang. Gatedfeedback refinement network for dense image labeling. InCVPR, 2017. 1, 7

[18] M. A. Islam, M. Rochan, S. Naha, N. Bruce, and Y. Wang.Gated feedback refinement network for coarse-to-fine densesemantic image labeling. arXiv:1806.11266, 2018. 1

[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNetclassification with deep convolutional neural networks. InNIPS, 2012. 1, 2

[20] V. A. Lamme and P. R. Roelfsema. The distinct modes of vi-sion offered by feedforward and recurrent processing. Trendsin neurosciences, 2000. 2

[21] K. Li, B. Hariharan, and J. Malik. Iterative instance segmen-tation. In CVPR, 2016. 2

[22] X. Li, Z. Jie, J. Feng, C. Liu, and S. Yan. Learning withrethinking: Recurrently improving convolutional neural net-works through feedback. Pattern Recognition, 2018. 2, 7

[23] M. Liang and X. Hu. Recurrent convolutional neural networkfor object recognition. In CVPR, 2015. 3

[24] M. Liang, X. Hu, and B. Zhang. Convolutional neural net-works with intra-layer recurrent connections for scene label-ing. In NIPS, 2015. 2, 3

[25] G. Lin, A. Milan, C. Shen, and I. Reid. RefineNet: Multi-path refinement networks for high-resolution semantic seg-mentation. In CVPR, 2017. 1, 3

[26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In CVPR, 2015. 1, 2, 7

[27] A. Nayebi, D. Bear, J. Kubilius, K. Kar, S. Ganguli,D. Sussillo, J. J. DiCarlo, and D. L. Yamins. Task-driven convolutional recurrent models of the visual system.arXiv:1807.00053, 2018. 5

[28] H. Noh, S. Hong, and B. Han. Learning deconvolution net-work for semantic segmentation. In ICCV, 2015. 1, 2, 3

[29] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De-Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto-matic differentiation in pytorch. 2017. 5

[30] Z. Peng, R. Zhang, X. Liang, X. Liu, and L. Lin. Geomet-ric scene parsing with hierarchical lstm. arXiv:1604.01931,2016. 2

[31] P. H. Pinheiro and R. Collobert. Recurrent convolutionalneural networks for scene labeling. In ICML, 2014. 1, 2,3, 7

[32] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar. Learn-ing to refine object segments. In ECCV, 2016. 3

[33] B. Shuai, Z. Zuo, B. Wang, and G. Wang. Scene segmenta-tion with dag-recurrent neural networks. TPAMI, 2017. 2

[34] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. In ICLR, 2015.1, 2

[35] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highwaynetworks. arXiv:1505.00387, 2015. 2

[36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In CVPR. 2

[37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In CVPR, 2015. 1

[38] A. Veit, M. J. Wilber, and S. Belongie. Residual networks be-have like ensembles of relatively shallow networks. In NIPS,2016. 2

[39] F. Yu and V. Koltun. Multi-scale context aggregation by di-lated convolutions. In ICLR, 2016. 3

[40] A. R. Zamir, T.-L. Wu, L. Sun, W. B. Shen, B. E. Shi, J. Ma-lik, and S. Savarese. Feedback networks. In CVPR, 2017. 1,2, 5

[41] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid sceneparsing network. In CVPR, 2017. 1, 2

Recurrent Iterative Gating Networks for Semantic Segmentation · VGG [34] do not employ any...

Documents

Transcript of Recurrent Iterative Gating Networks for Semantic Segmentation · VGG [34] do not employ any...