Feedback based Neural Networks - Stanford University · rent neural network model and connecting...

10
Feedback based Neural Networks Te-Lin Wu Department of Electrical Engineering Stanford University [email protected] Bo-Kui Shen Department of Computer Science Stanford University [email protected] Joint work with Lin Sun, Dr. Amir R Zamir, Prof. Silvio Savarese @ CVGL Department of Computer Science Stanford University Abstract The most successful learning models in deep learning are currently based on the paradigm of successive learn- ing of representations followed by a decision layer. This is mostly actualized by feedforward multilayer neural net- works, such as ConvNets, where each layer forms one of such successive representations. However, an alternative that can achieve the same goal is a feedback based ap- proach, in which the representation is formed in a iterative manner according to a feedback received from previous it- eration’s outcome. We establish that a feedback based approach has several fundamental advantages over feedforward: it enables mak- ing early predictions at the query time, its output conforms to a hierarchical structure in the label space (e.g. a taxon- omy), and it provides a new basis for Curriculum Learning. We also investigate several new feedback mechanisms (e.g. skip connections in time) and design choices (e.g. feedback length). We hope this study offers new perspectives for in- vestigating the feedback paradigm, in quest for more natu- ral and practical learning models. 1. Introduction A feedback based prediction is defined as approximat- ing an ultimate outcome in an iterative manner where each iteration’s operation is based upon the present out- come. Feedback is a strong common way of making pre- dictions in various fields, ranging from control theory to psychology [27, 35, 2]. Utilizing feedback connections is also heavily exercised by biological organisms and the brain [16, 36, 36, 7, 28], suggesting a core role for it in com- plex cognition. In this paper, we establish that a feedback based learning approach has several core advantages over the commonly employed feedforward paradigm making it = = Vehicle Wheeled Vehicle Car Vessel Canoe Bicycle Object Output f o 1 f(x) unroll o 2 f(x,o 1 ) f(x,o 2 ) o 3 Figure 1. A feedback based learning model. The basic idea is to make predictions in an iterative manner based on a notion of the so-far outcome. This provides several core advantages: I. enabling early predictions (given total inference time T , early predictions are made in fractions of T ); II. naturally conforming to a taxonomy in the output space (with each itera- tion, a narrower part of the taxonomy, shown in red, is explored); and III. better grounds for curriculum based learning. a worthwhile alternative. These advantages (elaborated be- low) are mostly attributed to the fact that the final prediction is made in a episodic, rather than one-time, manner along with a direct notion of the thus-far output. Early Predictions: The first advantage is providing early predictions, meaning that coarse estimations of the output can be provided in a fraction of total inference time. This is shown in Fig. 1. This property is a result of episodic inference and is in contrast to feedforward where a one-time output is provided only when the signal reaches the end of the network. This is of particular importance in practical scenarios, such as autonomous driving or robotics; for in- stance, imagine a self driving car that receives a caution- ary heads up about possibly approaching a pedestrian on a 1

Transcript of Feedback based Neural Networks - Stanford University · rent neural network model and connecting...

Page 1: Feedback based Neural Networks - Stanford University · rent neural network model and connecting the loss to each iteration (depicted in Fig.2). In high level terms, the im-age undergoes

Feedback based Neural Networks

Te-Lin WuDepartment of Electrical Engineering

Stanford [email protected]

Bo-Kui ShenDepartment of Computer Science

Stanford [email protected]

Joint work with Lin Sun, Dr. Amir R Zamir, Prof. Silvio Savarese @ CVGLDepartment of Computer Science

Stanford University

Abstract

The most successful learning models in deep learningare currently based on the paradigm of successive learn-ing of representations followed by a decision layer. Thisis mostly actualized by feedforward multilayer neural net-works, such as ConvNets, where each layer forms one ofsuch successive representations. However, an alternativethat can achieve the same goal is a feedback based ap-proach, in which the representation is formed in a iterativemanner according to a feedback received from previous it-eration’s outcome.

We establish that a feedback based approach has severalfundamental advantages over feedforward: it enables mak-ing early predictions at the query time, its output conformsto a hierarchical structure in the label space (e.g. a taxon-omy), and it provides a new basis for Curriculum Learning.We also investigate several new feedback mechanisms (e.g.skip connections in time) and design choices (e.g. feedbacklength). We hope this study offers new perspectives for in-vestigating the feedback paradigm, in quest for more natu-ral and practical learning models.

1. IntroductionA feedback based prediction is defined as approximat-

ing an ultimate outcome in an iterative manner whereeach iteration’s operation is based upon the present out-come. Feedback is a strong common way of making pre-dictions in various fields, ranging from control theory topsychology [27, 35, 2]. Utilizing feedback connectionsis also heavily exercised by biological organisms and thebrain [16, 36, 36, 7, 28], suggesting a core role for it in com-plex cognition. In this paper, we establish that a feedbackbased learning approach has several core advantages overthe commonly employed feedforward paradigm making it

𝒕 =𝑻𝟑 𝒕 = 𝑻

Vehicle

Wheeled Vehicle

Car

Vessel Canoe

Bicycle

Object…

Output

f

o1

f(x)unroll

o2

f(x,o1) f(x,o2)

o3

Figure 1. A feedback based learning model. The basic idea is to makepredictions in an iterative manner based on a notion of the so-far outcome.This provides several core advantages: I. enabling early predictions (giventotal inference time T , early predictions are made in fractions of T ); II.naturally conforming to a taxonomy in the output space (with each itera-tion, a narrower part of the taxonomy, shown in red, is explored); and III.better grounds for curriculum based learning.

a worthwhile alternative. These advantages (elaborated be-low) are mostly attributed to the fact that the final predictionis made in a episodic, rather than one-time, manner alongwith a direct notion of the thus-far output.

Early Predictions: The first advantage is providingearly predictions, meaning that coarse estimations of theoutput can be provided in a fraction of total inference time.This is shown in Fig. 1. This property is a result of episodicinference and is in contrast to feedforward where a one-timeoutput is provided only when the signal reaches the end ofthe network. This is of particular importance in practicalscenarios, such as autonomous driving or robotics; for in-stance, imagine a self driving car that receives a caution-ary heads up about possibly approaching a pedestrian on a

1

Page 2: Feedback based Neural Networks - Stanford University · rent neural network model and connecting the loss to each iteration (depicted in Fig.2). In high level terms, the im-age undergoes

highway, without needing to wait for the final definite out-put. Such scenarios are abundant in practice, usually timeis crucial, and limited computation resources can be reallo-cated based on early predictions on the fly given a properuncertainty measure, such as Minimum Bayes Risk [26].

Taxnomoy Compliance: Another advantage is makingpredictions that naturally conform to a hierarchical struc-ture in the output space, e.g. a taxonomy. The early predic-tions can be expected to conform to a coarse classification,while the later iterations take the role of further decompos-ing the coarse class into finer classes. This is illustrated inthe lower part of Fig.1. This is again due to the fact thatthe predictions happen in an episodic manner coupled witha coarse-to-fine representation that is naturally developedas a results of the feedback based training. Though weakerforms of this property can be achieved via feedforward aswell, feedback allows a more natural basis for gaining thiseven when not specifically trained using a taxonomy.

Curriculum Learning: The previous advantage isclosely related to the concept of Curriculum Learning [4],where gradually increasing the complexity of the learningtask leads to training a better learner (as compared to no or-dering based on complexity). Curriculum Learning is animportant mechanism routinely employed both in humaneducation as well as machine learning systems [11, 4, 25].For non-convex training criteria (such as CNNs), a curricu-lum is known to assists with finding better minima; in con-vex cases, curriculum improves the convergence speed [4].This is due to smoothing the objective criteria through thecurriculum to guide the optimization towards better minima.

Since the prediction in a feedforward network happensin a one-time manner, the only opportunity for enforcinga curriculum is through presenting the training data to thesame full network ordered based on complexity (i.e. firstepochs formed of easy examples, and later the hard ones).In contrast, the predictions in a feedback based model hap-pen in an episodic form, and this enables enforcing a cur-riculum through the episodes of prediction for one query. Inother words, sequential easy-to-hard predictions can be en-forced to be made for one datapoint (e.g., training the earlyepisodes to predict the type of animal and the later episodesto predict the particular breed) without the need for any or-dering in training datapoints. Therefore, any taxonomy canbe immediately used as a curriculum strategy.

The above fundamental advantages (taxonomy and cur-riculum learning advantages to be added) along with severalothers, such as robustness to local confusions [6, 29], makefeedback an attractive formulation for learning, regardlessof the end point performance. In our context, we definefeedback based prediction as a recurrent (weight) shared op-eration, where at each iteration, the problem output is gen-erated and passed onto the next iteration through a hiddenstate. The next iteration then updates the generated output

using the shared operation and received hidden state. Weput forth a generic architecture for such networks and ex-perimentally prove the aforementioned advantages on vari-ous datasets. We also investigate various design choices inmaking such feedback networks (such as feedback lengthor skip connections in time to ease the flow of signal). Weshow that a feedback based approach is not applicable justto temporal tasks, but static problems, e.g. object classifica-tion, can also utilize its advantages. Though we show thatthe feedback approach achieves pleasing final results (onpar with or better than feedforward), the primary goal of thispaper is to establish the aforementioned conceptual proper-ties, rather than optimizing for end-point performance onany benchmark.

2. Related WorkConventional feedforward networks, such as

AlexNet [24] or VGG [37], do not employ either recurrentor feedback like mechanisms. Recently, several success-ful efforts introduced recurrence or recurrence-inspiredmechanisms in feedforward models. ResNet [13] is onenominal example introducing parallel residual connections,as well as hypernetworks [12], highway networks [40],stochastic depth [18], RCNN [30], GoogleNet [41] etc. Itis crucial to note that these methods are still indeed feed-forward as introduction of recurrence does not necessarilyimply feedback; iterative injection of a notion of outputis essential for forming a proper feedback (precisely, theloss should be connected to each iteration, rather thanthe last iteration only). In other words, such methodsare essentially feedforward networks when rolled out intime. This is a primary difference between our approachand many existing works. We show that the feedbackmechanism, besides the recurrence, is indeed critical forachieving the discussed advantages.

Another group of methods, mostly in sequence modelingliterature, developed feedback like mechanism in order tocontrol the spatial selectivity of a model [44, 5, 33]. This isusually used for better modeling of long term dependencies,computational efficiency, or spatial localization. This cate-gory of methods are different from ours as altering spatialselectivity does not have a clear contribution to many statictasks and the discussed general advantages of feedback.

A few recent encouraging methods explicitly employedfeedback connections [6, 3, 43, 29, 31, 20] with promisingresults for the task of interest. The majority of these meth-ods are either task specific and/or attempt temporal model-ing. They also do not put forth and investigate the core ad-vantages of a general feedback based inference. We shouldalso emphasize that our feedback is always within the repre-sentation space. This will allow us to develop generic feed-back based architectures without the requirement of devel-oping task specific error-to-input functions [6].

2

Page 3: Feedback based Neural Networks - Stanford University · rent neural network model and connecting the loss to each iteration (depicted in Fig.2). In high level terms, the im-age undergoes

LtLt-1Lt-2L2L1L0

+ +

ConvLSTM

…unroll

L

ConvLSTM

ConvLSTM

ConvLSTM

ConvLSTM

ConvLSTM

ConvLSTM

Figure 2. Illustration of our core feedback model and skip connectionwhen unrolled in time. The blue and green boxes represent Convolutionaloperations and outputs, respectively.

Lastly, it is worth noting curriculum learning [11, 25, 4]and predictions on a taxonomy [17, 39, 8, 10, 21] are alsowell investigated in the literature, though none provide afeedback centric perspective which is our focus.

3. Feedback NetworksFeedback based prediction has two requirements: (1) it-

erativeness and (2) having a direct notion of posterior (out-put) in each iteration. We realize this by employing a recur-rent neural network model and connecting the loss to eachiteration (depicted in Fig. 2). In high level terms, the im-age undergoes a convolutional operation repeatedly and anoutput is generated each time; the recurrent convolutionaloperations are trained to produce the best output at each it-eration and converged to a maximum value in the end.

3.1. Convolutional LSTM Formulation

In this section, we share the details of our feed-back model which is based on stacking ConvLSTM [43]modules, that essentially replaces the operations in anLSTM [15] cell with convolutional structures. An LSTMcell uses hidden states to pass information through spa-tial and temporal directions. In the following paragraph,we briefly describe the gate connections between stackedConvLSTM. We parametrize temporal order with time t =0, 1, ..., T and spatial order with depth d = 0, 1, ..., D.

At depth d and time t, the output of a ConvLSTM cell isbased on spatial input (Xd−1

t ), temporal hidden state input(Hd

t−1), and temporal cell gate input (Cdt−1).

To compute the output of a ConvLSTM cell, the inputgate it and forget gate ft are used to control the informationpassing between hidden states, as described below:

it = σ(Wd,xi(Xd−1t ) +Wd,hi(H

dt−1))

ft = σ(Wd,xf (Xd−1t ) +Wd,hf (H

dt−1)).

(1)

Here W is a list of feedforward convolutional operationsapplied to X and H , and it is parametrized by d but nott since the weights of convolutional filters are shared inthe temporal dimension. Notice that what operations Wcontain is a design choice, and the exact length of the list,or in other words, the physical depth of a ConvLSTM cell

is discussed in Sec. 3.2.

The cell gate Cdt is computed as follows:

C̃t = tanh(Wd,xc(Xd−1t ) +Wd,hc(H

dt−1)),

Cdt = ft ◦Cd

t−1 + it ◦ C̃t.(2)

Finally, the hidden state Hdt and output Xd

t are updated ac-cording to the output state ot and cell state Cd

t :

ot = σ(Wd,xo(Xd−1t ) +Wd,ho(H

dt−1)),

Hdt = ot ◦ tanh(Cd

t ),

Xdt = Hd

t ,

(3)

where ‘◦’ denotes the Hadamard product. Also, we applybatch normalization [19] to each convolutional operation.

For every iteration, loss is connected to the output ofthe last ConvLSTM cell in physical depth. Here, the postprocesses of ConvLSTM output (pooling, fully connectedlayer, etc.) are ignored for sake of simplicity. Lt is thecross entropy loss at time t, while C denotes the correcttarget class number:

Lt = −logeH

Dt [C]∑

j eHD

t [j]. (4)

Connecting the loss to all iterations forces the network totackle the entire task at each iteration; therefore, it cannotadopt a representation scheme like feedforward networksthat go from low level representations to high level repre-sentations as the depth increases (in the case of feedbacknet, throughout different iterations), since low level repre-sentations are not sufficient for performing the whole clas-sification task. This yields the network to form a represen-tation across iterations in a coarse to fine manner (furtherdiscussed in Sec. 4.2.2.)

We initialize all X0t as the image input inp, and all

Hd0 as 0: ∀t ∈ 1, 2, · · · , T ,X0

t := inp and ∀d ∈1, 2, · · · , D,Hd

0 := 0. The operation of the ConvL-STM cell above can be referred as a simplified notationF(Xd−1

t ,Hdt−1).

3.2. Feedback Module Length

The feedback architecture design is flexible on the num-ber of feedforward units within a feedback module andhow many such modules to stack. We can categorize ourfeedback networks according to the number of feedforwardunits (Conv + BN) within the ConvLSTM recurrent feed-back module, i.e. the length of the feedback. This is shownin Fig. 3, where (a), (b) and (c) are named as Stack-1, Stack-2, and Stack-All. For a Stack-i feedback module, i feedfor-ward units are stacked within the ConvLSTM. Furthermore,such feedback modules can be concatenated to increase the

3

Page 4: Feedback based Neural Networks - Stanford University · rent neural network model and connecting the loss to each iteration (depicted in Fig.2). In high level terms, the im-age undergoes

Image/Pre-Conv

ConvBatchNormReLUConv

BatchNorm

ConvBatchNormReLUConv

BatchNorm

ConvBatchNormReLUConv

BatchNorm

FC

Loss

Pooling

Image/Pre-Conv

ConvBatchNorm

ConvBatchNorm

Pooling

Loss

ConvBatchNorm

FC

Feed

back

Conn

ection

: Ba

ckpro

p for

outpu

t loss

at eac

h tim

e step

Image/Pre-Conv

Pooling

Loss

ConvBatchNormReLU

ConvBatchNorm

ConvBatchNormReLU

FC

Figure 3. Feedback networks with different feedback lengths.

physical depth D. Lastly, the loss is connected to each iter-ation’s output; this design trains the network to attempt thetask even at the first iteration and pass the feedback messagethrough hidden output to future iterations. Which length offeedback connection to pick is a design choice. We providean empirical study on this in Sec. 4.2.1.

3.3. Temporal Skip Connection

In order to regulate the flow of signal through the net-work, we apply the identity skip connection to the feedbacknetwork. This was inspired by conceptually similar mech-anisms, such as the residual connection of ResNet [13] andthe recurrent skip coefficients in [45]. The skip connectionsadopted in our feedback model can be mathematically for-mulated as: the new input at time t: X̂d

t = Xdt +Hd

t−n, and

the final representation will be F(X̂dt ,H

dt−n,H

dt−1). The

skip connection is as shown in Fig.2 denoted by the reddashed line. We set n = 2 in our experiments.

Besides the regularization, Table 1 quantifies theimprovement made by such skip connections on CI-FAR100 [23] using Stack-2 architecture with physical depth4, virtual depth 32 and 8 iterations.

Feedback Net Top1 Top5w/o skip connections 67.37 89.97w/ skip connections 67.83 90.12

Table 1. Impact of skip connections in time on CIFAR100 [23]

3.4. Taxonomy Based PredictionIt is of particular practical value if the predictions of a

model conform to a taxonomy. That is, making a correctcoarse prediction about a query, if a correct fine predictioncannot be made. Given a taxonomy on the labels (e.g., Ima-geNet/CIFAR100 taxonomy), we can examine a network’scapacity to make taxonomy based prediction based on thefine class’s Softmax distribution. For each fine class ci, theprobability of a data entry being in this given weight W isdefined in Softmax as P (yci |x;W ) = e

fyci∑j efj

. The prob-

ability of a data entry being in a higher level coarse classconsisting of Ci = {c1, c2, ..., cn} is thus the sum of prob-

ability of the data entry being in each of the fine class:

P (yCi |x;W ) =∑k∈1:n

P (yck |x;W ) =

∑k∈1:n e

fyck∑j e

fj(5)

Therefore, we use a mapping matrix M with element i, jequals to 1 if ci ∈ Cj , to transform fine class distributionto coarse class distribution which also gives us the loss forcoarse prediction LCoarse; and thus, a coarse prediction pcis obtained through the fine prediction pf . In Sec. 4.2.4,it will be shown that the outputs of the feedback networkconform to a taxonomy, especially in early predictions.

3.5. Curriculum Learning

As discussed in Sec. 1, feedback networks provide anew way for enforcing a curriculum in learning and enableto use a taxonomy as a curriculum strategy. We adopt atime varying loss to enforce the curriculum. We use an an-nealed version of multi-task loss function at each time stepfor our N -iteration feedback networks, where the relation-ship of coarse class losses LCoarse(t) and fine class lossesLFine(t) parametrized by time t is formulated as:

L(t) = γLCoarset(t) + (1− γ)LFine(t), (6)

where γ is the weights to balance the coarse and fine cate-gory contribution. Without lossing generalization, here, weadopt a linear decay as γ = t

k , where t = 0, 1, 2, ..., k, k isthe iteration at which we stop using multi-task loss.

For object classification, the time varying loss functionencourages the network to recognize objects from coarse tofine category. In other words, the network learns from theroot of an taxonomy tree to its leaves. In Sec. 4.2.5 it willbe empirically shown that the feedback based approach wellutilizes this curriculum strategy, unlike feedforward.

3.6. Analysis on Depth of computation graph

Under proper hardware, the feedback model also has anadvantage over feedforward model on speed considering thedifference between depth of computation graphs. 1 This isbecause feedback network, with a shallower computationgraph, is a better fit for parallelism compared with feedfor-ward network. In the following paragraphs, we will discussthe computation graph depth of feedforward model withdepthD and that of feedback model with same virtual depth(consisting of m temporal iterations and physical depth n,D = m × n). Feedforward computation has the limitationthat representation X at depth i is dependent on the previ-ous representation at depth i−1 creating a serial process:Xi

depends only on Xi−1).

1The number of layers, number of parameters, etc are not correct met-rics of an algorithm’s run-time.

4

Page 5: Feedback based Neural Networks - Stanford University · rent neural network model and connecting the loss to each iteration (depicted in Fig.2). In high level terms, the im-age undergoes

……

Feed-forward FeedbackX1

X2

X3

XD

X11

X21X12

X13 X22 X31

Xnm

Figure 4. Feedforward computation graph vs. Feedback recurrentcomputation graph; feedback connection is hidden for graph simplicity

In our recurrent model, representation Xji at temporal

iteration i and physical depth j is dependent on two repre-sentations: the previous iteration Xj

i−1 (for stacked ConvL-STM, Hd

t = Xdt , per Eq. 3) and the previous depth Xj−1

i :

Xji = F(Xj

i−1,Xj−1i ).

The resulting computation graphs of feedforward andfeedback are shown in Fig. 4; notice that although bothgraphs have the same node count, they have different depth(longest directed path in a graph): feedforward’s depth isdff = D− 1 = mn− 1 while feedback network has depthdfb = m + n − 1. In a proper hardware scenario whereone can do parallel computations to a reasonable extend, in-ference time can be well measured by the longest distancefrom root to target (same as the graph’s depth). An exampleof feedback network’s computation can be shown in Fig. 5.Notice that for our computation, we havemax(m,n) paral-lel computation at each time step. Therefore, the total pre-diction time of feedforward network is larger than feedbacknetwork’s since dff = mn− 1 > m+ n− 1 = dfb.

Time for early iteration’s prediction can also be mea-sured by the longest distance from root to the output atk-th iteration. For feedback network, the distance dffk isn+ k− 1, but for feedforward network dfbk is nk− 1. Wehave dffk = nk − 1 > k + n− 1 = dfbk .

It is worth noting that the practical run-time of an algo-rithm depends on various factors, such as implementationand hardware. The purpose of the above analysis on com-putation graph’s depth is to exclude the impact of those fac-tors and quantify the true potential of each method. Theabove derivation is applicable to training process time aswell, given proper hardware.

4. Experimental Results

Our experimental evaluations, performed on the threebenchmarks of CIFAR100 [23], Stanford Cars [22], andMPII Human Pose [1], are provided in this section.

Temporal Iteration

Phys

ical

Dep

th

t = 1 t = 4t = 3 t = 5t = 2

Figure 5. Illustration of computation of feedback’s forward pass un-der proper hardware; feedback connections hidden for simplicity.

4.1. Baselines and Terminology

We first define our baselines and terminology.2

Physical Depth: the depth of the convolutional layers frominput layer to output layer. For feedback model, this repre-sents the spatial depth ignoring temporal dimension.Virtual Depth: the effective depth considering both spatialand temporal dimensions: physical depth × number of iter-ations (only applicable to our model).Baseline Models: We compare with Feedforward VGG[37]and Feedforward ResNet[13] as two of the most commonlyused models and with closest architecture resemblance toour convolutional layers. Both baselines have the same ar-chitecture, except for the residual connection. We use thesame physical module architecture for our methods and thebaselines. We also compare against original authors’ ar-chitecture [13] for the baselines. The kernel sizes and thetransitions of filter numbers remain the same as original.Auxiliary prediction layer (auxiliary loss): Feedfowardbaselines do not make episodic or mid-network predictions.However, to have a feedforward based baseline for episodicpredictions and investigate their upper bound performancein this regards, we connect a loss to different depth/layersof feedforward baselines along with new pooling and FClayers. This allows us to make predictions using the mid-network feedforward representations. We train these auxlayers by taking the fully trained feedforward network andtrain the aux layers from shallowest to deepest layer whilefreezing the convolutional weights.

4.2. CIFAR-100 and Analysis

We test our models and perform a comprehensive anal-ysis on CIFAR100 dataset [23]. CIFAR100 includes 100classes containing 600 images each. The 100 classes arecategorized into 20 classes. The original 100 classes are thefine level labels and the 20 superclasses are the coarse ones.

2The following naming convention is used: C(fi, fo, k, s): fi inputand fo output convolutional filters, kernel size k × k, stride s, and Nlayers. ReLU : rectified linear unit. BN : batch normalization. BR =BN + ReLU . Avg(k, s): average pooling with spatial size k × k, andstride s. FC(n,m): fully connected layer with n inputs, and m outputs.

5

Page 6: Feedback based Neural Networks - Stanford University · rent neural network model and connecting the loss to each iteration (depicted in Fig.2). In high level terms, the im-age undergoes

57.4

66.92

68.75 69.57

63.2367.48

68.23 68.21

33.97

36.7140.59

53.3259.37

64.21

1.7 1.63 1.195.67 12.36

36.4

69.36

37

37.05 38.05

47.3352.06

56.86

0.9 1.01

63.27

0

10

20

30

40

50

60

70

8 12 16 20 24 28 32

Top

1 A

ccur

acy

(%)

Physical/Virtual Depth

Early Prediction

FB Net curriculum trainedFB NetFF (ResNet w/ aux loss)FF (ResNet w/o aux loss)FF (VGG w/ aux loss)FF (VGG w/o aux loss)

Figure 6. Evluation of early predictions. Comparison of accuracy at thesame virtual depths between feedback (FB) model and feedforward (FF)baselines (VGG ResNet, with or without auxiliary loss layers)

45.01 39.02 37.84 38.1136.27

36.56 36.57 36.12

25.05

28.68 33.45

37.11

5.72 5.95

15.06

27.4427.87

31.98

4.33 3.65 5.3

34.41

0

5

10

15

20

25

30

35

40

45

8 16 24 32

E(N

) (%

)

Physical/Virtual Depth

Taxonomy Based Prediction

FB Net curriculum trainedFB NetFF (ResNet w/ aux loss)FF (ResNet w/o aux loss)FF (VGG w/ aux loss)FF (VGG w/o aux loss)

Figure 7. Evalaution of taxonomoy based prediction for feedback (FB)and feedforward (FF) networks trained with or without auxiliary lay-ers. We use only fine loss to train, except for the curriculum learned one.

Feedback Type Top1 Top5Stack-1 66.29 89.58Stack-2 67.83 90.12

Stack-All 65.85 89.04Table 2. Comparison of different feedback module lengths, all modelwith same physical depth 4 and virtual depth 16

4.2.1 Feedback Module LengthWe first study the effect of feedback module length, per thediscussion in Sec. 3.2. Table 2 provide the results of thistest while the physical depth and iteration count are keptconstant (physical depth 4, 4 iterations) across all models.

The results show that the best performance is achievedwhen the feedback length is neither too short nor too long.We found this observation to be valid across different testsand architectures, though the exact length may somewhatdeviate from 2. In the following experiments, for differentphysical depth, we optimize the value of this hyperparame-ter empirically (often 2 or 3).

4.2.2 Early PredictionWe demonstrate the early prediction advantage of the feed-back network in this section. We conduct a study at vir-tual depth 32 (similar results achieved with other depths)between the feedback and feedforward networks with vari-ous designs. As shown in Fig. 6, at the same virtual depthof 8, 12, and 16, the feedback network already achieves sat-isfactory and increasing accuracies. The solid lines denotethe basic feedforward networks with 32 layers; their earlypredictions are made using their final loss layer but appliedon mid-network representations. The results shows that thefeedforward network performs poorly during the first fewlayer’s representations as expected, suggesting that the ab-stractions learned are not suitable to complete the outputtask. With the addition of trained auxiliary layers (dashedcurves), the feedforward networks still considerably fall be-hind the feedback model. These results aligns with the hy-pothesis that a deep feedforward network is forming its rep-

Model 12T 15T 18T 21TFeedback Net 67.94 70.57 71.09 71.12

ResNet Ensemble 66.35 67.52 67.87 68.2Table 3. Top1 accuracy comparison between Feedback Net and an en-semble of ResNets that produce early predictions at the same computationgraph depth time step.

Model 12 24 36 48Feedback 67.94 70.57 71.09 71.12

Feedback Disconnected 36.23 62.14 67.99 71.34Table 4. The impact of feedback mechanism on CIFAR100 over fouriterations.

resentation from low level features to high level abstrac-tions, while the feedback model forms its representationin a coarse to fine manner (further discussed below). Al-though it is memory inefficient and wasteful in training,one can also achieve the effect of early prediction throughan ensemble of feedforward models in parallel (i.e. forevery depth at which one desires a result, have a feedfor-ward network of that depth). Since running an ensemble ofResNet in parallel has some ideal hardware requirements,we will make the following comparison under the analy-sis in Sec. 3.6. Applying the computation graph depthanalysis to our 48 layer virtual depth feedback model (12physical depth with 4 iterations), if we denote time to fin-ish one layer of convolution as T , then we have ith itera-tion result at: ti = (12 + 3i)T . Then the first to last it-eration’s result (virtual depth 12, 24, 36, 48) will come inat 12T, 15T, 18T, 21T ; optimizing an ensemble of ResNetability to return results at these time steps, we have an en-semble of ResNet with depth 12, 15, 18, 21. The perfor-mance comparison between feedback network and such en-semble of ResNet is shown in the Table 3.Feedback vs No Feedback: To examine whether the obser-vations so far and the early predictions demonstrated aboveare caused by the recurrent or feedback mechanism, we per-formed a test by disconnecting the loss from all iterations

6

Page 7: Feedback based Neural Networks - Stanford University · rent neural network model and connecting the loss to each iteration (depicted in Fig.2). In high level terms, the im-age undergoes

Model P/D V/D Top1 Top5(%) (%)

48 - 70.04 90.9632 - 69.36 91.07

Feedforward 12 - 66.35 90.02(ResNet[13]) 8 - 64.23 88.95

128* - 70.92 91.28110* - 72.06 92.1264* - 71.01 91.4848* - 70.56 91.6032* - 69.58 91.55

Feedforward 48 - 55.08 82.1(VGG[37]) 32 - 63.56 88.41

12 - 64.65 89.268 - 63.91 88.90

Feedback Net 12 48 71.12 91.518 32 69.57 91.014 16 67.83 90.12

ResNet V2[14] 1001 - 77.29 -SwapOut [38] 32 fat - 77.28 -RCNN [30] 4 fat 16 68.25 -

Highway [40] 19 - 67.76 -Table 5. Endpoint performance comparison of different models onCIFAR-100 dataset; P/D (physical depth), V/D (virtual depth), baselinesdenoted with * are using standard architecture used in the original ResNetpaper.

except the last, thus making the model recurrently feedfor-ward. As shown in Table 4, although making the modelrecurrently feedforward marginally increases the final per-formance, it takes away its ability to make early predictions.

4.2.3 Endpoint Performance ComparisonWe compare the various feedback network with variousfeedforward models in terms of the endpoint performanceson CIFAR-100 dataset. Our model is compared with base-lines of the same virtual and physical depths.

Using the same conventions mentioned in the previoussection, we define the detailed architecture as below:(0) Recurrent Block: Iterate(fi, fo, k, s,N, t), denotes the re-current modules in the feedback networks which iterates ttimes through the feed forward blocks within:

-C(fi, fo, k, s), BR -{C(fo, fo, k, 1), BR}N−1

(1) Preprocess and Postprocess: across all models we ap-ply the following pre-process: -C(3, 16, 3, 1), BR; and post-process: -Avg(8, 1)-FC(64, 100)

(2) Feedback Net with physical depth = 8:-Iterate(16, 32, 3, 2, 2, 4) -Iterate(32, 32, 3, 1, 2, 4)-Iterate(32, 64, 3, 2, 2, 4) -Iterate(64, 64, 3, 1, 2, 4)

(3) Feedback Net with physical depth = 12:-Iterate(16, 16, 3, 1, 3, 4) -Iterate(16, 32, 3, 2, 3, 4)-Iterate(32, 64, 3, 2, 3, 4) -Iterate(64, 64, 3, 1, 3, 4)

(4) Baseline Models with physical depth = D:-C(16, 32, 3, 2), BR -{C(32, 32, 3, 1), BR}

D2−1

-C(32, 64, 3, 2), BR -{C(64, 64, 3, 1), BR}D2−1

Rabbit Rabbit Rabbit Hamster Rabbit Porcupine Ray RayRabbit

Chimp Chimp Girl Bear Girl Plates Plates CamelChimp

Mouse Mouse Mouse Bee Dolphin Bee Bear CloudShrew

Clock Bowls Bowls Mower Cloud Cloud Cloud CloudClock

Snake Snake Lizard Chair Snake Seal Snail CloudSnake

Rocket Rocket Rocket Bottle Rocket Sea Plain PlainRocket

Kangaroo Kangaroo Lion Train Fox Lizard Snail BeetleFox

VD=32 VD=24 VD=16 VD=8 D=32 D=24 D=16 D=8Query Feedback Feedforward(ResNet)

Figure 8. Qualitative results of classification on CIFAR100. Each rowshows a query along with nearest neighbors at different depth for eachof feedback and feedfowrad networks. Orange, blue, and gray represent‘correct fine class’, ‘correct coarse class but wrong fine class’, and ‘bothincorrect’, respecitvely.

Table 5 shows the overall performance comparisons be-tween our models and the baselines. Feedback networksoutperforms the baselines with the same physical depth andworks on par with baselines with the same virtual depth,showing that the discussed advantages are not achieves atthe expense of sacrificing the endpoint performance. Thebottom part of Table 5 shows several recent methods thatare not comparable to ours, as they use additional mecha-nisms which we did not implement in our model. However,we include them for the sake of completeness.

4.2.4 Taxonomy Based Prediction

We measure the capacity E(N) of network N in makingtaxonomy based predictions as the probability of making acorrect coarse prediction given that it made the wrong fineclass prediction; in other words, how effective it can correctits wrong fine class prediction to a correct coarse class withan taxonomy mapping:

E(N) = P (correct(pc)|!correct(pf );N). (7)

The resulting taxonomy based prediction performanceis illustrated in Fig. 7, and qualitative result is shown inFig. 8. As the figure shows, feedback network can achievea good ability of making taxonomy based prediction evenat the first iteration (or in other words, shallower virtualdepth), while feedforward network do not achieve the samelevel of taxonomy capacity until the very last layer. This isagain aligned with the hypothesis that the feedback based

7

Page 8: Feedback based Neural Networks - Stanford University · rent neural network model and connecting the loss to each iteration (depicted in Fig.2). In high level terms, the im-age undergoes

Feedback Feedforwrd

Figure 9. Timed-tSNE plots for five random CIFAR100 classes show-ing how the representation evolves through depth/time for each method(i.e. how a datapoint moved). The brighter the arrow, the earlier thedepth/iteration. Vector lengths are shown in half to avoid cluttering.

approach develops a coarse-to-fine representation, as op-posed to the low-level to high-abstraction of feedforwardnetworks. This is qualitatively observed in both figures 89. Unlike feedforward representation that disentangles theclasses only in the last few layers, the feedback’s represen-tation is coarsely disentangled earlier and the changes aremostly around forming fine separation regions. (more dis-cussions in the supplementary material.)

4.2.5 Curriculum Learning

Model CL Fine CoarseFeedback Net N 68.21 79.7

Y 69.57(+1.34%) 80.81(+1.11%)

Feedforward N 69.36 80.29ResNet w/ Aux loss Y 69.24(-0.12%) 80.20(-0.09%)

Feedforward N 69.36 80.29ResNet w/o Aux loss Y 65.69(-3.67%) 76.94(-3.35%)

Feedforward N 63.56 75.32VGG w/ Aux loss Y 64.62(+1.06%) 77.18(+1.86%)

Feedforward N 63.56 75.32VGG w/o Aux loss Y 63.2(-0.36%) 74.97(-0.35%)

Table 6. Evaluation of the impact of Curriculum Learning (CL) onCIFAR100. Feedback network best utilizes the curriculum.

In Table 6, we compare the performance of the networkswhen trained with time-varying coarse-to-fine curriculumloss and when trained with the fine-only loss function. Thebest performance is achieved by feedback network trainedon the curriculum loss function. The performance boost bydoing such curriculum training is also the most significantin feedback network. This shows that feedback network ismore welcoming and adaptive to curriculum training. Thisfinding also shows the network is going down a taxonomytree through iterative feedback when trained using a taxo-nomic curriculum (see the curriculum curve in Fig. 7).

4.3. Stanford Cars Dataset

In order to verify that curriculum training could furtherimprove the performance of feedback network, we also con-duct the same set of curriculum training experiments on theStanford car dataset[22], which has a hierarchical coarse-

Model CL Fine CoarseFeedback Net N 50.33 74.15

Y 53.37(+3.04%) 80.7(+6.55%)Feedforward N 49.09 72.60

ResNet-24 Y 50.86(+1.77%) 77.25(+4.65%)

Feedforward N 41.04 67.65VGG-24 Y 41.87(+0.83%) 70.23(+2.58%)

Table 7. Evaluations on Cars dataset. The CL column denotes if cur-riculum learning was employed. All methods have (virtual or physical)depth of 24 and were trained for 80 epochs.

fine label structure as CIFAR-100. We provide the full ex-perimental setup in the supplementary materail.

To conduct a comparison study of curriculum learning’seffect, we perform all training from scratch without fine-tuning pretrained ImageNet [9] models [32] or augmentingthe dataset with additional images [42]. To suit the smallamount of training data in the cars dataset, we use shallowmodels for both feedforward and feedback. For feedforwardbaseline we have depth of 24 layers, and for feedback net-work we have physical depth 6 and iteration count 4, fol-lowing the same design in Sec. 4.1 Sec. 4.2.3; details isincluded in supplementary material.

-C(3, 16, 3, 1), BR-C(16, 32, 3, 2), BR -{C(32, 32, 3, 1), BR}5-C(32, 64, 3, 2), BR -{C(64, 64, 3, 1), BR}5-C(64, 128, 3, 2), BR -{C(128, 128, 3, 1), BR}5

-Avg(12, 1) -FC(128, 196)

For feedback model we have virtual depth 24 layers:-C(3, 16, 3, 1), BR-Iterate(16, 32, 3, 2, 2, 4)-Iterate(32, 64, 3, 2, 2, 4)-Iterate(64, 128, 3, 1, 2, 4)-Avg(12, 1)-FC(128, 196)

The curriculum training is applied to feedback modelusing time-varying coarse-to-fine loss. The coarse loss isachieve through a mapping from 196 fine car classes to 7coarse car types. The performance comparison is presentedin table 7.

4.4. Human Pose Estimation

To be comprehensive in our study of feedback’s effect onrepresentation learning, we test the effect of feedback on re-gression task MPII with different architecture. MPII HumanPose [1] consists of 40k samples (28k training, 11k testing).Just like we applied feedback to feedforward VGG stylemodel in classification, we applied feedback to the state ofthe art MPII model Hourglass [34]. We substitute sequenceof ResNet-like convolutional layers in one stack Hourglasswith ConvLSTM, which results in a shallower virtual depth.The performance comparison in Table 8 shows that we out-perform deeper baseline. A more comprehensive compari-son and architecture details is included in supplementary.5. Conclusion

We demonstrated that a feedback based approach hasseveral fundamental advantages over feedforward: early

8

Page 9: Feedback based Neural Networks - Stanford University · rent neural network model and connecting the loss to each iteration (depicted in Fig.2). In high level terms, the im-age undergoes

Method P/D V/D PCKhFeedforward-Hourglass 24 - 77.6

Feedback-Hourglass 4 12 82.3Table 8. Evaluations on MPII Human Pose Dataset

prediction, taxonomy compliance, and easy CurriculumLearning. Feedback networks’s final results on differenttasks are all on par or better than existing feedforward al-ternatives. We hope our study provides the communitywith new perspectives and motivation to investigate thisparadigm.

References[1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d

human pose estimation: New benchmark and state of the artanalysis. In 2014 IEEE Conference on Computer Vision andPattern Recognition, pages 3686–3693. IEEE, 2014. 5, 8

[2] S. J. Ashford and L. L. Cummings. Feedback as an individ-ual resource: Personal strategies of creating information. Or-ganizational behavior and human performance, 32(3):370–398, 1983. 1

[3] V. Belagiannis and A. Zisserman. Recurrent human poseestimation. arXiv preprint arXiv:1605.02914, 2016. 2

[4] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Cur-riculum learning. In Proceedings of the 26th annual interna-tional conference on machine learning, pages 41–48. ACM,2009. 2, 3

[5] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang,L. Wang, C. Huang, W. Xu, et al. Look and think twice: Cap-turing top-down visual attention with feedback convolutionalneural networks. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 2956–2964, 2015. 2

[6] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Humanpose estimation with iterative error feedback. arXiv preprintarXiv:1507.06550, 2015. 2

[7] R. M. Cichy, D. Pantazis, and A. Oliva. Resolving humanobject recognition in space and time. Nature neuroscience,17(3):455–462, 2014. 1

[8] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio,Y. Li, H. Neven, and H. Adam. Large-scale object classifica-tion using label relation graphs. In European Conference onComputer Vision, pages 48–64. Springer, 2014. 3

[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.In Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on, pages 248–255. IEEE, 2009. 8

[10] N. Ding, J. Deng, K. P. Murphy, and H. Neven. Probabilisticlabel relation graphs with ising models. In Proceedings of theIEEE International Conference on Computer Vision, pages1161–1169, 2015. 3

[11] J. L. Elman. Learning and development in neural networks:The importance of starting small. Cognition, 48(1):71–99,1993. 2, 3

[12] D. Ha, A. Dai, and Q. V. Le. Hypernetworks. arXiv preprintarXiv:1609.09106, 2016. 2

[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. arXiv preprint arXiv:1512.03385,2015. 2, 4, 5, 7

[14] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings indeep residual networks. arXiv preprint arXiv:1603.05027,2016. 7

[15] S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, Nov. 1997. 3

[16] C. S. Holling. Resilience and stability of ecological systems.Annual review of ecology and systematics, pages 1–23, 1973.1

[17] H. Hu, G.-T. Zhou, Z. Deng, Z. Liao, and G. Mori. Learn-ing structured inference neural networks with label relations.arXiv preprint arXiv:1511.05616, 2015. 3

[18] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger.Deep networks with stochastic depth. arXiv preprintarXiv:1603.09382, 2016. 2

[19] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.arXiv preprint arXiv:1502.03167, 2015. 3

[20] A. Karpathy and L. Fei-Fei. Deep visual-semantic align-ments for generating image descriptions. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 3128–3137, 2015. 2

[21] D. Koller and N. Friedman. Probabilistic graphical models:principles and techniques. MIT press, 2009. 3

[22] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object rep-resentations for fine-grained categorization. In Proceedingsof the IEEE International Conference on Computer VisionWorkshops, pages 554–561, 2013. 5, 8

[23] A. Krizhevsky and G. Hinton. Learning multiple layers offeatures from tiny images. 2009. 4, 5

[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InAdvances in Neural Information Processing Systems, 2012.2

[25] K. A. Krueger and P. Dayan. Flexible shaping: How learningin small steps helps. Cognition, 110(3):380–394, 2009. 2, 3

[26] S. Kumar and W. Byrne. Minimum bayes-risk decoding forstatistical machine translation. Technical report, DTIC Doc-ument, 2004. 2

[27] E. B. Lee and L. Markus. Foundations of optimal controltheory. Technical report, DTIC Document, 1967. 1

[28] T. S. Lee and D. Mumford. Hierarchical bayesian inferencein the visual cortex. JOSA A, 20(7):1434–1448, 2003. 1

[29] K. Li, B. Hariharan, and J. Malik. Iterative instance segmen-tation. arXiv preprint arXiv:1511.08498, 2015. 2

[30] M. Liang and X. Hu. Recurrent convolutional neural networkfor object recognition. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages3367–3375, 2015. 2, 7

[31] Q. Liao and T. Poggio. Bridging the gaps between residuallearning, recurrent neural networks and visual cortex. arXivpreprint arXiv:1604.03640, 2016. 2

[32] T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear cnn mod-els for fine-grained visual recognition. In Proceedings of theIEEE International Conference on Computer Vision, pages1449–1457, 2015. 8

9

Page 10: Feedback based Neural Networks - Stanford University · rent neural network model and connecting the loss to each iteration (depicted in Fig.2). In high level terms, the im-age undergoes

[33] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of vi-sual attention. In Advances in Neural Information ProcessingSystems, pages 2204–2212, 2014. 2

[34] A. Newell, K. Yang, and J. Deng. Stacked hourglassnetworks for human pose estimation. arXiv preprintarXiv:1603.06937, 2016. 8

[35] A. G. Parlos, K. T. Chong, and A. F. Atiya. Applicationof the recurrent multilayer perceptron in modeling complexprocess dynamics. IEEE Transactions on Neural Networks,5(2):255–266, 1994. 1

[36] N. C. Rust and J. J. DiCarlo. Selectivity and tolerance(invariance) both increase as visual information propagatesfrom cortical area v4 to it. The Journal of Neuroscience,30(39):12978–12995, 2010. 1

[37] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014. 2, 5, 7

[38] S. Singh, D. Hoiem, and D. Forsyth. Swapout: Learn-ing an ensemble of deep architectures. arXiv preprintarXiv:1605.06465, 2016. 7

[39] Y. Song, M. Zhao, J. Yagnik, and X. Wu. Taxonomic classi-fication for web-based videos. In Computer Vision and Pat-tern Recognition (CVPR), 2010 IEEE Conference on, pages871–878. IEEE, 2010. 3

[40] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highwaynetworks. arXiv preprint arXiv:1505.00387, 2015. 2, 7

[41] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 1–9, 2015. 2

[42] S. Xie, T. Yang, X. Wang, and Y. Lin. Hyper-class aug-mented and regularized deep learning for fine-grained im-age classification. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 2645–2654, 2015. 8

[43] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong,and W.-c. Woo. Convolutional lstm network: A machinelearning approach for precipitation nowcasting. In Advancesin Neural Information Processing Systems, pages 802–810,2015. 2, 3

[44] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhut-dinov, R. S. Zemel, and Y. Bengio. Show, attend and tell:Neural image caption generation with visual attention. arXivpreprint arXiv:1502.03044, 2(3):5, 2015. 2

[45] S. Zhang, Y. Wu, T. Che, Z. Lin, R. Memisevic, R. Salakhut-dinov, and Y. Bengio. Architectural complexity measures ofrecurrent neural networks. arXiv preprint arXiv:1602.08210,2016. 4

10