R N : ADAPTIVE SELECTION OF NON FUNCTIONS FOR MULTI … et al - Routing... · 2018-06-04 ·...

Published as a conference paper at ICLR 2018

ROUTING NETWORKS: ADAPTIVE SELECTION OFNON-LINEAR FUNCTIONS FOR MULTI-TASK LEARN-ING

Clemens RosenbaumCollege of Information and Computer SciencesUniversity of Massachusetts Amherst140 Governors Dr., Amherst, MA [email protected]

Tim Klinger & Matthew RiemerIBM Research AI1101 Kitchawan Rd, Yorktown Heights, NY 10598{tklinger,mdriemer}@us.ibm.com

ABSTRACT

Multi-task learning (MTL) with neural networks leverages commonalities in tasksto improve performance, but often suffers from task interference which reducesthe benefits of transfer. To address this issue we introduce the routing networkparadigm, a novel neural network and training algorithm. A routing network isa kind of self-organizing neural network consisting of two components: a routerand a set of one or more function blocks. A function block may be any neural net-work – for example a fully-connected or a convolutional layer. Given an input therouter makes a routing decision, choosing a function block to apply and passingthe output back to the router recursively, terminating when a fixed recursion depthis reached. In this way the routing network dynamically composes different func-tion blocks for each input. We employ a collaborative multi-agent reinforcementlearning (MARL) approach to jointly train the router and function blocks. Weevaluate our model against cross-stitch networks and shared-layer baselines onmulti-task settings of the MNIST, mini-imagenet, and CIFAR-100 datasets. Ourexperiments demonstrate a significant improvement in accuracy, with sharper con-vergence. In addition, routing networks have nearly constant per-task training costwhile cross-stitch networks scale linearly with the number of tasks. On CIFAR-100 (20 tasks) we obtain cross-stitch performance levels with an 85% reduction intraining time.

1 INTRODUCTION

Multi-task learning (MTL) is a paradigm in which multiple tasks must be learned simultaneously.Tasks are typically separate prediction problems, each with their own data distribution. In an earlyformulation of the problem, (Caruana, 1997) describes the goal of MTL as improving generalizationperformance by “leveraging the domain-specific information contained in the training signals ofrelated tasks.” This means a model must leverage commonalities in the tasks (positive transfer)while minimizing interference (negative transfer). In this paper we propose a new architecture forMTL problems called a routing network, which consists of two trainable components: a router anda set of function blocks. Given an input, the router selects a function block from the set, applies itto the input, and passes the result back to the router, recursively up to a fixed recursion depth. If therouter needs fewer iterations then it can decide to take a PASS action which leaves the current stateunchanged. Intuitively, the architecture allows the network to dynamically self-organize in responseto the input, sharing function blocks for different tasks when positive transfer is possible, and usingseparate blocks to prevent negative transfer.

The architecture is very general allowing many possible router implementations. For example, therouter can condition its decision on both the current activation and a task label or just one or theother. It can also condition on the depth (number of router invocations), filtering the function mod-ule choices to allow layering. In addition, it can condition its decision for one instance on whatwas historically decided for other instances, to encourage re-use of existing functions for improvedcompression. The function blocks may be simple fully-connected neural network layers or whole

1


networks as long as the dimensionality of each function block allows composition with the previousfunction block choice. They needn’t even be the same type of layer. Any neural network or part ofa network can be “routed” by adding its layers to the set of function blocks, making the architectureapplicable to a wide range of problems. Because the routers make a sequence of hard decisions,which are not differentiable, we use reinforcement learning (RL) to train them. We discuss the train-ing algorithm in Section 3.1, but one way we have modeled this as an RL problem is to create aseparate RL agent for each task (assuming task labels are available in the dataset). Each such taskagent learns its own policy for routing instances of that task through the function blocks.

To evaluate we have created a “routed” version of the convnet used in (Ravi & Larochelle, 2017)and use three image classification datasets adapted for MTL learning: a multi-task MNIST datasetthat we created, a Mini-imagenet data split as introduced in (Vinyals et al., 2016), and CIFAR-100(Krizhevsky, 2009), where each of the 20 label superclasses are treated as different tasks.1 Weconduct extensive experiments comparing against cross-stitch networks (Misra et al., 2016) and thepopular strategy of joint training with layer sharing as described in (Caruana, 1997). Our resultsindicate a significant improvement in accuracy over these strong baselines with a speedup in con-vergence and often orders of magnitude improvement in training time over cross-stitch networks.

2 RELATED WORK

Work on multi-task deep learning (Caruana, 1997) traditionally includes significant hand design ofneural network architectures, attempting to find the right mix of task-specific and shared parameters.For example, many architectures share low-level features like those learned in shallow layers of deepconvolutional networks or word embeddings across tasks and add task-specific architectures in laterlayers. By contrast, in routing networks, we learn a fully dynamic, compositional model which canadjust its structure differently for each task.

Routing networks share a common goal with techniques for automated selective transfer learningusing attention (Rajendran et al., 2017) and learning gating mechanisms between representations(Stollenga et al., 2014), (Misra et al., 2016), (Ruder et al., 2017). In the latter two papers, experi-ments are performed on just 2 tasks at a time. We consider up to 20 tasks in our experiments andcompare directly to (Misra et al., 2016).

Our work is also related to mixtures of experts architectures (Jacobs et al., 1991), (Jordan & Jacobs,1994) as well as their modern attention based (Riemer et al., 2016) and sparse (Shazeer et al., 2017)variants. The gating network in a typical mixtures of experts model takes in the input and choosesan appropriate weighting for the output of each expert network. This is generally implemented asa soft mixture decision as opposed to a hard routing decision, allowing the choice to be differen-tiable. Although the sparse and layer-wise variant presented in (Shazeer et al., 2017) does save somecomputational burden, the proposed end-to-end differentiable model is only an approximation anddoesn’t model important effects such as exploration vs. exploitation tradeoffs, despite their impacton the system. Mixtures of experts have recently been considered in the transfer learning setting(Aljundi et al., 2016), however, the decision process is modelled by an autoencoder-reconstruction-error-based heuristic and is not scaled to a large number of tasks.

In the use of dynamic representations, our work is also related to single task and multi-task modelsthat learn to generate weights for an optimal neural network (Ha et al., 2016), (Ravi & Larochelle,2017), (Munkhdalai & Yu, 2017). While these models are very powerful, they have trouble scaling todeep models with a large number of parameters (Wichrowska et al., 2017) without tricks to simplifythe formulation. In contrast, we demonstrate that routing networks can be applied to create dynamicnetwork architectures for architectures like convnets by routing some of their layers.

Our work extends an emerging line of recent research focused on automated architecture search.In this work, the goal is to reduce the burden on the practitioner by automatically learning blackbox algorithms that search for optimal architectures and hyperparameters. These include techniquesbased on reinforcement learning (Zoph & Le, 2017), (Baker et al., 2017), evolutionary algorithms(Miikkulainen et al., 2017), approximate random simulations (Brock et al., 2017), and adaptivegrowth (Cortes et al., 2016). To the best of our knowledge we are the first to apply this idea to multi-task learning. Our technique can learn to construct a very general class of architectures without the

1All dataset splits and the code will be released with the publication of this paper.

2


need for human intervention to manually choose which parameters will be shared and which will bekept task-specific.

Also related to our work is the literature on minimizing computation cost for single-task problemsby conditional routing. These include decisions trained with REINFORCE (Denoyer & Gallinari,2014), (Bengio et al., 2015), (Hamrick et al., 2017), Q Learning (Liu & Deng, 2017), and actor-criticmethods (McGill & Perona, 2017). Our approach differs however in the introduction of severalnovel elements. Specifically, our work explores the multi-task learning setting, it uses a multi-agentreinforcement learning training algorithm, and it is structured as a recursive decision process.

There is a large body of related work which focuses on continual learning, in which tasks are pre-sented to the network one at a time, potentially over a long period of time. One interesting recentpaper in this setting, which also uses the notion of routes (“paths”), but uses evolutionary algorithmsinstead of RL is Fernando et al. (2017).

While a routing network is a novel artificial neural network formulation, the high-level idea of taskspecific “routing” as a cognitive function is well founded in biological studies and theories of thehuman brain (Gurney et al., 2001), (Buschman & Miller, 2010), (Stocco et al., 2010).

3 ROUTING NETWORKS

input: v, t

f11

f12

f13

f21

f22

f23

f31

f32

f33

router(v, t, 1) router(v, t, 2) router(v, t, 3)

y = f32(f21(f13(v, t)))

Figure 1: Routing (forward) Example

A routing network consists of two components: a router and a set of function blocks, each of whichcan be any neural network layer. The router is a function which selects from among the functionblocks given some input. Routing is the process of iteratively applying the router to select a se-quence of function blocks to be composed and applied to the input vector. This process is illustratedin Figure 1. The input to the routing network is an instance to be classified (v, t), v ∈ Rd is a repre-sentation vector of dimension d and t is an integer task identifier. The router is given v, t and a depth(=1), the depth of the recursion, and selects from among a set of function block choices availableat depth 1, {f13, f12, f11}, picking f13 which is indicated with a dashed line. f13 is applied to theinput (v, t) to produce an output activation. The router again chooses a function block from thoseavailable at depth 2 (if the function blocks are of different dimensions then the router is constrainedto select dimensionally matched blocks to apply) and so on. Finally the router chooses a functionblock from the last (classification) layer function block set and produces the classification y.

Algorithm 1: Routing Algorithminput : x, t, n:

x ∈ Rd, d the representationdim;t integer task id;n max depth

output: v - the vector result of applyingthe composition of the selectedfunctions to the input x

1 v ← x2 for i in 1...n do3 a← router(x, t, i)4 if a 6= PASS then5 x← function blocka(x)

6 return v

Algorithm 1 gives the routing procedure in detail. Thealgorithm takes as input a vector v, task label t andmaximum recursion depth n. It iterates n times choos-ing a function block on each iteration and applying itto produce an output representation vector. A specialPASS action (see Appendix Section 7.2 for details) justskips to the next iteration. Some experiments don’trequire a task label and in that case we just pass adummy value. For simplicity we assume the algorithmhas access to the router function and function blocksand don’t include them explicitly in the input. Therouter decision function router : Rd × Z+ × Z+ →{1, 2, . . . , k, PASS} (for d the input representationdimension and k the number of function blocks) mapsthe current representation v, task label t ∈ Z+, andcurrent depth i ∈ Z+ to the index of the function blockto route next in the ordered set function block.

3


If the routing network is run for d invocations then we say it has depth d. For N function blocksa routing network run to a depth d can select from Nd distinct trainable functions (the paths in thenetwork). Any neural network can be represented as a routing network by adding copies of its layersas routing network function blocks. We can group the function blocks for each network layer andconstrain the router to pick from layer 0 function blocks at depth 0, layer 1 blocks at depth 1, andso on. If the number of function blocks differs from layer to layer in the original network, then therouter may accommodate this by, for example, maintaining a separate decision function for eachdepth.

3.1 ROUTER TRAINING USING RL

Algorithm 2: Router-Trainer: Training of a Routing Network.input: A dataset D of samples (v, t, y), v the input representation, t an integer task label, y a

ground-truth target label1 for each sample s = (v, t, y) ∈ D do2 Do a forward pass through the network, applying Algorithm 1 to sample s.

Store a trace T = (S,A,R, rfinal), where S = sequence of visited states (si); A = sequenceof actions taken (ai); R = sequence of immediate action rewards (ri) for action ai; and thefinal reward rfinal.The last output as the network’s prediction y and the final reward rfinal is +1 if the predictiony is correct; -1 if not.

3 Compute the loss L(y, y) between prediction y and ground truth y and backpropagate along thefunction blocks on the selected route to train their parameters.

4 Use the trace T to train the router using the desired RL training algorithm.

We can view routing as an RL problem in the following way. The states of the MDP are the triples (v,t, i) where v ∈ Rd is a representation vector (initially the input), t is an integer task label for v, andi is the depth (initially 1). The actions are function block choices (and PASS) in {1, . . . k, PASS}for k the number of function blocks. Given a state s = (v, t, i), the router makes a decision aboutwhich action to take. For the non-PASS actions, the state is then updated s′ = (v′, t, i + 1) and theprocess continues. The PASS action produces the same representation vector again but incrementsthe depth, so s′ = (v, t, i + 1). We train the router policy using a variety of RL algorithms andsettings which we will describe in detail in the next section.

Regardless of the RL algorithm applied, the router and function blocks are trained jointly. For eachinstance we route the instance through the network to produce a prediction y. Along the way werecord a trace of the states si and the actions ai taken as well as an immediate reward ri for actionai. When the last function block is chosen, we record a final reward which depends on the predictiony and the true label y.

Routing Example (see Figure 1)

y = f32(f21(f13(v, t)))

L(y, y)f13 f21 f32

∂L∂f32

∂L∂f21

∂L∂f13

rfinala1 a2 a3+r3+r2+r1

Figure 2: Training (backward) Example

We train the selected function blocks using SGD/backprop. In the example of Figure 1 this meanscomputing gradients for f32, f21 and f13. We then use the computed trace to train the router usingan RL algorithm. The high-level procedure is summarized in Algorithm 2 and illustrated in Figure 2.To keep the presentation uncluttered we assume the RL training algorithm has access to the routerfunction, function blocks, loss function, and any specific hyper-parameters such as discount rateneeded for the training and don’t include them explicitly in the input.

3.1.1 REWARD DESIGN

A routing network uses two kinds of rewards: immediate action rewards ri given in response to anaction ai and a final reward rfinal, given at the end of the routing. The final reward is a function

4


of the network’s performance. For the classification problems focused on in this paper, we set it to+1 if the prediction was correct (y = y), and −1 otherwise. For other domains, such as regressiondomains, the negative loss (−L(y, y)) could be used.

We experimented with an immediate reward that encourages the router to use fewer function blockswhen possible. Since the number of function blocks per-layer needed to maximize performance isnot known ahead of time (we just take it to be the same as the number of tasks), we wanted to seewhether we could achieve comparable accuracy while reducing the number of function blocks everchosen by the router, allowing us to reduce the size of the network after training. We experimentedwith two such rewards, multiplied by a hyper-parameter ρ ∈ [0, 1]: the average number of timesthat block was chosen by the router historically and the average historical probability of the routerchoosing that block. We found no significant difference between the two approaches and use theaverage probability in our experiments. We evaluated the effect of ρ on final performance and reportthe results in Figure 12 in the appendix. We see there that generally ρ = 0.0 (no collaborationreward) or a small value works best and that there is relatively little sensitivity to the choice in thisrange.

3.1.2 RL ALGORITHMS

(a): Single

Router

〈value, task〉

α0

a

(b): Per-Task

Router

〈value, task〉

a

α1 α2 ... αl ...

(c): Dispatched

Router

〈value, task〉

αd

a

α1 α2 ... αk ...

Figure 3: Task-based routing. 〈value, task〉 is the input consisting of value, the partial evaluationof the previous function block (or input x) and the task label task. αi is a routing agent; αd is adispatching agent.

To train the router we evaluate both single-agent and multi-agent RL strategies. Figure 3 shows threevariations which we consider. In Figure 3(a) there is just a single agent which makes the routingdecision. This is be trained using either policy-gradient (PG) or Q-Learning experiments. Figure3(b) shows a multi-agent approach. Here there are a fixed number of agents and a hard rule whichassigns the input instance to a an agent responsible for routing it. In our experiments we createone agent per task and use the input task label as an index to the agent responsible for routing thatinstance. Figure 3(c) shows a multi-agent approach in which there is an additional agent, denoted αdand called a dispatching agent which learns to assign the input to an agent, instead of using a fixedrule. For both of these multi-agent scenarios we additionally experiment with a MARL algorithmcalled Weighted Policy Learner (WPL).

We experiment with storing the policy both as a table and in form of an approximator. The tabularrepresentation has the invocation depth as its row dimension and the function block as its columndimension with the entries containing the probability of choosing a given function block at a givendepth. The approximator representation can consist of either one MLP that is passed the depth(represented in 1-hot), or a vector of d MLPs, one for each decision/depth.

Both the Q-Learning and Policy Gradient algorithms are applicable with tabular and approximationfunction policy representations. We use REINFORCE (Williams, 1992) to train both the approx-imation function and tabular representations. For Q-Learning the table stores the Q-values in theentries. We use vanilla Q-Learning (Watkins, 1989) to train tabular representation and train theapproximators to minimize the `2 norm of the temporal difference error.

Implementing the router decision policy using multiple agents turns the routing problem into astochastic game, which is a multi-agent extension of an MDP. In stochastic games multiple agentsinteract in the environment and the expected return for any given policy may change without anyaction on that agent’s part. In this view incompatible agents need to compete for blocks to train,since negative transfer will make collaboration unattractive, while compatible agents can gain by

5


sharing function blocks. The agent’s (locally) optimal policies will correspond to the game’s Nashequilibrium 2.

For routing networks, the environment is non-stationary since the function blocks are being trainedas well as the router policy. This makes the training considerably more difficult than in the single-agent (MDP) setting. We have experimented with single-agent policy gradient methods such asREINFORCE but find they are less well adapted to the changing environment and changes in otheragent’s behavior, which may degrade their performance in this setting.

Algorithm 3: Weighted Policy Learnerinput : A trace T = (S,A,R, rfinal)

n the maximum depth;R, the historical average

returns (initialized to 0 at the start oftraining);

γ the discount factor ; andλπ the policy learning rate

output: An updated router policy π1 for each action ai ∈ A do2 Compute the return:3 Ri ← rfinal +

∑nj=i γ

j−irj4 Update the average return:5 Ri ← (1− λπ)Ri + λπRi6 Compute the gradient:7 ∆(ai)← Ri − Ri8 Update the policy:9 if ∆(ai) < 0 then

10 ∆(ai)← ∆(ai)(1− π(ai))11 else12 ∆(ai)← ∆(ai)(π(ai))

13 π ← simplex-projection(π + λπ∆)

One MARL algorithm specifically designed to addressthis problem, and which has also been shown to con-verge in non-stationary environments, is the weightedpolicy learner (WPL) algorithm (Abdallah & Lesser,2006), shown in Algorithm 3. WPL is a PG algorithmdesigned to dampen oscillation and push the agentsto converge more quickly. This is done by scalingthe gradient of the expected return for an action a ac-cording the probability of taking that action π(a) (ifthe gradient is positive) or 1 − π(a) (if the gradientis negative). Intuitively, this has the effect of slow-ing down the learning rate when the policy is mov-ing away from a Nash equilibrium strategy and in-creasing it when it approaches one. The full WPLalgorithm is shown in Algorithm 3. It is assumedthat the historical average return Ri for each actionai is initialized to 0 before the start of training. Thefunction simplex-projection projects the updated pol-icy values to make it a valid probability distribution.The projection is defined as: clip(π)/

∑(clip(π)),

where clip(x) = max(0,min(1, x)). The states S inthe trace are not used by the WPL algorithm.

Details, including convergence proofs and more exam-ples giving the intuition behind the algorithm, can befound in (Abdallah & Lesser, 2006). A longer expla-nation of the algorithm can be found in Section 7.4 inthe appendix. The WPL-Update algorithm is defined only for the tabular setting. It is future work toadapt it to work with function approximators.

As we have described it, the training of the router and function blocks is performed independentlyafter computing the loss. We have also experimented with adding the gradients from the routerchoices ∆(ai) to those for the function blocks which produce their input. We found no advantagebut leave a more thorough investigation for future work.

4 QUANTITATIVE RESULTS

We experiment with three datasets: multi-task versions of MNIST (MNIST-MTL) (Lecun et al.,1998), Mini-Imagenet (MIN-MTL) (Vinyals et al., 2016) as introduced by (Ravi & Larochelle,2017), and CIFAR-100 (CIFAR-MTL) (Krizhevsky, 2009) where we treat the 20 superclasses astasks. In the binary MNIST-MTL dataset, the task is to differentiate instances of a given class cfrom non-instances. We create 10 tasks and for each we use 1k instances of the positive class cand 1k each of the remaining 9 negative classes for a total of 10k instances per task during training,which we then test on 200 samples per task (2k samples in total). MIN-MTL is a smaller version ofImageNet (Deng et al., 2009) which is easier to train in reasonable time periods. For mini-ImageNetwe randomly choose 50 labels and create tasks from 10 disjoint random subsets of 5 labels eachchosen from these. Each label has 800 training instances and 50 testing instances – so 4k trainingand 250 testing instances per task. For all 10 tasks we have a total of 40k training instances. Finally,

2A Nash equilibrium is a set of policies for each agent where each agent’s expected return will be lower ifthat agent unilaterally changes its policy

6


CIFAR-100 has coarse and fine labels for its instances. We follow existing work (Krizhevsky, 2009)creating one task for each of the 20 coarse labels and include 500 instances for each of the corre-sponding fine labels. There are 20 tasks with a total of 2.5k instances per task; 2.5k for training and500 for testing. All results are reported on the test set and are averaged over 3 runs. The data aresummarized in Table 1.

Each of these datasets has interesting characteristics which challenge the learning in different ways.CIFAR-MTL is a “natural” dataset whose tasks correspond to human categories. MIN-MTL is ran-domly generated so will have less task coherence. This makes positive transfer more difficult toachieve and negative transfer more of a problem. And MNIST-MTL, while simple, has the difficultproperty that the same instance can appear with different labels in different tasks, causing interfer-ence. For example, in the “0 vs other digits” task, “0” appears with a positive label but in the “1 vsother digits” task it appears with a negative label.

Dataset # Training # TestingCIFAR-MTL 50k 10kMIN-MTL 40k 2.5kMNIST-MTL 100k 2k

Table 1: Dataset training and testing splits

Our experiments are conducted on a convnet archi-tecture (SimpleConvNet) which appeared recently in(Ravi & Larochelle, 2017). This model has 4 convo-lutional layers, each consisting of a 3x3 convolutionand 32 filters, followed by batch normalization anda ReLU. The convolutional layers are followed by 3fully connected layers, with 128 hidden units each.Our routed version of the network routes the 3 fullyconnected layers and for each routed layer we supply one randomly initialized function block pertask in the dataset. When we use neural net approximators for the router agents they are always2 layer MLPs with a hidden dimension of 64. A state (v, t, i) is encoded for input to the ap-proximator by concatenating v with a 1-hot representation of t (if used). That is, encoding(s) =concat(v, one hot(t)).

We did a parameter sweep to find the best learning rate and ρ value for each algorithm on eachdataset. We use ρ = 0.0 (no collaboration reward) for CIFAR-MTL and MIN-MTL and ρ = 0.3for MNIST-MTL. The learning rate is initialized to 10−2 and annealed by dividing by 10 every 20epochs. We tried both regular SGD as well as Adam Kingma & Ba (2014), but chose SGD as itresulted in marginally better performance. The SimpleConvNet has batch normalization layers butwe use no dropout.

For one experiment, we dedicate a special “PASS” action to allow the agents to skip layers dur-ing training which leaves the current state unchanged (routing-all-fc recurrent/+PASS). A detaileddescription of the PASS action is provided in the Appendix in Section 7.2.

All data are presented in Table 2 in the Appendix.

In the first experiment, shown in Figure 4, we compare different RL training algorithms on CIFAR-MTL. We compare five algorithms: MARL:WPL; a single agent REINFORCE learner with a sep-arate approximation function per layer; an agent-per-task REINFORCE learner which maintains aseparate approximation function for each layer; an agent-per-task Q learner with a separate approx-imation function per layer; and an agent-per-task Q learner with a separate table for each layer. Thebest performer is the WPL algorithm which outperforms the nearest competitor, tabular Q-Learningby about 4%. We can see that (1) the WPL algorithm works better than a similar vanilla PG, whichhas trouble learning; (2) having multiple agents works better than having a single agent; and (3)the tabular versions, which just use the task and depth to make their predictions, work better herethan the approximation versions, which all use the representation vector in addition predict the nextaction.

The next experiment compares the best performing algorithm WPL against other routing approaches,including the already introduced REINFORCE: single agent (for which WPL is not applicable). Allof these algorithms route the full-connected layers of the SimpleConvNet using the layering ap-proach we discussed earlier. To make the next comparison clear we rename MARL:WPL to routing-all-fc in Figure 5 to reflect the fact that it routes all the fully connected layers of the SimpleConvNet,and rename REINFORCE: single agent to routing-all-fc single agent. We compare against severalother approaches. One approach, routing-all-fc-recurrent/+PASS, has the same setup as routing-all-fc, but does not constrain the router to pick only from layer 0 function blocks at depth 0, etc. Itis allowed to choose any function block from two of the layers (since the first two routed layers

7


Figure 4: Influence of the RL algorithm onCIFAR-MTL. Detailed descriptions of the im-plementation each approach can be found in theAppendix in Section 7.3.

Figure 5: Comparison of Routing Architec-tures on CIFAR-MTL. Implementation detailsof each approach can be found in the Appendixin Section 7.3.

have identical input and output dimensions; the last is the classification layer). Another approach,soft-mixture-fc, is a soft version of the router architecture. This soft version uses the same functionblocks as the routed version, but replaces the hard selection with a trained softmax attention (seethe discussion below on cross-stitch networks for the details). We also compare against the singleagent architecture shown in 3(a) called routing-all-fc single agent and the dispatched architectureshown in Figure 3(c) called routing-all-fc dispatched. Neither of these approached the performanceof the per-task agents. The best performer by a large margin is routing-all-fc, the fully routed WPLalgorithm.

We next compare routing-all-fc on different domains against the cross-stitch networks of Misra et al.(2016) and two challenging baselines: task specific-1-fc and task specific-all-fc, described below.

Cross-stitch networks Misra et al. (2016) are a kind of linear-combination model for multi-tasklearning. They maintain one model per task with a shared input layer, and “cross stitch” connectionlayers, which allow sharing between tasks. Instead of selecting a single function block in the nextlayer to route to, a cross-stitch network routes to all the function blocks simultaneously, with theinput for a function block i in layer l given by a linear combination of the activations computed byall the function blocks of layer l−1. That is: inputli =

∑kj=1 w

lijvl−1,j , for learned weightswlij and

layer l − 1 activations vl−1,j . For our experiments, we add a cross-stitch layer to each of the routedlayers of SimpleConvNet. We additional compare to a similar “soft routing” version soft-mixture-fcin Figure 5. Soft-routing uses a softmax to normalize the weights used to combine the activations ofprevious layers and it shares parameters for a given layer so that wl

i = wli′ for all i, i′, l.

Figure 6: Results on domain CIFAR-MTL Figure 7: Results on domain MIN-MTL (miniImageNet)

8


The task-specific-1-fc baseline has a separate last fully connected layer for each task and shares therest of the layers for all tasks. The task specific-all-fc baseline has a separate set of all the fully con-nected layers for each task. These baseline architectures allow considerable sharing of parametersbut also grant the network private parameters for each task to avoid interference. However, unlikerouting networks, the choice of which parameters are shared for which tasks, and which parametersare task-private is made statically in the architecture, independent of task.

The results are shown in Figures 6, 7, and 8. In each case the routing net routing-all-fc performsconsistently better than the cross-stitch networks and the baselines. On CIFAR-MTL, the routingnet beats cross-stitch networks by 7% and the next closest baseline task-specific-1-fc by 11%. OnMIN-MTL, the routing net beats cross-stitch networks by about 2% and the nearest baseline task-specific-1-fc by about 6%. We surmise that the results are better on CIFAR-MTL because the taskinstances have more in common whereas the MIN-MTL tasks are randomly constructed, makingsharing less profitable.

On MNIST-MTL the random baseline is 90%. We experimented with several learning rates butwere unable to get the cross-stitch networks to train well here. Routing nets beats the cross-stitchnetworks by 9% and the nearest baseline (task-specific-all-fc) by 3%. The soft version also hadtrouble learning on this dataset.

In all these experiments routing makes a significant difference over both cross-stitch networks andthe baselines and we conclude that a dynamic policy which learns the function blocks to compose ona per-task basis yields better accuracy and sharper convergence than simple static sharing baselinesor a soft attention approach.

In addition, router training is much faster. On CIFAR-MTL for example, training time on a sta-ble compute cluster was reduced from roughly 38 hours to 5.6, an 85% improvement. We haveconducted a set of scaling experiments to compare the training computation of routing networksand cross-stitch networks trained with 2, 3, 5, and 10 function blocks. The results are shown inthe appendix in Figure 15. Routing networks consistently perform better than cross-stitch networksand the baselines across all these problems. Adding function blocks has no apparent effect on thecomputation involved in training routing networks on a dataset of a given size. On the other hand,cross-stitch networks has a soft routing policy that scales computation linearly with the number offunction blocks. Because the soft policy backpropagates through all function blocks and the hardrouting policy only backpropagates through the selected block, the hard policy can much more easilyscale to many task learning scenarios that require many diverse types of functional primitives.

To explore why the multi-agent approach seems to do better than the single-agent, we manuallycompared their policy dynamics for several CIFAR-MTL examples. For these experiments ρ = 0.0so there is no collaboration reward which might encourage less diversity in the agent choices. In thecases we examined we found that the single agent often chose just 1 or 2 function blocks at eachdepth, and then routed all tasks to those. We suspect that there is simply too little signal available tothe agent in the early, random stages, and once a bias is established its decisions suffer from a lackof diversity.

Figure 8: Results on domain MNIST-MTL

The routing network on the other hand learns apolicy which, unlike the baseline static models,partitions the network quite differently for eachtask, and also achieves considerable diversity inits choices as can be seen in Figure 11. This fig-ure shows the routing decisions made over thewhole MNIST MTL dataset. Each task is la-beled at the top and the decisions for each ofthe three routed layers are shown below. Webelieve that because the routing network hasseparate policies for each task, it is less sen-sitive to a bias for one or two function blocksand each agent learns more independently whatworks for its assigned task.

9


Figure 9: The Policies of all Agents for the firstfunction block layer for the first 100 samples ofeach task of MNIST-MTL

Figure 10: The Probabilities of all Agents oftaking Block 7 for the first 100 samples of eachtask (totalling 1000 samples) of MNIST-MTL

5 QUALITATIVE RESULTS

To better understand the agent interaction we have created several views of the policy dynamics.First, in Figure 9, we chart the policy over time for the first decision. Each rectangle labeled Ti onthe left represents the evolution of the agent’s policy for that task. For each task, the horizontal axisis number of samples per task and the vertical axis is actions (decisions). Each vertical slice showsthe probability distribution over actions after having seen that many samples of its task, with darkershades indicating higher probability. From this picture we can see that, in the beginning, all taskagents have high entropy. As more samples are processed each agent develops several candidatefunction blocks to use for its task but eventually all agents converge to close to 100% probability forone particular block. In the language of games, the agents find a pure strategy for routing.

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10

10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28 29

30 31 32 33 34 35 36 37 38 39

Figure 11: An actual routing map forMNIST-MTL.

In the next view of the dynamics, we pick one partic-ular function block (block 7) and plot the probabil-ity, for each agent, of choosing that block over time.The horizontal axis is time (sample) and the verti-cal axis is the probability of choosing block 7. Eachcolored curve corresponds to a different task agent.Here we can see that there is considerable oscillationover time until two agents, pink and green, emergeas the “victors” for the use of block 7 and each assignclose to 100% probability for choosing it in routingtheir respective tasks. It is interesting to see that theeventual winners, pink and green, emerge earlier aswell as strongly interested in block 7. We have no-ticed this pattern in the analysis of other blocks andspeculate that the agents who want to use the block are being pulled away from their early Nashequilibrium as other agents try to train the block away.

Finally, in Figure 11 we show a map of the routing for MNIST-MTL. Here tasks are at the topand each layer below represents one routing decision. Conventional wisdom has it that networkswill benefit from sharing early, using the first layers for common representations, diverging later toaccommodate differences in the tasks. This is the setup for our baselines. It is interesting to see

10


that this is not what the network learns on its own. Here we see that the agents have converged ona strategy which first uses 7 function blocks, then compresses to just 4, then again expands to use5. It is not clear if this is an optimal strategy but it does certainly give improvement over the staticbaselines.

6 FUTURE WORK

We have presented a general architecture for routing and multi-agent router training algorithm whichperforms significantly better than cross-stitch networks and baselines and other single-agent ap-proaches. The paradigm can easily be applied to a state-of-the-art network to allow it to learn todynamically adjust its representations.

As described in the section on Routing Networks, the state space to be learned grows exponentiallywith the depth of the routing, making it challenging to scale the routing to deeper networks in theirentirety. It would be interesting to try hierarchical RL techniques (Barto & Mahadevan (2003)) here.

Our most successful experiments have used the multi-agent architecture with one agent per task,trained with the Weighted Policy Learner algorithm (Algorithm 3). Currently this approach is tabularbut we are investigating ways to adapt it to use neural net approximators.

We have also tried routing networks in an online setting, training over a sequence of tasks for fewshot learning. To handle the iterative addition of new tasks we add a new routing agent for each andoverfit it on the few shot examples while training the function modules with a very slow learningrate. Our results so far have been mixed, but this is a very useful setting and we plan to return to thisproblem.

REFERENCES

Sherief Abdallah and Victor Lesser. Learning the task allocation game. In Proceedings of the fifthinternational joint conference on Autonomous agents and multiagent systems, pp. 850–857. ACM,2006. URL http://dl.acm.org/citation.cfm?id=1160786.

Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with anetwork of experts. arXiv preprint arXiv:1611.06194, 2016.

Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architec-tures using reinforcement learning. ICLR, 2017.

Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning.Discrete Event Dynamic Systems, 13(4):341–379, 2003. URL http://link.springer.com/article/10.1023/A:1025696116075.

Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation inneural networks for faster models. CoRR, abs/1511.06297, 2015. URL http://arxiv.org/abs/1511.06297.

Andrew Brock, Theodore Lim, James M. Ritchie, and Nick Weston. SMASH: one-shot model archi-tecture search through hypernetworks. CoRR, abs/1708.05344, 2017. URL http://arxiv.org/abs/1708.05344.

Timothy J Buschman and Earl K Miller. Shifting the spotlight of attention: evidence for discretecomputations in cognition. Frontiers in human neuroscience, 4, 2010.

Rich Caruana. Multitask learning. Machine Learning, 28(1):41–75, Jul 1997. ISSN 1573-0565. doi:10.1023/A:1007379606734. URL https://doi.org/10.1023/A:1007379606734.

Corinna Cortes, Xavi Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. Adanet: Adap-tive structural learning of artificial neural networks. arXiv preprint arXiv:1607.01097, 2016.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale HierarchicalImage Database. In CVPR09, 2009.

11

http://dl.acm.org/citation.cfm?id=1160786

http://link.springer.com/article/10.1023/A:1025696116075

http://link.springer.com/article/10.1023/A:1025696116075

http://arxiv.org/abs/1511.06297




https://doi.org/10.1023/A:1007379606734


Ludovic Denoyer and Patrick Gallinari. Deep sequential neural network. arXiv preprintarXiv:1410.0510, 2014.

Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A. Rusu,Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in superneural networks. CoRR, abs/1701.08734, 2017. URL http://arxiv.org/abs/1701.08734.

Kevin Gurney, Tony J Prescott, and Peter Redgrave. A computational model of action selection inthe basal ganglia. i. a new functional anatomy. Biological cybernetics, 84(6):401–410, 2001.

David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.

Jessica B Hamrick, Andrew J Ballard, Razvan Pascanu, Oriol Vinyals, Nicolas Heess, and Peter WBattaglia. Metacontrol for adaptive imagination-based optimization. ICLR, 2017.

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures oflocal experts. Neural computation, 3(1):79–87, 1991.

Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm.Neural computation, 6(2):181–214, 1994.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

Yann Lecun, Lon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied todocument recognition. In Proceedings of the IEEE, pp. 2278–2324, 1998.

Lanlan Liu and Jia Deng. Dynamic deep neural networks: Optimizing accuracy-efficiency trade-offsby selective execution. arXiv preprint arXiv:1701.00299, 2017.

Mason McGill and Pietro Perona. Deciding how to decide: Dynamic routing in artificial neuralnetworks. International Conference on Machine Learning, 2017.

Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Dan Fink, Olivier Francon, BalaRaju, Arshak Navruzyan, Nigel Duffy, and Babak Hodjat. Evolving deep neural networks. arXivpreprint arXiv:1703.00548, 2017.

Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks formulti-task learning. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pp. 3994–4003, 2016.

Tsendsuren Munkhdalai and Hong Yu. Meta networks. International Conference on Machine Learn-ing, 2017.

Janarthanan Rajendran, P. Prasanna, Balaraman Ravindran, and Mitesh M. Khapra. ADAAPT:attend, adapt, and transfer: Attentative deep architecture for adaptive policy transfer from multiplesources in the same domain. ICLR, abs/1510.02879, 2017. URL http://arxiv.org/abs/1510.02879.

Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. ICLR, 2017.

Matthew Riemer, Aditya Vempaty, Flavio Calmon, Fenno Heath, Richard Hull, and Elham Khabiri.Correcting forecasts with multifactor neural attention. In International Conference on MachineLearning, pp. 3010–3019, 2016.

Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders Søgaard. Sluice networks:Learning what to share between loosely related tasks. arXiv preprint arXiv:1705.08142, 2017.

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.ICLR, 2017.

12







Andrea Stocco, Christian Lebiere, and John R Anderson. Conditional routing of information to thecortex: A model of the basal ganglias role in cognitive coordination. Psychological review, 117(2):541, 2010.

Marijn F Stollenga, Jonathan Masci, Faustino Gomez, and Juergen Schmidhuber. Deep networkswith internal selective attention through feedback connections. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Information Pro-cessing Systems 27, pp. 3545–3553. Curran Associates, Inc., 2014.

Oriol Vinyals, Charles Blundell, Timothy P. Lillicrap, Koray Kavukcuoglu, and Daan Wierstra.Matching networks for one shot learning. CoRR, abs/1606.04080, 2016. URL http://arxiv.org/abs/1606.04080.

Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, King’sCollege, Cambridge, 1989.

Olga Wichrowska, Niru Maheswaranathan, Matthew W Hoffman, Sergio Gomez Colmenarejo,Misha Denil, Nando de Freitas, and Jascha Sohl-Dickstein. Learned optimizers that scale andgeneralize. arXiv preprint arXiv:1703.04813, 2017.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcementlearning. Machine learning, 8(3-4):229–256, 1992. ISSN 0885-6125.

Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. ICLR, 2017.

13




7 APPENDIX

7.1 IMPACT OF RHO

Figure 12: Influence of the “collaboration reward” ρ on CIFAR-MTL. The architecture is routing-all-fc with WPL routing agents.

Figure 13: Comparison of per-task training cost for cross-stitch and routing networks. We add afunction block per task and normalize the training time per epoch by dividing by the number oftasks to isolate the effect of adding function blocks on computation.

7.2 THE PASS ACTION

When routing networks, some resulting sets of function blocks can be applied repeatedly. Whilethere might be other constraints, the prevalent one is dimensionality - input and output dimensionsneed to match. Applied to the SimpleConvNet architecture used throughout the paper, this meansthat of the fc layers - (convolution → 48), (48 → 48), (48 → #classes), the middle transforma-tion can be applied an arbitrary number of times. In this case, the routing network becomes fullyrecurrent and the PASS action is applicable. This allows the network to shorten the recursion depth.

7.3 OVERVIEW OF IMPLEMENTATIONS

We have tested 9 different implementation variants of the routing architectures. The architecturesare summarized in Tables 3 and 4. The columns are:

#Agents refers to how many agents are used to implement the router. In most of the experiments,each router consists of one agent per task. However, as described in 3.1, there are implementationswith 1 and #tasks + 1 agents.

14


Epoch 1 5 10 20 50 100

RL (Figure 4)

REINFORCE: approx 20 20 20 20 20 20Qlearning: approx 20 20 20 20 24 25Qlearning: table 20 36 47 50 55 55MARL-WPL: table 31 53 57 58 60 60

arch (Figure 5)

routing-all-fc 31 53 57 58 60 60routing-all-fc recursive 31 43 45 48 48 46routing-all-fc dispatched 20 23 28 37 42 41soft mixture-all-fc 20 24 27 30 32 35routing-all-fc single agent 20 23 33 42 44 44

CIFAR (Figure 6)

routing-all-fc 31 53 57 58 60 60task specific-all-fc 21 29 33 36 42 42task specific-1-fc 27 34 39 42 48 49cross stitch-all-fc 26 37 42 49 52 53

MIN (Figure 7)

routing-all-fc 34 54 57 55 58 57task specific-all-fc 22 30 37 43 47 48task specific-1fc 29 38 43 46 51 51cross-stitch-all-fc 29 41 48 53 56 55

MNIST (Figure 8)

routing-all-fc 90 90 98 99 99 99task specific-all-fc 90 91 94 95 95 96task specific-1fc 90 90 91 92 93 95soft mixture-all-fc 90 90 90 90 90 90cross-stitch-all-fc 90 90 90 90 90 90

Table 2: Numeric results (in % accuracy) for Figures 4 through 8

(a) first 2 tasks (b) first 3 tasks

(c) first 5 tasks (d) first 10 tasks

Figure 15: Results on the first n tasks of CIFAR-MTL

15


Name Num Agents Policy Representation Part of State = (v, t, d) UsedMARL:WPL Num Tasks Tabular (num layers x num function blocks) t, dREINFORCE Num Tasks Vector (num layers) of approx functions v, t, dQ-Learning Num Tasks Vector (num layers) of approx functions v, t, dQ-Learning Num Tasks Tabular (num layers x num function blocks) t, d

Table 3: Implementation details for Figure 4. All approx functions are 2 layer MLPs with a hiddendim of 64.

Name Num Agents Policy Representation Part of State = (v, t, d) Usedrouting-all-fc Num Tasks Tabular (num layers x num function blocks) t, drouting-all-fc non-layered Num Tasks tabular (num layers x num function blocks) t, dsoft-routing-all-fc Num Tasks Vector (num layers) of appox functions v, t, ddispatched-routing-all-fc Num Tasks + 1 Vector (num layers) of appox functions + dispatcher v, t, dsingle-agent-routing-all-fc 1 Vector (num layers) of approx functions) v, t, d

Table 4: Implementation details for Figure 5. All approx functions are 2 layer MLP’s with a hiddendim of 64.

Policy Representation There are two dominant representation variations, as described in 3.1. Inthe first, the policy is stored as a table. Since the table needs to store values for each of the differentlayers of the routing network, it is of size num layers× num actions. In the second, it is representedeither as vector of MLP’s with a hidden layer of dimension 64, one for each layer of the routingnetwork. In this case the input to the MLP is the representation vector v concatenated with a one-hotrepresentation of the task identifier.

Policy Input describes which parts of the state are used in the decision of the routing action. Fortabular policies, the task is used to index the agent responsible for handling that task. Each agentthen uses the depth as a row index into into the table. For approximation-based policies, there aretwo variations. For the single agent case the depth is used to index an approximation function whichtakes as input concat(v, one-hot(t)). For the multi-agent (non-dispatched) case the task label is usedto index the agent and then the depth is used to index the corresponding approximation functionfor that depth, which is given concat(v, one-hot(t)) as input. In the dispatched case, the dispatcheris given concat(v, one-hot(t)) and predicts an agent index. That agent uses the depth to find theapproximation function for that depth which is then given concat(v, one-hot(t)) as input.

7.4 EXPLANATION OF THE WEIGHTED POLICY LEARNER (WPL) ALGORITHM

The WPL algorithm is a multi-agent policy gradient algorithm designed to help dampen policyoscillation and encourage convergence. It does this by slowly scaling down the learning rate for anagent after a gradient change in that agents policy. It determines when there has been a gradientchange by using the difference between the immediate reward and historical average reward for theaction taken. Depending on the sign of the gradient the algorithm is in one of two scenarios. If thegradient is positive then it is scaled by 1 − π(ai). Over time if the gradient remains positive it willcause π(ai) to increase and so 1−π(ai) will go to 0, slowing the learning. If the gradient is negativethen it is scaled by π(ai). Here again if the gradient remains negative over time it will cause π(ai)to decrease eventually to 0, slowing the learning again. Slowing the learning after gradient changesdampens the policy oscillation and helps drive the policies towards convergence.

16

R N : ADAPTIVE SELECTION OF NON FUNCTIONS FOR MULTI … et al - Routing... · 2018-06-04 ·...

Documents

Transcript of R N : ADAPTIVE SELECTION OF NON FUNCTIONS FOR MULTI … et al - Routing... · 2018-06-04 ·...