Project Report: Stochastic Neural Architecture Searchierg6130/2019/report/team3.pdf · Project...

Project Report: Stochastic Neural ArchitectureSearch

XIAO DA, QI DI, LIU XINYIDepartment of Information EngineeringThe Chinese University of Hong Kong

Shatin, Hong Kongxd018, qd018, [email protected]

Abstract

We implement the Stochastic Neural Architecture Search (SNAS) algorithm, whichis a practical version of Neural Architerture Search (NAS). SNAS inherits the basicpipeline of NAS, training operation parameters as well as architecture parametersduring the same round of backpropagation. However, SNAS introduces a novelsearch gradient method to speed up the searching process. To further investigatethe performance of SNAS, we also implement the Efficient Neural ArchitechtureSearch (ENAS), another fast and inexpensive version of NAS, as the comparison.The experiment result shows that after training, SNAS outperforms ENAS inconverging speed with a relatively much smaller parameter size, while reaching thesimilar search validation accuracy.

1 Introduction

As the backbone technology supporting modern intelligent mobile applications, Deep Neural Net-works represent the most commonly adopted machine learning technique and have become increas-ingly popular. Due to DNNs’s ability to perform highly accurate and reliable inference tasks, theyhave witnessed successful applications in a broad spectrum of domains from computer vision tospeech recognition and natural language processing. However, parameter tuning is a very difficulttask for deep models. The explosive combinations of many hyper-parameters and network structureparameters will make conventional random search and grid search very inefficient(2). Therefore, inrecent years, the structure search and hyper-parameters optimization of deep neural networks havebecome a research hot spot().

We all know that the performance of deep learning algorithm depends not only on the neural weights,but also on various hyper-parameters to a large extent. One of the reasons why some paper resultsare difficult to reproduce is that obtaining the optimal hyper-parameters value require a lot of effort.Experts’ knowledge can only contribute a little in designing for optimal hyper-parameters.

Thus, there has been some work in hyper-parameter optimization in deep learning field, which can bedivided to two types. The first one is focusing on optimizing the training parameters(3) (learningrate, batch size, weight decay, etc), which use classical hyper-parameter optimization approaches likeRandom search, Grid search, Bayesian optimization(4), Reinforcement learning, and EvolutionaryAlgorithm. Another type is to optimize the network structure parameters, which is called NAS(Neural Architecture Search)(20). In this paper, we focus on the NAS field and try to implement anefficient algorithm in NAS called SNAS (Stochastic Neural Architecture Search).

NAS can be considered as a black box optimization problem. The mechanism of NAS is defining thesearch space firstly, then identify the candidate network structures by search strategy and evaluatethem, and search in the next round according to the feedback. More specifically, we categorizemethods for NAS according to three dimensions: Search Space, Search Strategy, and Performance

1. INTRODUCTION

Estimation Strategy. Search space. The search space defines the variables of the optimization

Figure 1: The multi-branch and repetitive cells of architecture

problem. The definition of search space basically corresponds to the development of DNN. It can beseen that most of the early CNNs are chain structure, so the search space in the initial NAS workmainly considers this structure, which means it only consider several layers, what type each layeris, and the corresponding hyper-parameters of this type. In later stage, with the emergence of multi-branch structures such as ResNet(5), DenseNet(7) and Skip connection, the multi-branch structure isalso considered in NAS, which increases the diversity of hyper-parameters combination and producemore structures. In addition, many DNN structures begin to contain duplicate substructures (such asInception, DenseNet, ResNet, etc.), called cells or blocks. So NAS began to consider such a structureand proposed cell-based search. That is to say, the cell structure is searched only, and these cells areoverlapped and stitched by the overall network. In order for the search model to handle the networkstructure, it needs to be coded. The network architecture in the search space can be represented as astring or vector describing the structure like Figure 1.

Search strategy. Search strategy defines what algorithm can be used to quickly and accurately findthe optimal configuration of network structure parameters. Common search methods include Randomsearch, Bayesian Optimization, Evolutionary Algorithm, Reinforcement Learning and Gradient-basedAlgorithm.

Performance estimation strategy. We know that the most time-consuming part of NAS is thetraining of candidate models, and the purpose of training is to evaluate the accuracy of the structure.In order to obtain the accuracy of a network model without too much time training, some proxymeasures are usually used as estimators. For example, the model accuracy trained on a small amountof data sets, or on a low resolution, or after training a small amount of epoch. Although this willgenerally underestimate the accuracy, what we want is not the absolute accuracy, but the relativevalues between different networks. There are some other ideas, such as using the agent model inengineering optimization for reference interpolating prediction with observed points or migrating atparameter level.

In this paper, we implement SNAS(18), which regards the NAS architecture as a whole MDPprocess. SNAS finds the original NAS modeling is a task which has the totally delayed reward in

2

2. LITERATURE REVIEW

the determinant environment. That is, a series of actions correspond to a delayed reward, and TDiteration will lead to deviation, reduce search efficiency and result in low efficiency of reinforcementlearning in NAS tasks. Thus, SNAS hopes to solve the efficiency problem caused by delayed reward.It changes the original structure of NAS, regards the whole structure as a MDP process directly,and adopts the training/test loss of the generated sub-network where the loss can be differentiated.That is to say, the search space is continuous and the objective function is differentiable, so thegradient-based method can be used to search more effectively.

2 Literature Review

Neural Architecture Search. NAS is the foundation of SNAS, which defines the original architectureof search space, search strategy, and performance estimation strategy. It uses policy gradient to trainthe controller RNN to generate a series of tokens descriptions about the structure of the subnetwork(such as number of filters, filter height, stride width, etc). Then the child network (only convolutionlayer) is trained with these tokens descriptions until convergence and calculate the accuracy of thetest set, where the accuracy is used as a reward to adjust the parameters of the controller RNN.

Auxiliary mechanisms. Improving the efficiency of NAS is a prerequisite to extending it to morecomplicated vision tasks like detection, as well as larger data sets. In the complete pipeline ofNAS, parameter learning is a time-consuming one that attracts attention from the literature. At thebeginning there has been some auxiliary mechanisms like performance prediction(6), iterative search,hyper-network generated weights, successfully accelerated NAS to certain degrees.

Efficient Neural Architecture Search via Parameter Sharing. Getting rid of these auxiliary mech-anisms, ENAS(11) is the state-of-the-art NAS framework, proposing parameter sharing among allpossible child graphs, which is followed by SNAS. ENAS found that the bottleneck of NAS is thateach sub network needs to wait for convergence and then only take the use of accuracy discardingall training weights, where the training results of the previous round of sub-networks are not used.ENAS use parameter sharing to reuse the parameters obtained by other models with only minormodifications. More specifically, what ENAS learns and selects is the connection between Nodes.Through different connections, a large number of neural network model structures will be generated.Choosing the optimal connection from them is equivalent to ‘designing’ a new neural network model.

DARTS: Differentiable architecture search.(9) The most important motivation of SNAS is to lever-age the gradient information in generic differentiable loss to update architecture distribution, which isshared by DARTS. In both SNAS and DARTS, the reward function is made differentiable using thetraining/testing loss, R(Z) = Lθ(Z), such that the architecture learning could leverage information inthe gradients of this loss and conduct together with operation parameters training. However, to avoidthe sampling process and gradient back-propagation through discrete random variables, DARTS takesanalytical expectation at the input of each node over operations at incoming edges and optimizes arelaxed loss with deterministic gradients. After an architecture derivation introduced in DARTS, theperformance falls enormously and the parameters need to be retrained.

3 Methodology

The basic structure of a cell or a neural is shown in Figure 2(11). It consists of nodes and edges. Thecontroller decides the final structure of the cell as well as the operations selected as in ENAS, whosedetails will be discussed in Section 3.1. In Section 3.2, we will give a introduction on search spaceincluding connections between nodes and operations on edges. Then in Section 3.3, we will showthat how to relax this discrete search space into continuous differentiable expression. In Section 3.3and 3.4, we will give an interpretation about the credit assignment and resource constraint whichboth improve the search efficiency of SNAS.The numbers of nodes and operations are pre-definedmanually, forming the fundamental frame of the cell. The number of cells that constructing the finalnetwork is also defined manually. Therefore, the search space is still limited.

3.1 Efficient Neural Architecture Search: ENAS

The conception of ENAS was proposed in 2018(11). This paper presents a more efficient gradient-based method of finding good architectures based on the pipeline of NAS. The basic flowchart is

3

3. METHODOLOGY

Figure 2: The graph represents the entire search space while the red arrows define a child architectureoption in the search space, which is decided by a controller. Here, node 1 is the input to the modelwhereas nodes 3 and 6 are the model’s outputs.

shown in Figure 3. The modification is mainly done by the following two aspects. First is thatENAS defines the conception of nodes and the controller’s job. The function of the controller can bedivided into two parts: 1) to decide which two nodes should be connected 2) to decide which kind ofoperations should be applied between two connected nodes. Second is that ENAS makes the modelssharing the operation parameters ω during the training process, which means that if a new modelcontains an operation learned before, it will inherit the operation parameter with no more training.Thus the training speed of the whole network is accelerated. The gradient of operation parameters ωis computed by using the Monte Carlo estimation:

∇ωEm∼π(m;θ)[L(m;ω)] ≈ 1

M

M∑i=1

∇ωL(mi;ω) (1)

where Em∼π(m;θ) is the expected loss function; π(m; θ) is the controller’s policy; L(m;ω) is thestandard cross-entropy loss; m stands for the samples from a model and the value of M is 1. Besides,ENAS also manually defines the number of nodes inside a neural, which also reduces the size ofsearching space.

Figure 3: Flowchart of the model

4

3. METHODOLOGY

3.2 Search Space

Search space consists of two set. One is the set of total possible combinations of the connectionsbetween nodes, or in other word, is the set of edges; and the other is the set of operations on edges. InSNAS, we set the number of nodes as four and the operation set contains: 5×5 and 3×3 separableconvolution, 5×5 and 3×3 dilated separable convolution, 3×3 max pooling, 3×3 average pooling,skip connection and zero operation. Here zero operation means the edge is not selected. In this way,we can represent the intermediate nodes as:

xj =∑i<j

Oi,j(xi) (2)

where Oi,j is the selected operation as edge (i,j); xi and xj are intermediate nodes. Since theexistence of skip connection operation, we have these nodes ordered and edges directed pointingfrom lower indexed nodes to higher ones. We further represent the volume of architecture decisionswith a distribution p(Z) since it is generally tractable in a cell, and in this way we can rewrite therepresentation of intermediate nodes as:

xj =∑i<j

Oi,j(xi) =∑i<j

ZTi,jOi,j (3)

where Zi,j is the one-hot random variable to edge (i,j). In SNAS, we assume that p(Z) is factorizablewith search parameters α and operation parameters θ. The objective of SNAS is also the rewardwith respect to Z following the setting in NAS(20). However, we directly use the loss function oftraining/testing rather than the validation accuracy so that both search parameters and operationparameters under one generic loss as:

EZ∼pα(Z)[R(Z)] = EZ∼pα(Z)[Lθ(Z)] (4)

The Monte Carlo estimation process is shown in Figure 4

Figure 4: An intuitive visualization. Edges with different colours stand for different operations. Zi,jis the random one-hot variable indicating masks of edge (i,j). The objective is the expectation ofgeneric loss L

3.3 Parameter Learning

To avoid the objective optimization suffering from the high variance of likelihood ration trick(17)and not utilizing the differentiable nature, SNAS uses concrete distribution(10) to relax the discretearchitecture parameter distribution to be continuous and differentiable:

Zki,j = fαi,j (Gki,j) =

exp((logαki,j + Gki,j)/λ)∑n

l=0 exp((logαli,j + Gl

i,j)/λ)(5)

5

3. METHODOLOGY

where Zi,j is the softened one-hot random variable for the picked operation at edge (i,j); Gki,j =

− log(− log(Uki,j)) is the kth Gumbel random variable; αi,j is the architecture parameter; λ is thetemperature of the softmax. Maddison(10) proved that:

p( limλ−→0

Zki,j = 1) = αki,j/(

n∑l=0

αli,j) (6)

making the relaxation unbiased after converging. Further, some gradient calculation equations arelisted below with respect to xj , θki,j and αki,j :

∂L

∂xj=

∑m>j

∂L

∂xmZTm

∂Om(xj)

∂xj,

∂L

∂θki,j=

∂L

∂xjZki,j

∂Oi,j(xi)

∂θki,j,

∂L

∂αki,j=

∂L

∂xjOTi,j(xi)(δ(k

′− k)− Zi,j)Zki,j

1

λαki,j

(7)

3.4 Credit Assignment

In reinforcement learning, temporally and laterally assigning credits to actions is a significanttopic(12)(14)(16)(19). ENAS uses the proximal policy optimization (PPO)(11) to optimize thearchitecture policy parameters, which assigns credits using TD learning. However, since the reward ofENAS task is only obtained after the end of the architecture search, it is a delay-reward task. Further,TD has bias with delayed reward and exponentially slow correction proved by Arjona-Mdeina(1).Thus, SNAS contains no MDP assumption and reward function equals to loss function, which isdifferentiable. The expected search gradient for architecture search parameters at edge (i,j) can bederived as following:

EZ∼p(Z)[∂L

∂αki,j] = EZ∼p(Z)[∇αki,j log p(Zi,j)[

∂L

∂xjOi,j(xi)]c] (8)

where [·]_c emphasizes · is constant for the gradient calculation. Apparently, this gradient equals to apolicy gradient for an edge whose credit assigned as:

Ri,j = −[∂L

∂xjOi,j(xi)]c (9)

The existence of skip connection operations makes it possible for nodes being involved into multiplelayers and their credits will be integrated, which is weighted by Oi,j(xi). Thus, there is no delayedreward for each architecture decision in SNAS. At the beginning, Zi,j is continuous while operationsshare the credit. After that, Zi,j will get closer to one-hot as the temperature goes down.

3.5 Resource Constraint

In addition to the searching validation accuracy and training efficiency, the total forwarding timedelay will become another concern considering the employment feasibility. This factor is taken intoaccount in the loss function:

EZ∼pα(Z)[Lθ(Z) + ηC(Z)] = EZ∼pα(Z)[Lθ(Z)] + ηEZ∼pα(Z)[C(Z)] (10)

where C(Z) is the time delay of the child network and η is the delay coefficient that controls the levelof time delay punishment. However, C(Z) is not differentiable with respect to either θ or α comparedwith Lθ(Z). To make these two compact, we make use of the following representation of C(Z) toachieve the decomposition, that C(Z) is actually a linear combination of all one-hot random variablesZi,j :

C(Z) =∑i,j

C(Zi,j) =∑i,j

ZTi,jC(Oi,j) (11)

Since the feature map size of each node is independent to the structural decision, which means that thedistribution on each edge (i,j) is optimized locally and is the conservative decomposition of the global

6

4. EXPERIMENTS

optimization due to the form of linear combination. Moreover, pα(Z) is fully factorizable in SNAS,which allows us to calculate EZ∼pα(Z)[C(Z)] using sum-product algorithm analytically(8). Finally,we also need to optimize the Monte Carlo estimation of EZ∼pα(Z)[C(Z)] with the representation ofthe following:

EZ∼pα(Z)[C(Z)] =∑i,j

EZ∼pα [EZ∼pα [ZTi,jC(Oi,j)]] (12)

using policy gradient.

4 Experiments

Our experiments can be divided into two stages. The first one is the neural architecture search trainingstage and the second one is the retraining stage. Basically, the neural architecture with the besttraining validation was defined in stage one using SNAS algorithm and further the whole neuralnetwork was constructed as the learned cells’ stack. Then the learned network was fed into the secondretraining stage for operation parameters optimization. It should be noted that all experiments arecarried out on CIFAR-10 for image classification and for ENAS, the experiment stages are identicalto SNAS. The code is released1.

4.1 Stage 1: Neural Architecture Search

The basic experiment set up is shown as follow. We use 50000 images for training phase and 10000images for testing phase. Data transformation is realized by the standard data pre-processing andaugmentation techniques, for example images are first padded to 40 × 40 and then randomly croppedto 32× 32. The operation parameters θ are optimized by applying momentum SGD while architecturedistribution parameters α are optimized by applying Adam. The batch size is 16 and the numberof channels is also 16. After applying the forwarding time penalty, the cell structure is reshaped asFigure 5(a) and 5(b). The total training epochs are 125 for SNAS. The training and testing searchvalidation accuracies are shown in Figure 6(a) and 6(b).

For ENAS, all settings are identical except the training epochs are 300 instead. The performance ofENAS algorithm is shown in Figure 7(a) and 7(b).

4.2 Stage 2: Retraining

For retraining phase, we reuse 50000 training images for a retraining epochs of 600. The networkconsists of 20 cells. The initial number of channels is increased from 16 to 36. We use Momentum asoptimizer with cosine learning rate decay. The final performances of both learned SNAS and ENASnetwork are shown in the following table:

Architecture Accuracy Parameter size

SNAS 96.27% 2.9MENAS 97.01% 4.6M

5 Discussion

From Figure 5(a) and 5(b), it is obvious that cell architecture has been reconstructed. For example,in normal cell structure before applying resource constraint, node 0 and 1 are in series while inreduction cell, they become parallel to each other. Besides, more separate convolution operationssuch as sep_conv_5×5 and sep_conv_3×3 are applied at the data input stage to cut off the number ofparameters meanwhile more dilated convolution operations such as dil_conv_5×5 and dil_conv_3×3are chosen at the deeper layers of data processing to retain the captured features. Comparing Figure6(a) 6(b) with 7(a) 7(b), SNAS and ENAS both achieve an searching validation accuracy of over 80%on training set while with different training epochs; for testing set, ENAS seems to perform slightlybetter than SNAS, still at the expense of a much longer training period. SNAS also shows the similarperformance as ENAS at the retraining part with a much smaller parameter size. However, restricted

1https://github.com/xdhhh/SNAS

7

6. CONCLUSION

(a)

(b)

Figure 5: (a) Normal cell (b) Reduction cell

(a) (b)

Figure 6: (a) SNAS training accuracy (b) SNAS testing accuracy

to the large search space size and limited computational resource, the batch size is small, which maydirectly cause the discount of performance of SNAS. Several solutions can be introduced to eliminatethis problem, such as applying the conception of Binary Neural Network(13) or implementing theSuperKernel based methods(15), to further lessen the size of search space.

6 Conclusion

We implement SNAS algorithm which is a fast and economically applicable version of NAS. Tofurther investigate the improvement of performance, ENAS algorithm is also implemented forcomparison. The experiment results show that compared with ENAS, SNAS can achieve a fasterconverging speed within the training of only 125 epochs than ENAS with over 300 epochs. This is

8

REFERENCES REFERENCES

(a) (b)

Figure 7: (a) ENAS training accuracy (b) ENAS testing accuracy

partially due to the huge gap in parameter size between two algorithms with 2.9 million for SNAS and4.6 million for ENAS. After retraining, SNAS reaches an evaluation accuracy of 96.27% while ENASreaches an evaluation accuracy of 97.01%. Directly using loss function as reward has been proved asan effective method to simplify the optimization process. However, restricted to the computationalresources we use a relatively small batch size during architecture searching phase of SNAS, whichmay influence its final performance. This can be solved in the expected future.

References[1] J. A. Arjona-Medina, M. Gillhofer, M. Widrich, T. Unterthiner, and S. Hochreiter. Rudder: Return

decomposition for delayed rewards. arXiv preprint arXiv:1806.07857, 2018.[2] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning

Research, 13:281–305, 2012.[3] M. Claesen and B. D. Moor. Hyperparameter search in machine learning. arXiv:1502.02127, 2015.[4] P. I. Frazier. A tutorial on bayesian optimization. arXiv:1807.02811, 2018.[5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the

IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.[6] K. Hoste, A. Phansalker, L. Eeckhout, A. Georges, L. K. John, and K. D. Bosschere. Performance

prediction based on inherent program similarity. International Conference on Parallel Architecture andCompilation Techniques (PACT), 2006.

[7] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708,2017.

[8] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEETransactions on Information Theory, 1998.

[9] H. Liu, K. Simonyan, and Y. Ynag. Darts: Differentiable archiecture search. ICLR arXiv:1806.09055,2019.

[10] C. J. Maddison. A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712,2016.

[11] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parametersharing. arXiv preprint arXiv:1802.03268, 2018.

[12] D. Precup, R. S. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. Computer ScienceDepartment Faculty Publicatin Series, 2000.

[13] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binaryconvolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer,2016.

[14] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control usinggeneralized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.

[15] D. Stamoulis, R. Ding, D. Wang, D. Lymberopoulos, B. Priyantha, J. Liu, and D. Marculescu. Single-pathnas: Designing hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877, 2019.

[16] G. Tucker, S. Bhupatiraju, S. Gu, R. E. Turner, Z. Ghahramani, and S. Levine. The mirage of action-dependent baselines in reinforcement learning. arXiv preprint arXiv: 1802.10031, 2018.

[17] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3-4):229–256, 1992.

[18] S. Xie, H. Zheng, C. Liu, and L. Lin. Snas: Stochastic neural architecture search. arXiv:1812.09926, 2019.[19] Z. Xu, H. van Hasselt, and D. Silver. Meta-gradient reinforcement learning. arXiv preprint

arXiv:1805.09801, 2018.

9

REFERENCES REFERENCES

[20] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprintarXiv:1611.01578, 2016.

10

Project Report: Stochastic Neural Architecture Searchierg6130/2019/report/team3.pdf · Project...

Documents

Transcript of Project Report: Stochastic Neural Architecture Searchierg6130/2019/report/team3.pdf · Project...