Accelerating Convolutional Networks via Global & Dynamic ... · all layers by masking out the...

Accelerating Convolutional Networks via Global & Dynamic Filter Pruning

Shaohui Lin1,2, Rongrong Ji1,2∗, Yuchao Li1,2, Yongjian Wu3, Feiyue Huang3, Baochang Zhang4

1Fujian Key Laboratory of Sensing and Computing for Smart City, Xiamen University, China2School of Information Science and Engineering, Xiamen University, China

3BestImage, Tencent Technology (Shanghai) Co.,Ltd, China4School of Automation Science and Electrical Engineering, Beihang University, China

Contact author: [email protected]

AbstractAccelerating convolutional neural networks hasrecently received ever-increasing research focus.Among various approaches proposed in the litera-ture, filter pruning has been regarded as a promisingsolution, which is due to its advantage in significantspeedup and memory reduction of both networkmodel and intermediate feature maps. To this end,most approaches tend to prune filters in a layer-wise fixed manner, which is incapable to dynami-cally recover the previously removed filter, as wellas jointly optimize the pruned network across lay-ers. In this paper, we propose a novel global & dy-namic pruning (GDP) scheme to prune redundantfilters for CNN acceleration. In particular, GDPfirst globally prunes the unsalient filters across alllayers by proposing a global discriminative func-tion based on prior knowledge of each filter. Sec-ond, it dynamically updates the filter saliency allover the pruned sparse network, and then recov-ers the mistakenly pruned filter, followed by a re-training phase to improve the model accuracy. Spe-cially, we effectively solve the corresponding non-convex optimization problem of the proposed GDPvia stochastic gradient descent with greedy alter-native updating. Extensive experiments show thatthe proposed approach achieves superior perfor-mance to accelerate several cutting-edge CNNs onthe ILSVRC 2012 benchmark, comparing to thestate-of-the-art filter pruning methods.

1 IntroductionConvolutional neural networks (CNNs) have achieved re-markable success in various applications such as image clas-sification [He et al., 2016; Krizhevsky et al., 2012; Simonyanand Zisserman, 2014], object detection [Girshick et al., 2014]and semantic segmentation [Long et al., 2015]. However, thepromising performance is accompanied by significant com-putation cost, which raises huge difficulty to deploy these∗Corresponding author

CNNs in real-time applications without the support of highly-efficient Graphic Processing Units (GPUs). As a result, theacceleration of convolutional networks has become emerging.

Recent works in convolutional neural network accelerationcan be categorized into three groups, i.e., low-rank decompo-sition, parameter quantization, and network pruning. Amongthem, network pruning has received ever-increasing researchfocus, which merits in limited memory footprints due to thesmall amount of filter parameters and intermediate activation,which is highly required for memory-light online inference.Methods in network pruning can be further categorized intoeither non-structured or structured pruning. Non-structuredpruning [LeCun et al., 1989; Hassibi and Stork, 1993; Hanet al., 2015a; 2015b] targets at directly pruning parametersindependently in each layer, which will cause irregular mem-ory access that adversely impacts the efficiency of online in-ference. Under such a circumstance, specialized hardware[Han et al., 2016] or software [Liu et al., 2015; Park et al.,2017] designs are required to further speedup the prunedunstructured CNNs. Instead, structured pruning [Anwar etal., 2015; Lebedev and Lempitsky, 2016; Wen et al., 2016;Li et al., 2016; Luo et al., 2017; Molchanov et al., 2017;Hu et al., 2016] aims at directly removing filters as a whole,which is far more efficient and independent to specializedhardware/software platforms. For instance, Anwar et al. [An-war et al., 2015] introduced the structured sparsity to ei-ther filter-wise or channel-wise convolutional filter selection,based on which pruned filters with regularity by using parti-cle filtering. Luo et al. [Luo et al., 2017] implicitly associ-ated the convolutional filter in the current layer with the inputchannel in the next layer, based on which pruned filters in thecurrent layer via input channel selection of the next layer.

However, the existing structured pruning schemes prunethe convolutional neural network in a layer-by-layer fixedmanner, which is less adaptive, less efficient, and less effec-tive. First, in local pruning, iterative layer-wise pruning andlocal fine-tuning are required, which is computational inten-sive. Second, mistaken pruning of the salient filter is irre-trievable, which is inadaptive and the pruned network cannotachieve an optimal performance.

In this paper, we present a novel global & dynamic prun-ing (GDP) scheme to prune redundant filters to address above

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

2425

…

…

Inp

ut

Co

nv

Laye

rsFC

laye

rs Output

… Pre-trained model with full masks

… ⊙ …

…

Global Pruning

…

Intermediate parameters

Dynamic updating

… ⊙ …

…

Update Parameters W*

and reselect m

W* m

… ⊙ …

W* m W* m

…Filters W*: Global mask m:⊙: Khatri-Rao product 1 0

Figure 1: An illustration of GDP. Each rectangle with color (e.g.,red rectangle) is a filter in the filter set W∗ , while a global maskm with binary values determines the saliency of filters (i.e., � indi-cates the corresponding filter is salient, and � is unsalient). First, apre-trained model and a full global mask are employed to initializethe network. Then, the redundant filters are globally pruned acrossall layers by masking out the corresponding value as 0. Finally, it-eratively dynamic updating of filters and global mask is done to im-prove the accuracy of the pruned network. (Best viewed in color.)

two issues, which can largely accelerate the pruned networkswhile reducing the networks accuracy loss. Unlike the pre-vious schemes of layer-by-layer fixed filter pruning, our keyinnovation lies in evaluating the importance/saliency of indi-vidual filter globally across all network layers, upon whichdynamically and iteratively prune and tune the network, withthe mechanism to recall filters that are mistakenly pruned inthe previous iterations. Figure 1 demonstrates the flowchartof the proposed framework. In particular, we first initializea pre-trained convolutional network and globally mask all fil-ters to be equal to 1 (i.e., an external switch which determineswhether the filter is pruned). Then, we design a global dis-criminative function to determine the saliency scores of indi-vidual filters. Such scores guide us to globally prune the un-salient filters across all layers, which equivalently masks outunsalient filters as 0. Finally, we iteratively tune the sparsenetwork and dynamically update the filter saliency in a top-down manner. By such operations, filters that are previouslymasked out is possible to recalled, which significantly im-proves the accuracy of the pruned network. In terms of op-timization, GDP can be described as a non-convex optimiza-tion problem, which is then effectively solved via the stochas-tic gradient descent with greedy alternative updating.

The proposed GDP is evaluated on the ImageNet 2012dataset [Russakovsky et al., 2015] and implemented on thewidely-used AlexNet [Krizhevsky et al., 2012], VGG-16 [Si-monyan and Zisserman, 2014] and ResNet-50 [He et al.,2016]. Comparing to the state-of-the-art filter pruning meth-ods [Wen et al., 2016; Li et al., 2016; Luo et al., 2017;Molchanov et al., 2017; Hu et al., 2016], the proposed GDPscheme achieves the superior performance by a factor of2.12× GPU speedup with 1.15% Top-5 accuracy loss onAlexNet, 2.17× CPU speedup with 1.45% Top-5 accuracyloss on VGG-16, and 1.93× CPU speedup with 2.16% Top-5accuracy loss on ResNet-50.

2 Related WorkLeCun et al. [LeCun et al., 1989] and Hassibi et al. [Hassibiand Stork, 1993] proposed a saliency measurement to remove

unimportant weights, which is determined by the second-order derivative matrix of the loss function with respect to theparameters. Recently, Han et al. [Han et al., 2015a; 2015b]proposed to prune parameters by using iterative thresholdingto remove unsalient weights with small absolute values. Guoet al. [Guo et al., 2016] pruned weights by using connectionsplicing to avoid incorrect pruning. However, such schemecan only be worked in a local manner layer-by-layer. Differ-ent from connection splicing, the proposed dynamic updatingis conducted in a global manner, which can restore importantfilters that were mistakenly removed across all layers.

In line with our work, a few methods have been pro-posed for filter-level/channel pruning (i.e., structured prun-ing), which can reduce both network size and inference speed.Li et al. [Li et al., 2016] measured the importance of eachfilter by calculating the `1-norm to prune unsalient filters to-gether with their corresponding feature maps. Hu et al. [Hu etal., 2016] computed the Average Percentage of Zeros (APoZ)of each filter, i.e., the percentage of zero values in the outputfeature map associated with the corresponding filter, whichserves as its score to guide pruning. Lebedev et al. [Lebe-dev and Lempitsky, 2016] and Wen et al. [Wen et al., 2016]utilized group sparsity regularization to prune convolutionalfilters in a group-wise fashion during the training, which ishowever less efficient since only stochastic gradient descendis used. Recently, a new criterion based on Taylor expan-sion has been introduced in [Molchanov et al., 2017] to glob-ally prune one filter and then fine-tune the pruned network.However, it was prohibitively costly when applying to deepnetworks, as time-consuming fine-tuning has to be done afterpruning each filter. Our method is different to all above meth-ods, in terms of globally removing unsalient filters acrossall layers, as well as dynamically restoring salient filters thatwere previously mislabeled removed.

Orthogonal methods to our work include low-rank de-composition [Denton et al., 2014; Jaderberg et al., 2014;Lin et al., 2016; 2017; Lebedev et al., 2014; Kim et al., 2015],parameter quantization [Gong et al., 2014; Wu et al., 2016;Courbariaux et al., 2015; Courbariaux and Bengio, 2016;Rastegari et al., 2016], which have also widely used to ac-celerate convolutional networks. Low-rank decomposition[Denton et al., 2014; Jaderberg et al., 2014; Lin et al., 2017;Lebedev et al., 2014; Kim et al., 2015] typically decomposedconvolutional filters into a sequence of tensor based convo-lutions with fewer parameters. For parameter quantization,Gong et al. [Gong et al., 2014] and Wu et al. [Wu et al.,2016] employed product quantization over parameters to re-duce the redundancy in the parameter space. Recently, di-rectly predicting the model with binary weights has been pro-posed in [Courbariaux et al., 2015; Courbariaux and Bengio,2016; Rastegari et al., 2016]. It is worth to note that, ourscheme can be integrated with the above orthogonal methodsto further accelerate the pruned network.

3 Globally Dynamic Pruning3.1 NotationsCNN can be viewed as a feed-forward multi-layer architec-ture that maps the input image to a certain output vector. In


2426

CNN, the convolutional layers are most time-consuming. Letus denote a set of image feature maps in the l-th layer byZl ∈ RHl×Wl×Cl with size Hl × Wl and individual maps(or channels) Cl. The feature maps can either be the in-put of the network Z0, or the output feature maps Zl withl ∈ [1, 2, · · · , L]. In addition, we denote individual featuremap by Z

(k)l ∈ RHl×Wl with k ∈ [1, 2, · · · , Cl]. The in-

dividual output feature map of the l-th convolutional layerZ

(k)l is obtained by applying the convolutional operator (∗)

to a set of input feature maps with filters parameterized byW(k)

l ∈ Rd×d×Cl−11, i.e.,

Z(k)l = f(Zl−1 ∗W(k)

l ), (1)

where f(·) is a non-linear activation function, e.g., rectifierlinear unit (ReLU).

In many deep learning frameworks like Caffe [Jia et al.,2014] and Tensorflow [Abadi et al., 2016], the tensor-basedconvolution operator is reformulated as a matrix-by-matrixmultiplication by lowering the input and reshaping the filters,such as:

Z∗l = f(Z∗l−1 ×W∗l ), (2)

where each row of the matrix Z∗l−1 ∈ RHlWl×d2Cl−1 isrelated to the spatial position of the output tensor trans-formed from the input tensor Zl−1, and the matrix W∗

l ∈Rd2Cl−1×Cl is reshaped from filterWl

2.

3.2 The Proposed Pruning SchemeOur goal is to globally prune redundant filters. To that effect,a large network can be directly converted into a compact onewithout repeatedly evaluating each filter saliency and fine-tuning the pruned network layer-by-layer. We introduce aglobal mask to temporally mask out unsalient filters in eachiteration during training. Therefore, Eq. 2 can be rewritten as:

Z∗l = f(Z∗l−1 × (W∗

l �ml)), s.t. l = 1, 2, · · · , L, (3)

where ml ={

0, 1}Cl is a mask with binary values. mk

l = 1if the k-th filter is salient in the l-th layer, and 0 otherwise. �denotes the Khatri-Rao product operator.

As we argued, pruning the filters in an irretrievable/fixedway is inflexible and ineffective in practice, which will causesevere performance loss. Note that the filter saliency maychange dramatically after pruning a certain layer, as thereexists complex interconnections among filters [Guo et al.,2016]. Therefore, dynamic pruning, i.e., enabling the roll-back of masked filters in a global perspective, is highly de-sired to improve the discriminability of the pruned network.

To better describe the objective function for the pro-posed GDP, we denote filters of the entire network asW∗ =

{W1∗

1 ,W2∗1 , · · · ,W

CL∗L

}and a global mask as

m ={

0, 1}∑L

l=1 Cl . We further give a set of train-

ing examples D ={X =

{X1,X2, · · · ,XN

},Y =

1For simplicity, we discuss the problem without the bias term.2These efficient implementations can take advantage of highly

optimized linear algebra packages, such as Intel MKL and BLAS.

{Y1,Y2, · · · ,YN

}}, where Xi and Yi represent an input

and a target output, respectively. We propose to solve the fol-lowing optimization problem:

min L(Y, g

(X ;W∗,m

))s.t. m = h(W∗)∥∥m∥∥

0≤ β

∑Ll=1 Cl,

(4)

where L(·) is a loss function for the pruned network, e.g.,cross-entropy loss. g

(X ;W∗,m

)takes the input X , the fil-

ters W∗ and the global mask m to map to an s-dimensionaloutput (s is the number of classes). h(·) is a global discrim-inative function to determine the saliency values of filters,which depends on the prior knowledge of W∗. The outputentry of function h(·) is binary, i.e., to be 1 if the correspond-ing filter is salient, and 0 otherwise.

Eq. 4 is the core function in our GDP framework, which isnon-convex and whose solver will be introduced in Sec. 3.3.β ∈ (0, 1] is a threshold to determine the sparsity of thepruned network. The problem Eq. 4 is NP-hard, because ofthe ‖ · ‖0 operator. We simplify this NP-hard problem bybounding m on prior knowledge of W∗. Then, it is solvedby greedy and alternatively updatingW∗ and m by using thestochastic gradient descent, which will be introduced in detailin Sec. 3.3.

3.3 The SolverWe first investigate the constraint in Eq. 4, which can be re-laxed by greedily selecting an amount of β

∑Ll=1 Cl most

important filters, which determines the global discriminativefunction h based on the prior knowledge of W∗. Then, weonly need to solve the objective function in Eq. 4 through thestochastic gradient descent. Since every filter has a mask, weupdateW∗ as below:

W∗l = W∗

l − η∂L(Y, g(X ;W∗,m)

)∂(W∗

l �ml), l = 1, · · · , L, (5)

where W∗l ∈ Rd2Cl−1×Cl hasCl filters, η is the learning rate.

The global mask m and the filtersW∗ are updated iterativelyto dynamically adapt to the pruned network. Algorithm 1presents the detailed optimization algorithm.

In Eq. 5, we employ back-propagation to calculate the par-tial L with respect to W∗

l �ml, instead of filters W∗l . In the

framework of greedy alternative updating, m depends on theknowledge ofW∗, and is implemented by the global discrim-inative function h, which can be constructed by sorting theimportance of each filter and signing all entries with 0 or 1.After that, all filtersW∗ are then masked by the global maskm to be updated to adapt to a newly pruned network.

Comparing to the existing solvers in layer-wised pruning,the above solver has two following advantages:

1. The saliency evaluation of filters are global (i.e., acrossall layers), and the corresponding pruning is conductedonly one time, rather than layer-by-layer.

2. We enable a dynamic updating of filters that are incor-rectly masked out, which constitutes a closed circularprocedure to improve the accuracy and flexibility of thepruned network.


2427

Algorithm 1 The proposed global dynamic pruning scheme

Input: Training data D = {X ,Y}, reference model W =

{W11, · · · ,WCL

L }, sparsity threshold β, learning rate η, thresh-old of updating mask e, maximum iterations T .

Output: The updated parameters and their binary masks W∗ =

{W1∗l , · · · ,W

CL∗L },m = {0, 1}

∑Ll=1 Cl .

1: InitializeW∗ byW , m = 1, and t = 0.2: repeat3: Forward Pass:

Choose a minibatch from D, conduct forward propagationand loss computation withW∗,m via Eq. 3.

4: Backward Pass:

Compute the gradient of filter∇W∗l by

∂L(Y,g(X ;W∗,m)

)∂(W∗

l�ml)

.5: Update:

if Mod((t + 1), e) == 0 then update m by function h(·)based on the currentW∗;UpdateW∗ via Eq. 5 and the current gradient∇W∗l .

6: t := t+ 1.7: until convergence or t reach maximum iterations T .

To accelerate the convergence of Algorithm 1, we set a lowfrequency for the global mask updating, which is controlledby the threshold e. And the global mask is not updated whenthe network is in the warm-up phase (i.e., right after finishingthe mask updating). To explain, with a large loss of the net-work in the unstable warm-up phase, frequently updating theglobal mask cannot provide useful information to guide thenetwork pruning. Therefore, we set e to be a large value inthe warm-up phase, which aims to slowing down the updat-ing frequency of the global mask. After the warm-up phase,we decrease the value e to accelerate the updating of both theglobal mask and the filter weights. For different networks,the detailed setups of the threshold e are discussed in our ex-periments subsequently.

3.4 The Global MaskTo obtain the global mask m, a global discriminative functionis further required to evaluate the importance of each filter.We introduce a criterion to measure the contribution of filtersbased on the Taylor expansion, termed TE.

Taylor expansion (TE). We propose a criterion based onTaylor expansion, which identifies and removes redundantfilters whose removal has a limited impact to the loss func-tion. Let Wk∗

l be the k-th filter from the l-th layer, as pre-sented in Sec. 3.2. For notation convenience, we consider theglobal output function, which has a global mask with all en-tries equal to 1, i.e., g(X ;W∗,m = 1) = g(X ;W∗). Toconsider all filters with a probability to be selected as salientfilters, all entries in the global mask are first set to be 1, wehave: ∣∣∣∆L(Y, g(X ;Wk∗

l ))∣∣∣ =

∣∣∣L(Y, g(X ;Wk∗l = 0)

)− L

(Y, g(X ;W∗)

)∣∣∣, (6)

where L(Y, g(X ;Wk∗

l = 0))

evaluates the loss in thecase that the k-th filter from the l-th layer is pruned, whileL(Y, g(X ;W∗)

)evaluates the loss when keeping all filters.

To facilitate discussion, the notation in Eq. 6 is simplified as:∣∣∣∆L(Wk∗l

)∣∣∣ =∣∣∣L(D,Wk∗

l = 0)− L

(D,W∗)∣∣∣. (7)

Therefore, we can estimate the change of the loss∆L(Wk∗

l

)by approximating L

(D,W∗) with the first-order

Taylor expansion at Wk∗l = 0:∣∣∣∆L(Wk∗

l )∣∣∣ ≈ ∣∣∣∂L(D,W∗)

∂Wk∗l

Wk∗l

∣∣∣, (8)

where the value ∂L(D,W∗)

∂Wk∗l

is obtained via back-propagation.

Since the filter Wk∗l is a d2Cl−1-dimensional vector, we cal-

culate the change of the loss∣∣∆L(Wk∗

l )∣∣ by accumulating

the product of the loss function’s gradient and the own valueof filter as below3:∣∣∣∆L(Wk∗

l )∣∣∣ ≈ ∣∣∣ R∑

r=1

∂L(D,W∗)

∂Wk∗l,r

Wk∗l,r

∣∣∣, (9)

where R is the dimension of a filter. Therefore, we constructa function to measure the saliency score of a filter, i.e., fT :

Rd2Cl−1 → R+ with

fT (Wk∗l ) =

∣∣∣ d2Cl−1∑r=1

∂L(D,W∗)

∂Wk∗l,r

Wk∗l,r

∣∣∣. (10)

The global saliency scores of all filters IndT is constructed,which are sorted by a descending order, i.e., IndT =

sort({fT (W1∗

1 ), · · · , fT (WCL∗L )

}). Therefore, each ele-

ment mi, i = 1, 2, · · · ,∑L

l=1 Cl in the global mask can beconstructed by taking the corresponding top-β

∑Ll=1 Cl in-

dex in the set:

mi = hi(W∗) =

1, i ∈ IndT [1 : β

L∑l=1

Cl],

0, otherwise.

(11)

4 Experiments4.1 Experimental SettingsWe evaluate the proposed GDP approach on AlexNet[Krizhevsky et al., 2012], VGG-16 [Simonyan and Zisser-man, 2014] and ResNet-50 [He et al., 2016] in ImageNet2012 dataset [Russakovsky et al., 2015], which containsabout 1.2M training images and 50K validation images of1,000 classes. Training images in the ImageNet dataset arerescaled to the size of 256×256, with a 224×224 (227×227for AlexNet) crop randomly sampled from each image and itshorizontal flip. The accuracy is measured on the validation setusing single-view testing (central patch only).

Implementation Details. We implement our global dy-namic pruning in Tensorflow [Abadi et al., 2016]. To getthe baseline accuracy of each network, we train AlexNet,

3In practice, the entire training data is divided intoM minibatch,we average the

∣∣∆L(Wk∗l )∣∣ over M .


2428

Model Layer FLOPs FLOPs% FLOPs%GDP-D GDP

VGG-16

Conv1 1 89.91M 56.25% 56.25%Conv1 2 1.85B 33.44% 42.24%Conv2 1 926.45M 32.97% 41.63%Conv2 2 1.85B 54.21% 54.21%Conv3 1 925.65M 51.12% 51.50%Conv3 2 1.85B 51.55% 52.75%Conv3 3 1.85B 98.44% 98.05%Conv4 1 925.25M 58.79% 49.02%Conv4 2 1.85B 35.94% 12.60%Conv4 3 1.85B 38.69% 12.56%Conv5 1 462.52M 46.73% 42.52%Conv5 2 462.52M 55.23% 79.35%Conv5 3 462.52M 50.56% 87.52%

Total 15.36B 51.16% 48.03%

Table 1: FLOPs comparison of GDP and GDP-D, when β is set tobe 0.7. FLOPs% is the percentage of the remaining FLOPs.

Method Hy-P FLOPs Speedup (ms) Top-1 Top-5CPU GPU Acc. Acc.

AlexNet - 729.7M 2,990 36 56.60% 80.12%SSL - 559.3M 2,493 30 55.28% 78.88%FMP - 434.8M 1,839 27 54.73% 78.53%

GDP0.7 455.2M 2,252 28 56.46% 80.01%0.6 348.2M 1,760 22 55.83% 79.64%0.5 263.1M 1,629 17 54.82% 78.97%

Table 2: Comparing different pruning methods for acceleratingAlexNet. Hy-P denotes the setting of hyper-parameter and batchsize is 32 (the same in the following tables).

VGG-16 and ResNet-50 from scratch and follow the samepre-processing and hyper-parameter setting as Krizhevsky etal. [Krizhevsky et al., 2012], Simonyan et al [Simonyan andZisserman, 2014] and He et al. [He et al., 2016], respectively.We achieve the results of each reference model as shown inTable 2, Table 3 and Table 4. We solve the optimization Eq. 4by running on NVIDIA GTX 1080Ti GPU with 128GB ofRAM. All models are trained for a total of 30 epochs withbatch sizes of 128, 32 and 32 for AlexNet, VGG-16 andResNet-50, respectively. The learning rate is set to an initialvalue of 0.001 and then scaled by 0.1 throughout 10 epochs.The weight decay is set to be 0.0005 and the momentum isset to be 0.9. To re-train the pruned network, we use an initiallearning rate of 0.0001 for a total of 20 epochs, with a con-stant dropping factor of 10 throughout 10 epochs. In terms ofthreshold e, we use different values for different CNNs. Morespecifically, e = 3 for the first 10 epochs, e = 2 for the sec-ond 10 epochs and e = 1 for the remaining epochs is used forAlexNet. When training VGG-16 and ResNet-50, we use thesame value of e = 2 for the first 20 epochs and e = 1 for theremaining 10 epochs. In terms of hyper-parameter β, we varyβ in the set of {0.5, 0.6, 0.7} with 3 values to select the besttrade-off between speedup rate and classification accuracy.

Evaluation Protocols. The Top-1 and Top-5 classificationaccuracy on the validation set are employed as the evalua-tion protocol. We further measure the speedup ratio under thebatch size of 32 to select the trade-off between speedup ra-tio and classification accuracy in a single-thread Intel XeonE5-2620 CPU and NVIDIA GTX TITAN X GPU.

1 1.2 1.4 1.6 1.8 2 2.2464748495051525354555657

GPU speedup ratio

Top−

1 ac

cura

cy (%

)

OriScratchL1APoZSSLFMPGDP−DGDP

(a) AlexNet acceleration

1 1.2 1.4 1.6 1.8 2 2.26162636465666768697071

GPU speedup ratio

Top−

1 ac

cura

cy (%

)

OriScratchL1APoZFMPCPGDP−DGDP

(b) VGG-16 acceleration

Figure 2: Comparison of different filter selection schemes for ac-celerating AlexNet and VGG-16. “Scratch” means that the networkis trained from the scratch, and “Ori” denotes the original CNNs.GDP-D refer to global pruning filter without dynamic updating.


VGG-16 - 15.5B 10,824 322 70.32% 89.42%FMP - 4.2B 5,237 167 65.20% 84.86%CP - 4.9B 5,618 159 67.34% 87.92%

GDP0.7 7.5B 7,122 205 69.88% 89.16%0.6 6.4B 6,680 188 68.80% 88.77%0.5 3.8B 4,979 139 67.51% 87.95%

Table 3: The results of accelerating VGG-16.

4.2 Quantitative EvaluationWe compare the proposed GDP method with the state-of-the-art filter pruning methods, including structured sparsity learn-ing (SSL) [Wen et al., 2016], `1-norm pruning (L1) [Li et al.,2016], channel-based pruning (CP) [Luo et al., 2017], fea-ture map based pruning (FMP) [Molchanov et al., 2017], andaverage percentage of zeros (APoZ) [Hu et al., 2016].

AlexNet and VGG-16 on ImageNet. Both AlexNet andVGG-16 contain several convolutional layers and 3 fully-connected layers. We first compare our proposed GDP tofive layer-wise filter pruning methods on GPU speedup andTop-1 accuracy. As shown in Figure 2, the results reveal threekey observations: (1) Without dynamic updating (GDP-D),layer-wise pruning (e.g., L1 and APoZ) performs better thanglobal pruning. To explain, GDP-D prunes many filters withpotential inter-relation at once, which leads to a significantaccuracy loss even with global fine-tuning. Instead, localfine-tuning is repeatedly utilized in L1, APoZ, CP and FMPto reduce the accuracy loss, after each layer is pruned. How-ever, repeating local fine-tuning is pretty time-consuming andseriously affects the pruning efficiency. For example, prun-ing each layer of VGG-16 in L1, APoZ and CP requires lo-cal fine-tuning with average 10 epochs, which requires 130epochs in total to finish pruning the entire network, which is6.5 times and 2.5 times more epochs than GDP-D and GDP,respectively. (2) We randomly initialize the model with thesame number of filters per layer to GDP and train it fromscratch, which achieves not as good accuracy as L1, APoZand GDP. This result explicitly verifies that the initializationof deep model is pretty critical for non-convex optimization.(3) Dynamic updating is very effective to improve the net-work’s discriminability in GDP. Compared to GDP-D, GDPemploys the dynamic updating to iteratively tune the filtersin a retrievable way, which can restore the salient filters withmisjudgement and correspondingly improve the classification


2429

0 2 4 6 8 10 12x 105

1

2

3

4

5

6

7

Iteration number

Trai

ning

dat

a lo

ss

GDP−0.7GDP−0.6GDP−0.5

0 5 10 15 20 25 3020

30

40

50

60

70

Epoch

Top−

1 ac

cura

cy (%

)

GDP−0.7GDP−0.6GDP−0.5

(a) The training loss and testing Top-1 accuracy using GDP to prune VGG-16 at different β

0 1 2 3 4 5 6 7 8x 105

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Iteration number

Trai

ning

dat

a lo

ss

Fine−tuning(GDP−0.7)Fine−tuning(GDP−0.6)Fine−tuning(GDP−0.5)

0 2 4 6 8 10 12 14 16 18 2064

65

66

67

68

69

70

Epoch

Top−

1 ac

cura

cy (%

)

Fine−tuning(GDP−0.7)Fine−tuning(GDP−0.6)Fine−tuning(GDP−0.5)

(b) The training loss and testing Top-1 accuracy for fine-tuning the pruned network by GDP at different β

Figure 3: Comparison of different β for pruning VGG-16 via GDP scheme.


ResNet-50 - 3.86B 9,822 345 75.13% 92.30%

CP - 2.44B 5,999 278 72.04% 90.67%- 1.71B 5,253 246 71.01% 90.02%

GDP0.7 2.24B 6,616 279 72.61% 91.05%0.6 1.88B 5,821 261 71.89% 90.71%0.5 1.57B 5,012 242 70.93% 90.14%

Table 4: The results of accelerating ResNet-50.

accuracy of the pruned network. For example, GDP performs3.9% higher in Top-1 accuracy than GDP-D when pruningVGG-16 with about 2.32× speedup. Moreover, GDP tends toprune more filters in the layers with high computation com-plexity, which leads to the reduction of FLOPs and the in-crease of CNN speedup. For example, as shown in Table 1,GDP prunes more filters on the middle layers of VGG-16with high computation complexity (e.g., Conv4 1, Conv4 2and Conv4 3), while GDP-D tends to prune more filters onthe last layers (e.g., Conv5 2 and Conv5 3). By equippingwith the dynamic updating, GDP achieves the best perfor-mance to prune both AlexNet and VGG-16 comparing to allfilter pruning schemes. For AlexNet, GDP achieves 1.1%higher in Top-1 accuracy than FMP at the about 1.63× GPUspeedup. For VGG-16, the speedup rate is increased by afactor of 2.32× with 67.51% Top-1 accuracy in GDP, com-pared to 2.03× with 67.37% Top-1 accuracy in CP. Specifi-cally, the detailed changed process of training loss and testingTop-1 accuracy using GDP scheme to prune VGG-16 is pre-sented in Figure 3. The figure shows that GDP converges to ahigh accuracy after 30 epochs, and we can further improve theclassification of the pruned network by a simple fine-tuning.

Subsequently, GDP is also compared to several state-of-the-art filter pruning methods about FLOPs reduction, CPUand GPU speedup, which is shown in Table 2 and Table 3. For

better comparison, SSL [Wen et al., 2016] employs the filter-wise sparsity regularization and achieves a limited computa-tion reduction, i.e., about 170M FLOPs reduction with 1.2×CPU speedup, but obtains 1.32% loss in Top-1 accuracy. Asfor FMP [Molchanov et al., 2017], their motivation is simi-lar to our TE mask, but with totally different filter selectionand training designs. Our GDP prunes the filters at once in aretrievable way, while FMP prunes one filter permanently ata time. As shown in Table 2, GDP achieves higher Top-1/5accuracy with more FLOPs reduction than FMP. As for CP[Luo et al., 2017], it conducts a greedy local channel selec-tion to prune the channel with the smallest channel approx-imated error. As shown in Table 3, CP yields a final prunednetwork with 3.16× FLOPs reduction, 1.9× CPU speedupand a 1.5% loss in Top-5 accuracy. Compared to CP, GDP isfaster to prune the redundant filters in a global manner with-out intermediate feature responses, and achieve better perfor-mance with about 4× FLOPs reduction, 2.17× CPU speedupand a 1.47% Top-5 accuracy loss.

ResNet-50 on ImageNet. ResNet-50 is a more compactstructure with less redundancy than AlexNet and VGG-16.Since significantly smaller FLOPs are located in the last layerand the projection shortcut layer, we only prune the first twolayers of each residual block and leave the last layer and theprojection shortcut layer unchanged, as to match the dimen-sion of output. In fact, FLOPs in the last convolutional layercan be significantly reduced, since large number of channelsas the input have been reduced, which is caused by pruningthe number of filter in the second convolutional layer. Asshown in Table 4, GDP achieves better performance in termsof pruning ResNet-50. With the increase number of filterpruning, (i.e., the value of β is from 0.7 to 0.5), FLOPs can besignificantly reduced via GDP, but with slight increase in Top-5 accuracy loss. Compared to CP, GDP achieves the higherTop-5 accuracy (90.14% in GDP vs. 90.02% in CP) with


2430

(a) Epoch 0

(b) Epoch 1

(c) Epoch 30

Figure 4: Dynamically update filters, masks and output feature mapson the first layer of VGG-16. Left: filters and masks, Right: outputfeature maps. In the left column, each rectangle contains filter andmask, in which the black one indicates the mask is unchanged, andthe red one presents the filter and mask are updated. In addition, themask is a smaller rectangle, in which � indicates the correspondingfilter is salient, and � is unsalient. In the right column, the featuremaps changed correspondingly are shown in the red boxes.

a higher CPU and GPU speedup (5,012ms CPU and 242msGPU online inference in GDP vs. 5,253ms CPU and 246msGPU online inference in CP). To explain, dynamic updatingin GDP significantly improves the discriminability and gen-eralization of the pruned network.

4.3 Visualization of Dynamic UpdatingQuantitatively, we have testified the effectiveness of dynamicupdating in our global pruning scheme. To show the pro-cess of dynamic updating, we visualize filters, masks and out-put feature maps of the first convolutional layer for VGG-16by using the proposed GDP method and setting β to be 0.5,as presented in Figure 4. Before pruning the filters (i.e., themasks all equal to 1), low-level features (e.g., edge, color andcorner detectors of various directions) can be found amongthe listed filters and output feature maps, as shown in Fig-ure 4(a). After the first-round mask updating, the network ispruned temporarily by selecting the salient filters based onTE, and then is updated to adapt to the pruned network, as

shown in Figure 4(b). After 30 epochs, the network is con-vergent to adaptively obtain the salient filters and their masksby dynamic updating. In Figure 4(c), several number of fil-ters, masks and output feature maps are different with theones after the first updating, which indicates that the saliencyof filters were mistakenly judged in the beginning, but weresuccessfully updated during the dynamic updating.

5 ConclusionThis work presents a global dynamic pruning (GDP) schemeto prune redundant filters for CNN acceleration. We employa global discriminative function based on prior knowledge ofeach filter to globally prune the unsalient filters across all lay-ers. To decrease accuracy loss caused by incorrect globallypruning, we dynamically update the filter saliency all overthe pruned sparse network. Specially, we further handle thecorresponding non-convex optimization problem of the pro-posed GDP, which is effectively solved via stochastic gradientdescent with greedy alternative updating. In experiments, theproposed GDP achieves superior performance to acceleratevarious cutting-edge CNNs on ILSVRC-12, comparing to thestate-of-the-art filter pruning methods.

AcknowledgmentsThis work is supported by the National Key R&D Pro-gram (No.2017YFC0113000, and No.2016YFB1001503),the Natural Science Foundation of China (No.U1705262,No.61772443, No.61402388 and No.61572410), the PostDoctoral Innovative Talent Support Program under GrantBX201600094, the China Post-Doctoral Science Foundationunder Grant 2017M612134, Scientific Research Project ofNational Language Committee of China (Grant No. YB135-49), and Natural Science Foundation of Fujian Province,China (No. 2017J01125 and No. 2018J01106).

References[Abadi et al., 2016] Martı́n Abadi, Ashish Agarwal, Paul

Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,Greg S Corrado, Andy Davis, Jeffrey Dean, MatthieuDevin, et al. Tensorflow: Large-scale machine learn-ing on heterogeneous distributed systems. arXiv preprintarXiv:1603.04467, 2016.

[Anwar et al., 2015] Sajid Anwar, Kyuyeon Hwang, andWonyong Sung. Structured pruning of deep convolutionalneural networks. arXiv preprint arXiv:1512.08571, 2015.

[Courbariaux and Bengio, 2016] M. Courbariaux andY. Bengio. Binarynet: Training deep neural networkswith weights and activations constrained to+ 1 or-1. arXivpreprint arXiv:1602.02830, 2016.

[Courbariaux et al., 2015] M. Courbariaux, Y. Bengio, andJ. David. Binaryconnect: Training deep neural networkswith binary weights during propagations. In NIPS, 2015.

[Denton et al., 2014] Emily L Denton, Wojciech Zaremba,Joan Bruna, Yann LeCun, and Rob Fergus. Exploitinglinear structure within convolutional networks for efficientevaluation. In NIPS, pages 1269–1277, 2014.


2431

[Girshick et al., 2014] Ross Girshick, Jeff Donahue, TrevorDarrell, and Jitendra Malik. Rich feature hierarchies foraccurate object detection and semantic segmentation. InCVPR, pages 580–587, 2014.

[Gong et al., 2014] Yunchao Gong, Liu Liu, Ming Yang,and Lubomir Bourdev. Compressing deep convolu-tional networks using vector quantization. arXiv preprintarXiv:1412.6115, 2014.

[Guo et al., 2016] Yiwen Guo, Anbang Yao, and YurongChen. Dynamic network surgery for efficient dnns. InNIPS, pages 1379–1387, 2016.

[Han et al., 2015a] Song Han, Huizi Mao, and William JDally. Deep compression: Compressing deep neural net-work with pruning, trained quantization and huffman cod-ing. CoRR, abs/1510.00149, 2, 2015.

[Han et al., 2015b] Song Han, Jeff Pool, John Tran, andWilliam Dally. Learning both weights and connections forefficient neural network. In NIPS, pages 1135–1143, 2015.

[Han et al., 2016] Song Han, Xingyu Liu, Huizi Mao, JingPu, Ardavan Pedram, Mark A Horowitz, and William JDally. Eie: efficient inference engine on compressed deepneural network. In ISCA, 2016.

[Hassibi and Stork, 1993] Babak Hassibi and David G Stork.Second order derivatives for network pruning: Optimalbrain surgeon. In NIPS, 1993.

[He et al., 2016] Kaiming He, Xiangyu Zhang, ShaoqingRen, and Jian Sun. Deep residual learning for image recog-nition. In CVPR, pages 770–778, 2016.

[Hu et al., 2016] Hengyuan Hu, Rui Peng, Yu-Wing Tai, andChi-Keung Tang. Network trimming: A data-driven neu-ron pruning approach towards efficient deep architectures.arXiv preprint arXiv:1607.03250, 2016.

[Jaderberg et al., 2014] Max Jaderberg, Andrea Vedaldi, andAndrew Zisserman. Speeding up convolutional neu-ral networks with low rank expansions. arXiv preprintarXiv:1405.3866, 2014.

[Jia et al., 2014] Yangqing Jia, Evan Shelhamer, Jeff Don-ahue, Sergey Karayev, Jonathan Long, Ross Girshick, Ser-gio Guadarrama, and Trevor Darrell. Caffe: Convolutionalarchitecture for fast feature embedding. In ACM MM,pages 675–678. ACM, 2014.

[Kim et al., 2015] Yong-Deok Kim, Eunhyeok Park,Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin.Compression of deep convolutional neural networks forfast and low power mobile applications. arXiv preprintarXiv:1511.06530, 2015.

[Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever,and Geoffrey E Hinton. Imagenet classification with deepconvolutional neural networks. In NIPS, 2012.

[Lebedev and Lempitsky, 2016] Vadim Lebedev and VictorLempitsky. Fast convnets using group-wise brain damage.In CVPR, pages 2554–2564, 2016.

[Lebedev et al., 2014] Vadim Lebedev, Yaroslav Ganin,Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky.

Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553,2014.

[LeCun et al., 1989] Yann LeCun, John S Denker, Sara ASolla, Richard E Howard, and Lawrence D Jackel. Op-timal brain damage. In NIPS, 1989.

[Li et al., 2016] Hao Li, Asim Kadav, Igor Durdanovic,Hanan Samet, and Hans Peter Graf. Pruning filters for ef-ficient convnets. arXiv preprint arXiv:1608.08710, 2016.

[Lin et al., 2016] Shaohui Lin, Rongrong Ji, Xiaowei Guo,and Xuelong Li. Towards convolutional neural networkscompression via global error reconstruction. In IJCAI,pages 1573–1759, 2016.

[Lin et al., 2017] Shaohui Lin, Rongrong Ji, Chao Chen, andFeiyue Huang. Espace: Accelerating convolutional neuralnetworks via eliminating spatial & channel redundancy. InAAAI, pages 1424–1430, 2017.

[Liu et al., 2015] Baoyuan Liu, Min Wang, Hassan Foroosh,Marshall Tappen, and Marianna Pensky. Sparse convolu-tional neural networks. In CVPR, pages 806–814, 2015.

[Long et al., 2015] Jonathan Long, Evan Shelhamer, andTrevor Darrell. Fully convolutional networks for seman-tic segmentation. In CVPR, pages 3431–3440, 2015.

[Luo et al., 2017] Jianhao Luo, Jianxin Wu, and Weiyao Lin.Thinet: A filter level pruning method for deep neural net-work compression. In ICCV, 2017.

[Molchanov et al., 2017] Pavlo Molchanov, Stephen Tyree,Tero Karras, Timo Aila, and Jan Kautz. Pruning convo-lutional neural networks for resource efficient inference.In ICLR, 2017.

[Park et al., 2017] Jongsoo Park, Sheng Li, Wei Wen, PingTak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey.Faster cnns with direct sparse convolutions and guidedpruning. In IJCAI, 2017.

[Rastegari et al., 2016] M. Rastegari, V. Ordonez, J. Red-mon, and A. Farhadi. Xnor-net: Imagenet classificationusing binary convolutional neural networks. arXiv preprintarXiv:1603.05279, 2016.

[Russakovsky et al., 2015] Olga Russakovsky, Jia Deng,Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,Zhiheng Huang, Andrej Karpathy, Aditya Khosla, MichaelBernstein, Alexander C. Berg, and Li Fei-Fei. Ima-geNet Large Scale Visual Recognition Challenge. Inter-national Journal of Computer Vision (IJCV), 115(3):211–252, 2015.

[Simonyan and Zisserman, 2014] Karen Simonyan and An-drew Zisserman. Very deep convolutional networksfor large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014.

[Wen et al., 2016] Wei Wen, Chunpeng Wu, Yandan Wang,Yiran Chen, and Hai Li. Learning structured sparsity indeep neural networks. In NIPS, 2016.

[Wu et al., 2016] Jiaxiang Wu, Cong Leng, Yuhang Wang,Qinghao Hu, and Jian Cheng. Quantized convolutionalneural networks for mobile devices. In CVPR, 2016.


2432

Accelerating Convolutional Networks via Global & Dynamic ... · all layers by masking out the...

Documents

Transcript of Accelerating Convolutional Networks via Global & Dynamic ... · all layers by masking out the...