Constructing Fast Network through Deconstruction of Convolution · 2019. 2. 19. · Constructing...

11
Constructing Fast Network through Deconstruction of Convolution Yunho Jeon School of Electrical Engineering, KAIST [email protected] Junmo Kim School of Electrical Engineering, KAIST [email protected] Abstract Convolutional neural networks have achieved great success in various vision tasks; however, they incur heavy resource costs. By using deeper and wider networks, network accuracy can be improved rapidly. However, in an environment with limited resources (e.g., mobile applications), heavy networks may not be usable. This study shows that naive convolution can be deconstructed into a shift operation and pointwise convolution. To cope with various convolutions, we propose a new shift operation called active shift layer (ASL) that formulates the amount of shift as a learnable function with shift parameters. This new layer can be optimized end-to-end through backpropagation and it can provide optimal shift values. Finally, we apply this layer to a light and fast network that surpasses existing state-of-the-art networks. Code is available at https://github.com/ jyh2986/Active-Shift. 1 Introduction Deep learning has been applied successfully in various fields. For example, convolutional neural networks (CNNs) have been developed and applied successfully to a wide variety of vision tasks. In this light, the current study examines various network structures[15, 19, 21, 20, 22, 6, 7, 9]. In particular, networks are being made deeper and wider because doing so improves accuracy. This approach has been facilitated by hardware developments such as graphics processing units (GPUs). However, this approach increases the inference and training times and consumes more memory. Therefore, a large network might not be implementable in environments with limited resources, such as mobile applications. In fact, accuracy may need to be sacrificed in such environments. Two main types of approaches have been proposed to avoid this problem. The first approach is network reduction via pruning[5, 4, 17], in which the learned network is reduced while maximizing accuracy. This method can be applied to most general network architectures. However, it requires additional processes after or during training, and therefore, it may further increase the overall amount of time required for preparing the final networks. The second approach is to use lightweight network architectures[10, 8, 18] or new components[13, 23, 26] to accommodate limited environments. This approach does not focus only on limited resource environments; it can provide better solutions for many applications by reducing resource usage while maintaining or even improving accuracy. Recently, grouped or depthwise convolution has attracted attention because it reduces the computational complexity greatly while maintaining accuracy. Therefore, it has been adopted in many new architectures[2, 24, 8, 18, 26]. Decomposing a convolution is effective for reducing the number of parameters and computational complexity. Initially, a large convolution was decomposed using several small convolutions[19, 22]. Binary pattern networks[13] have also been proposed to reduce network size. This raises the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

Transcript of Constructing Fast Network through Deconstruction of Convolution · 2019. 2. 19. · Constructing...

Page 1: Constructing Fast Network through Deconstruction of Convolution · 2019. 2. 19. · Constructing Fast Network through Deconstruction of Convolution Yunho Jeon School of Electrical

Constructing Fast Networkthrough Deconstruction of Convolution

Yunho JeonSchool of Electrical Engineering, KAIST

[email protected]

Junmo KimSchool of Electrical Engineering, KAIST

[email protected]

Abstract

Convolutional neural networks have achieved great success in various vision tasks;however, they incur heavy resource costs. By using deeper and wider networks,network accuracy can be improved rapidly. However, in an environment withlimited resources (e.g., mobile applications), heavy networks may not be usable.This study shows that naive convolution can be deconstructed into a shift operationand pointwise convolution. To cope with various convolutions, we propose anew shift operation called active shift layer (ASL) that formulates the amountof shift as a learnable function with shift parameters. This new layer can beoptimized end-to-end through backpropagation and it can provide optimal shiftvalues. Finally, we apply this layer to a light and fast network that surpassesexisting state-of-the-art networks. Code is available at https://github.com/jyh2986/Active-Shift.

1 Introduction

Deep learning has been applied successfully in various fields. For example, convolutional neuralnetworks (CNNs) have been developed and applied successfully to a wide variety of vision tasks.In this light, the current study examines various network structures[15, 19, 21, 20, 22, 6, 7, 9]. Inparticular, networks are being made deeper and wider because doing so improves accuracy. Thisapproach has been facilitated by hardware developments such as graphics processing units (GPUs).

However, this approach increases the inference and training times and consumes more memory.Therefore, a large network might not be implementable in environments with limited resources, suchas mobile applications. In fact, accuracy may need to be sacrificed in such environments. Twomain types of approaches have been proposed to avoid this problem. The first approach is networkreduction via pruning[5, 4, 17], in which the learned network is reduced while maximizing accuracy.This method can be applied to most general network architectures. However, it requires additionalprocesses after or during training, and therefore, it may further increase the overall amount of timerequired for preparing the final networks.

The second approach is to use lightweight network architectures[10, 8, 18] or new components[13,23, 26] to accommodate limited environments. This approach does not focus only on limitedresource environments; it can provide better solutions for many applications by reducing resourceusage while maintaining or even improving accuracy. Recently, grouped or depthwise convolutionhas attracted attention because it reduces the computational complexity greatly while maintainingaccuracy. Therefore, it has been adopted in many new architectures[2, 24, 8, 18, 26].

Decomposing a convolution is effective for reducing the number of parameters and computationalcomplexity. Initially, a large convolution was decomposed using several small convolutions[19, 22].Binary pattern networks[13] have also been proposed to reduce network size. This raises the

32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

Page 2: Constructing Fast Network through Deconstruction of Convolution · 2019. 2. 19. · Constructing Fast Network through Deconstruction of Convolution Yunho Jeon School of Electrical

question of what the atomic unit is for composing a convolution. We show that a convolution can bedeconstructed into two components: 1×1 convolution and shift operation.

Recently, shift operations have been used to replace spatial convolutions[23] and reduce the number ofparameters and computational complexity. In this approach, shift amounts are assigned heuristicallyby grouping input channels. In this study, we formulate the shift operation as a learnable functionwith shift parameters and optimize it through training. This generalization affords many benefits.First, we do not need to assign shift values heuristically; instead, they can be trained from randominitializations. Second, we can simulate convolutions with large receptive fields such as a dilatedconvolution[1, 25]. Finally, we can obtain a significantly improved tradeoff between performanceand complexity.

The contributions of this paper are summarized as follows:

• We deconstruct a heavy convolution into two atomic operations: 1×1 convolution andshift operation. This deconstruction can greatly reduce the parameters and computationalcomplexity.

• We propose an active shift layer (ASL) with learnable shift parameters that allows opti-mization through backpropagation. It can replace the existing spatial convolutions whilereducing the computational complexity and inference time.

• The proposed method is used to construct a light and fast network. This network showsstate-of-the-art results with fewer parameters and low inference time.

2 Related Work

Decomposition of Convolution VGG[19] decomposes a 5×5 convolution into two 3×3 convolu-tions to reduce the number of parameters and simplify network architectures. GoogleNet[22] uses1×7 and 7×1 spatial convolutions to simulate a 7×7 convolution. Lebedev et al. [16] decomposed aconvolution with a sum of four convolutions with small kernels. Recently, depthwise separable convo-lution has been shown to achieve similar accuracy to naive convolution while reducing computationalcomplexity[2, 8, 18].

In addition to decomposing a convolution into other types of convolutions, a new unit has beenproposed. A fixed binary pattern[13] has been shown to be an efficient alternative for spatialconvolution. A shift operation[23] can approximately simulate spatial convolution without anyparameters and floating point operations (FLOPs). An active convolution unit (ACU)[11] anddeformable convolution[3] showed that the input position of convolution can be learned by introducingcontinuous displacement parameters.

Mobile Architectures Various network architectures have been proposed for mobile applications withlimited resources. SqueezeNet[10] designed a fire module for reducing the number of parameters andcompressed the trained network to a very small size. ShuffleNet[26] used grouped 1×1 convolutionto reduce dense connections while retaining the network width and suggested a shuffle layer tomix grouped features. MobileNet[18, 8] used depthwise convolution to reduce the computationalcomplexity.

Network Pruning Network pruning[5, 4, 17] is not closely related to our work. However, thismethodology has similar aims. It reduces the computational complexity of trained architectures whilemaintaining the accuracy of the original networks. It can be also applied to our networks for furtheroptimization.

3 Method

The basic convolution has many weight parameters and large computational complexity. If thedimension of the weight parameter is D × C ×K, the computation complexity (in FLOPs) is

(D × C ×K)× (W ×H), (1)

where K is the spatial dimension of the kernel (e.g., K is nine for 3×3 convolution); C and D arethe numbers of input and output channels, respectively; and the spatial dimension of the input featureis W ×H .

2

Page 3: Constructing Fast Network through Deconstruction of Convolution · 2019. 2. 19. · Constructing Fast Network through Deconstruction of Convolution Yunho Jeon School of Electrical

3.1 Deconstruction of Convolution

The basic convolution for one spatial position can be formulated as follows:

yd,m,n =∑c

∑k

wd,c,k · xc,m+ik,n+jk =∑k

∑c

wd,c,k · xc,m+ik,n+jk , (2)

where x·,·,· is an element of the C ×W ×H input tensor, and y·,·,· is an element of the D×W ×Houtput tensor. w·,·,· is an element of the D × C ×K weight tensor W . k is the spatial index of thekernel, which points from top-left to bottom-right of a spatial dimension of the kernel. ik and jk aredisplacement values for the corresponding kernel index k. The above equation can be converted to amatrix multiplication:

Y = W+ ×X+ =[W :,:,1 W :,:,2 ... W :,:,K

X1

:,:

X2:,:

...XK

:,:

=

∑k

W :,:,k ×Xk:,: =

∑k

W :,:,k × Sk(X)

(3)

where W+ and X+ are reordered matrices for the weight and input, respectively. W :,:,k representsthe D × C matrix corresponding to the kernel index k. X is a C × (W ·H) matrix, and Xk

:,: is aC × (W ·H) matrix that represents a spatially shifted version of input matrix X with shift amount(ik, jk). Then, W+ becomes a D × (K · C) matrix, and X+ becomes a (K · C)× (W ·H) matrix.The output Y forms a D × (W ·H) matrix.

Eq. (3) shows that the basic convolution is simply the sum of 1×1 convolutions on shifted inputs.The shifted input Xk

:,: can be formulated using the shift function Sk that maps the original input tothe shifted input corresponding to the kernel index k. The conventional convolution uses the usualshifted input with integer-valued shift amounts for each kernel index k. As an extreme case, if wecan share the shifted inputs regardless of the kernel index, that is, Sk(X) = S(X), this simplifies tojust one pointwise (i.e., 1×1) convolution and greatly reduces the computation complexity:∑

k

W :,:,k × Sk(X) =∑k

W :,:,k × S(X) = (∑k

W :,:,k)× S(X) = W × S(X) (4)

However, as in this extreme case, if only one shift is applied to all input channels, the receptive fieldof convolution is too limited, and the network will provide poor results. To overcome this problem,ShiftNet[23] introduced grouped shift that applies different shift values by grouping input channels.This shift function can be represented by Eq. (5) and is followed by the single pointwise convolutionY = W × SG(X).

SG(X) =

X1

1:n,:

X2n+1:2n,:

...XG

(G−1)n+1:,:

(5) SC(X) =

X1

1,:

X22,:

...XCC,:

(6)

where n is the number of channels per kernel, and it is the same as bC/Kc. G is the number of shiftgroups. If C is a multiple of K, G is the same as K, otherwise, G is K+1.The shift function appliesdifferent shift values according to the group number and not by the kernel index. The amount of shiftis assigned heuristically to cover all kernel dimensions, and pointwise convolution is applied beforeand after the shift layer to make it invariant to the permutation of input and output channels.

3.2 Active Shift Layer

Applying the shift by group is a little artificial, and if the kernel size is large, the number of inputchannels per shift is reduced. To solve this problem, we suggest a depthwise shift layer that applies

3

Page 4: Constructing Fast Network through Deconstruction of Convolution · 2019. 2. 19. · Constructing Fast Network through Deconstruction of Convolution Yunho Jeon School of Electrical

L

D

R

U

(a) Grouped Shift (b) ASL

Figure 1: Comparison of shift operation: (a) shifting is applied to each group and the shift amount isassigned heuristically[23] and (b) shifting is applied to each channel using shift parameters and theyare optimized by training.

different shift values for each channel (Eq. (6)). This is similar to decomposing a convolutioninto depthwise separable convolutions. Depthwise separable convolution first makes features withdepthwise convolution and mixes them with 1×1 convolutions. Similarly, a depthwise shift layershifts each input channel, and shifted inputs are mixed with 1×1 convolutions.

Because the shifted input SC(X) goes through single 1×1 convolutions, this reduces the com-putational complexity by a factor of the kernel size K. More importantly, by removing both thespatial convolution and sparse memory access, it is possible to construct a network with only denseoperations like 1×1 convolutions. This can provide a greater speed improvement than that achievedby reducing only the number of FLOPs.

Next, we consider how to assign the shift value for each channel. The exhaustive search over allpossible combinations of assigning the shift values for each channel is intractable, and assigningvalues heuristically is suboptimal. We formulated the shift values as a learnable function with theadditional shift parameter θs that defines the amount of shift of each channel (Eq. (7)). We called thenew component the active shift layer (ASL).

θs = {(αc, βc)|1 ≤ c ≤ C} (7)

where c is the index of the channel, and the parameters αc and βc define the horizontal and verticalamount of shift, respectively. If the parameter is an integer, it is not differentiable and cannot beoptimized. We can relax the integer constraint and allow αc and βc to take real numbers, and thevalue for non-integer shift can be calculated through interpolation. We used bilinear interpolationfollowing [11]:

xc,m+αc,n+βc=Z1

c · (1−∆αc) · (1−∆βc) + Z3c ·∆αc · (1−∆βc)

+Z2c · (1−∆αc) ·∆βc + Z4

c ·∆αc ·∆βc,(8)

∆αc = αc − bαcc,∆βc = βc − bβcc, (9)

where (m,n) is the spatial position of the feature map, and Zic are the four nearest integer points forbilinear interpolation:

Z1c = xc,m+bαcc,n+bβcc, Z

2c = xc,m+bαcc,n+bβcc+1,

Z3c = xc,m+bαcc+1,n+bβcc, Z

4c = xc,m+bαcc+1,n+bβcc+1.

(10)

By using interpolations, the shift parameters are differentiable, and therefore, they can be trainedthrough backpropagation. With the shift parameter θs, a conventional convolution can be formulatedas follows with ASL SθsC :

Y = W × SθsC (X) = W ×

X

(α1,β1)1,:

X(α2,β2)2,:

...

X(αC ,βC)C,:

(11)

4

Page 5: Constructing Fast Network through Deconstruction of Convolution · 2019. 2. 19. · Constructing Fast Network through Deconstruction of Convolution Yunho Jeon School of Electrical

Layer Inference time FLOPs

3×3 dw-conv 39 ms 29M1×1 conv 11 ms 206M

BN+Scale/Biasa 3 ms <5MReLU 5 ms <5M

Eltwise Sum 2 ms <5M

Table 1: Comparison of inference time vs. numberof FLOPs. A smaller number of FLOPs does notguarantee fast inference.

aThe implementation of BN of Caffe [12] is notefficient because it is split with two components. Weintegrated them for fast inference.

Figure 2: Ratio of time to FLOPs. This repre-sents the inference time per 1M FLOPs. A lowervalue means that the unit runs efficiently. Timeis measured using an Intel i7-5930K CPU with asingle thread and averaged over 100 repetitions.

ASL affords many advantages compared to the previous shift layer[23] (Fig. 1). First, shift values donot need to be assigned manually; instead, they are learned through backpropagation during training.Second, ASL does not depend on the original kernel size. In previous studies, if the kernel sizechanged, the group for the shift also changed. However, ASL is independent of kernel size, and it canenlarge its receptive field by increasing the amount of shift. This means that ASL can mimic dilatedconvolution[1, 25]. Furthermore, there is no need to permutate to achieve the invariance properties inASL; therefore, it is not necessary to use 1×1 convolution before ASL. Although our component hasparameters unlike the heuristic shift layer[23], these are almost negligible compared to the number ofconvolution parameters. If the input channel is C, ASL has only 2 · C parameters.

3.3 Trap of FLOPs

FLOPs is widely used for comparing model complexity, and it is considered proportional to the runtime. However, a small number of FLOPs does not guarantee fast execution speed. Memory accesstime can be a more dominant factor in real implementations. Because I/O devices usually accessmemory in units of blocks, many densely packed values might be read faster than a few numbers oflargely distributed values. Therefore, the implementability of an efficient algorithm in terms of bothFLOPs and memory access time would be more important. Although a 1×1 convolution has manyFLOPs, this is a dense matrix multiplication that is highly optimized through general matrix multiply(GEMM) functions. Although depthwise convolution reduces the number of parameters and FLOPsgreatly, this operation needs fragmented memory access that is not easy to optimize.

For gaining a better understanding, we performed simple experiments to observe the inference timeof each unit. These experiments were conducted using a 224×224 image with 64 channels, and theoutput channel dimension of the convolutions is also 64. Table 1 shows the experimental results.Although 1×1 convolution has a much larger number of FLOPs than 3×3 depthwise convolution, theinference of 1×1 convolution is faster than that of 3×3 depthwise convolution.

The ratio of inference time to FLOPs can be a useful measure for analyzing the efficiency of layers(Fig. 2). As expected, 1×1 convolution is most efficient, and the other units have similar efficiency.Here, it should be noted that although axillary layers are fast compared to convolutions, they are notthat efficient from an implementation viewpoint. Therefore, to make a fast network, we have to alsoconsider these auxiliary layers. These results are derived using the popular deep learning packageCaffe [12], and it does not guarantee an optimized implementation. However, it shows that we shouldnot rely only on the number of FLOPs alone to compare the network speed.

4 Experiment

To demonstrate the performance of our proposed method, we conducted several experiments withclassification benchmark datasets. For ASL, the shift parameters are randomly initialized withuniform distribution between -1 and 1. We used a normalized gradient following ACU[11] with aninitial learning rate of 1e-2. Input images are normalized for all experiments.

5

Page 6: Constructing Fast Network through Deconstruction of Convolution · 2019. 2. 19. · Constructing Fast Network through Deconstruction of Convolution Yunho Jeon School of Electrical

4.1 Experiment on CIFAR-10/100

We conducted experiments to verify the basic performance of ASL with the CIFAR-10/100dataset [14] that contains 50k training and 10k test 32×32 images. We used conventional pre-processing methods [7] to pad four pixels of zeros on each side, flipped horizontally, and croppedrandomly. We trained 64k iterations with an initial learning rate of 0.1 and multiplied by 0.1 after32k and 48k iterations.

We compared the results with those of ShiftNet[23], which applied shift operations heuristically. Thebasic building block is BN-ReLU-1×1 Conv-BN-ReLU-ASL-1×1 Conv order. The network sizeis controlled by multiplying the expansion rate (ε) on the first convolution of each residual block.Table 3 shows the results; ours are consistently better than the previous results by a large margin. Wefound that widening the base width is more efficient than increasing the expansion rate; this increasesthe width of all layers in a network. With the same depth of 20, the network with base width of 46achieved better accuracy than that with a base width of 16 and expansion rate of 9. The last row ofthe table shows that our method provided better results with fewer parameters and smaller depth.Because increasing depth caused an increase in inference time, this approach is also better in terms ofinference speed.

Table 2: Comparison with networks for depthwise convolution.ASL makes the network faster and provides better results. Band DW denote the BN-ReLU layer and depthwise convolution,respectively. For a fair comparison of BN, we also conductedexperiments on a network without BN-ReLU between depthwiseconvolution and last 1×1 convolution(1B-DW3-1).

Building block C10 Inference Time(CPUa)

Training Time(GPUb)

1B-DW3-B-1 94.16 16 ms 9h031B-DW3-1 93.97 15 ms 7h41

1B-ASL-1(ours) 94.5 10.6 ms 5h53

aIntel i7-5930KbGTX Titan X(Maxwell)

Interestingly, our proposed layernot only reduces the computa-tional complexity but also couldimprove the network perfor-mance. Table 2 shows a compari-son result with the network usingdepthwise convolution. A focuson optimizing resources might re-duce the accuracy; nonetheless,our proposed architecture pro-vided better results. This showsthe possibility of extending ASLto a general network structure. Anetwork with ASL runs faster interms of inference time, and train-ing with ASL is also much fasterowing to the reduction in sparsememory access and the numberof BNs.

Fig. 3 shows an example of the shift parameters of each layer after optimizations (ASNet with basewidth of 88). Large shift parameter values mean that a network can view a large receptive field. Thisis similar to the cases of ACU[11] and dilated convolution[1, 25], both of which enlarge the receptivefield without additional weight parameters. An interesting phenomenon is observed; whereas theshift values of the other layers are irregular, the shift values in the layer with stride 2 (stage2/shift1,stage3/shift1) can be seen to tend to the center between the pixels. This seems to compensate forreducing the resolution. These features make ASL more powerful than conventional convolutions,and it could result in higher accuracy.

4.2 Experiment on ImageNet

To prove the generality of the proposed method, we conducted experiments with an ImageNet 2012classification task. We did not apply intensive image augmentation; we simply used a randomlyflipped image with 224×224 cropped from 256×256 following Krizhevsky et al. [15]. The initiallearning rate was 0.1 with a linearly decaying learning rate, and the weight decay was 1e-4. Wetrained 90 epochs with 256 batch size.

Our network is similar to a residual network with bottleneck blocks[6, 7], and Table 6 showsour network architecture. All building blocks are in a residual path, and we used a pre-activationresidual[7]. We used the same basic blocks as in the previous experiment, and we used only one spatialconvolution for the first layer. Because increasing the depth also increases the number of auxiliary

1Depths are counted without shift layer, as noted in a previous study

6

Page 7: Constructing Fast Network through Deconstruction of Convolution · 2019. 2. 19. · Constructing Fast Network through Deconstruction of Convolution Yunho Jeon School of Electrical

Table 3: Comparison with ShiftNet. Our results are better by a large margin. The last row shows thatour result is better with a smaller number of parameters and depth.

ShiftNet[23] ASNet(ours)Depth1 Base Width ε Param(M) C10 C100 C10 C100

20 16 1 0.035 86.66 55.62 89.14 63.4320 16 3 0.1 90.08 62.32 91.62 68.8320 16 6 0.19 90.59 68.64 92.54 70.68

20 16 9 0.28 91.69 69.82 92.93 71.8320 46 1 0.28 - - 93.52 73.07

110 16 6 1.2 93.17 72.56 93.73 73.4620 88 1 0.99 - - 94.53 76.73

Figure 3: Trained shift values of each layer. Shifted values are scattered to various positions. Thisenables the network to cover multiple receptive fields.

layers, we expanded the network width to increase the accuracy as in previous studies[8, 18, 23, 26].The network width is controlled by the base width w.

Table 5 shows the result for ImageNet, and our method obtains much better results with fewerparameters compared to other networks. In particular, we surpass ShiftNet by a large margin; thisindicates that ASL is much more powerful than heuristically assigned shift operations. In terms ofinference time, we compared our network with MobileNetV2, one of the best-performing networkswith fast inference time. MobileNetV2 has fewer FLOPs compared to our network; however, thisdoes not guarantee a fast inference time, as we noted in section 3.3. Our network runs faster thanMobileNetV2 because we have smaller depth and no depthwise convolutions that run slow owing tomemory access (Fig. 4).

4.3 Ablation Study

Although both ShiftNet[23] and ASL use shift operation, ASL achieved much better accuracy. Thisis because of a key characteristic of ASL, namely, that it learns the real-valued shift amounts usingthe network itself. To clarify the factor of improvement according to the usage of ASL, we conductedadditional experiments using ImageNet (AS-ResNet-w32). Table 4 shows the top-1 accuracy withthe amount of improvement inside parenthesis. Grouped Shift (GS) indicates the same method asShiftNet. Sampled Real (SR) indicates cases in which the initialization of the shift values was sampledfrom a Gaussian distribution with standard deviation 1 to imitate the final state of shift values trainedby ASL. Similarly, the values for Sampled Integer (SI) are obtained from a Gaussian distribution, butrounds a sample of real numbers to an integer point. Training Real (TR) is the same as our proposedmethod, and only TR learns the shift values.

Comparing GS with SI suggests that random integer sampling is slightly better than heuristicassignment owing to the potential expansion of the receptive field as the amount of shift can be largerthan 1. The result of SR shows that a relaxation of the shift values to the real domain, another key

7

Page 8: Constructing Fast Network through Deconstruction of Convolution · 2019. 2. 19. · Constructing Fast Network through Deconstruction of Convolution Yunho Jeon School of Electrical

characteristic of ASL, turned out to be even more helpful. In terms of learning shift values, TRachieved the largest improvement, which was around two times that achieved by SR. These resultsshow the effectiveness of learning real-valued shift parameters.

Table 4: Ablation study using AS-ResNet-w32 on ImageNet. The improvement by using ASLoriginated from expanding the domain of a shift parameter from integer to real and learning shifts.

Method Shifting Domain Initialization Learning Shift Top-1

Grouped Shift Integer Heuristic X 59.8 (-)Sampled Integer Integer N (0, 1) X 60.1 (+0.3)Sampled Real Real N (0, 1) X 61.9 (+2.1)Training Real Real U [−1, 1] O 64.1 (+4.3)

Table 5: Comparison with other networks. Our networks achieved better results with similar numberof parameters. Compared to MobileNetV2, our network runs faster although it has a larger number ofFLOPs. Table is sorted by descending order of the number of parameters.

Inference Timea

Network Top-1 Top-5 Param(M) FLOPs(M) CPU(ms) GPU(ms)

MobileNetV1[8] 70.6 - 4.2 569 - -ShiftNet-A[23] 70.1 89.7 4.1 1.4G 74.1 10.04MobileNetV2[18] 71.8 91 3.47 300 54.7 7.07AS-ResNet-w68(ours) 72.2 90.7 3.42 729 47.9 6.73ShuffleNet-×1.5[26] 71.3 - 3.4 292 - -

MobileNetV2-×0.75 69.8 89.6 2.61 209 40.4 6.23AS-ResNet-w50(ours) 69.9 89.3 1.96 404 32.1 6.14MobileNetV2-×0.5 65.4 86.4 1.95 97 26.8 5.73

MobileNetV1-×0.5 63.7 - 1.3 149 - -SqueezeNet[10] 57.5 80.3 1.2 - - -ShiftNet-B 61.2 83.6 1.1 371 31.8 7.88AS-ResNet-w32(ours) 64.1 85.4 0.9 171 18.7 5.37ShiftNet-C 58.8 82 0.78 - - -

aMeasured by Caffe [12] using an Intel i7-5930K CPU with a single thread and GTX Titan X (Maxwell).Inference time for MobileNet and ShiftNet (including FLOPs) are measured by using their network description.

Figure 4: Comparison with MobileNetV2[18]. Our network achieves better results with the sameinference time.

8

Page 9: Constructing Fast Network through Deconstruction of Convolution · 2019. 2. 19. · Constructing Fast Network through Deconstruction of Convolution Yunho Jeon School of Electrical

Table 6: Network structure for AS-ResNet. Network width is controlled by base width w

Input size Output size Operator Output channel Repeat stride

2242 1122 3×3 conv w 1 21122 1122 basic block w 1 11122 562 basic block w 3 2562 282 basic block 2w 4 2282 142 basic block 4w 6 2142 72 basic block 8w 3 272 1 global avg-pool - 1 -1 1 fc 1000 1 -

5 Conclusion

In this study, we deconstruct convolution to a shift operation followed by pointwise convolution.We formulate a shift operation as a function having additional parameters. The amount of shiftcan be learned end-to-end through backpropagation. The ability of learning shift values can helpmimic various type of convolutions. By sharing the shifted input, the number of parameters andcomputational complexity can be reduced greatly. We also showed that using ASL could improve thenetwork accuracy while reducing the network parameters and inference time. By using the proposedlayer, we suggested a fast and light network and achieved better results compared to those of existingnetworks. The use of ASL for more general network architectures could be an interesting extensionof the present study.

References[1] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan Yuille.

Semantic image segmentation with deep convolutional nets and fully connected crfs. InInternational Conference on Learning Representations (ICLR), 2015.

[2] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), pages 1800–1807, July 2017.

[3] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutionalnetworks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),pages 764–773, Oct 2017. doi: 10.1109/ICCV.2017.89.

[4] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connectionsfor efficient neural network. In Advances in Neural Information Processing Systems, pages1135–1143, 2015.

[5] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neuralnetworks with pruning, trained quantization and huffman coding. In International Conferenceon Learning Representations (ICLR), 2016.

[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2016.

[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residualnetworks. In European Conference on Computer Vision (ECCV), pages 630–645. Springer,2016.

[8] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, TobiasWeyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neuralnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

[9] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), June 2018.

9

Page 10: Constructing Fast Network through Deconstruction of Convolution · 2019. 2. 19. · Constructing Fast Network through Deconstruction of Convolution Yunho Jeon School of Electrical

[10] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, andKurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mbmodel size. arXiv preprint arXiv:1602.07360, 2016.

[11] Yunho Jeon and Junmo Kim. Active convolution: Learning the shape of convolution for imageclassification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages1846–1854. IEEE, 2017.

[12] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick,Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast featureembedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages675–678. ACM, 2014.

[13] Felix Juefei-Xu, Vishnu Naresh Boddeti, and Marios Savvides. Local binary convolutionalneural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),July 2017.

[14] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.2009.

[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deepconvolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Wein-berger, editors, Advances in Neural Information Processing Systems, pages 1097–1105. CurranAssociates, Inc., 2012.

[16] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky.Speeding-up convolutional neural networks using fine-tuned cp-decomposition. In InternationalConference on Learning Representations (ICLR), 2014.

[17] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters forefficient convnets. In International Conference on Learning Representations (ICLR), 2017.

[18] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.Mobilenetv2: Inverted residuals and linear bottlenecks. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), June 2018.

[19] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-ScaleImage Recognition. In International Conference on Learning Representations (ICLR), pages1–14, 2015.

[20] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[21] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex A. Alemi. Inception-v4,inception-resnet and the impact of residual connections on learning. In ICLR 2016 Work-shop, 2016.

[22] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-thinking the inception architecture for computer vision. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 2818–2826, 2016.

[23] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gho-laminejad, Joseph Gonzalez, and Kurt Keutzer. Shift: A zero flop, zero parameter alternative tospatial convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition(CVPR), June 2018.

[24] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residualtransformations for deep neural networks. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 5987–5995. IEEE, 2017.

[25] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. InInternational Conference on Learning Representations (ICLR), 2016.

10

Page 11: Constructing Fast Network through Deconstruction of Convolution · 2019. 2. 19. · Constructing Fast Network through Deconstruction of Convolution Yunho Jeon School of Electrical

[26] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficientconvolutional neural network for mobile devices. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), June 2018.

11