ZNNi: Maximizing the Inference Throughput of 3D...

12
ZNNi: Maximizing the Inference Throughput of 3D Convolutional Networks on CPUs and GPUs Aleksandar Zlateski , Kisuk Lee Electrical Engineering and Computer Science Dept. Brain and Cognitive Sciences Dept. Massachusetts Institute of Technology Cambridge, MA 02139 USA [email protected], [email protected] H. Sebastian Seung Neuroscience Institute and Computer Science Dept. Princeton University Princeton, NJ 08540 USA [email protected] Abstract—Sliding window convolutional networks (ConvNets) have become a popular approach to computer vision problems such as image segmentation and object detection and local- ization. Here we consider the parallelization of inference, i.e., the application of a previously trained ConvNet, with emphasis on 3D images. Our goal is to maximize throughput, defined as the number of output voxels computed per unit time. We propose CPU and GPU primitives for convolutional and pooling layers, which are combined to create CPU, GPU, and CPU-GPU inference algorithms. The primitives include convolution based on highly efficient padded and pruned FFTs. Our theoretical analyses and empirical tests reveal a number of interesting findings. For example, adding host RAM can be a more efficient way of increasing throughput than adding another GPU or more CPUs. Furthermore, our CPU-GPU algorithm can achieve greater throughput than the sum of CPU-only and GPU-only throughputs. I. I NTRODUCTION Researchers have revived the use of convolutional networks (ConvNets) for computer vision, under the banner of “deep learning.” The revival has been driven by increases in the speed of ConvNet training made possible by GPU imple- mentations [1], [2], [3], [4]. This paper is focused on the inference problem, that is, applying a trained network to images to arrive at a classified outputs as fast as possible. Fast inference is critical for big data applications involving large numbers of images and/or very large images. Billions of photos and millions of videos are shared online every day [5], [6]. Scientists are also generating large amounts of image data. For example, high-speed electron microscopy can generate a petascale 3D image from a cubic millimeter of brain in a few weeks [7]. We will focus on 3D ConvNets, which are relevant for both videos and 3D images. 2D ConvNets are regarded as a less costly special case. In contemporary computer vision, researchers are most familiar with applying ConvNets for image-to-label transfor- mations, where the input is an image and the output is trained to match known classes. However, ConvNets can also be used for image-to-image transformations. In this context, a ConvNet acts like a more complex version of a typical filtering operation in image processing. The ConvNet is applied to a window that slides over the input image. For each location of the window, the ConvNet outputs a set of numbers, effectively producing a set of output images, each with the same resolution as the input image. A. Throughput of sliding window inference Sliding window ConvNets have also been applied to image segmentation, producing an output image representing the probability that a voxel is a boundary between objects or not [8], and to semantic segmentation, labeling each voxel in an input image by the class of the object to which it belongs [9]. Here each output image represents the probability that an object of a certain class is located at a voxel. Sliding window ConvNets have also been applied to image segmenta- tion, to produce an output image representing the probability that a voxel is a boundary between objects or not [10]. And they have been applied to semantic segmentation, the problem of labeling each voxel in an input image by the class of the object to which it belongs [11]. In general, sliding window ConvNets are increasing in popularity as they are applied to more and more problems in computer vision. However, transforming image-to-image is even more computationally costly than image-to-label, so the need for speeding up sliding window ConvNets is especially acute. In large scale sliding window inference, the input image is divided into smaller input patches. These are transformed by the ConvNet into output patches, which are recombined to generate the output image. This divide-and-conquer approach is motivated by both time and space considerations. The computation can be sped up by assigning the patches to multiple workers. Also if the computation were not divided, it might not fit in the RAM available to a single worker. Each output patch is smaller than the input patch, because the sliding window is restricted to be entirely contained within the input image. (This is analogous to a “valid” convolution in MATLAB.) Therefore the input patches are chosen to overlap so that the output patches exactly cover the output image without overlap. (This is analogous to the overlap-save or overlap-scrap method for computing a single convolution described in signal processing textbooks.) SC16; Salt Lake City, Utah, USA; November 2016 978–1–4673–8815–3/16/$31.00 © 2016 IEEE 854

Transcript of ZNNi: Maximizing the Inference Throughput of 3D...

Page 1: ZNNi: Maximizing the Inference Throughput of 3D ...cacs.usc.edu/education/cs653/Zlateski-ZNNi-SC16.pdf · weeks [7]. We will focus on 3D ConvNets, which are relevant for both videos

ZNNi: Maximizing the Inference Throughput of 3DConvolutional Networks on CPUs and GPUs

Aleksandar Zlateski∗, Kisuk Lee†∗Electrical Engineering and Computer Science Dept.

†Brain and Cognitive Sciences Dept.Massachusetts Institute of Technology

Cambridge, MA 02139 USA∗[email protected], †[email protected]

H. Sebastian SeungNeuroscience Institute

and Computer Science Dept.Princeton University

Princeton, NJ 08540 [email protected]

Abstract—Sliding window convolutional networks (ConvNets)have become a popular approach to computer vision problemssuch as image segmentation and object detection and local-ization. Here we consider the parallelization of inference, i.e.,the application of a previously trained ConvNet, with emphasison 3D images. Our goal is to maximize throughput, definedas the number of output voxels computed per unit time. Wepropose CPU and GPU primitives for convolutional and poolinglayers, which are combined to create CPU, GPU, and CPU-GPUinference algorithms. The primitives include convolution basedon highly efficient padded and pruned FFTs. Our theoreticalanalyses and empirical tests reveal a number of interestingfindings. For example, adding host RAM can be a more efficientway of increasing throughput than adding another GPU ormore CPUs. Furthermore, our CPU-GPU algorithm can achievegreater throughput than the sum of CPU-only and GPU-onlythroughputs.

I. INTRODUCTION

Researchers have revived the use of convolutional networks(ConvNets) for computer vision, under the banner of “deeplearning.” The revival has been driven by increases in thespeed of ConvNet training made possible by GPU imple-mentations [1], [2], [3], [4]. This paper is focused on theinference problem, that is, applying a trained network toimages to arrive at a classified outputs as fast as possible.Fast inference is critical for big data applications involvinglarge numbers of images and/or very large images. Billions ofphotos and millions of videos are shared online every day [5],[6]. Scientists are also generating large amounts of image data.For example, high-speed electron microscopy can generate apetascale 3D image from a cubic millimeter of brain in a fewweeks [7]. We will focus on 3D ConvNets, which are relevantfor both videos and 3D images. 2D ConvNets are regarded asa less costly special case.

In contemporary computer vision, researchers are mostfamiliar with applying ConvNets for image-to-label transfor-mations, where the input is an image and the output is trainedto match known classes. However, ConvNets can also be usedfor image-to-image transformations. In this context, a ConvNetacts like a more complex version of a typical filtering operationin image processing. The ConvNet is applied to a window that

slides over the input image. For each location of the window,the ConvNet outputs a set of numbers, effectively producinga set of output images, each with the same resolution as theinput image.

A. Throughput of sliding window inferenceSliding window ConvNets have also been applied to image

segmentation, producing an output image representing theprobability that a voxel is a boundary between objects ornot [8], and to semantic segmentation, labeling each voxelin an input image by the class of the object to which itbelongs [9]. Here each output image represents the probabilitythat an object of a certain class is located at a voxel. Slidingwindow ConvNets have also been applied to image segmenta-tion, to produce an output image representing the probabilitythat a voxel is a boundary between objects or not [10]. Andthey have been applied to semantic segmentation, the problemof labeling each voxel in an input image by the class of theobject to which it belongs [11]. In general, sliding windowConvNets are increasing in popularity as they are appliedto more and more problems in computer vision. However,transforming image-to-image is even more computationallycostly than image-to-label, so the need for speeding up slidingwindow ConvNets is especially acute.

In large scale sliding window inference, the input imageis divided into smaller input patches. These are transformedby the ConvNet into output patches, which are recombined togenerate the output image. This divide-and-conquer approachis motivated by both time and space considerations. Thecomputation can be sped up by assigning the patches tomultiple workers. Also if the computation were not divided,it might not fit in the RAM available to a single worker.

Each output patch is smaller than the input patch, becausethe sliding window is restricted to be entirely contained withinthe input image. (This is analogous to a “valid” convolutionin MATLAB.) Therefore the input patches are chosen tooverlap so that the output patches exactly cover the outputimage without overlap. (This is analogous to the overlap-saveor overlap-scrap method for computing a single convolutiondescribed in signal processing textbooks.)

SC16; Salt Lake City, Utah, USA; November 2016978–1–4673–8815–3/16/$31.00 © 2016 IEEE

854

Page 2: ZNNi: Maximizing the Inference Throughput of 3D ...cacs.usc.edu/education/cs653/Zlateski-ZNNi-SC16.pdf · weeks [7]. We will focus on 3D ConvNets, which are relevant for both videos

We define worker throughput as the number of voxels in theoutput patch divided by the time required for a single workerto process that patch. In this paper, a worker will be a singleshared-memory machine, either CPU or GPU or combinationof the two. Our goal is to maximize worker throughput.

When computing ConvNet outputs for nearby locations ofthe sliding window, some of the arithmetic operations are iden-tical. For efficiency it is important to reuse the results of theseoperations. For ConvNets with convolutional layers only, thishappens naturally as the output patch is the same resolution asthe input patch. If there are pooling layers, however, the outputpatch produced by a ConvNet is a subsampling of the output ofa sliding window ConvNet. To obtain the entire output image,one must compute all sub–samplings with different offsetsseparately, and then combine them. This naıve algorithm issub-optimal because it fails to reuse computations for nearbyoutput voxels.

Here we instead provide pooling primitives that computemax-pooling fragments (MPF), a more efficient strategy forsliding window computations [12], [13]. ELEKTRONN isthe only other publicly available package with MPF supportknown to us (http://elektronn.org). The MPF algorithm com-putes the same results as the approach known as “dilatedconvolution”[14], “strided kernels”[15], “max filtering”[16], or“filter rarefaction”[17].

Given a patch, MPF is able to efficiently reuse computationwithin that patch. We would like to use as large a patchas possible to maximize efficiency, but must divide theseinto smaller patches to reduce memory use. This divisionincurs more costs than just a loss of computational reuse.Reuse cannot happen for nearby locations in different outputpatches, assuming the computations on the patches are doneindependently. For example, it is highly inefficient to makethe input patch the same size as the sliding window, whichresults in no reuse at all across different locations of the slidingwindow. It is more efficient to increase the size of the inputpatch, to reduce the fraction of voxels near the border andthereby reduce inefficiency due to lack of reuse. This is notthe case for ConvNet training, where there is typically somerange of input sizes that is optimal for training speed.

The desire for a larger input patch for computationalefficiency creates pressure for more memory. Maximizingthroughput will require a trade–off between these two com-peting factors: increasing memory use for larger patches andmore reuse and reducing memory use due to exhausting theavailable RAM.

II. NOVELTY, CONTRIBUTIONS AND RELATED WORK

We believe the focus on acceleration of ConvNet throughputfor inference is a conceptual novelty, in contrast to existingworks focused on accelerating the training phase. To maximizethe inference throughput we propose a system (framework)consisting of primitives for different layer types and computingarchitectures, as well as methods to efficiently combine them.The elements of our system have various degree of novelty.

!"#$%&'#&("'

)*+

),-.,/(&%,-0/),-.,/(&%,-0/1023*,,/%-4 1023*,,/%-4

5*+

*,,/36)(7889 1*: *,,/ 1*:::;7%"'#&

)(7883<=>/%#%&35?11

)(7883<=>/%#%&3>"'#,=>(&'@35?11

::;7%"'#&

70&03*0"0//'/

;0AB3*0"0//'/

5*+3C3D,A&3E!1

),-.,/(&%,-0/

Fig. 1: Diagram of all layers primitives. The red primitives arewrappers primitives provided by CuDNNv4. The green primitivesare the novel primitives introduced in this paper.

We consider ConvNet architectures that contain both con-volutional and pooling layers. GPU implementations such ascuDNNv4 provide a number of primitives for both kinds oflayers. A particular architecture is implemented by combiningprimitives. In line with this approach, we introduce a numberof new layer primitives for the CPU and GPU (Fig. 1). Theseare designed to have low memory overhead, which should beimportant for high throughput as mentioned above.

Our new convolutional primitives are either direct or FFT-based. To the best of our knowledge, our system is the firstapplication of well-studied pruned FFTs to any kind of deeplearning. We introduce new implementations of pruned FFTsthat are faster (for kernels1 on average 5× for CPU and 10×for GPU) while having a small memory footprint. Our FFT-based convolutional primitive for the GPU is designed to usemuch less memory than the algorithm proposed by [18], [19]and implemented in fbfft.

We provide two new FFT-based convolutional primitives forthe CPU. The main difference between the two is the paral-lelization strategy. As shown later, the task–parallel version ismore efficient if the number of input images and the number ofoutput images are both large, but at the cost of more memoryoverhead than the fork–join data parallel version. However, thetask parallel algorithm is designed to use less memory thanthe one previously proposed for ConvNet training [16]. Finally,we also provide a new direct convolutional primitive for theCPU, but this turns out to be less useful than our FFT-basedprimitives in most circumstances.

In empirical tests, we combine the primitives to maximizethroughput for CPU-only and GPU-only algorithms. In somecases, our FFT-based GPU primitives outperform the cuD-NNv4 primitives by a large margin. The CPU-only algorithmmay achieve higher throughput than the GPU-only algorithmbecause the CPU benefits from data locality and fast caches,even if it is capable of fewer floating point operations (FLOPs)per second.

To work around the limited onboard RAM of the GPU, weintroduce a novel GPU + host RAM primitive. We show thatusing this new primitive can lead to much higher throughput,

1Convolution kernels, also known as filters, should not be confused withGPU kernels.

855

Page 3: ZNNi: Maximizing the Inference Throughput of 3D ...cacs.usc.edu/education/cs653/Zlateski-ZNNi-SC16.pdf · weeks [7]. We will focus on 3D ConvNets, which are relevant for both videos

in spite of the slow PCI-E communication between GPU andhost RAM.

Finally, we propose a novel CPU-GPU method, in whichthe first layers of the ConvNet are computed by the CPU andthe later layers are computed by the GPU. The algorithm ispipelined to utilize the CPU and GPU efficiently. This yieldsthe highest throughput of all the algorithms, sometimes evengreater than the sum of throughputs of GPU + host RAM andCPU-only.

A relatively simple kind of data parallelism for utilizingboth the CPU and the GPU is proposed by Caffe con Troll(CcT) [20]. Such “Batch parallelism” is reasonable for train-ing, but suboptimal for inference because to fit in RAM theinput patches must be small when the batch size is large.Furthermore, CcT handles neither 3D nor sliding windowinference.

To the best of our knowledge, the only publicly available3D sliding window max-pooling ConvNet implementationsare the Caffe fork of [15], ZNN [16] (CPU based), andELEKTRONN [21] (both GPU based). The last is the onlycompetitor that is specifically optimized for inference.

As a prelude to the layer primitives, we first introducelower-level FFT primitives which turn out to be important forefficient convolution.

III. PRUNED FFTImproving the efficiency of convolution is crucial, because

this computation is performed so many times in a ConvNet.The advantages of FFT convolution have been shown for 2DConvNets running on GPUs [18], [19], and 3D ConvNetsrunning on CPUs [16].

In FFT convolution the kernel and the image are zero-padded to a common size. Since the kernel is typically muchsmaller than the image, the padded kernel consists mostly ofzeros. Ignoring the zeros is known as FFT pruning, and canprovide speedup [22].

The speedup from pruning FFTs is more modest for a singleconvolution, which requires one padded kernel FFT, one imageFFT, and one inverse FFT. However, ConvNets, which aredominated by kernel FFTs, can greatly benefit.

We propose and implement an algorithm for pruned 3DFFTs, for both the CPU and GPU. Pruning 3D FFTs of akernel reduces the FLOPs by 3×. In practice, our approachachieves an average of 5× speedup over the naıve (zero–padding) approach on the CPU and 10× speedup on the GPU.The additional speedup on the CPU come from increasedmemory locality, and on the GPU from better utilization ofthe available cores.

While our pruned FFT algorithms give a substantialspeedup, understanding them is not necessary for understand-ing the rest of our contributions. The reader may prefer to skipto the next section on how pruned FFTs are used to computethe convolutional layers.

A. General algorithmFor 3D FFT-based convolution, the 3D images x and y are

first zero-padded to the same size. The inverse FFT of the

Fig. 2: Pruned FFTs. The dark blue voxels show the locations of thenonzero elements of the image after zero-padding.

point-wise product contains the result of the convolution. Theimages x and y can be zero-padded to any size, as long astheir size is equal.

A 3D FFT is obtained by computing 1D FFTs along thethree dimensions. Some of these 1D FFTs are of an array withall elements equal to 0. These are unnecessary to compute asthe FFT of an all zeros signal is all zeros.

We can reduce the amount of computation by computingonly necessary 1D transforms. When computing the FFT of atrainable kernel of size k3 zero padded to size of n3, instead ofnaively computing n2 1D FFTs along each dimension, whichtakes Cn3 log n3 we could first only do k2 FFTs along onedimension, then k×x along then next, and finally n2 along thelast dimension, as shown on Fig. 2. This approach reduces thecomputational cost from Cn3 log n3 to Cn log n[k2+k·n+n2].As most of the FFTs are performed on kernels, and as thekernel sizes are usually much smaller compared to the imagesizes (k ≪ n), we could reduce the computation cost by nearlytwo thirds.

B. CPU implementationSuppose we would like to compute the FFT of an x×y×z

image zero-padded to an x′ × y′ × z′ image.The x × y × z image is zero-padded to x′ × y × z. This

is easily implemented by doing a linear copy of the memory,and zero-padding the rest. We then perform y · z 1D real tocomplex FFT transforms along the x direction. The FFTs areperformed out of place into a pre-allocated, and zero-initializedcomplex-valued ⌊x

2 ⌋ + 1 × y′ × z′ image. We then performin-place 1D transforms along the y direction, followed by thez direction.

The inverse FFT is computed by following the above stepsin reverse order. The 1D FFTs can either be done serially, orby N workers in parallel (in a parallel for loop).

This method induces a memory overhead of x′ × y × z, aspace required for zero–padding the image along x directionin the first step of the algorithm.

C. GPU primitivesOn the GPU, we always perform FFTs on b 3D images

simultaneously, in order to achieve high utilization of the manyGPU threads.

856

Page 4: ZNNi: Maximizing the Inference Throughput of 3D ...cacs.usc.edu/education/cs653/Zlateski-ZNNi-SC16.pdf · weeks [7]. We will focus on 3D ConvNets, which are relevant for both videos

A set of b 3D images can be represented as a 4D tensor.We need to perform 1D FFTs along the three least significantdimensions. Our algorithm computes the 3D FFTs as a seriesof tensor transforms2 and 1D FFTs along the least significantdimension of the tensor.

When computing the FFT transforms of b 3D images eachof size x×y×z padded to x′×y′×z′, the size of the 4D inputtensor I is b×x×y×z. First, 1D in–place real to complex 1Dtransforms along the z direction are performed. We prepare theinput by extending the 4D tensor along the z direction to fitthe result. The transform will need to contain z′′ = z′/2 + 1complex numbers, and we need twice as many reals. A 4Dtensor I1 of size b× x× y × 2z′′) is first initialized to zero,and appropriate elements from the input are filled in (elementsIi,j,k,l get mapped to I1i,j,k,l while the rest of elements of I1

are set to zero).A batch of b in–place real to complex 1D transforms are

then performed. The result represents a 4D complex tensor !I1of size b × x × y × z′′. Note that all the 1D transforms aredone on contiguous memory chunks (along the least significantdimension).

In the next step we perform in-place complex to complextransforms along the y direction. To do this the elements of !I1are permuted into another 4D tensor !I2 of size b×x×z′′×y′,such that the element !I1i,j,k,l gets mapped to !I2i,j,l,k and therest of !I2 is zero–filled. We then perform in-place complex tocomplex transforms along the least significant dimension of!I2.

In the final step, we perform the transform along the xdirection. We permute !I2 into a new 4D complex tensor !I3 ofsize b× z′′× y′× x′. An element !I2i,j,k,l is mapped to !I3i,k,l,j .Complex to complex in-place transforms are performed alongthe least significant dimension of !I3.

As we only perform point-wise multiplications of trans-formed images, or take the inverse transforms of them, wecan just keep the result in this representation and avoid doingan extra permutation.

The inverse transform can be performed taking the samesteps in reverse order.

D. Implementation detailsOur approach uses 3rd party libraries to perform each batch

of 1D FFTs. Depending on the library implementation, thesize to which we pad the 3D image can greatly influence thecomputational complexity.

On the CPU we use either fftw or Intel MKL, and padthe images (and kernels) to sizes that can be written in the formof 2a3b5c7d11e13f . When Intel MKL is used any such sizeis allowed, however, when fftw is used we only allow sizesfor which e+f is either 0 or 1 [23], [24]. On the GPU we usecuFFT [25], which has optimized algorithms only for sizes ofthe form 2a3b5c7d.

4D tensor permuting requires a lot of indexing calculation,which can involve a lot of expensive division and modulus

2This is known as a permute function in MATLAB

operations. Sometimes these operations are more expensivethan the actual 1D FFT transforms performed. We improve theperformance by using pre-computed magic numbers and shiftsas described in [26]. Image reshaping is easily implementedusing the Thrust CUDA library [27].

We limit the large cuFFT memory overhead for computingbatches of 1D transforms by splitting the batch computationinto sub–batches of 1D transforms. We make sure that the sub–batch size is still large enough to utilize all the computationalpower, but limit the size so that we limit the memory overhead.

The memory overhead of the algorithm is due to the factthat we do out-of-place permuting of the 4D tensor, whichrequires space for b·x·y′ ·z′′ complex numbers. This, however,will not increase the memory overhead of our algorithmfor convolutional layers on the GPU, as it already needs atemporary buffer of size b · x′ · y′ · z′′ for other purposes,which is reused for scratch space in the FFT transforms.

Additional, relatively small, overhead comes from the mem-ory required by cuFFT to perform a batch of 1D transforms.By dividing the batch into sub–batches we essentially limitthis overhead to a pre–defined constant amount of memory.

IV. CONVOLUTIONAL LAYERS

We begin with the primitives for the convolutional layers,which are the most computationally intensive.

The input to a convolutional layer is a tuple of f images,and the output a tuple of f ′ images. We want to process abatch of S inputs to yield a batch of S outputs, via

Os,j =f"

i=1

wji ∗ Is,i

for 1 ≤ s ≤ S and 1 ≤ j ≤ f ′. Here Is,i is the ith image ofthe sth input in the batch, and Os,j is the jth image of the sth

output in the batch, and wji is the kernel from the ith imagein an input tuple to the jth image in an output tuple.

We will assume 3D images and kernels. If Is,i has sizen = ⟨nx, ny, nz⟩ and wji has size k = ⟨kx, ky, kz⟩, then wecan regard I as a 5D tensor of size S × f × nx × ny × nz , was a 5D tensor of size f ′ × f × kx × ky × kz , and O as a 5Dtensor of size S × f ′ × n′

x × n′y × n′

z , where n′ = n− k+ 1.We will refer to the sizes of the 5D tensors I and O as input

and output shape, respectively. The relationship between inputshape and output shape depends on kernel size as in Table I.

A. CPU primitives

We propose three parallel algorithms for the convolutionallayer that are suited for multi-core CPUs. The first algorithmperforms direct convolution, whereas the other two use FFTbased convolutions.

1) Direct convolution algorithm: The computation is par-allelized by two parallel for loops such that each image ofeach output in the batch is computed in parallel on a differentworking thread (see Algorithm 1). The parallel for loops areimplemented using Intel thread building blocks such that thework is evenly divided over the available cores.

857

Page 5: ZNNi: Maximizing the Inference Throughput of 3D ...cacs.usc.edu/education/cs653/Zlateski-ZNNi-SC16.pdf · weeks [7]. We will focus on 3D ConvNets, which are relevant for both videos

Layer Input shape Output shape FLOPS

Convolutional – Direct S × f × n3 S × f ′ × [n− k]3 S · f ′ · f · n3 · k3Convolutional – FFT–based S × f × n3 S × f ′ × [n− k]3 S · 3Cn3 logn[f ′ + f ] + 4Sf ′ · f · n3 + f · f ′ · Cn logn[k2 + k · n+ n2]Max Pooling S × f × n3 S × f × [n/p]3 S · f · n3

Max Fragment Pooling S × f × n3 [S · p3]× f × (n/p)3 S · f · n3 · p3

TABLE I: Relation between input and output shapes for convolutional and pooling layers, along with computational complexities. Inputshape is for a batch of S inputs, each of which is an f -tuple of 3D images with size n3, and output shape is analogous. The 3D kernel hassize k3. The pooling window has size p3, and the constant C for the FFT complexity depends on the FFT implementation.

Algorithm 1 Multi-core algorithm for a convolutional layerusing direct convolution.

CONVOLUTIONAL-FORWARD-FFT-CPU1(I, w, S, f, f ′, n, k)

1 n′ = n− k + 12 O = 5D-REAL-TENSOR(S, f ′, n′

x, n′y , n

′z)

3 parallel for i = 0 to S − 14 parallel for j = 0 to f ′ − 15 for k = 0 to f − 16 Oi,j = Oi,j + CONVOLVE(Ii,k, wj,k)7 FREE-MEMORY(I)8 return O

CPU algorithm Memory required

Direct (naıve) S · f · n+ S · f ′ · n′

Direct (MKL) S · f · n+ S · f ′ · n′ + T · n′

FFT algorithm 1 max

{S · f · (n+ n)

S · f ′ · n′ + (S · f + 1) · n

FFT algorithm 2 max

⎧⎪⎨

⎪⎩

S · f · (n+ n)

S · (f + f ′) · n+ T · nS · f ′ · (n′ + n)

GPU algorithm Memory required

cuDNNv4 (default) S · f · n+ S · f ′ · n′

cuDNNv4 (precomp) 2S · f · n+ S · f ′ · n′

FFT K +max

⎧⎪⎨

⎪⎩

S · f · (n+ n) + f · nS · (f + f ′) · n+ 2f · nS · f ′ · (n′ + n) + f ′ · n

TABLE II: Memory required by different implementations. S is thebatch size, f and f ′ represent the number of input/output imagesof the layer. n and n′ represent the number of pixels in eachinput/output image, and !n represents the number of elements in thetransformed image. K is pre–defined constant amount of memoryallocated for cuFFT, and T is the number of available cores for theCPU algorithms.

We provide one implementation using naıve convolution andthe other using Intel MKL. The latter is 2× faster on average,but requires extra memory for a temporary image where aresult of convolution is stored before accumulating it to theoutput image. The memory overhead of both implementationsis given in Table II.

2) Data parallel FFT-based algorithm: The computation-ally intensive operations are individually parallelized (seeAlgorithm 2). More specifically each FFT and inverse FFTtransform is done in parallel as explained in the previoussection. The PARALLEL-MAD function computes a series ofmultiply-add operations of complex numbers in parallel bydividing the range into roughly equal sub-ranges, each ofwhich is executed on a single core.

Algorithm 2 Multi-core algorithm for a convolutional layer

CONVOLUTIONAL-FORWARD-FFT-CPU1(I, w, S, f, f ′, n, k)

1 n′ = n− k + 1

2 n = FFT-OPTIMAL-SIZE(n)3 I = 5D-COMPLEX-TENSOR(S, f, ⌊nx/2⌋+ 1, ny , nz)4 for i = 0 to S − 15 for j = 0 to f − 16 Ii,j = PARALLEL-FFT(Ii,j)7 FREE-MEMORY(I)8 O = 5D-REAL-TENSOR(S, f ′, n′

x, n′y , n

′z)

9 O = 4D-COMPLEX-TENSOR(S, ⌊nx/2⌋+ 1, ny , nz)10 w = 3D-COMPLEX-TENSOR(⌊nx/2⌋+ 1, ny , nz)11 for i = 0 to f ′ − 112 for j = 0 to f − 113 w = PARALLEL-FFT(wi,j)14 for k = 0 to S − 115 PARALLEL-MAD(Ik,j , w, Ok)16 for k = 0 to S − 117 Ok,i = PARALLEL-INVERSE-FFT(Ok)18 FREE-MEMORY(I)19 FREE-MEMORY(O)20 FREE-MEMORY(w)21 return O

The memory requirement of the algorithm equals the maxi-mal amount of memory required by the algorithm at any singlepoint of time during the execution, and is given in Table II.

3) Task parallel FFT-based algorithm: The main quantitiesof the task parallel algorithm are: (1) breaking up the com-putation required by the convolutional layer into tasks thatoperate on independent chunks of memory, (2) creating a taskdependency graph, and (3) scheduling the tasks for execution.

There are five different task types:• Input image transform task computes the forward FFT

transform of a single input image.• Kernel transform task computes the forward FFT trans-

form of a single kernel.• Multiply-add task computes the point-wise product of

an input image and a kernel FFT accumulating the resultto an appropriate image transform.

• Output image transform task computes the inverse FFTof the appropriate accumulated image transform. Thistask is also responsible for adding the bias and applyingthe transfer function.

• Synchronization tasks, beside serving as synchronizationpoints, are the only tasks responsible (and only onesallowed) to allocate and/or deallocate memory.

The task dependency graph of all the tasks required forcomputing the output images of a convolutional layer with

858

Page 6: ZNNi: Maximizing the Inference Throughput of 3D ...cacs.usc.edu/education/cs653/Zlateski-ZNNi-SC16.pdf · weeks [7]. We will focus on 3D ConvNets, which are relevant for both videos

Fig. 3: Task dependency diagram of a task–based convolutional layer.

four input and five output images for a batch size of fouris shown on Fig. 3. The tasks are created and queued whenall their dependencies have been satisfied. There are foursynchronization tasks effectively dividing the computationinto three stages. The layout of the kernel transform tasksforms a grid, with the number of columns equal to the numberof output images, and the number of rows equal to the numberof input images. The task of the ith column and jth rowcomputes the transform of the kernel wi,j . Furthermore, eachsuch task has S dependent multiply-add tasks, where S isthe batch size of the input (equal to four in Fig. 3). The kth

dependent multiply-add task of a kernel transform task incolumn i and row j accumulates the product of the transformsof the jth input image of the kth batch and the filter wi,j tothe transform of the ith output image of the kth batch.

The tasks are executed by N worker threads, where Nequals the number of available cores (or virtual cores, whenhyper–threading is enabled). Each worker thread is pinned toa single core. This means that there is 1–1 relation between theworkers and the available cores – each worker is allowed torun only on a specific hardware core, as described in [28].For this reason, we will use “worker thread” and “core”interchangeably.

The first synchronization task allocates memory for theFFT transforms of the input images. The number of dependentinput image transform tasks equals the number of inputimages times the batch size S. They are then executed by theN worker threads, such that each worker picks up an arbitrarytask and executes it.

The last thread to complete the execution of an input imagetransform task immediately executes the second synchroniza-tion task. This tasks deallocates the memory holding the inputimages, as their values are no longer required. It then allocatesmemory for the transforms of the output images. At this pointM threads are chosen as primary threads, where M is themaximum of N – total number of threads, and the number ofoutput images. The primary threads are chosen so that they areevenly distributed over multiple physical chips. Each primarythread is given a temporary buffer equal to the size requiredto fit the transform of a padded kernel for that layer.

The kernel transform tasks and multiply-add tasks arethen scheduled for execution based on their distance to thesink node of the task dependency graph, such that the moredistant nodes get scheduled first. The scheduling has twoadditional constraints: (1) the kernel transform tasks canonly be executed by a primary thread, and (2) its dependentmultiply-add tasks can only be executed by worker threadsthat are pinned to the cores on the same physical chip.

This strategy is chosen over the popular alternative approachto task scheduling based on work stealing [29], [30] becauseit divides the work more evenly over multi–chip machines andfurther increase cache locality. On a 4-way machine it yieldsmore deterministic results – very little variance in run-timeand average of 20% speed improvement over the alternative.

The last multiply-add task to complete executes the thirdsynchronization task. This task deallocates the memorybuffers given to the primary threads as well as the memoryused to store the transforms of the input images. It alsoallocates the memory to store the final result – the outputimages.

The number of output image transform tasks equals thenumber of output images times the batch size. The tasks areexecuted by all N workers, such that each worker picks up anarbitrary task and executes it. The last output image trans-form task to finish also executes the final synchronizationtask, which frees the memory required for the output imagetransforms.

The memory required by the task parallel algorithm canbe higher than the one of the data parallel algorithm, whenmany cores are available. The exact memory required equalsthe maximal memory required by each of the 3 stages, and isgiven in Table II.

The task–parallel approach has two main advantages overthe data-parallel approach. The coarse-grained tasks allowfewer synchronization points which raises utilization of theavailable cores. In the data–parallel primitive, each convolutionand point-wise summation is done using fork–join parallelism,which can induce severe overhead, as joins wait for the lastthread to finish. Further, the design of the task dependencygraph ensures efficient use of the L3 cache, which is especiallybeneficial for multi–chip systems. On a 4–way Intel Xeon E7–8890 v3 machine the task parallel algorithm is up to 10× fasterthan the data parallel one (for large enough f ′ · S and f · S).

B. GPU primitives

For the GPU, we propose three different algorithms. Twoof them use cuDNN’s 3D primitives that are based on implicitmatrix–matrix multiplication. The third FFT–based implemen-tation is based on our, previously described, algorithm forpruned FFTs.

1) Direct convolution using cuDNNv4: The two algorithmsusing the direct convolution are implemented using cuDNN.CuDNN performs 3D convolution as an implicit matrix–matrixmultiplication, meaning that the matrices are not actuallycreated. The first algorithm uses extra memory to store pre-

859

Page 7: ZNNi: Maximizing the Inference Throughput of 3D ...cacs.usc.edu/education/cs653/Zlateski-ZNNi-SC16.pdf · weeks [7]. We will focus on 3D ConvNets, which are relevant for both videos

computed indices. The second algorithm, which we find 3-5×slower, does not require any extra memory.

2) FFT based algorithm: FFT-based convolutional layeris based on the GPU implementation of the pruned FFTalgorithm described in Section III-C.

Algorithm 3 FFT based convolutional layer algorithm for theGPU.

CONVOLUTIONAL-FORWARD-FFT-GPU(I, w, S, f, f ′, n, k)

1 n′ = n− k + 12 I = 5D-COMPLEX-TENSOR(S, f, ⌊nx/2⌋+ 1, ny , nz)3 s = 5D-COMPLEX-TENSOR(f, ⌊nx/2⌋+ 1, ny , nz)4 for i = 0 to S − 15 Ii = GPU-PARALLEL-FFT(Ii, s)6 FREE-MEMORY(I)7 O = 5D-COMPLEX-TENSOR(S, f ′, ⌊nx/2⌋+ 1, ny , nz)8 for i = 0 to f ′ − 19 wi = PARALLEL-FFT(wi, s)

10 for j = 0 to S − 111 s = PARALLEL-MULT(wi, Ij)12 Oj,i = PARALLEL-ACCUMULATE(s)13 FREE-MEMORY(I)14 FREE-MEMORY(s)15 O = 5D-REAL-TENSOR(S, f ′, n′

x, n′y , n

′z)

16 s = 5D-COMPLEX-TENSOR(f ′, ⌊nx/2⌋+ 1, ny , nz)17 for i = 0 to S − 118 Oi = GPU-PARALLEL-INVERSE-FFT(Oi)19 FREE-MEMORY(O)20 FREE-MEMORY(s)21 return O

The algorithm, given in Algorithm 3 resembles the taskbased CPU algorithm in the sense that it consists of threestages with memory allocated/deallocated between the stages.The lines 2 and 3 allocate memory required for the in-put image transforms and the scratch space required byGPU-PARALLEL-FFT procedure (explained in Section III-C).The first stage (lines 4 and 5) computes the transforms ofall the input images by performing f 3D FFTs in parallel.The memory used by the input images is then released, andmemory for storing the FFTs of the output images is allocated(lines 6 and 7).

In the second stage (lines 8–12) we loop over the f ′ outputimages. For each output image we compute the transform ofthe f relevant kernels (ones connecting each of the inputimages and the current output image). We then loop overthe inputs in the batch, and for each batch we compute thepoint–wise product of the relevant input image transforms withthe relevant kernel transforms, producing f complex valuedimages. The values of the f images are then accumulatedto a single image, – the transform of the appropriate outputimage. Note how we can re–use the scratch space s (used forGPU-PARALLEL-FFT) to store the point–wise product of ftransformed images.

The memory used by the input image transforms, and thescratch space, is then released. We then allocate memory forthe output images as well as new scratch space of differentsize, required for computing f ′ inverse FFT transforms at once(lines 13–16).

Fig. 4: Decomposing a convolutional layer into multiple convolutionalsub–layers.

In the final stage (lines 17 and 18) we compute the outputimages by looping over the batches and computing f ′ inverseFFTs in parallel. Finally we free the memory of the outputtransforms and the scratch space.

The memory required by the algorithm is equal to themaximum of memory required in each of the three stages(Table II).

C. GPU + host RAM primitiveConsider a convolutional layer whose input is shape is

(S, f, x, y, z) and output shape (S, f ′, x′, y′, z′). The compu-tation performed by the layer can be divided into N sub–layers with input shapes of (Si, fi, x, y, z) and output shape(Si, f ′

i , x′, y′, z′). Fig. 4 illustrates how the computation of a

convolutional layer with S = 1, f = 6, and f ′ = 4 can bedivided into N = 4 sub–layers, each having Si = 1, fi = 3and f ′

i = 2. The blue color represents the input images thathave to be transferred to the GPU, the red color represents thememory that has to be allocated on the GPU. The green colorrepresent the results that have to be transferred back to thehost. The computation of each sub–layer can be performedby any of the GPU–only primitives. The time required forprocessing the layer will equal to the sum of processing time ofeach sub–layer and the time required for memory transfers. Toreduce the transfer overhead two CUDA streams are used. Onestream is responsible for uploading and downloading data fromthe device, while the other stream simultaneously performs thecomputation. Due to the GPU’s memory limit, not all divisionsare feasible.

For a given input shape, picking the optimal division isdone by an exhaustive search with the following heuristic.When S > 1, we prefer the divisions into sub–layers suchthat fi = f , f ′

i = f ′, and Si ≤ S. This, essentially dividesthe batches of the input into sub–batches that are processedon the GPU. In other cases (S = 1) we consider all possibledivisions and pick one of the GPU–only primitives describedabove based on empirical measurement.

The host memory requirement for this layer equals theamount of memory required to store the input and the outputtensor, and GPU on–board memory has to be large enough tofacilitate each sub–layer.

V. MAX-POOLING AND MAX-POOLING FRAGMENTS

Max pooling of an image of size n with the window sizeof p = ⟨px, py, pz⟩ divides an image into blocks of size p.The maximum value is computed for each block, yielding an

860

Page 8: ZNNi: Maximizing the Inference Throughput of 3D ...cacs.usc.edu/education/cs653/Zlateski-ZNNi-SC16.pdf · weeks [7]. We will focus on 3D ConvNets, which are relevant for both videos

image of size ⟨nx/px, ny/py, nz/pz⟩. The input image size nis restricted such that nx, ny and nz are divisible by px, pyand pz respectively.

On the CPU, we implement the max-pooling layer so thatthe max-pooling of each image is performed in parallel (e.g.by using parallel for loop). For the GPU we use the cuDNNprimitives for max-pooling.

When the input image has the same size as the ConvNetfield of view, the output image consists of a single voxel.

Max pooling fragmentation of an image of size n withthe window size of p produces px × py × pz output images(fragments) by performing multiple max pooling operations onthe image at offsets (x, y, z), where 0 ≤ x < px, 0 ≤ y < py ,and 0 ≤ z < pz . When the image has size such that n+ 1 isdivisible by p, the sizes of all produced fragments will equal⟨⌊nx/px⌋, ⌊ny/py⌋, ⌊nz/pz⌋⟩.

It is important to note that max-pooling fragmentationincreases the batch size for subsequent layers. For an MPFlayer, the number of output images is equal to the number ofinput images times the number of fragments px×py×pz . Theincrease in the batch size has an impact on the parallelizationof subsequent layers. Simple max-pooling does not change thebatch size.

Our CPU implementation loops over all the f input imagesof each of S inputs in a parallel for loop, and performs themax-pooling operation at each offset.

In the GPU implementation, for each valid offset (x, y, z)we invoke the cuDNN max-pooling primitive to compute themax-pooling of all input images at that offset.

Implementing GPU + host RAM MPF layer turned outto be impractical. When the input is stored in host RAM,it is more efficient to compute the MPF layers using theCPU, even when very few cores are available. This is becauseof the expensive transfer to and from the device and lowcomputational complexity of the MPF layers.

VI. GPU-ONLY OR CPU-ONLY INFERENCE

By stringing together the CPU (or GPU) layer primitivesdefined above, we can now construct CPU-only (or GPU-only) algorithms for ConvNet inference. For each layer, wehave a choice between several primitives. Each max–poolinglayer can be replaced by a MPF layer. The size of the inputpatch and the number of inputs in the batch should be chosen.These parameters and the primitives should be chosen tomaximize throughput. Below we describe some theoreticalconsiderations and empirical data about the optimal choice.

A. Maximizing throughputThe result of a network applied to an input I is obtained

by sequentially applying each primitive. The input to thefirst layer’s primitive is I , and the input to every subsequentprimitive will be the output of the previous layer’s primitive.The output of the last layer will be our desired result I ′. Notethat if MPF layers are used, the most significant dimension ofthe output can increase. For an input shape of (S, f, x, y, z)the output shape will be of the form (αS, f ′, x′, y′, z′). Where

Fig. 5: A network of form CPCPCCCC with the first half executedone layer at a time, and the second half one batch at a time. Thepooling window size of the MPF layers is 2× 1× 1.

α value is depends on the amount of MPF layers used andtheir pooling window sizes. This output represents S sets, eachhaving α fragments which should be recombined to obtain thesliding–window result. [12], [13].

The throughput of the network is therefore defined as:

SIZE(I ′)#1≤i≤L TIME(Primitivei, Ii)

Where Ii is the input of the ith layer’s primitive. Theoutput shape will depend on the shape of the input I and theprimitives chosen for each max–pooling layer. As the inputshapes to each layer need to have integral sizes, not everycombination of layer primitives and input shapes are allowed(see Table I). Additional constraint is that the memory requiredfor ith primitive to process input Ii has to be smaller than thememory available to the system (either the CPU or the GPU).

In general case, the highest throughput network implemen-tation can be found using an exhaustive search:

1) Loop over all possibilities for the max–pooling layers.This will introduce constraints on the allowable inputshapes.

2) Loop over all allowed input shapes.3) Determine the best implementation for each convolutional

layer.This is possible because for fixed choice of max–pooling or

MPF of each pooling layer, and fixed input shape, the timeand space required for each convolutional layer is uniquelydetermined. We pick the fastest one that satisfies the memoryconstrain.

In the empirical measurements below, it will turn out thatfor our networks, the highest throughput is obtained when allthe max–pooling layers are replaced with MPF layers, andwhen the input batch size is one (S = 1). Additionally, higherthroughput is achieved for larger input sizes.

VII. GPU + HOST RAM AND CPU–GPU INFERENCE

A. GPU + host RAM ConvNet execution

The simplest way to execute a network using the GPU forcomputation and host RAM for storage is to use individuallyoptimized GPU + host RAM layer described above for eachconvolutional layer, and the CPU implementation of a MPFlayer for each pooling layer.

When using MPF layers for evaluating a max-poolingConvNet, it turns out that better performance is achieved byminimizing the data transferred to and from the GPU. To

861

Page 9: ZNNi: Maximizing the Inference Throughput of 3D ...cacs.usc.edu/education/cs653/Zlateski-ZNNi-SC16.pdf · weeks [7]. We will focus on 3D ConvNets, which are relevant for both videos

understand how performance can be improved, we note thisproperty of a ConvNet:

A ConvNet with input shape I = (S, f, x, y, z) withS > 1, will have the batch size S′ of the output shapeI ′ = (S′, f ′, x′, y′, z′) always divisible by S. The valuesI ′(S

S i : S′

S (i + 1), :, :, :, :) will only depend on I(i, :, :, :, :).This means that concatenating results of applying a layer ontwo inputs I1(S1, f, x, y, z) and I2(S2, f, x, y, z) will equal tothe result of the concatenated input of size (S1+S2, f, x, y, z).

Consider the output shape of the first θ layers of a ConvNet.The computation of the rest of the layers can be considered tobe another ConvNet that takes the output of the θth layer as theinput. If some of the first θ layers were MPF layers, the batchsize of the θth layer output Sθ will be greater than 1. Insteadprocessing one layer at a time for the rest of the layers, onemight be able to process all remaining layers for a sub–batchSθ at a time using a GPU–only network. This will reducethe memory transfer overhead as no intermediate results haveto be transferred back to the host. In Fig. 5 we illustrate thetimeline of executing a network of the form CPCPCCCC byhaving the first four layers executed one layer at a time, andthe rest one batch at a time.

Finding the optimal network execution strategy for thispattern becomes more complex. The first additional parameterwe have is θ, (0 ≤ θ ≤ L), where L is the number of layersof the given ConvNet. This parameter represents the numberof layers that will be processed one at a time using the GPU+ host RAM or CPU-MPF layer at a time. The rest of thenetwork is executed one (or more) batches at a time using theGPU–only primitives.

For a given value of θ and a given input size, this approachhas two limitations. Firstly, θ layers have to fit on the hostRAM. Secondly, there has to exist a GPU–only network thatcan process the latter layers on the GPU.

In order to find an optimal implementation we, consider anyvalid input shape and any valid θ that are convolutional to useGPU + host RAM primitive, and separately optimize the restof the GPU–only network

B. CPU–GPU ConvNet executionFinally, inference can be done by utilizing both CPU and

GPU. As in the GPU + host RAM approach described above,the network layers are divided into two groups. For the first θlayers, we use the optimal CPU implementation as defined inthe previous section, and for the rest of the layers we use theoptimal GPU implementation as defined above.

The CPU and the GPU form a producer–consumer pipeline.The CPU produces by computing the first θ layers for a giveninput image, and queuing the result. The GPU consumes thedata on the queue, taking as input the output of the θth layer,and yields the final output of the last layer.

This approach can generate huge memory overhead if theCPU produces data much faster than the GPU can consume.For that reason, the CPU is not allowed to start working onthe next input until the queue is empty – until the GPU hadpicked up and started executing the rest of the network for all

Layer F–maps n337 n537 n726 n926

1 80 Conv 23 Conv 43 Conv 63 Conv 83

2 80 Pool 23 Pool 23 Pool 23 Pool 233 80 Conv 33 Conv 53 Conv 73 Conv 93

4 80 Pool 23 Pool 23 Pool 23 Pool 235 80 Conv 33 Conv 53 Conv 73 Conv 93

6 80 Pool 23 Pool 23 Conv 73 Pool 937 80 Conv 33 Conv 53 Conv 73 Conv 93

8 80 Conv 33 Conv 53 Conv 73 Conv 93

9 80 Conv 33 Conv 53

10 3 Conv 33 Conv 53

TABLE III: ConvNet architectures of the benchmarked networks.

the data on the queue. This essentially limits the queue to amaximal size of one.

For a given value of θ and a given input size the GPUwill operate on the output of the θth layer producing the finaloutput. Hence the output of the θth layer has to be stored inthe host RAM, along with memory allocated for the networkoutput. As both the CPUs and GPUs can do work only whenthe queue is empty, the rest of the host RAM is available tothe CPUs.

Finding the optimal implementation through an exhaustivesearch resembles the one in the previous section. For eachvalid input shape, we loop over all valid values of θ, and foreach such division of the ConvNet, we separately optimize thefirst θ CPU-only layers and the rest of the GPU-only layers,having the memory limitations in mind.

VIII. BENCHMARKS

We expect that algorithm throughput will depend on theConvNet architecture. We chose networks that are represen-tative of the state of the art in dense prediction tasks [31],[32]. Such architectures use many convolution and poolinglayers to achieve a large field of view. The architectures withsmaller kernels are more typical, but we also include somewith larger kernels. The architectures of all four benchmarkednetworks are shown in Table III. A rectified linear (relu)transfer function is applied after each convolutional layer.The complexity of the transfer function has little influenceon computation time as it represents only a small fraction ofthe overall computation.

The benchmarks are performed on two machines. The firstmachine is a 4-way Intel Xeon E7 8890v3 with total of 72cores (144 hyper–threads), 256GB of RAM and a Titan XGPU (with 12GB on–board RAM). The second machine isan Amazon EC2 instance with 32 virtual cores and 244GBof RAM (r3.8xlarge). The second machine is included as it ismore readily available.

A. CPU–only and GPU–only resultsThe highest throughput for both the CPU–only and the

GPU–only inference was obtained when all the max–poolinglayers are replaced with MPF layers, and when the inputbatch size is one (S = 1). Additionally, higher throughputis achieved for larger input sizes.

The fact that MPF layers outperform max-pooling layersis not surprising; it has been shown that using MPF layers

862

Page 10: ZNNi: Maximizing the Inference Throughput of 3D ...cacs.usc.edu/education/cs653/Zlateski-ZNNi-SC16.pdf · weeks [7]. We will focus on 3D ConvNets, which are relevant for both videos

Input size

Thro

ughput (M

Vox/

s)

0

0.2

0.4

0.6

0.8

CPU (72 cores) CPU (16 cores) GPU

0

0.05

0.1

0.15

0.2

200 400 6000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

200 400 6000

0.05

0.1

0.15

0.2

0.25

n337 n537

n726 n926

Fig. 6: Maximal throughput achieved vs input image size using GPU–only and CPU–only primitives.

reduces the operation counts required for computing a singleoutput pixel [12], [13]. With limited amount of memory,the value S = 1 allows for larger input images, and thusmore computational reuse. On the other side, when FFT–basedconvolution is used, both increased image size and increase inS reduces the computational cost [18], [19]. The fact that thethroughput is highest when S = 1 for our networks makessense as we get benefits from both the large input image andthe reuse of kernel FFTs in latter layers. This is because MPFlayers effectively increase the batch size of all subsequentlayers (Table I).

Fig. 6 shows the throughput achieved on the four bench-marked networks (Table III) with the batch size S = 1, anddifferent image sizes. Generally, when the same primitivesare used, the throughput increases with the size of the inputimage. The throughput suddenly drops when the input imagesize doesn’t allow for faster primitives due to the memoryrequirements, so less efficient ones have to be used.

It turns out that the optimal choice for primitives for theCPU is always the same regardless of the network choice. Inall cases the first (convolutional) layer was optimized to data–parallel FFT–based algorithm, and the rest of the convolutionallayers used the task–based algorithm. This is expected becauseof the much higher cache locality of our FFT implementationscompared to direct convolution, which plays an important roleon the CPU with very fast cache access and relatively slowRAM access.

The optimal use of primitives for the GPU have higher de-pendence on the ConvNet’s architecture. The optimal choice ofprimitives for each network, as well as the optimal input imagesizes are given in Table IV. Interestingly, the implementationfor the first layer of all networks is the slower version ofthe cuDNN’s primitive. Even though the primitive is slower,it is able to process larger images, as it has lower memoryrequirements. There is a trade–off between the layer speed,and the maximal size of the input that a layer can process.In this case it was more beneficial to be able to processlarger inputs. As expected, the cuDNN’s direct convolution

n337 n537 n726 n926

Input size 2353 2173 2433 2533

Layer 1 CuDNN1 CuDNN1 CuDNN1 CuDNN1Layer 2 MPF MPF MPF MPFLayer 3 CuDNN1 FFT FFT CuDNN1Layer 4 MPF MPF MPF MPFLayer 5 CuDNN2 FFT FFT FFTLayer 6 MPF MPF FFT FFTLayer 7 CuDNN2 CuDNN2 FFT FFTLayer 8 CuDNN2 CuDNN2 FFT FFTLayer 9 CuDNN2 CuDNN2

Layer 10 CuDNN2 CuDNN2

TABLE IV: Optimal choice for different layers.

Memory Consumed (GB)

Thro

ughput (M

Vox/

s)

0

0.2

0.4

0.6

0.8

1

1.2

CPU (72 cores) GPU+Host RAM CPU+GPU GPU

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0 50 100 150 200 2500

0.1

0.2

0.3

0.4

n337 n537

n726 n926

Fig. 7: Maximal throughput achieved vs memory consumption usingGPU–only, CPU–only, CPU + host RAM and CPU–GPU implemen-tations for different image sizes.

primitives outperform our FFT–based ones for small kernelsizes. As expected the FFT–based primitives are more efficientfor larger kernels and larger images sizes.

B. GPU + host RAM and CPU–GPU results

On Fig. 7 we show the results of the exhaustive search (withS = 1) for the highest throughput on the four benchmarkednetworks (Table III) using the GPU + host RAM and CPU-GPU inference, alongside the CPU and GPU only one. Insteadof showing the input image size on the x axis, we decideto show the memory required by the implementation. Thememory consumed is calculated as max{MCPU ,MGPU}.This allows us to estimate the throughput on systems withsimilar CPU/GPU performances but less available memory.

We observe that the GPU+host RAM can greatly improvethe throughput for kernel sizes of 53 and larger. This isreasonable, as for very small kernels, the PCI-E transfer timedominates over the time spend doing computation on the GPU.However, as the image size increases, so does the granularityof the sub–layers for the GPU+host RAM primitives. Thisrequires more transfers to and from the GPU increasing thetransfer overhead, and yielding lower throughput.

Our CPU-GPU approach yields the highest throughputwhich is higher than the CPU–only and GPU–only combined.

863

Page 11: ZNNi: Maximizing the Inference Throughput of 3D ...cacs.usc.edu/education/cs653/Zlateski-ZNNi-SC16.pdf · weeks [7]. We will focus on 3D ConvNets, which are relevant for both videos

Network Baseline (cuDNN) Caffe ELEKTRONN ZNN GPU-Only CPU-Only GPU + host RAM GPU-CPU

n337 22, 934.8 1.348 122, 668 34, 334.8 671, 782 262, 131 727, 103 1, 059, 910n537 1, 048.68 – – 9, 494.5 29, 352.1 194, 683 147, 965 334, 163n726 13, 520.4 – 6, 122 31, 354.8 97, 257.2 300, 312 148, 194 470, 166n926 2, 667.86 – – 20, 908.6 35, 051.3 249, 190 104, 946 375, 295

TABLE V: Comparisons to other methods.

This is expected as the combined approach allows for larger in-put sizes, while utilizing all available computational resources.

IX. COMPARISON TO OTHER ALGORITHMS

We compare our 4 approaches (GPU–only, CPU–only, GPU+ host RAM and GPU–CPU) with other publicly availableimplementations and show the results in the Table V. Allbenchmarks are performed on the same hardware, the 4–wayIntel Xeon E7-8890 v3 machine with 256GB of RAM and aTitan X GPU.

The results of our approaches are based on the optimizationsdescribed in the previous sections. For the other approaches,we varied the input sizes and measured the throughput, wereported the highest value of throughput obtained, which wasalways correlated with the size of the input we were able toprocess.

The baseline (cuDNN) approach consists of calling thecuDNN [33] primitives for convolution and max–pooling.Unlike other approaches, this is not a general framework –it requires the user to write some code for calling into thelow level cuDNN primitives. We expect that a user withminimal programming experience could implement the above.Our implementation was done in C++, however one coulduse cuDNN bindings for other languages. We expect thatframeworks that support cuDNN primitives, such as Caffe [34]or Theano [35] should achieve similar throughput.

Caffe [34] is another GPU ConvNet framework. We bench-marked a non–official fork that implements sliding windowConvNets using “strided kernels” [15]. The implementation isalso optimized for training and seems to have a huge memoryoverhead as we were only able to run the smallest of thenetworks.

ELEKTRONN [21] was the only competitor that providesinference optimization for 3D sliding window ConvNets usingMPF. The package also uses cuDNN convolutional primitives.However, it was able only to process two of our four networks.

ZNN [16] is a framework optimized for training slidingwindow 3D ConvNets on multi–core and many–core CPUsusing “max–filtering” followed by FFT–based “sparse convo-lution”. ZNN was the best competitor for networks with filtersof 5× 5× 5 or larger.

Our CPU-only, GPU-only, GPU + host RAM, and CPU-GPU implementations outperform all competitors. For thesmallest ConvNet architecture, 237 with the kernel sizes of33, the next best competitor was ELEKTRONN with approxi-mately a tenth of the speed of our CPU–GPU approach. For allthe other ConvNets, the next best competitor was ZNN withapproximately 15× smaller throughput.

X. CONCLUSIONS

Our system, ZNNi, achieves high throughput by providingnovel CPU and GPU primitives for convolutional and poolinglayers, which are designed to minimize memory overhead.Low memory footprint is important as processing a largerimage tends to increase throughput, because fractionally lesscomputation is wasted on the borders of the image. In oursystem, the optimal choice of primitives as well as the optimalinput size is empirically determined. Our empirical results(Table IV) show that an apparently slower algorithm may endup having higher throughput if it can process a larger imagewithin the constraint of the available RAM.

As expected, our CPU-GPU approach achieves the greatestthroughput, which is higher than the CPU–only and GPU–only combined. The achieved throughput of ConvNets thatare representative of state of the art in dense prediction is10× or more than other publicly available implementations ofsliding window 3D ConvNets. All of our code has been madeavailable as open source project (https://github.com/seung-lab/ZNNi-release).

Our analysis was based solely on throughput (number ofoutput voxels per unit time) as the performance metric. Arelated metric is energy consumed per voxel. For a given CPUor GPU, maximizing throughput is equivalent to minimizingenergy consumption, assuming that the power consumption isroughly constant in time. An interesting implication of ourwork is that in some situations the most economical way ofincreasing inference throughput may be to increase host RAMrather adding more GPUs or CPUs.

ACKNOWLEDGMENTS

We thank Kai Li and Nir Shavit for helpful discussions.We are grateful to Intel Corporation for providing the 4-way Intel Xeon E7-8890 v3 machine, and for supporting theIntel Parallel Computing Center at Princeton University. Weacknowledge support from IARPA (D16PC00005), the Math-ers Foundation, NIH/NINDS, and the U.S. Army ResearchOffice (W911NF-12-1-0594). Kisuk Lee was supported by aSamsung Scholarship.

REFERENCES

[1] K. Chellapilla, S. Puri, and P. Simard, “High performance convolutionalneural networks for document processing,” in Tenth International Work-shop on Frontiers in Handwriting Recognition. Suvisoft, 2006.

[2] D. Scherer, H. Schulz, and S. Behnke, “Accelerating large-scale con-volutional neural networks with parallel graphics multiprocessors,” inArtificial Neural Networks–ICANN 2010. Springer, 2010, pp. 82–91.

[3] D. Strigl, K. Kofler, and S. Podlipnig, “Performance and scalabilityof gpu-based convolutional neural networks,” in 2010 18th EuromicroConference on Parallel, Distributed and Network-based Processing.IEEE, 2010, pp. 317–324.

864

Page 12: ZNNi: Maximizing the Inference Throughput of 3D ...cacs.usc.edu/education/cs653/Zlateski-ZNNi-SC16.pdf · weeks [7]. We will focus on 3D ConvNets, which are relevant for both videos

[4] D. C. Ciresan, U. Meier, J. Masci, L. Maria Gambardella, and J. Schmid-huber, “Flexible, high performance convolutional neural networks forimage classification,” in IJCAI Proceedings-International Joint Confer-ence on Artificial Intelligence, vol. 22, no. 1, 2011, p. 1237.

[5] M. Meeker, “Internet trends 2014,” http://www.kpcb.com/blog/2014-internet-trends, 2014 (accessed April 9, 2016).

[6] “Youtube changes at a rate of 33% a year.”[7] J. W. Lichtman, H. Pfister, and N. Shavit, “The big data challenges of

connectomics,” Nature neuroscience, vol. 17, no. 11, pp. 1448–1454,2014.

[8] O. Matan, C. J. Burges, Y. LeCun, and J. S. Denker, “Multi-digitrecognition using a space displacement neural network,” in NIPS.Citeseer, 1991, pp. 488–495.

[9] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le-Cun, “Overfeat: Integrated recognition, localization and detection usingconvolutional networks,” arXiv preprint arXiv:1312.6229, 2013.

[10] V. Jain, J. F. Murray, F. Roth, S. Turaga, V. Zhigulin, K. L. Briggman,M. N. Helmstaedter, W. Denk, and H. S. Seung, “Supervised learningof image restoration with convolutional networks,” in Computer Vision,2007. ICCV 2007. IEEE 11th International Conference on. IEEE, 2007,pp. 1–8.

[11] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P. E. Barbano,“Toward automatic phenotyping of developing embryos from videos,”Image Processing, IEEE Transactions on, vol. 14, no. 9, pp. 1360–1371,2005.

[12] A. Giusti, D. C. Ciresan, J. Masci, L. M. Gambardella, and J. Schmidhu-ber, “Fast image scanning with deep max-pooling convolutional neuralnetworks,” arXiv preprint arXiv:1302.1700, 2013.

[13] J. Masci, A. Giusti, D. Ciresan, G. Fricout, and J. Schmidhuber,“A fast learning algorithm for image segmentation with max-poolingconvolutional networks,” in Image Processing (ICIP), 2013 20th IEEEInternational Conference on. IEEE, 2013, pp. 2713–2717.

[14] F. Yu and V. Koltun, “Multi-scale context aggregation by dilatedconvolutions,” arXiv preprint arXiv:1511.07122, 2015.

[15] F. Tschopp, “Efficient convolutional neural networks for pixelwiseclassification on heterogeneous hardware systems,” arXiv preprintarXiv:1509.03371, 2015.

[16] A. Zlateski, K. Lee, and H. S. Seung, “Znn-a fast and scalable algorithmfor training 3d convolutional networks on multi-core and many-coreshared memory machines,” arXiv preprint arXiv:1510.06706, 2015.

[17] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks forsemantic segmentation,” in The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), June 2015.

[18] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of convolutionalnetworks through ffts,” in International Conference on Learning Repre-sentations (ICLR2014). CBLS, April 2014.

[19] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, andY. LeCun, “Fast convolutional nets with fbfft: A gpu performanceevaluation,” arXiv preprint arXiv:1412.7580, 2014.

[20] S. Hadjis, F. Abuzaid, C. Zhang, and C. R. C. con Troll, “Shallow ideasto speed up deep learning,” in Workshop on Data analytics in the Cloud(DanaC), 2015.

[21] M. P. I. F. M. Research. (2015) ELEKTRONN a neural network toolkit.[Online]. Available: http://elektronn.org/

[22] H. V. Sorensen and C. S. Burrus, “Efficient computation of the dft withonly a subset of input or output points,” IEEE transactions on signalprocessing, vol. 41, no. 3, pp. 1184–1200, 1993.

[23] M. Frigo and S. G. Johnson, “Fftw users manual,” MassachusettsInstitute of Technology, 1999.

[24] ——, “Fftw: An adaptive software architecture for the fft,” in Acoustics,Speech and Signal Processing, 1998. Proceedings of the 1998 IEEEInternational Conference on, vol. 3. IEEE, 1998, pp. 1381–1384.

[25] C. Nvidia, “Cufft library,” 2010.[26] H. S. Warren, Hacker’s delight. Pearson Education, 2013.[27] N. Bell and J. Hoberock, “Thrust: A 2 6,” GPU Computing Gems Jade

Edition, p. 359, 2011.[28] J. Jeffers and J. Reinders, High Performance Parallelism Pearls Volume

Two: Multicore and Many-core Programming Approaches. MorganKaufmann, 2015.

[29] J. Reinders, Intel threading building blocks: outfitting C++ for multi-core processor parallelism. ” O’Reilly Media, Inc.”, 2007.

[30] T. Willhalm and N. Popovici, “Putting intel® threading building blocksto work,” in Proceedings of the 1st international workshop on Multicoresoftware engineering. ACM, 2008, pp. 3–4.

[31] K. Lee, A. Zlateski, A. Vishwanathan, and H. S. Seung, “Recursive train-ing of 2d-3d convolutional networks for neuronal boundary detection,”arXiv preprint arXiv:1508.04843, 2015.

[32] M. Helmstaedter, K. L. Briggman, S. C. Turaga, V. Jain, H. S. Seung,and W. Denk, “Connectomic reconstruction of the inner plexiform layerin the mouse retina,” Nature, vol. 500, no. 7461, pp. 168–174, 2013.

[33] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXivpreprint arXiv:1410.0759, 2014.

[34] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in Proceedings of the ACM InternationalConference on Multimedia. ACM, 2014, pp. 675–678.

[35] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Des-jardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: a cpuand gpu math expression compiler,” in Proceedings of the Python forscientific computing conference (SciPy), vol. 4. Austin, TX, 2010, p. 3.

865