High-Performance VLSI Design for Convolution Layer of Deep ...

High-Performance VLSI Design for

Convolution Layer of Deep Learning

Neural Networks

Jian-Lin Zeng, Kuan-Hung Chen, Jhao-Yi Wang Department of Electronic Engineering, Feng Chia University, Taichung, Taiwan

Abstract- In this paper, a high performance Deep Convolutional Neural Networks (DCNN)

hardware architecture, composed of three major parts, is proposed. The first part is the Convolution

Operation Unit (COU). It employs a Processing Element (PE) array to realize the high efficiency

convolution operations. The second part is the COU Management. This management controls the PE

Array and keeps the PEs working at the most efficient state. The third part is the Storage and

Accumulation Unit (SAU). The tasks of SAU are storing and accumulating the partial sums that

produced in convolution process. We implemented this design with TSMC 40nm General technology.

And the experimental results show that our design provides 32.2 GOPS for AlexNet [1] in 200MHz

clock rate, the total memory cost is 134k-byte. Compared with [4], we reduce the memory size by

26.17% as well as speed up the convolutional computing by 39.94% with lower hardware cost.

Index-Terms- convolutional neural networks (CNN), deep learning, processor architecture, CNN

accelerator.

I. INTRODUCTION

Deep Convolutional Neural Networks (DCNN) are widely used in artificial intelligence (AI) machine. It

achieves good ability in many modern AI applications. However, the large amount of calculation and high

energy consumption are the side effect. Convolution operations account for more than 90% of all operations

of DCNN. There are 664 mega operations (MAC OP) in AlexNet [1] convolution layers. Fig. 1 shows the

convolution operation. To overcome the above shortcomings, we design a hardware accelerator to speed

up the convolution operations without consuming a lot of energy. Moreover, since these operations are very

similar and regular, they are very appropriate to accelerate by using hardware.

To reduce the burden of DRAM traffic, one possible way is to truncate the word length of weight and

data. Our experimental results show that only 0.79% top-5 accuracy loss is caused by truncating the word-

length of images/weights to 8/8-bit for AlexNet from the previous 16/16-bit. Word-length truncation can

reduce the memory size, circuit area, and the costs of data movements. To improve the performance, we

use weight stationary [4] to design COU and use eights multipliers in each PE. Parallelized multiplications

and huge size of local memories provide high performance and low cost of data movement for hardware

architecture. Furthermore, to reduce the memory costs of weight stationary, we shared one memories in

seven PEs. Massive memory costs are reduced by sharing the local memory in seven PEs. The contributions

of this paper are summarized as follows:

1) We truncate data word length of image and weight to reduce the costs of calculation and data movement.

2) We parallelize the multiplications to improve the computing performance.

3) We reduce the memory costs by sharing local memory in PEs.

Fig. 1 Diagram of Convolution Operation

II. NETWORK ON CHIP

When we are executing convolution layers of the state-of-the-art CNN models, the number of model’s

parameters are too huge to ensure high performance and low energy consumption simultaneously. The

major superiority of utilizing hardware architecture to execute convolution operations is the flexible

dataflow. Since the images and weights can be reused many times in a convolution layer, placing

appropriate storage space to keep the reusable data on chip can reduce the costs of data movement.

The general architecture of CNN accelerators is utilizing the PE array to calculate the convolution results.

They build the network on chip by creating the inter-connection between PEs. There are two major benefits

of this approach. The first advantage is low cost for partial sums accumulation. Partial sums can be

accumulated in the dataflow as shown in Fig. 2. The second advantage is the data can be shared in spatial

architecture as shown in Fig. 3.

Fig. 2 Accumulating partial sums in dataflow

Fig. 3 Diagram of Convolution Operation

III. SYSTEM ARCHITECTURE

A. Overview

Fig. 4 is the overall architecture of proposed chip. All of the data are read in and written out of the chip

through the system interface. Input Data Decoder decodes and buffs the data that are imported from system

interface. COU contains a PE array to process convolution operations and a 6k-byte storage space to store

the reusable data. In addition, COU is dominated by the COU management unit. This management unit

helps us to map the PE array to a variety shape of convolution layers. PEs will be enabled if they are selected

by COU management unit. At the output of COU, SAU stores and accumulates the partial sum results which

produced by COU in convolution process. SAU consists of 28 partial sum buffers to store the partial sums

that are exported from COU and a 112k-byte memory to store the accumulated partial sum results. After

all of the convolution operations are finished, data will be activated and export serially through the system

bus.

Fig. 4 Overall Architecture of Proposed Chip

B. Input Data Decoder

In our design, COU can be mapped to a variety shape of convolution layers by COU management unit.

Different convolution layers are configured with different parameters and different number of data imported

from outside. The tasks of Input Data Decoder are decoding the input data and transmitting them to the

corresponding storage space. There are three types of input data, i.e., parameters, weights and images. Fig.

5 shows the proposed Input Data Decoder architecture.

Fig. 5 Input Data Decoder Architecture

C. Convolution Operation Unit (COU)

In overall process of our design to accelerate the convolution operation, COU is the most important

module in our design which executes the major operations. This module consists of a PE array to calculate

partial sum results and 6k-byte storage space to store reusable data. There are 168 PEs in the PE array which

can be mapped to different convolution layers for efficient computation. However, convolution operations

require many times of multiplications and additions. Employing a sufficient size of memory to reuse data

can improve performance of convolution operations. Therefore, our PE array is partitioned to 28 PE groups

and we place a 256-byte register file in each PE group to enhance data reuse efficiency. On the other hand,

there are seven PEs in a PE group and they share a 256-byte register file among their group. The COU

architecture is showed in Fig. 6. Actually, this COU can be mapped to most of the well-known CNN models

[1, 8, 9, 12, 14, and 15]. All of the convolution layers of the above models can map to this hardware

architecture by configuring different parameters. In COU, a PE is enabled when it is selected by the COU

management unit during the mapping operation.

Fig. 6 COU Architecture

D. Processing Element (PE)

In our architecture, Processing Element is the most basic unit of our design. All of the convolution

operation results are calculated by PEs. Each PE can process one-dimension pixels of data in CNN

operations. In order to support various of the convolution layers, the control circuits of the PE usually

complicated. The architecture of PE is showed in Fig. 7. There are eight multipliers and an adder tree in

each PE. Each multiplier can execute 8/8-bit multiplication. And the adder tree accumulates the

multiplication results. Additionally, the multipliers can share the control circuits in the PE. In the most

appropriate case, one PE can process eight operations (8 MACs) in one clock cycle. The parallelized

multiplications and additions provides outstanding performance for the convolution operation.

Fig. 7 PE Architecture

E. COU Management Unit

We employ a COU management unit to control all PEs in COU for efficiently executing convolution

operations. This management unit configures the PE array to a suitable status when we import parameters

of current convolution layer. Fig. 8 shows the COU management unit and the PE group architecture.

Fig. 8 COU Management Unit and PE Groups Architecture

There are three tasks of COU management unit, i.e., determining IDs for each PE groups, storing image

data into image memory and assigning image data and weights to PEs, as described in details as follows:

1) Determining IDs for PE groups: Data transfer is a very important issue in deep learning accelerator. We

reuse the data in on-chip memory as much as possible to reduce DRAM bandwidth requirement. This can

be achieved by assigning image. For instance, assigning the same Image ID and different Filter IDs to

different PE group can process different kernel at the same time. This action not only increases data

processing parallelism, but also reuses the loaded image data. On the other hand, the operation of sliding

window can be achieved by increment the Image ID. In this operation, we keep the overlapped portion

inside the PE group and load the new portion only. Moreover, PEs also can be turned off if they are not

mapped in current convolution layers.

The data will cast to the PE array according to the PE group IDs.

2) Storing image data to image memory: Employing memory reasonably can enhance convolutional

computing efficiency. We use a 16k-byte register file to achieve this task. Image data are stored in the

register file until the image data are not reusable. However, some of the images of a convolution layer are

too heavy to store all of the image data on chip memory. We partition the image to sub-images when the

image size exceeds 16-kbytes.

3) Assigning image data and weights to PE groups: In the convolution process, additional IDs are generated

by COU management unit to transmit images and weights for each PE group. Weights are stored in register

file of a PE group according to the filter ID, and images are stored in PEs according to the image ID. There

is a 256-byte register file in each PE group. Therefore, PE can reuse weights by accessing the register file.

After an image completes the operations with all of the weights in the register file, new image can be

imported and re-multiply the image data to these weights again.

F. Storage and Accumulation Unit (SAU)

Each convolution layer in CNN models has many channels. In our hardware architecture, we execute one

channel of operations in one iteration. And, we complete all of the operations of one convolution layer by

many iterations. The results of iterations are stored and accumulated in SAU. Fig. 9 shows the architecture

of SAU. In convolution operations, SAU is an important unit which completes the final accumulations.

There are two major module in this unit, i.e., accumulators and partial sum memory. Partial sum memory

is a 112k-byte storage space for storing partial sums that are produced in convolution operations. To satisfy

the maximum exporting throughput of COU, this storage space is divided to 28 memory banks, each

contains 4k entries with 64-bit word-length for each bank. Hence, we use 28 partial sum buffers for 16-bit

to 64-bit conversion and 28 accumulators for partial sums accumulations. We utilize four adders in each

Accumulator to do four additions in one clock cycle as shown in Fig. 10.

Fig. 9 Storage and Accumulation Unit Architecture

Fig. 10 Accumulator Architecture

IV. EXPERIMENTAL RESULTS

A. Word-Length Truncation

Convolution operation are calculated using a 32-bit floating point precision typically. However, using

fixed point precision to calculate them can reduce the power consumption directly without reducing the

final classification accuracy. The literature [30] provides a quantization method for fixed point

implementation of CNN models. To reduce the storage size and save energy consumption from data

movement, we employed word-length truncation from the 32-bit to 1-bit. Additionally, literature [10] shows

the classification results with 1 bit operations. We take an experiment on 1000-class AlexNet [1] from

ILSVRC-12 [16] and analyze them. Fig. 11 shows the 1000-class classification top-5 accuracy results of

ILSVRC-12 AlexNet with different word-lengths. The experimental results show that only 0.79% top-5

accuracy was sacrified when we truncate the word-length from 32-bit to 8-bit. In addition, top-5 accuracy

can be kept at 75.97% when we truncate the word-length from 32-bit to 7-bit. For the 8-bits case, we use 1

bit for sign bit and 7 bit for integer to represent input feature map pixels. The important pixels will be

normed to a big value. The fraction part of the input feature map pixels will only provide very small impacts.

And we use 1 bit sign bit and 7 bits fraction bits to represent weights. Because the value of weights are

between 1 and -1.

Fig. 11 ILSVRC-12 AlexNet top-5 Accuracy in Fixed Point

B. Performance and Gate-Count Analysis

According to the experimental results of word-length truncation, we know that truncating the word-

length of each convolution layer of AlexNet from 32-bit to 8-bit only reduces the top-5 accuracy from

79.86% to 79.07%. We made four designs with different numbers of multiplier, using different word-length

and sharing the memory or not. The parameters and experimental results are listed in Table I. The designs

4-bits, 0.58% 5-bits, 0.61%

6-bits, 49.73%

7-bits, 75.97% 8-bits, 79.07%

9-bits, 79.48%

16-bits, 79.78%

Floating-Point, 79.86%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

4-bits 5-bits 6-bits 7-bits 8-bits 9-bits 16-bits

Acc

uca

cy

Word Length

Fixed-Pont Floating-Point

are synthesis by Synopsys Design Compiler with TSMC 40nm General technology. The performance

results of these four designs are the summation of the processing times of each convolution layers of

AlexNet. There are two clock domains in the overall system, i.e., 200MHz for hardware ip clock and

60MHz for system clock. For the off chip memory, we use a 64-bit data width single port DRAM simulation

model. The accelerator fetches the parameters, images and filters from DRAM and exports the output

feature map pixels to the DRAM after the comvolution operations are finished. The operations are

processed layer by layer until completing all of the convolution layers. We provides the 16 bits patterns and

the 8 bit patterns for the designs. And we import the patterns to the designs depend on the supporting word-

length.

Design 1(16b) was the first design which we refered to [4] and the most similar design. The major

differences between design1(16b) and [4] are memory size and the methods of DRAM access. And we

reduce the word-length from 16-bit to 8-bit based on the experimental results in last session. Even though

the bit width of data are reduced, the performance was limited by the computing core. That is the reason

why we use 8 multipliers in design2. We can see the performance was improved from 19.78 to 34.3 in

design1(8b) and design2. But the gate count cost also increased 76%. To solve the increased gate count

cost, we share the memories in design3. Even though the GOPS decreased from 34.3 to 32.2, but we reduced

the total memory size by 36kByte.

TABLE I

PARAMETERS OF EACH DESIGN AND THEIR EXPERIMENTAL RESULTS FOR ALEXNET @ 200MHZ

Designs Design1

(16b) Design1

(8b) Design2

(8b) Design3

(8b)

Technology TSMC 40nm General CMOS

Word-Length

(I/W) 16/16 8/8 8/8 8/8

# of mul. in each PE 1 1 8 8

Sharing memory No No No Yes

Total memory

(Byte) 170 k 170 k 170 k 134 k

Gate Count (NAND2) 2.3 M 1.7 M 2.2 M 1.3 M

Performance 15.6 19.78 34.3 32.2

(GOPS)

We make some comparisons between our design (design3) and [4]. The major differences of our design

and [4] is the PE architecture. First, our design uses eight multipliers and an adder tree in each PE, but [4]

uses only one multiplier and an accumulator. Second, [4] uses a memory to store weights in each PE, but

we shared one memory among seven PEs. Fig. 12 shows the processing times of our design and [4] in each

convolution layer of AlexNet when 200MHz clock rate was provided. Actually, we didn’t make an very

good optimization on DRAM access. It causes our design slower than [4] by 4.02% when executing

AlexNet convolution layer 1. Because there is a huge data accessing requirement between DRAM and

accelerator in convolution layer 1. In other layers, our design can provide better performance. For the

summation, our design improves the calculation performance by 39.94%.

Fig. 12 Processing Times of AlexNet Convolution Layers

Fig. 13 shows the performance and NAND2 gate-count costs comparison results of our design with the

previous works. In hardware design point of view, both the performance and area costs are significant. High

performance means the hardware can provide exceptional calculation speed, and small gate-count indicates

that the chip has lower cost of production. When we consider both the performance and gate-count costs at

the same time, our design provides a prominent cost-performance ratio as shown in Fig. 13.

- 4.02%

+ 143.16 %

+ 27.71 %

+ 29.94 %+ 8.47 %

+ 39.94 %

0

5

10

15

20

25

30

35

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Total

Pro

cess

ing

Tim

e (m

s)

Convolution Layers of AlexNet

Our Design(design3) [4]

Fig. 13 Comparison Graph of Performance and Gate-Count

Table II lists the specifications of our design and the previous works. This table indicates our design has

the smallest on chip memory and the lowest gate-count costs as well as provides a superior performance.

Compared with [4], our design reduces memory size by 26.17% and improves the convolution operations

by 39.94% with lower hardware costs.

TABLE II

SPECIFICATION COMPARISON FOR ALEXNET @ 200MHZ

Designs Our Design (Design3)

[4] [5] [6] [29]

Technology 40nm General

CMOS

65nm

LP CMOS

40nm

LP CMOS

28nm

Faraday SOI

28nm

Faraday SOI

Supply Voltage 0.9 v 1 v 1.1 v 1 v 0.575 v

Core Size (mm2) n/a 12.25 2.4 1.87 34

Gate-Count

(NAND2) 1.3 M 1.85 M 1.6 M 1.95 M

> 5 M

(Estimated)

Total Memory

(Byte) 134 k 181.5 k 148 k 144 k 5.6 M

Word-Length

(I/W) 8/8 16/16 1-16 1-16

16/16

or 8/8

Support Filter Size 1-12 1-12 All All 1-12

Performance

(GOPS) 32.2 23.01 28.4 31.3 38.78

GOPS: 32.2

GC: 1.3 M

GOPS: 23.01GC: 1.85 M

GOPS: 28.4

GC: 1.6 M

GOPS: 31.3

GC: 1.95 M

GOPS: 38.78Die Size: 34 mm2

Estimated GC: > 5 M

0

5

10

15

20

25

30

35

40

45

0 1 2 3 4 5

GO

PS

of

Ale

xN

et@

(200M

Hz)

NAND2 Gate Count (M)

Our Design [4] [5] [6] [29]( 1 OP = 1 MAC )

V. CONCLUSION

This paper proposes a hardware accelerator for DCNN. There are two characteristics of this paper when

our design compared with other previous works. First, we truncate the data word length for AlexNet model

to reduce the data movement costs with losing a little accuracy. Second, we optimize the PE array structure

to achieve higher computation performance and lower memory cost. In other words, our design provides a

high costs-performance ratio accelerator for DCNN.

REFERENCES

[1] K. Alex, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural

Networks,” Commun. ACM, vol. 60, no. 6, pp. 84–90, May 2017.

[2] J. Qiu et al., “Going Deeper with Embedded FPGA Platform for Convolutional Neural Network,” in

Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, New

York, NY, USA, 2016, pp. 26–35.

[3] Y. H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for

Convolutional Neural Networks,” in ACM/IEEE 43rd Annual International Symposium on Computer

Architecture (ISCA), 2016, pp. 367–379.

[4] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable

Accelerator for Deep Convolutional Neural Networks,” IEEE Journal of Solid-State Circuits, vol. 52,

no. 1, pp. 127–138, Jan. 2017.

[5] B. Moons and M. Verhelst, “An Energy-Efficient Precision-Scalable ConvNet Processor in 40-nm

CMOS,” IEEE Journal of Solid-State Circuits, vol. 52, no. 4, pp. 903–914, Apr. 2017.

[6] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5 Envision: A 0.26-to-10TOPS/W

subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network

processor in 28nm FDSOI,” in IEEE International Solid-State Circuits Conference (ISSCC), 2017, pp.

246–247.

[7] Z. Du et al., “ShiDianNao: Shifting Vision Processing Closer to the Sensor,” in Proceedings of the 42nd

Annual International Symposium on Computer Architecture, New York, NY, USA, 2015, pp. 92–104.

[8] C. Szegedy et al., “Going Deeper with Convolutions,” arXiv:1409.4842 [cs], Sep. 2014.

[9] J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” arXiv:1612.08242 [cs], Dec. 2016.

[10] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNet Classification Using

Binary Convolutional Neural Networks,” arXiv:1603.05279 [cs], Mar. 2016.

[11] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document

recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.

[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,”

arXiv:1512.03385 [cs], Dec. 2015.

[13] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high performance FPGA-based accelerator

for large-scale convolutional neural networks,” in the 26th International Conference on Field

Programmable Logic and Applications (FPL), 2016, pp. 1–9.

[14] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image

Recognition,” arXiv:1409.1556 [cs], Sep. 2014.

[15] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely Connected Convolutional

Networks,” arXiv:1608.06993 [cs], Aug. 2016.

[16] O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” arXiv:1409.0575 [cs],

Sep. 2014.

[17] M. Sankaradas et al., “A Massively Parallel Coprocessor for Convolutional Neural Networks,” in the

20th IEEE International Conference on Application-specific Systems, Architectures and Processors,

2009, pp. 53–60.

[18] V. Sriram, D. Cox, K. H. Tsoi, and W. Luk, “Towards an embedded biologically-inspired machine

vision processor,” in International Conference on Field-Programmable Technology, 2010, pp. 273–278.

[19] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A Dynamically Configurable

Coprocessor for Convolutional Neural Networks,” in Proceedings of the 37th Annual International

Symposium on Computer Architecture, New York, NY, USA, 2010, pp. 247–257.

[20] F. Conti and L. Benini, “A Ultra-low-energy Convolution Engine for Fast Brain-inspired Vision in

Multicore Clusters,” in Proceedings of the Design, Automation & Test in Europe Conference &

Exhibition, San Jose, CA, USA, 2015, pp. 683–688.

[21] L. Cavigelli and L. Benini, “Origami: A 803 GOp/s/W Convolutional Network Accelerator,” IEEE

Transactions on Circuits and Systems for Video Technology, vol. 27, no. 11, pp. 2461–2475, Nov.

2017.

[22] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep Learning with Limited Numerical

Precision,” arXiv:1502.02551 [cs, stat], Feb. 2015.

[23] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal, “Memory-centric accelerator design for

Convolutional Neural Networks,” in IEEE 31st International Conference on Computer Design (ICCD),

2013, pp. 13–19.

[24] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-based Accelerator Design

for Deep Convolutional Neural Networks,” in Proceedings of the ACM/SIGDA International

Symposium on Field-Programmable Gate Arrays, New York, NY, USA, 2015, pp. 161–170.

[25] K. Benkrid and S. Belkacemi, “Design and implementation of a 2D convolution core for video

applications on FPGAs,” in Third International Workshop on Digital and Computational Video, 2002,

pp. 85–92.

[26] F. Cardells-Tormo, P. L. Molinet, J. Sempere-Agullo, L. Baldez, and M. Bautista-Palacios, “Area-

efficient 2D shift-variant convolvers for FPGA-based digital image processing,” in International

Conference on Field Programmable Logic and Applications, 2005., 2005, pp. 578–581.

[27] T. Chen et al., “DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-

learning,” in Proceedings of the 19th International Conference on Architectural Support for

Programming Languages and Operating Systems, New York, NY, USA, 2014, pp. 269–284.

[28] Y. Jia et al., “Caffe: Convolutional Architecture for Fast Feature Embedding,” arXiv:1408.5093 [cs],

Jun. 2014.

[29] G. Desoli et al., “14.1 A 2.9TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for

intelligent embedded systems,” in IEEE International Solid-State Circuits Conference (ISSCC), 2017,

pp. 238–239.

[30]D. D. Lin, S. S. Talathi, and V. S. Annapureddy, “Fixed Point Quantization of Deep Convolutional

Networks,” Nov. 2015.

High-Performance VLSI Design for Convolution Layer of Deep ...

Documents

Transcript of High-Performance VLSI Design for Convolution Layer of Deep ...