High-Performance VLSI Design for Convolution Layer of Deep ...
Transcript of High-Performance VLSI Design for Convolution Layer of Deep ...
High-Performance VLSI Design for
Convolution Layer of Deep Learning
Neural Networks
Jian-Lin Zeng, Kuan-Hung Chen, Jhao-Yi Wang Department of Electronic Engineering, Feng Chia University, Taichung, Taiwan
Abstract- In this paper, a high performance Deep Convolutional Neural Networks (DCNN)
hardware architecture, composed of three major parts, is proposed. The first part is the Convolution
Operation Unit (COU). It employs a Processing Element (PE) array to realize the high efficiency
convolution operations. The second part is the COU Management. This management controls the PE
Array and keeps the PEs working at the most efficient state. The third part is the Storage and
Accumulation Unit (SAU). The tasks of SAU are storing and accumulating the partial sums that
produced in convolution process. We implemented this design with TSMC 40nm General technology.
And the experimental results show that our design provides 32.2 GOPS for AlexNet [1] in 200MHz
clock rate, the total memory cost is 134k-byte. Compared with [4], we reduce the memory size by
26.17% as well as speed up the convolutional computing by 39.94% with lower hardware cost.
Index-Terms- convolutional neural networks (CNN), deep learning, processor architecture, CNN
accelerator.
I. INTRODUCTION
Deep Convolutional Neural Networks (DCNN) are widely used in artificial intelligence (AI) machine. It
achieves good ability in many modern AI applications. However, the large amount of calculation and high
energy consumption are the side effect. Convolution operations account for more than 90% of all operations
of DCNN. There are 664 mega operations (MAC OP) in AlexNet [1] convolution layers. Fig. 1 shows the
convolution operation. To overcome the above shortcomings, we design a hardware accelerator to speed
up the convolution operations without consuming a lot of energy. Moreover, since these operations are very
similar and regular, they are very appropriate to accelerate by using hardware.
To reduce the burden of DRAM traffic, one possible way is to truncate the word length of weight and
data. Our experimental results show that only 0.79% top-5 accuracy loss is caused by truncating the word-
length of images/weights to 8/8-bit for AlexNet from the previous 16/16-bit. Word-length truncation can
reduce the memory size, circuit area, and the costs of data movements. To improve the performance, we
use weight stationary [4] to design COU and use eights multipliers in each PE. Parallelized multiplications
and huge size of local memories provide high performance and low cost of data movement for hardware
architecture. Furthermore, to reduce the memory costs of weight stationary, we shared one memories in
seven PEs. Massive memory costs are reduced by sharing the local memory in seven PEs. The contributions
of this paper are summarized as follows:
1) We truncate data word length of image and weight to reduce the costs of calculation and data movement.
2) We parallelize the multiplications to improve the computing performance.
3) We reduce the memory costs by sharing local memory in PEs.
Fig. 1 Diagram of Convolution Operation
II. NETWORK ON CHIP
When we are executing convolution layers of the state-of-the-art CNN models, the number of model’s
parameters are too huge to ensure high performance and low energy consumption simultaneously. The
major superiority of utilizing hardware architecture to execute convolution operations is the flexible
dataflow. Since the images and weights can be reused many times in a convolution layer, placing
appropriate storage space to keep the reusable data on chip can reduce the costs of data movement.
The general architecture of CNN accelerators is utilizing the PE array to calculate the convolution results.
They build the network on chip by creating the inter-connection between PEs. There are two major benefits
of this approach. The first advantage is low cost for partial sums accumulation. Partial sums can be
accumulated in the dataflow as shown in Fig. 2. The second advantage is the data can be shared in spatial
architecture as shown in Fig. 3.
Fig. 2 Accumulating partial sums in dataflow
Fig. 3 Diagram of Convolution Operation
III. SYSTEM ARCHITECTURE
A. Overview
Fig. 4 is the overall architecture of proposed chip. All of the data are read in and written out of the chip
through the system interface. Input Data Decoder decodes and buffs the data that are imported from system
interface. COU contains a PE array to process convolution operations and a 6k-byte storage space to store
the reusable data. In addition, COU is dominated by the COU management unit. This management unit
helps us to map the PE array to a variety shape of convolution layers. PEs will be enabled if they are selected
by COU management unit. At the output of COU, SAU stores and accumulates the partial sum results which
produced by COU in convolution process. SAU consists of 28 partial sum buffers to store the partial sums
that are exported from COU and a 112k-byte memory to store the accumulated partial sum results. After
all of the convolution operations are finished, data will be activated and export serially through the system
bus.
Fig. 4 Overall Architecture of Proposed Chip
B. Input Data Decoder
In our design, COU can be mapped to a variety shape of convolution layers by COU management unit.
Different convolution layers are configured with different parameters and different number of data imported
from outside. The tasks of Input Data Decoder are decoding the input data and transmitting them to the
corresponding storage space. There are three types of input data, i.e., parameters, weights and images. Fig.
5 shows the proposed Input Data Decoder architecture.
Fig. 5 Input Data Decoder Architecture
C. Convolution Operation Unit (COU)
In overall process of our design to accelerate the convolution operation, COU is the most important
module in our design which executes the major operations. This module consists of a PE array to calculate
partial sum results and 6k-byte storage space to store reusable data. There are 168 PEs in the PE array which
can be mapped to different convolution layers for efficient computation. However, convolution operations
require many times of multiplications and additions. Employing a sufficient size of memory to reuse data
can improve performance of convolution operations. Therefore, our PE array is partitioned to 28 PE groups
and we place a 256-byte register file in each PE group to enhance data reuse efficiency. On the other hand,
there are seven PEs in a PE group and they share a 256-byte register file among their group. The COU
architecture is showed in Fig. 6. Actually, this COU can be mapped to most of the well-known CNN models
[1, 8, 9, 12, 14, and 15]. All of the convolution layers of the above models can map to this hardware
architecture by configuring different parameters. In COU, a PE is enabled when it is selected by the COU
management unit during the mapping operation.
Fig. 6 COU Architecture
D. Processing Element (PE)
In our architecture, Processing Element is the most basic unit of our design. All of the convolution
operation results are calculated by PEs. Each PE can process one-dimension pixels of data in CNN
operations. In order to support various of the convolution layers, the control circuits of the PE usually
complicated. The architecture of PE is showed in Fig. 7. There are eight multipliers and an adder tree in
each PE. Each multiplier can execute 8/8-bit multiplication. And the adder tree accumulates the
multiplication results. Additionally, the multipliers can share the control circuits in the PE. In the most
appropriate case, one PE can process eight operations (8 MACs) in one clock cycle. The parallelized
multiplications and additions provides outstanding performance for the convolution operation.
Fig. 7 PE Architecture
E. COU Management Unit
We employ a COU management unit to control all PEs in COU for efficiently executing convolution
operations. This management unit configures the PE array to a suitable status when we import parameters
of current convolution layer. Fig. 8 shows the COU management unit and the PE group architecture.
Fig. 8 COU Management Unit and PE Groups Architecture
There are three tasks of COU management unit, i.e., determining IDs for each PE groups, storing image
data into image memory and assigning image data and weights to PEs, as described in details as follows:
1) Determining IDs for PE groups: Data transfer is a very important issue in deep learning accelerator. We
reuse the data in on-chip memory as much as possible to reduce DRAM bandwidth requirement. This can
be achieved by assigning image. For instance, assigning the same Image ID and different Filter IDs to
different PE group can process different kernel at the same time. This action not only increases data
processing parallelism, but also reuses the loaded image data. On the other hand, the operation of sliding
window can be achieved by increment the Image ID. In this operation, we keep the overlapped portion
inside the PE group and load the new portion only. Moreover, PEs also can be turned off if they are not
mapped in current convolution layers.
The data will cast to the PE array according to the PE group IDs.
2) Storing image data to image memory: Employing memory reasonably can enhance convolutional
computing efficiency. We use a 16k-byte register file to achieve this task. Image data are stored in the
register file until the image data are not reusable. However, some of the images of a convolution layer are
too heavy to store all of the image data on chip memory. We partition the image to sub-images when the
image size exceeds 16-kbytes.
3) Assigning image data and weights to PE groups: In the convolution process, additional IDs are generated
by COU management unit to transmit images and weights for each PE group. Weights are stored in register
file of a PE group according to the filter ID, and images are stored in PEs according to the image ID. There
is a 256-byte register file in each PE group. Therefore, PE can reuse weights by accessing the register file.
After an image completes the operations with all of the weights in the register file, new image can be
imported and re-multiply the image data to these weights again.
F. Storage and Accumulation Unit (SAU)
Each convolution layer in CNN models has many channels. In our hardware architecture, we execute one
channel of operations in one iteration. And, we complete all of the operations of one convolution layer by
many iterations. The results of iterations are stored and accumulated in SAU. Fig. 9 shows the architecture
of SAU. In convolution operations, SAU is an important unit which completes the final accumulations.
There are two major module in this unit, i.e., accumulators and partial sum memory. Partial sum memory
is a 112k-byte storage space for storing partial sums that are produced in convolution operations. To satisfy
the maximum exporting throughput of COU, this storage space is divided to 28 memory banks, each
contains 4k entries with 64-bit word-length for each bank. Hence, we use 28 partial sum buffers for 16-bit
to 64-bit conversion and 28 accumulators for partial sums accumulations. We utilize four adders in each
Accumulator to do four additions in one clock cycle as shown in Fig. 10.
Fig. 9 Storage and Accumulation Unit Architecture
Fig. 10 Accumulator Architecture
IV. EXPERIMENTAL RESULTS
A. Word-Length Truncation
Convolution operation are calculated using a 32-bit floating point precision typically. However, using
fixed point precision to calculate them can reduce the power consumption directly without reducing the
final classification accuracy. The literature [30] provides a quantization method for fixed point
implementation of CNN models. To reduce the storage size and save energy consumption from data
movement, we employed word-length truncation from the 32-bit to 1-bit. Additionally, literature [10] shows
the classification results with 1 bit operations. We take an experiment on 1000-class AlexNet [1] from
ILSVRC-12 [16] and analyze them. Fig. 11 shows the 1000-class classification top-5 accuracy results of
ILSVRC-12 AlexNet with different word-lengths. The experimental results show that only 0.79% top-5
accuracy was sacrified when we truncate the word-length from 32-bit to 8-bit. In addition, top-5 accuracy
can be kept at 75.97% when we truncate the word-length from 32-bit to 7-bit. For the 8-bits case, we use 1
bit for sign bit and 7 bit for integer to represent input feature map pixels. The important pixels will be
normed to a big value. The fraction part of the input feature map pixels will only provide very small impacts.
And we use 1 bit sign bit and 7 bits fraction bits to represent weights. Because the value of weights are
between 1 and -1.
Fig. 11 ILSVRC-12 AlexNet top-5 Accuracy in Fixed Point
B. Performance and Gate-Count Analysis
According to the experimental results of word-length truncation, we know that truncating the word-
length of each convolution layer of AlexNet from 32-bit to 8-bit only reduces the top-5 accuracy from
79.86% to 79.07%. We made four designs with different numbers of multiplier, using different word-length
and sharing the memory or not. The parameters and experimental results are listed in Table I. The designs
4-bits, 0.58% 5-bits, 0.61%
6-bits, 49.73%
7-bits, 75.97% 8-bits, 79.07%
9-bits, 79.48%
16-bits, 79.78%
Floating-Point, 79.86%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
4-bits 5-bits 6-bits 7-bits 8-bits 9-bits 16-bits
Acc
uca
cy
Word Length
Fixed-Pont Floating-Point
are synthesis by Synopsys Design Compiler with TSMC 40nm General technology. The performance
results of these four designs are the summation of the processing times of each convolution layers of
AlexNet. There are two clock domains in the overall system, i.e., 200MHz for hardware ip clock and
60MHz for system clock. For the off chip memory, we use a 64-bit data width single port DRAM simulation
model. The accelerator fetches the parameters, images and filters from DRAM and exports the output
feature map pixels to the DRAM after the comvolution operations are finished. The operations are
processed layer by layer until completing all of the convolution layers. We provides the 16 bits patterns and
the 8 bit patterns for the designs. And we import the patterns to the designs depend on the supporting word-
length.
Design 1(16b) was the first design which we refered to [4] and the most similar design. The major
differences between design1(16b) and [4] are memory size and the methods of DRAM access. And we
reduce the word-length from 16-bit to 8-bit based on the experimental results in last session. Even though
the bit width of data are reduced, the performance was limited by the computing core. That is the reason
why we use 8 multipliers in design2. We can see the performance was improved from 19.78 to 34.3 in
design1(8b) and design2. But the gate count cost also increased 76%. To solve the increased gate count
cost, we share the memories in design3. Even though the GOPS decreased from 34.3 to 32.2, but we reduced
the total memory size by 36kByte.
TABLE I
PARAMETERS OF EACH DESIGN AND THEIR EXPERIMENTAL RESULTS FOR ALEXNET @ 200MHZ
Designs Design1
(16b) Design1
(8b) Design2
(8b) Design3
(8b)
Technology TSMC 40nm General CMOS
Word-Length
(I/W) 16/16 8/8 8/8 8/8
# of mul. in each PE 1 1 8 8
Sharing memory No No No Yes
Total memory
(Byte) 170 k 170 k 170 k 134 k
Gate Count (NAND2) 2.3 M 1.7 M 2.2 M 1.3 M
Performance 15.6 19.78 34.3 32.2
(GOPS)
We make some comparisons between our design (design3) and [4]. The major differences of our design
and [4] is the PE architecture. First, our design uses eight multipliers and an adder tree in each PE, but [4]
uses only one multiplier and an accumulator. Second, [4] uses a memory to store weights in each PE, but
we shared one memory among seven PEs. Fig. 12 shows the processing times of our design and [4] in each
convolution layer of AlexNet when 200MHz clock rate was provided. Actually, we didn’t make an very
good optimization on DRAM access. It causes our design slower than [4] by 4.02% when executing
AlexNet convolution layer 1. Because there is a huge data accessing requirement between DRAM and
accelerator in convolution layer 1. In other layers, our design can provide better performance. For the
summation, our design improves the calculation performance by 39.94%.
Fig. 12 Processing Times of AlexNet Convolution Layers
Fig. 13 shows the performance and NAND2 gate-count costs comparison results of our design with the
previous works. In hardware design point of view, both the performance and area costs are significant. High
performance means the hardware can provide exceptional calculation speed, and small gate-count indicates
that the chip has lower cost of production. When we consider both the performance and gate-count costs at
the same time, our design provides a prominent cost-performance ratio as shown in Fig. 13.
- 4.02%
+ 143.16 %
+ 27.71 %
+ 29.94 %+ 8.47 %
+ 39.94 %
0
5
10
15
20
25
30
35
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Total
Pro
cess
ing
Tim
e (m
s)
Convolution Layers of AlexNet
Our Design(design3) [4]
Fig. 13 Comparison Graph of Performance and Gate-Count
Table II lists the specifications of our design and the previous works. This table indicates our design has
the smallest on chip memory and the lowest gate-count costs as well as provides a superior performance.
Compared with [4], our design reduces memory size by 26.17% and improves the convolution operations
by 39.94% with lower hardware costs.
TABLE II
SPECIFICATION COMPARISON FOR ALEXNET @ 200MHZ
Designs Our Design (Design3)
[4] [5] [6] [29]
Technology 40nm General
CMOS
65nm
LP CMOS
40nm
LP CMOS
28nm
Faraday SOI
28nm
Faraday SOI
Supply Voltage 0.9 v 1 v 1.1 v 1 v 0.575 v
Core Size (mm2) n/a 12.25 2.4 1.87 34
Gate-Count
(NAND2) 1.3 M 1.85 M 1.6 M 1.95 M
> 5 M
(Estimated)
Total Memory
(Byte) 134 k 181.5 k 148 k 144 k 5.6 M
Word-Length
(I/W) 8/8 16/16 1-16 1-16
16/16
or 8/8
Support Filter Size 1-12 1-12 All All 1-12
Performance
(GOPS) 32.2 23.01 28.4 31.3 38.78
GOPS: 32.2
GC: 1.3 M
GOPS: 23.01GC: 1.85 M
GOPS: 28.4
GC: 1.6 M
GOPS: 31.3
GC: 1.95 M
GOPS: 38.78Die Size: 34 mm2
Estimated GC: > 5 M
0
5
10
15
20
25
30
35
40
45
0 1 2 3 4 5
GO
PS
of
Ale
xN
et@
(200M
Hz)
NAND2 Gate Count (M)
Our Design [4] [5] [6] [29]( 1 OP = 1 MAC )
V. CONCLUSION
This paper proposes a hardware accelerator for DCNN. There are two characteristics of this paper when
our design compared with other previous works. First, we truncate the data word length for AlexNet model
to reduce the data movement costs with losing a little accuracy. Second, we optimize the PE array structure
to achieve higher computation performance and lower memory cost. In other words, our design provides a
high costs-performance ratio accelerator for DCNN.
REFERENCES
[1] K. Alex, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural
Networks,” Commun. ACM, vol. 60, no. 6, pp. 84–90, May 2017.
[2] J. Qiu et al., “Going Deeper with Embedded FPGA Platform for Convolutional Neural Network,” in
Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, New
York, NY, USA, 2016, pp. 26–35.
[3] Y. H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for
Convolutional Neural Networks,” in ACM/IEEE 43rd Annual International Symposium on Computer
Architecture (ISCA), 2016, pp. 367–379.
[4] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable
Accelerator for Deep Convolutional Neural Networks,” IEEE Journal of Solid-State Circuits, vol. 52,
no. 1, pp. 127–138, Jan. 2017.
[5] B. Moons and M. Verhelst, “An Energy-Efficient Precision-Scalable ConvNet Processor in 40-nm
CMOS,” IEEE Journal of Solid-State Circuits, vol. 52, no. 4, pp. 903–914, Apr. 2017.
[6] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5 Envision: A 0.26-to-10TOPS/W
subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network
processor in 28nm FDSOI,” in IEEE International Solid-State Circuits Conference (ISSCC), 2017, pp.
246–247.
[7] Z. Du et al., “ShiDianNao: Shifting Vision Processing Closer to the Sensor,” in Proceedings of the 42nd
Annual International Symposium on Computer Architecture, New York, NY, USA, 2015, pp. 92–104.
[8] C. Szegedy et al., “Going Deeper with Convolutions,” arXiv:1409.4842 [cs], Sep. 2014.
[9] J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” arXiv:1612.08242 [cs], Dec. 2016.
[10] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNet Classification Using
Binary Convolutional Neural Networks,” arXiv:1603.05279 [cs], Mar. 2016.
[11] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document
recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,”
arXiv:1512.03385 [cs], Dec. 2015.
[13] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high performance FPGA-based accelerator
for large-scale convolutional neural networks,” in the 26th International Conference on Field
Programmable Logic and Applications (FPL), 2016, pp. 1–9.
[14] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image
Recognition,” arXiv:1409.1556 [cs], Sep. 2014.
[15] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely Connected Convolutional
Networks,” arXiv:1608.06993 [cs], Aug. 2016.
[16] O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” arXiv:1409.0575 [cs],
Sep. 2014.
[17] M. Sankaradas et al., “A Massively Parallel Coprocessor for Convolutional Neural Networks,” in the
20th IEEE International Conference on Application-specific Systems, Architectures and Processors,
2009, pp. 53–60.
[18] V. Sriram, D. Cox, K. H. Tsoi, and W. Luk, “Towards an embedded biologically-inspired machine
vision processor,” in International Conference on Field-Programmable Technology, 2010, pp. 273–278.
[19] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A Dynamically Configurable
Coprocessor for Convolutional Neural Networks,” in Proceedings of the 37th Annual International
Symposium on Computer Architecture, New York, NY, USA, 2010, pp. 247–257.
[20] F. Conti and L. Benini, “A Ultra-low-energy Convolution Engine for Fast Brain-inspired Vision in
Multicore Clusters,” in Proceedings of the Design, Automation & Test in Europe Conference &
Exhibition, San Jose, CA, USA, 2015, pp. 683–688.
[21] L. Cavigelli and L. Benini, “Origami: A 803 GOp/s/W Convolutional Network Accelerator,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 27, no. 11, pp. 2461–2475, Nov.
2017.
[22] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep Learning with Limited Numerical
Precision,” arXiv:1502.02551 [cs, stat], Feb. 2015.
[23] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal, “Memory-centric accelerator design for
Convolutional Neural Networks,” in IEEE 31st International Conference on Computer Design (ICCD),
2013, pp. 13–19.
[24] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-based Accelerator Design
for Deep Convolutional Neural Networks,” in Proceedings of the ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, New York, NY, USA, 2015, pp. 161–170.
[25] K. Benkrid and S. Belkacemi, “Design and implementation of a 2D convolution core for video
applications on FPGAs,” in Third International Workshop on Digital and Computational Video, 2002,
pp. 85–92.
[26] F. Cardells-Tormo, P. L. Molinet, J. Sempere-Agullo, L. Baldez, and M. Bautista-Palacios, “Area-
efficient 2D shift-variant convolvers for FPGA-based digital image processing,” in International
Conference on Field Programmable Logic and Applications, 2005., 2005, pp. 578–581.
[27] T. Chen et al., “DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-
learning,” in Proceedings of the 19th International Conference on Architectural Support for
Programming Languages and Operating Systems, New York, NY, USA, 2014, pp. 269–284.
[28] Y. Jia et al., “Caffe: Convolutional Architecture for Fast Feature Embedding,” arXiv:1408.5093 [cs],
Jun. 2014.
[29] G. Desoli et al., “14.1 A 2.9TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for
intelligent embedded systems,” in IEEE International Solid-State Circuits Conference (ISSCC), 2017,
pp. 238–239.
[30]D. D. Lin, S. S. Talathi, and V. S. Annapureddy, “Fixed Point Quantization of Deep Convolutional
Networks,” Nov. 2015.