ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration
description
Transcript of Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration
![Page 1: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/1.jpg)
1/21
Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration
Chen Huang and Frank Vahid
Dept. of Computer Science and Engineering University of California, Riverside, USA{chuang,vahid}@cs.ucr.edu
This work was supported in part by NSF CNS-1016792
![Page 2: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/2.jpg)
Chen Huang UC Riverside
2/21
Outline
Haar-feature based object detection algorithm
Custom design space exploration: Feature mapping problem
Experimental results
![Page 3: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/3.jpg)
Chen Huang UC Riverside
3/21
Original image
Scaled images
Haar-Feature based object detection algorithm
(320 – 20) * (240 – 20) = 66,000 sub-windows
X axis
Y axis
0
240
320
Movement of sub-window
Faces detected on different scales
… 20x20 sub- window
Face found
![Page 4: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/4.jpg)
Chen Huang UC Riverside
4/21
Face detection in sub-window
Fail
Pass
Facial Haar features
Calculate Haar-feature value:
Pixel_Sum(Rect_W) – Pixel_Sum(Rect_B)Constant time Pixel_Sum calculation
Pixel_Sum(R1) = P4 - P2 - P3 + P1 = 4
1 1 11 1 1
1 1 1
Original image Integral Image
1 2 32 4 6
3 6 9
p1 p2
p3 p4R1
Need 4 corner values
Stores Pixel sum of Rect(from top-left corner to this point)
P4
P2
P3
P120 x 20 sub-window
![Page 5: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/5.jpg)
Chen Huang UC Riverside
5/21
Cascade decision process
Frontal-face has 2000 features
S12 features
S25 features
S316 features
S22212 features
Divided into multiple stages
……pass pass pass
Face detected
pass
Reject
Fail
Fail any stage will reject current sub-window
![Page 6: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/6.jpg)
Chen Huang UC Riverside
6/21
Algorithm FPGA implementation
Buffer controller
Integral image Rectangle
drawer
Video out(objects in rectangles)
ClassifierImage scaler
20 x 20 Sub-window
Haar feature calculation/decision
Frame grabber
Video in
FPGA
![Page 7: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/7.jpg)
Chen Huang UC Riverside
7/21
Integral image and Classifier
Frame grabber
Video in
Buffer controller
Integral image Rectangle
drawer
Video out(objects in rectangles)
ClassifierImage scaler Classifier
Integral Image Buffer
(20 x 20 17-bit register file)
a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4
0
Feature threshold>
Left value
Right valueFeature value
mux +
multiply b
y constant-1 x2 x2 x3
+(Feature sum)
Rect sum Rect sum Rect sum
Data delivery
![Page 8: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/8.jpg)
Chen Huang UC Riverside
8/21
Communication bottleneck
A classifier port
……
20 x 20 Integral image
400-to-1 mux
400-to-1 17-bit MUX:
2300 LUTs
12 MUXes: 27,600 LUTs40% of Virtex5 110T(69,120)
General communication architecture
Drawbacks:
Does not scale well for multiple classifiers
Wire congestion problem
![Page 9: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/9.jpg)
Chen Huang UC Riverside
9/21
Integral image
CF1 CF2 CF3 CF4
Multiple Classifiers
Custom communication architecture for multi-classifier
400-1 mux
CF1 CF2 CF3 CF4
Classifier number
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Feature num
ber
![Page 10: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/10.jpg)
Chen Huang UC Riverside
10/21
Integral image
CF1 CF2 CF3 CF4
Multiple Classifiers
Custom communication architecture for multi-classifier
CF1_port1 CF2_port9 CF3_port7
24-1 mux 9-1 mux 24-1 mux16-1 mux
CF4_port2Custom communication architecture
Classifier number
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Feature num
ber
CF1 CF2 CF3 CF4
![Page 11: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/11.jpg)
Chen Huang UC Riverside
11/21
1 2 3 4 Stage 1
Feature mapping problem
Mapping 26 features into 4 Classifiers
Stage and feature
CF1 CF2 CF3 CF4
5
Classifier
Stage 1
Stage 2
Stage n
pass
pass
Object found
Reject
Fail
Fail
Fail6 7 8 9
10 11 12Stage 2
13 14 15 16
17 18 19 20
21 22 23 24
25 26
Stage 3
Features
CF1 CF2 CF3 CF4
![Page 12: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/12.jpg)
Chen Huang UC Riverside
12/21
Feature mapping problem
SwapMigrate
#possible mapping grows exponentially with #features
Simulated Annealing neighborT
otal stage
delay
Total wire number
Performance Size
Objective:Min (Total stage delay * Total wire number)
1 million iterations (30 min)
Mapping 26 features into 4 Classifiers
Stage and feature
CF1 CF2 CF3 CF4
Stage 3 S
tage 2 Stage 1 1 2 3 4
5
6 7 8 9
10 11 12
13 14 15 16
17 18 19 20
21 22 23 24
25 26
Classifier
CF1 CF2 CF3 CF4
![Page 13: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/13.jpg)
Chen Huang UC Riverside
13/21
BRAM
Select
Automatic VHDL code generation
Scheduling:
Integral Image
5 24 46 92
MUX
Classifier 1
Feature mapping:
1, 4, 66, 3
(needs entry:
5, 24, 46, 92)
1
4
3
1 2 3 4
24 5 92 46
2Mux1: mux4 port map(II(5), II(24), II(46), II(92), select, dout);
C1: classifier port map(dout, …);
Bram1: bram generic map(2, 1, 4, 3, …) Port map(…., select);
Structural RTL code for communication components
dout
![Page 14: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/14.jpg)
Chen Huang UC Riverside
14/21
Review of custom design space exploration
Object detection application
Custom design space exploration
Program analysis
Design exploration
Design generation
Resource constraints, performance requirements
Map to different FPGAs
Execution timePareto design points
Size
Different number of classifiers
Communication bottleneck
400-1 muxFeature mapping problem
![Page 15: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/15.jpg)
Chen Huang UC Riverside
15/21
Experiment scenarios
Different implementations Desktop: Pentium4 3.0 GHz fixed-point C FPGA: 1 CF(1 mux), 1 CF(3 mux), 1 CF(6 mux), 1 CF, 2 CF, 4 CF, 8 CF, 16 CF on
Xilinx Virtex LX 50T, LX110T, and LX155T Feature sets
Face: 2135 features Eye: 1066 features
Sample images Face(simple) Face(complex) Eye
Classifier
12 ports
![Page 16: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/16.jpg)
Chen Huang UC Riverside
16/21
Experiment: FPGA resource utilization
General comm. architecture
Custom comm. architecture
LX50T.(29,000)
LX100T.(69,000)
LX155T.(97,000)
Map to different Xilinx Virtex5 FPGAs
Communication architecture
400-1 mux
Classifier number
24-1 mux
9-1 mux
24-1 mux
16-1 mux
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
1 CF(6 mux)
1 CF(12 mux)
2 CF 4 CF 8 CF 16 CF
Des
ign
size
(nu
mbe
r of
LU
TS
)
Comms
Static
1 CF(3 mux)
1 CF(1 mux)
![Page 17: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/17.jpg)
Chen Huang UC Riverside
17/21
Components' timing info
Image scaler
Buffer controller
Classifier
65 Mhz11 cycles/window
65 Mhz(3+examined features/#CF) cycles/window
130 Mhz6 cycles/pixel
Frame/sec
124110
0.6
201
Performance upper bound (110 fps)
Performance of different components
min max
Frame grabber
Video in
Buffer controller
Integral image Rectangle
drawer
Video out(objects in rectangles)
ClassifierImage scaler
Xilinx Virtex5 110T FPGA
![Page 18: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/18.jpg)
Chen Huang UC Riverside
18/21
Performance comparison
Upper bound
FPGA implementations are
0.6 to 25X faster than desktop C
0
20
40
60
80
100
120
Desktop 1 CF(1 mux)
1 CF(3 mux)
1 CF(6 mux)
1 CF 2 CF 4 CF 8 CF
Per
form
ance
(fr
ame/
sec.
)
Face(complex)
Face(simple)
16 CF
Eye
Pentium 4 3.0 GHz
(determined by buffer controller)
![Page 19: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/19.jpg)
Chen Huang UC Riverside
19/21
Comparison to previous work
Compared to Cho’s [FPGA 09] implementation of the same algorithm with 320x240 pixels on the same FPGA.
Size(LUTs) Performance(fps)
Cho's(1 CF) 64,143 17.5
Ours(1 CF) 45,713 19.3
Cho's(3 CFs) 84,232 28.8
Ours(16 CFs) 77,059 90.9
More scalable due to custom design space exploration
3x faster with 8% less LUTs
![Page 20: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/20.jpg)
Chen Huang UC Riverside
20/21
Video Demo http://www.youtube.com/watch?v=gkQVanU5P5U
![Page 21: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814f56550346895dbd018e/html5/thumbnails/21.jpg)
Chen Huang UC Riverside
21/21
Conclusions
Effectively implemented object detection algorithm on a modern series of FPGAs
Custom design space exploration is necessary for complex applications
Future work: Implement more applications using custom search/optimization
Thank you!