Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature...
Transcript of Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature...
![Page 1: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/1.jpg)
Hardware Acceleration of Feature Detection and Description Algorithms on
Low‐Power Embedded Platforms
Onur Ulusel, Christopher Picardo, Christopher Harris, Sherief Reda, R. Iris Bahar,
School of Engineering, Brown University
![Page 2: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/2.jpg)
Image Processing in Mobile Systems
2
• Image processing is everywhere!– Input data has changed from words/numbers to images
– Sensors have improved dramatically
• Image processing is a major driving factor in technological advancement– Autonomization relies on image processing
• Mobile/Embedded platforms?? – Real‐time computing + limited data bandwidth prefer local computing to offloading to cloud
– BUT image processing can be very computationally intensive and power hungry
www.google.com
www.guardiantv.com
![Page 3: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/3.jpg)
Accelerating Image Processing on Low Power Embedded Platforms
• Meeting real time image processing requirements for many of these applications requires HW assisted acceleration
• Which algorithms do we accelerate?– Feature detection and feature description are key building blocks for image retrieval, biometric identification, visual odometry, etc.
– Computational efficient detection and analysis of image features is critical for performance and energy‐efficiency
3http://www.sybernautix.com/ https://blog.pivotal.iohttps://vision.in.tum.de
![Page 4: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/4.jpg)
Hardware Acceleration for Energy Constrained Image Processing
• Low power embedded platforms– Field Programmable Gate Arrays (FPGAs)– Graphical Processing Units (GPUs)– Low power general processors (CPUs)
4
FPGAs GPUs
XilinxVirtex 6 Xilinx Zynq 7020 1532 core NVIDIA
GeForce GTX 680192 core NVIDIA
Jetson TK1
Power 15 W <5W 195 W <12W
![Page 5: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/5.jpg)
Our Contributions
• Comparative study of feature detection and description algorithms–What are their computation kernel characteristics?
• Comparative study of platforms for embedded applications– Advantages/disadvantages of each platform?
• Accelerating algorithms on different platforms– How can algorithms be modified to better exploit available hardware of each platform?
– How does performance compare in terms of run time and energy consumption?
5
![Page 6: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/6.jpg)
Feature Detection
• What is a ‘feature’?– An “interesting” part of an image that can be used to identify objects
• Examples: Edges, corners, ridges, blobs
6Slide adapted from Darya Frolova, Denis Simakov, Weizmann Institude
flat region:no change within block
edge:no change
along the edge
corner:change in
all directions
![Page 7: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/7.jpg)
Feature Description• Given the features, uniquely describe them so they can be matched in other images
• Descriptors summarize characteristics of the features – E.g., intensity, orientation
• Descriptors should be distinctive and insensitive to local image deformations.
7Images from: R. Szeliski, Computer Vision: Algorithms and Applications
![Page 8: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/8.jpg)
Accuracy and Run‐time Comparisons
8
• HoG (Histogram of Gradient) based Descriptors– SIFT: Scale‐Invariant Feature Transform– SURF: Speeded Up Robust Features
• Binary Feature Descriptors– BRIEF: Binary Robust Independent Elementary Features– BRISK: Binary Robust Invariant Scalable Keypoints
1
10
100
1000
0
20
40
60
80
SIFT SURF BRIEF BRISK
CPU ru
n‐tim
e (m
s) ‐log scale
Accuracy Percetage Precision
Recall
Run‐time
(% of detected features that are correct)
(% of features detected)
![Page 9: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/9.jpg)
FAST: Features from Accelerated Segment Test
9
BresenhamCirclesliding
window
12‐pixel continuity? If so then feature
Pre‐compare pixels 1, 5, 9, and 13 to determine possibility for continuity
On average 98.5% of the comparisons fail the continuity test at the pre‐compare stage
Rosten and Drummond, ECCV’06
![Page 10: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/10.jpg)
BRIEF: Binary Robust Independent Elementary Features
• Compare intensities of pairs of points using Hamming distance
• BRIEF Sampling pattern–512 sampling pairs– For each pair, Xi is at (0,0) and Yitakes all possible values from coarse polar grid
– Sampling pairs are generated from a 31×31 region around center pixel
10
Chosen sampling pattern results in a 512‐bit characterization array
![Page 11: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/11.jpg)
BRISK: Binary Robust Invariant Scalable Keypoints
11
• BRISK uses custom sampling pattern• 512 sampling pairs generated from a 31×31 region (like BRIEF)
• Distinguishes between short/long pairs– Short pairs used similar to BRIEF to generate descriptor vectors based on intensity comparisons
– Long pairs used for orientation computation by rotating sampling pattern
BRISK sampling patternRed circles represent standard deviation of Gaussian smoothing
![Page 12: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/12.jpg)
Start
R d IFrame
Read Input Frame
h i l l
Bresenham circle
For each pixel , p, apply the 7x7 filter of Bresenham circle
is a corner
p is a corner
Generate N sampling pairs , Xi and Yi, around p
Xi > Yi
Di = 1
Di = 0
FeatureDetection
Brief FeatureDescription
Stop
False
False
Algorithm Flowchart
• FAST feature detection + BRIEF feature description
• Obtaining sampling window for feature description requires irregular access pattern
FAST
BRIEF
![Page 13: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/13.jpg)
Start
R d IFrame
Read Input Frame
F h i l l
Bresenham circle
For each pixel , p, apply the 7x7 filter of Bresenham circle
is a corner
p is a corner
Generate N sampling pairs , Xi and Yi, around p
Xi > Yi
Di = 1
Di = 0
FeatureDetection
False
False
Perform Orientation Compensation for BRISK
Stop
FeatureDescription
Algorithm Flowchart
• FAST feature detection + BRISK feature description
• BRISK requires an extra step for orientation compensation– A significant amount of extra hardware resources for this step
FAST
BRISK
![Page 14: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/14.jpg)
Experimental Embedded Platforms
• FPGA: MicroZED development board:– 28nm Zynq 7020 SoC– Artix‐7 FPGA + 1GB DDR3– dual‐core Arm Cortex A9 CPU (for debug and init. only)
• GPU & CPU: Jetson TK1 development kit– 28nm Tegra K1 SoC– Kepler GPU with 192 CUDA cores @ 950MHz– Quadcore ARM Cortex A15 CPU @ 2.5GHz (single core activated)– 2GB Memory– Running OpenCV versions of FAST, BRIEF, BRISK
14
![Page 15: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/15.jpg)
Feature Detection & Description:Block Diagram
Image data
Zig‐Zag Traversing Line Buffers
Mask Size Register Array
Buffer Address Generator andRegister Array
+‐
+‐
+‐
+‐
Pre‐compute units
Equa
lity X 12
+‐
Circle Comparator
XX
xx
xx
xx
x&ycoordina
tesEnable
& Sync signals
ARM Cortex A9
CPU
Central Interconn
ect
32b GP AX
I Master P
ort
10‐Line word
Buffe
r
Processing System (PS)
Control
Programmable Logic (PL)
is_corner
Smoothing & Region Generation
N‐wide comparator
Orientation Compensation
descrip
tor
AXI Interconn
ect
+‐
Memory InterfaceDDR3
15
orientation
![Page 16: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/16.jpg)
Feature Detection & Description:Block Diagram
Image data
Zig‐Zag Traversing Line Buffers
Mask Size Register Array
Buffer Address Generator andRegister Array
+‐
+‐
+‐
+‐
Pre‐compute units
Equa
lity X 12
+‐
Circle Comparator
XX
xx
xx
xx
x&ycoordina
tesEnable
& Sync signals
ARM Cortex A9
CPU
Central Interconn
ect
32b GP AX
I Master P
ort
10‐Line word
Buffe
r
Processing System (PS)
Control
Programmable Logic (PL)
is_corner
Smoothing & Region Generation
N‐wide comparator
Orientation Compensation
descrip
tor
AXI Interconn
ect
+‐
Memory InterfaceDDR3
16
orientation
FAST Feature Detection
![Page 17: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/17.jpg)
Feature Detection & Description:Block Diagram
Image data
Zig‐Zag Traversing Line Buffers
Mask Size Register Array
Buffer Address Generator andRegister Array
+‐
+‐
+‐
+‐
Pre‐compute units
Equa
lity X 12
+‐
Circle Comparator
XX
xx
xx
xx
x&ycoordina
tesEnable
& Sync signals
ARM Cortex A9
CPU
Central Interconn
ect
32b GP AX
I Master P
ort
10‐Line word
Buffe
r
Processing System (PS)
Control
Programmable Logic (PL)
is_corner
Smoothing & Region Generation
N‐wide comparator
Orientation Compensation
descrip
tor
AXI Interconn
ect
+‐
Memory InterfaceDDR3
17
orientation
BRIEF Descriptor
![Page 18: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/18.jpg)
Feature Detection & Description:Block Diagram
Image data
Zig‐Zag Traversing Line Buffers
Mask Size Register Array
Buffer Address Generator andRegister Array
+‐
+‐
+‐
+‐
Pre‐compute units
Equa
lity X 12
+‐
Circle Comparator
XX
xx
xx
xx
x&ycoordina
tesEnable
& Sync signals
ARM Cortex A9
CPU
Central Interconn
ect
32b GP AX
I Master P
ort
10‐Line word
Buffe
r
Processing System (PS)
Control
Programmable Logic (PL)
is_corner
Smoothing & Region Generation
N‐wide comparator
Orientation Compensation
descrip
tor
AXI Interconn
ect
+‐
Memory InterfaceDDR3
18
orientation
BRISK Descriptor
![Page 19: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/19.jpg)
Feature Detection & Description:Block Diagram
19
Feature detection logic
Image data
Zig‐Zag Traversing Line Buffers
Mask Size Register Array
Buffer Address Generator andRegister Array
+‐
+‐
+‐
+‐
Pre‐compute units
Equa
lity X 12
+‐
Circle Comparator
XX
xx
xx
xx
x&ycoordina
tesEnable &
Sync signals
ARM Cortex A9
CPU
Central Interconn
ect
32b GP AX
I Master P
ort
10‐Line word
Buffe
r
Processing System (PS)
Control
Programmable Logic (PL)
is_corner
Smoothing & Region Generation
N‐wide comparator
Orientation Compensation
descrip
tor
AXI Interconn
ect
+‐
Memory InterfaceDDR3
orientation
Descriptor logicData control logic for detection and description
![Page 20: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/20.jpg)
Results: Run‐time4.4
25.8
13.8
13.7
36.4
103.7
18.2
13.7
50.8
111.7
27.6
13.7
0
20
40
60
80
100
120
Intel i7 CPU ARM on Jetson Tegra GPU onJetson
Zynq FPGA
Run‐tim
e (m
s)
FAST FAST+BRIEF FAST+BRISK
20
![Page 21: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/21.jpg)
Results: Power & Energy
21
(detection) (detection + description)
21.5
6.5
4.3
2.20
22.2
6.2
6.3
2.27
22.4
6.3
6.3
2.31
0
5
10
15
20
25
Intel i7 CPU EmbeddedCPU
Tegra GPU Zynq FPGAPo
wer (W
)FAST FAST+BRIEF FAST+BRISK
94.0
166.8
59.3
30.00
809.4
696.4
114.1
30.96
1137
.7
705.1
174.3
31.58
0
200
400
600
800
1000
1200
Intel i7 CPU EmbeddedCPU
Tegra GPU Zynq FPGA
Energy (m
J)
FAST FAST+BRIEF FAST+BRISK
![Page 22: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/22.jpg)
Results: Profiling
• Feature description stalled due to memory throttle– Needs better data management
22
0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 45.0%
ExecutionDependency
Pipe Busy
MemoryThrottle
Not Selected
Stall Reasons during GPU Computation
Description Detection(BRIEF) (FAST)
0%
20%
40%
60%
80%
100%
FPGA: FAST FPGA: FAST + BRIEF GPU: FAST GPU: FAST + BRIEF
Instruction Distribution
Load/Store FP/Integer
• Feature description for GPU implementation has bump in load/store ops– Almost 10X more than just FAST
![Page 23: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/23.jpg)
Results: FPGA Resource Utilization
23
10% 14%27%
32%19%
37%
57%66%
37%
LookUp Tables Flip Flops Block Rams
FAST FAST+BRIEF FAST+BRISK
Lookup Tables
Flip Flops
BlockRAMs
FAST 4564 1551 8
FAST + BRIEF 14398 2093 11
FAST + BRISK 25575 7115 11
• BRISK requires significant amount of extra resources for smoothing and orienting
• Extra resources do not translate to much extra power
Distribution of resourcesResource utilization
![Page 24: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,](https://reader035.fdocuments.us/reader035/viewer/2022062306/5f1aabd0761fbd7b59658ff8/html5/thumbnails/24.jpg)
Conclusions
• FPGA outperforms CPUs and GPUs in terms of power & performance– FAST + BRISK: 36 fps vs. 147 fps– FPGA amenable to various HW optimizations:
• deep pipelining, optimized memory access, pre‐computation
• FGPA implementations better for handling multiple kernels– For GPUs, multiple kernels highly bounded by kernel scheduler and memory bottlenecks
– FPGA customization on layers better for tackling operations on multiple kernels. • Use profiling on GPU implementation as first step to FPGA optimization
– identify nature of bottlenecks– Customized FPGA HW can often better manage certain types of bottlenecks
24