AMETHYST: Image Registration Engine for Multiframe Processing · Multiframe Image Processing...
Transcript of AMETHYST: Image Registration Engine for Multiframe Processing · Multiframe Image Processing...
AMETHYST: Image Registration Engine for Multiframe Processing
Department of Electrical Engineering
Princeton University, Princeton, NJ 08544 USA
Mohammed Shoaib, Rich Stoakley, Matt Uyttendayle, and Jie Liu
Sensing and Energy Research Group
Microsoft Research
Multiframe processing (MFP) enables advanced algorithms for image analysis
MFP: Why should you care?
1
Multiframe Image Processing
Enhanced Image
Image 1
Image N
Image 2
– e.g., high-dynamic range (HDR) imaging, de-noising,
image stabilization, de-blurring, super-resolution
imaging, de-hazing, panoramic stitching, etc.
Short
Exposure
Long
Exposure
HDR ImageDe-noised ImageNoisy
Image 1
Noisy
Image 2
Stabilized ImageBlurry
Image 1
Blurry
Image 2
Blurry
Image 1
Blurry
Image 2
Blurry
Image 3
De-blurred Image
IMG SOURCES: (1) L. Zhang, CVPR ‘10 (2) M. Tico, Nokia ERDC ‘06, (3) A. Tomaszewska WSCG ‘07
Why is it hard?
2
E.g., HDR Photography
Typically, serial processing → frame delays cause issues:
1. Moving objects create artifacts
Frame misalignments lead to artifacts in fused image
2. Moving camera also creates artifacts
Senso
r
ISP
Low exposure
shotHigh
exposure shot
Mem
ory
CPU
Merge algorithm to create
HDR image
Senso
r
ISP
Mem
ory
~ 2 seconds/frame
IMG SOURCES: (1) L. Zhang, CVPR ‘10 (2) NVIDIA whitepaper, MWC ‘13
What are some existing solutions?
3
Solution 1: HDR Capture, e.g., Toshiba T4K05
Algorithmic solution is more interesting → needs no hardware change and scales to other applications
Solution 2: Algorithmic
Artifacts absent
HDR image
S-HDR image sensor
Image alignment
Compositing algorithm
Artifacts absent
HDR image
IMG SOURCES: (1) product brief T4K05 ‘13 (2) L. Zhang, CVPR ‘10
What are others doing about it?
4
Senso
r
ISP
Low exposure
shotHigh
exposure shot
Mem
ory
CPU
Merge algorithm to create
HDR image
Senso
r
ISP
Mem
ory
~ 2 seconds/frameFig: Camera architecture in current high-end mobile devices
Sensor ISPQuad-core
CPU
Memory
Sensor ISPTegra 4
GPU Cores
NVIDIA Computational Photography Engine
Fig: Chimera: The NVIDIA computational photography arch.
Tegra 4 Quad-core CPU
Memory
Senso
r
~ 0.2 seconds/frame
CPU GPU
ISP
E.g., NVIDIA: Tegra 4 (2014)
1st real-time HDR, 1st HDR panorama, 1st object tracking
Proprietary ISP-embedded algorithms use GPU for acceleration →~10x speedup and cost power
SOURCE: NVIDIA whitepaper, MWC ‘13
What are their limitations?
5
1. Current solutions: Modest speedups. Not generally applicable.
Image registration is a computational bottleneck → needs acceleration
Our target: ~100x speedup compared to software
Image Registration
Multiframe Compositing
IMG1
IMG1
IMG1
EnhancedImage
> 1.8 sec./frame ~1 sec. total
Image Alignment Image Fusing
2. Current algorithmic solutions: slow on CPUs
What have we done about it?
6
Sensor
F0 Frame Bus
CSI
Streaming Input
BF LS WB DM
CCM EE γ YUV
M0
Mn
G G G
G G G
G
GPUCPU
IPD DFE
HE IWP
AMETHYSTISP
VI
Composite Frames
F1 Fn C0 C1 Cm
Aligned Frames Bus
Fig: Proposed architecture for multi-frame image processing (MFP)
Specialized IP core for image registration Compositing
We propose an architecture for MFP that has a dedicated accelerator for image registration
What are our findings?
7
Technology 45 nm SOI
Area 0.15 mm2 (30k gates)
Memory < 2 MB
Frequency 1.1 GHz
Power 62.7 mW
Exec. Time 30ms/frame
Speed-up$ 37x over CPU
AMETHYST+ Performance Summary
AMETHYST shows speed-up of : 8x over GPU and 5x over FPGAat a power lower by : 14x than GPU and 3x than FPGA
0
0.1
0.2
0.3
0 200 400 600 800 1000
Tim
e/Fr
ame
(s)
Power Consumption (mW)
750
890
920
209.3245818
214.0819587
178.1161896
20
39.4
62.690
2
4
6
8
0 0.05 0.1 0.15 0.2 0.25
GFL
OP
S
Watts/GFLOPS
FPGA
GPU
GPU
FPGA
ASIC (AMETHYST)
ASIC (AMETHYST)
A4, PowerVR SGX535*
Tegra3, 12C GeForce*
Snapdragon, Adreno200*
Xilinx V6, 240k LC
Artix 7, 85k LC
Smartfusion2, 56k LUT
AMETHYST, 0.3 GHz
AMETHYST, 0.8 GHz
AMETHSYT, 1.1 GHz
Sensor
F0 Frame BusC
SIStreaming Input
GPUCPUAMETHYSTISPVI
Composite Frames
Fn
Aligned Frames Bus
C0 Cm
* performance and power are estimated values+ synthesis results for (IPD+DFE) blocks only $ assuming 60% cost due to (IPD + DFE)
What are our findings? Contd…
8
Highlights• State-of-the-art algorithm$ (from Photosynth)
2d-PE
2d-PE
2d-PE
2d-PE
ctrl regs
x
Idle
2d-PE
2d-PE
2d-PE
2d-PE
1d-PE
Ou
t
FIFO
FIFO
FIFO
+
+
1d-PE
FIFO
FIFO
FIFO
Kernel Matrix
Inp
ut
Mat
rix
Local FIFO
• 1st MFP engine for re-targetable applications
• Extensively configurable parallelism
• Multi-level data pipelining and interleaving• Systolic ops w/ 2-level vector reduction
F0 Frame Bus
AMETHYST: Architecture Block Diagram
F1 Fn
Aligned Frames Bus
Ix
Covariance
Iy
NMS
Harris IPD
Daisy (Sunflower) Feature Extraction
gradient histogram (e.g., SIFT, GLOH)
sine-weighted gradient quantization
quadrature steerable filters
Isotropic diff. of Gaussians (DoGs)
iterative
non-iterative
T-block Processing
Spatial Pooling
Normalization
Pat
ch B
uff
er
G Block Smoothing
G(x,σ)
Homography Estimation
Image Warping
RANSAC
Feat
ure
Bu
ffe
r
P P P
P P P
P P P
9x9 Jacobi SVD Interpolate
Kernel 0
Kernel 1
Kernel n
Re-sample
······························· ··
· ····· ···· ···· ·····
······
····· ····
$M. Brown et al., PAMI ‘10
Next
9
Technical steps:
– Finish implementing RTL for HE and IWP modules
Stage 1: FPGA validation
DDR 3/4 interface
Dat
a I/
O ARM Cortex Ax CPU
FPU and NEON
MMU
I-Cache D-CacheAMETHYST
MFP accelerator
DMA x Ch.
IRQ, DMA sync. SWDT
TTC
Clock
TTC
DebugUARTProgr. Interface
Stage 2: Silicon validation
ISP Core
AMETHYSTMFP accelerator
Stage 3: ISP integration
– Verify full-design on FPGA-based programmable SoCs (e.g., Zynq)– Develop HW-SW co-design with ARM core towards custom SoC– Perform physical design and post-layout validation of SoC
– Integrate silicon-proven design IP with ISP core