Image, Vision, and AI Processor For Mobile Devices Google's Fully … · 2018-08-20 · Pixel...
Transcript of Image, Vision, and AI Processor For Mobile Devices Google's Fully … · 2018-08-20 · Pixel...
Pixel Visual Core:Google's Fully Programmable Image, Vision, and AI Processor For Mobile DevicesJason Redgrave, Albert Meixner, Nathan Goulding-Hotta, Artem Vasilyev, and Ofer Shacham
1
Motivation: The Vision
2
No HDR+ HDR+
Software Motivation
3
● Imaging, Vision, and AI are fast evolving fields● Productizing new algorithms is difficult
○ ASIC design adds long delay into Research→Product cycle○ CPU efficiency limits complexity in mobile space○ Reliance on external vendors limits changes to the stack
● Software wants control, flexibility, and efficiency
Hardware Motivation
4
CPU
Ene
rgy
per O
p (p
J/O
p)
Performance (Op/sec)
GPU
ASICPVC’s IPU
10-20x
10-20x
10x
Not-programmable<1 pJ/Op>1 Tera Ops/Sec
Programmable>100 pJ/Op~1-10 Giga Ops/Sec
Programmable Image Processing Unit (IPU)<1 pJ/Op>1 Tera Ops/Sec
Domain-specific Software
● High-level programming model is Halide○ Domain-specific language for Image Processing
● IPU supports a subset of Halide language○ No floating point○ Limits on available memory access patterns
● Halide backend generates kernels and all API calls○ Proprietary API for resource allocation and execution control
5
Compilation
● Halide backend generates high-level, virtual ISA (vISA)○ RISC ISA with streaming-friendly memory model○ Cross-generational & Architecture-independent
● Final compilation into physical ISA (pISA)○ Compilation can be offline or on-device○ Generation-specific VLIW ISA○ All memory movements are explicit (no caches)
6
Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem7
Conceptual View of Hardware: DAG of Kernels
LineBufferLineBuffer
DRAM
2D Stencil Processor
LineBuffer
LineBuffer
2D Stencil Processor
LineBuffer
2D Stencil Processor
2D Stencil Processor
DRAM
Camera
LineBuffer
LineBuffer
● Many cores, each fully programmable● Configurable DAG topology
Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem
Hardware resources view
8
Line Buffer 0
Line Buffer 1
Line Buffer 2
Line Buffer N
From DRAM
To DRAM
Stencil processor 0
Shift Reg Shift Reg
Bus Ifc
Stencil processor M-1
Shift Reg Shift Reg
Bus Ifc
Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem
Pixel Visual Core Architecture
9
IPU Core 2
IPU Core 1
IPU Core 4
IPU Core 3
IPU Core 6
IPU Core 5
IPU Core 8
IPU Core 7
PCIe
MIPIA53
LPD
DR
4
IPU IO Block● A53● LPDDR4● MIPI● PCIe● IPU
DRAM IPU SoC
Chip Specs: TSMC 28nm 6.0 x 7.2 mm 426 MHz 512 MB DRAM Power <4500 mW
Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem
IPU Architecture
10
● 8x Cores○ STP○ LBP
● I/O○ DMA○ MMU○ SSP (Buffer)
● Ring NoC
Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem
Stencil Processor (STP)
11
● Scalar lane○ Instruction RAM
● SHG (Load/Store)● 2D Array
○ 256 Compute lanes○ 144 Halo lanes○ Shared RAMs○ Shift network
Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem
Compute Lane
12
● Single cycle!● Dual 16b ALU● 16b x 16b Multiply-Add● Memory
○ Register File○ Shared RAM (L0)
Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem
Read Neighbor Network
13
● Output pixel depends on neighboring inputs
● One vertical or horizontal shift per cycle
● Shifts are circular (toroidal)
● 1/2/3/4 hops in each direction
ALU ALU
ALU ALU
Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem
Line Buffer Pool (LBP)
14
● Data storage● Synchronization● 2D Line Buffer abstraction
(2D FIFO)
Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem
Virtually Tall Line Buffer
15
Sliding buffer memory (sb_memory)
extended fullwidth buffer
fb_rows
sb_rows
sb_cols
img_width
Read Sheet
full width buffer memory (fb_memory)
fb_rows
Reused from last pass
Will be reused in the next pass
Used only by the current pass
● Memory saving over full buffer
● Elasticity set by SB width● FB height set by kernel
reuse
Results
16
Align Merge Finish
2.8x 2.8x
5.8xPerformance
Align Merge Finish
6.9x 7.1x
16.7xEnergy Efficiency
7-16x more energy-efficient despite 3-generation process gap!
Mobile SoC (10nm) Pixel Visual Core (28nm)
17
T.J. Alumbaugh, Ke Bai, Fabrizio Basso, Trevor Bunker, Dinesh Chandrasekaran, Ed Chang, Robert Chapman, Arun Chauhan, Albert Chen, Chien-Yu Chen, Matt Cockrell, Jamison Collins, Joe Dao, Neeti Desai, Dusty DeWeese, Andrea Di Blas, Daniel Finchelstein, Arnd Geis, Filippo Gioachin, Richard Goe, Nathan Goulding-Hotta, Ben Gribstad, Cheng Gu,
Penny Gur, Nico Hailey, Ashok Halambi, Adam Hampson, Mahdi Hamzeh, Vince Harron, Sam Hasinoff, Zhijun He, Viresh Hingarh, Sean Howarth, Richard Hsu, John Keen, Asif Khan, Ji Yun Kim, Timothy Knight, Victor Leung, Marc Levoy, Yuan
Lin, Joshua Litt, Chenjie Luo, Bill Mark, Albert Meixner, Jon Michelson, Rob Mohr, Michael Moreno, Ben Mossawir, Rolf Mueller, Sean O'Boyle, Sarang Padalkar, Hyunchul Park, David Patterson, Alexander Perez, Karthika Periyathambi, Bob
Pflederer, Theodore Popp, Todd Poynor, Jing Pu, Shahriar Rabii, Ramesh Ramarao, Jason Redgrave, Masumi Reynders, Shac Ron, Sabarish Sankaranarayanan, Vaidyanathan Seetharaman, Ofer Shacham, Dillon Sharlet, Sean Silva, Saravana Soundararajan, Don Stark, Eino-Ville Talvala, Pei Zhao Tang, David Tolnay, Michelle Tomasko, Artem Vasilyev, Jaakko
Ventela, Nate Voorhies, David Warren, Kevin Watts, Huachang Xu, Catherine Yu, Rong Zhou, Qiuling Zhu
Questions?
18