Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime...

21
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 1 NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision Paper by: Clement Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eugenio Culurciello and Yann LeCun Presentation by: Brendan Adkins and Tarun Khubchandani 1 Farabet and LeCun’s original talk: https://www.youtube.com/watch?v=KaJtT1K3 GtI

Transcript of Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime...

Page 1: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 1

NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision

Paper by: Clement Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eugenio Culurciello and Yann LeCun

Presentation by: Brendan Adkins and Tarun Khubchandani

1

Farabet and LeCun’s original talk: https://www.youtube.com/watch?v=KaJtT1K3GtI

Page 2: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 2

Introduction and Background

2

Page 3: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 3

Computer Vision

3

● Extract high level information from images

○ Form relationship between high-dimensional data to low dimensional space

● Object Recognition○ Dense feature extraction from

regularly spaced samples● GPUs increasing in prominence in CV

○ Inexpensive, easily available, easily programmable

○ Poor performance/power consumption compared to custom HW

Page 4: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 4

Contribution

● Provide real-time detection, categorization and localization of pipelined megapixel images○ 10x less power consumption than

laptop computer (~10W)○ 100x speedup in application

● Similar work being carried out at NEC Labs, Stanford and Kaist

4

Page 5: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 5

Architecture

5

Page 6: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 6

Dataflow Grid

6

● 2D Grid of Processing Tiles (PT)○ Bank of Processing Operators ○ Routing MUX connecting local data

lines to global/neighbor tiles● Smart DMA

○ Asynchronous data transfers with priority to off-chip memory

● Global/Local Data Lines○ Global lines connect PT to SDMA○ Local Data lines connect neighbors

● Runtime Configuration Bus○ Reconfigure grid to specialize at

runtime

Page 7: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 7

Runtime Reconfiguration● FPGA:

○ Versatile, simple processing elements (~104/package)○ ~ms reconfiguration time but ~hr synthesis time

● Multicore Processor:○ Simple usage (extensions to programming languages for parallelism)○ Far fewer processing elements (10-100)

● Proposed Architecture:○ Halfway between above options○ Applications specialise in vision

7

Page 8: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 8

Optimized Processing Tiles● Specialized heavily on 2D

convolutions○ Top row PTs are MACs used as

convolvers, implemented in FPGA by hardwired MAC

○ Middle row is general purpose ops○ Bottom row is non-linear mapping

(normalization, linear activation, etc.). Done with look-up or linear decomposition

● Pipelined to have 1 result/cycle● Pixels stored Q8.8 and scaled to 32-bit

in operations

8

Page 9: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 9

Architecture Constraints

● High throughput, but not necessary low latency

○ Operations replicated in both dimensions○ # similar computations > latency in pipelined

processor● Must be stallable

○ Allows any path to be configured, even if requiring more bandwidth than available

○ Achieved with FIFO buffer

9

Page 10: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 10

Architecture Constraints

● Configuration time ≈ system latency○ Crucial to runtime reconfiguration, achieved

with configuration bus● Coarse grained processing elements

○ Maximize ratio between computing and routing

10

Page 11: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 11

Smart DMA

● Custom engine to allow multiple async access

● Arbiter MUX/DEMUX access to memory with high bandwidth

● Ports can be configured to R/W specific chunks and communicate status to Control Unit

○ Dataflow: Operation driven fully by data

11

Page 12: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 12

Compiler

12

Page 13: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 13

Purpose● Extracting levels of parallelism

from graph descriptions of algorithms

● Graphs are given in the Torch5 environment

○ Matrix representation similar to MATLAB

● Known sequence of operators are matched to pre-optimized routines

13

Training a XOR gate in Torch 5, http://torch5.sourceforge.net/manual/newbieTutorial.html

Page 14: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 14

Parallelization Methods● Across modules

○ Special cases○ Cascading convolutions and nonlinear mapping

● Across images○ Can use multiple PTs to convolve multiple inputs with a kernel at once○ NueFlow/LuaFlow’s strength and the most simple method

● Within an image

14

Page 15: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 15

Application and Performance Comparison

15

Page 16: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 16

Application: Street Scenes● Trained with LabelMe dataset of

Spanish cities.● 3 phases of training● Post training network mapped to

NueFlow using LuaFlow

16

Page 17: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 17

Phase 1: CN1● 3 Convolutions● Small kernels (5x5)

○ Small receptive field

● Focus on minimizing cross entropy to promote rare categories

17

Page 18: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 18

Phase 2: CN2● 3 Convolutions● Kernels increased to 9x9 size

○ Receptive field 2% of image

18

Page 19: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 19

Phase 3: CN3● 4 Convolutions● Kernels kept at 9x9 size

○ Receptive field 5% of image

19

Page 20: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 20

Performance Comparison● V6

○ Competitive GOP rate○ Strength in power efficiency,

indicates potential use case in systems in which the speed of an mGPU would suffice, but are power-constrained.

20

● IBM○ Vast projected improvement in GOP rate and efficiency○ Could fully eclipse the mGPU in speed and efficiency

Page 21: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 21

Questions?

21