IMPLEMENTATION OF EFFICIENT, LOW POWER DEEP NEURAL NETWORKS ON NEXT-GENERATION … ·...

IMPLEMENTATION OF EFFICIENT, LOW POWER DEEP NEURAL NETWORKS

ON NEXT-GENERATION INTEL CLIENT PLATFORMS

Michael Deishermichael.deisher@intel.com

Intel CorporationHillsboro, Oregon USA

Andrzej Polonskiandrzej.polonski@intel.com

Intel Technology PolandGdansk, Poland

What is GNA?

Low power neural co-processor for continuous inference at the “edge”

Designed for Intel® Quark™, Intel Atom®, and Intel® Core™ based devices

Runs while application processor is in low power sleep state

Interfaces to system or private memory avoiding CPU cache pollution

CPU Core0

CPU Core2

CPU Core3

CPU Core1

Memory

Layer Descriptor

Weight Matrices

Bias Vectors

PWL Activation Descriptors

I/O Buffers

Memory Management Unit

Wide SIMD Math Unit

Activation Function Approximator

Sequencer

Layer Descriptors, Weights, Biases, PWL Segments,

Layer Inputs

Layer Outputs

How does GNA work?

Neural network topology stored in memory as list of layer descriptors

Layer types: affine, diagonal affine, Gaussian mixture model, recurrent, convolutional1D, transpose, copy

Activation function: piecewise linear (PWL) approximation

All-integer math (inputs, outputs, weights, biases)

Batching of layer inputs for better throughput

Optional on-the-fly pruning (affine layer)

Complex graphs (e.g., LSTM, GRU, TDNN) constructed from basic layer types

Stream processing model

• App processor configures memory, starts GNA, and sleeps or does useful work

• GNA signals when forward propagation is complete

Layer TypeInput

OrientationOutput

OrientationBatch

SizeWeight

WidthPWL

ActivationPartial

OutputAffine column vector column vector 1-8 1B, 2B optional yesDiagonal Affine column vector column vector 1-8 1B, 2B optional noConvolutional row vector row vector 1 2B optional noGaussian Mix column vector column vector 1-8 1B no yesRecurrent row vector row vector 1-8 1B, 2B required noCopy row vector row vector 1-8 N/A N/A noInterleave row vector column vector 1-8 N/A N/A noDeinterleave column vector row vector 1-8 N/A N/A no

How is GNA used?

Start with floating point neural network trained in framework of choice

Import using Intel® Deep Learning SDK Deployment Tool or Kaldi example

Link with Intel® Deep Learning SDK Inference Engine or GNA native library

GNA native library options

• Firmware API (for Intel® Quark™ running real-time operating system)

• Middleware API (for Intel Atom® and Intel® Core™ running Linux, Windows)

Complexity is hidden in layer descriptor list construction

• Handled transparently by Intel® Deep Learning SDK model optimizer

• Full control via GNA native library API

Neural Network Training

Frameworks

Quantization &Graph Mapping

Inference Runtime

Kaldi, Tensorflow,CNTK, Theano, Caffe

Intel DL SDK,GNA Example Code

Intel DL SDK,GNA Native Library

API Function DescriptionGNADeviceOpen acquire handle to GNA device

GNADeviceClose release handle to GNA device

GNAAlloc allocate memory (and pin so it cannot be swapped out)

GNAFree free GNA memory (after unpinning)

GNAPropagateForward propagate inputs through all layers of network

GNAWait wait until propagation request completes or timeout

Intel® Deep Learning SDK Deployment Tool Enables full utilization of IA Inference while abstracting HW from developers

Optimize:•Imports trained models from all popular DL framework regardless of training HW•Model Canonicalization, Compression and Quantization

Deploy:•One API across all Intel HW and systems•Friendly Inference solution: (low footprint, easy API, control meeting Functional Safety)•Optimizes Inference execution per target hardware under-the-hood

Ease of use + Embedded friendly + Extra performance boost

Trained Model Optimize

Compress

Quantize

Inference Engine

CPU GPU FPGA …GNA

https://software.intel.com/en-us/deep-learning-sdk

What You Are Seeing: Live ASR Demonstration

Video playing TED talk by Aimee Mullens: The opportunity of adversity

Customer speech recognizer performing local real-time transcription

Acoustic model: 6-layer DNN w/ 2048 hidden nodes trained on 3000 hrs

Acoustic likelihood scoring is selectable:

• Native (customer optimized), GNA SW emulation (CPU), or GNA (HW)

Performance monitor shows drop in CPU residency with GNA HW

Corresponding drop in power consumption not shown

What You Are Seeing: Performance Demonstration

Real-time visualization of acoustic log-likelihoods

• DNN: additional layer added mapping outputs to phones

• Ordered by class: non-speech, unvoiced fricatives, voiced stops, unvoiced stops, voiced fricatives, liquids & glides, nasals, front vowels, mid vowels, back vowels, diphthongs

• LSTM: CTC-trained phone outputs in natural order

Repeating cycle: score on CPU, score on GNA, …

Shows difference in scoring speed between CPU and GNA HW

Batch scoring of 7-layer2048 hidden node DNN

Log-likelihoods color coded

CPU utilization in Perfmon

Scoring on GNA is 3x fasterthan 1.1GHz Atom CPUin this configuration**

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at [intel.com].

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel® Quark™, Intel Atom®, Intel® Core™, and Intel® Deep Learning SDK are a trademark of Intel Corporation or its subsidiaries in the U.S. and/or other countries.For more information on benchmarking go to www.intel.com/benchmarks.

Offload saves about 50% of a 1.1GHz Atom™ core in this configuration*

Switch from CPUto GNA here

Video playback with real-time transcription.

Recognition results matchclosed captioning.

Runningon

vowels...

fricativesnon-speech

*Note that other configurations (e.g., higher CPU clock, use of more CPU cores, etc.) may improve performance at cost of higher power consumption

**Note that other configurations (e.g., higher CPU clock, etc.) may have lower utilization benefit while retaining significant power reduction

Coming soon

Low likelihood High likelihood

IMPLEMENTATION OF EFFICIENT, LOW POWER DEEP NEURAL NETWORKS ON NEXT-GENERATION … ·...

Documents

Transcript of IMPLEMENTATION OF EFFICIENT, LOW POWER DEEP NEURAL NETWORKS ON NEXT-GENERATION … ·...

Introduction to Deep Neural Networks - Deep Learning · Introduction to Deep Neural Networks 0. Logistics Spring 2020 1. Neural Networks are taking over! •Neural networks have become

Deep Convolutional Neural Network for Image …papers.nips.cc/paper/5485-deep-convolutional-neural...Deep Convolutional Neural Network for Image Deconvolution Li Xu ∗ LenovoResearch

Deep Learning in Neural Networks: An Overviejuergen/DeepLearning28May2014.pdf · Deep Learning in Neural Networks: An Overview ... 1 Introduction to Deep Learning ... Neural Fitted

Trained Quantization Thresholds for Accurate and Efficient Fixed … · 2020-03-16 · TRAINED QUANTIZATION THRESHOLDS FOR ACCURATE AND EFFICIENT FIXED-POINT INFERENCE OF DEEP NEURAL

Parallelized Deep Neural Networks

ENERGY-EFFICIENT NEUROMORPHIC COMPUTINGweb.eecs.umich.edu/~Mazum/Neural Computing Preface.pdf2.5 Deep Learning 2.5.1 Pre-Deep-Learning Era 2.5.2 The Rise of Deep Learning 2.5.3 Deep

RESOURCE-EFFICIENT AND REPRODUCIBLE MODEL …RESOURCE-EFFICIENT AND REPRODUCIBLE MODEL SELECTION ON DEEP LEARNING SYSTEMS Supun Nakandala 1Yuhao Zhang Arun Kumar ABSTRACT Deep neural

"Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford

Deep convolutional neural networks - Computer Science- …web.cs.ucdavis.edu/.../lee_lecture19_deeplearning_notes.pdf · Deep convolutional neural networks ... “Deep” architecture

An Efficient Approach to the Supervised Training of Deep ... · Deep Neural Networks (DNNs) support applications such as autonomous vehicles, image processing, financial transactions,

Introduction to Deep Neural Networksbhiksha/courses/deep... · Introduction to Deep Neural Networks 0. Logistics Spring 2020 1. Neural Networks are taking over! •Neural networks

Mini-COVIDNet: Efficient Lightweight Deep Neural Network ...

Deep Multi-State Dynamic Recurrent Neural Networks ...papers.nips.cc/paper/9594-deep-multi-state-dynamic-recurrent-neural... · Deep Multi-State Dynamic Recurrent Neural Networks

"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural Networks," a Presentation from MIT

Eyeriss: An Energy-Efficient Reconfigurable Accelerator ... · © 2016 IEEE 14.5: Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks International

Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks … · 2017. 9. 20. · 2/ 30 Based on: “Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks”,

Efficient Processing of Deep Neural Networks: A …...Sze et al.: Efficient Processing of Deep Neural Networks: A Tutorial and Survey Vol. 105, No. 12, December 2017 | Proceedings

EIE: Efficient Inference Engine on Jun 20 2016 Compressed Deep …vicente/recognition/2016/presentations/eie.pdf · EIE: Efficient Inference Engine on Compressed Deep Neural Network

EFFICIENT METHODS AND HARDWARE FOR DEEP …qf934gh3708/EFFICIENT... · Deep neural networks have evolved to be the state-of-the-art technique for machine learningtasks. However,thesealgorithmsarecomputationallyintensive,whichmakesitdiﬃcultto

DeepIoT: Compressing Deep Neural Network Structures … · DeepIoT: Compressing Deep Neural Network ... sensing-related tasks on Intel Edison ... Compressing Deep Neural Network Structures