Jacek Czaja, Machine Learning Engineer, AI Product Group

Legal Disclaimer & Optimization NoticeINFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

4

Artificial Intelligence, Machine Learning & Deep Learning

Deep Learning use cases

Transport: Automated Driving

Finance: financial forecastingFinance: Customer support Energy: Oil & gas search

Health: Pneumonia detection Space: Lunar Craters detection

Consumer: Speech/text search

[1]

Agriculture: Robotics

Positive:

Negative:

6

Why Now?

Bigger Data Better Hardware Smarter Algorithms

Image: 1000 KB / picture

Audio: 5000 KB / song

Video: 5,000,000 KB / movie

Transistor density doubles every 18 months

Cost / GB in 1995: $1000.00

Cost / GB in 2017: $0.02

Advances in algorithm innovation, including neural networks, leading to better accuracy in training models

7

Sharing

Companies share algorithms and topologies

http://www.camdencca.org/content/uploads/2016/11/community-ideas-sharing.jpg

https://www.juanmerodio.com/en/wp-content/uploads/gold-data.jpg

Their gold is:

• Data

• Trained models

• Talent

8

Visual Understanding Research

@ Intel Labs China

2D/3D Face &

Emotion EngineVisual Parsing &

Multimodal Analysis

Face Analysis TechnologyMultimodal Emotion Recognition

…

Automatic Image/Video CaptioningVisual Question & Answering

…

Efficient DNN Design

& Compression

Efficient CNN Algorithm DesignDNN Model Compression

…

Innovate in cutting-edge visual cognition & machine learningtechnologies for smart computing to enable novel usages and user experience

FC

Loss

128

128

Face Analysis &

Emotion RecognitionVisual Parsing &

Multimodal Analysis

Deep Learning based

Visual Recognition

6

9

Machine Learning Types

http://www.frankichamaki.com/data-driven-market-segmentation-more-effective-marketing-to-segments-using-ai/

Reinforcement

Act in a environment to maximize reward

Build autonomous agents that learn

Supervised

Teach desired behavior with labeled data and infer new data

Unsupervised

Make inferences with unlabeled data and discover patterns

Semi-supervised

A combination of supervised and unsupervised learning

Labeled Data

Classified Data

Labeled and Unlabeled Data

Classified Data

Unlabeled Data Clustered Data

10

Machine Learning Types

http://www.frankichamaki.com/data-driven-market-segmentation-more-effective-marketing-to-segments-using-ai/

Reinforcement

Act in a environment to maximize reward

Build autonomous agents that learn

Supervised

Teach desired behavior with labeled data and infer new data

Unsupervised

Make inferences with unlabeled data and discover patterns

Semi-supervised

A combination of supervised and unsupervised learning

Labeled Data

Classified Data

Labeled and Unlabeled Data

Classified Data

Unlabeled Data Clustered Data

11

Trainingdata

output expected

…

0.10 0.15 0.20 …0.05

person cat dog bike

0 1 0 … 0

person cat dog bike

penalty(error or cost)

…

Forward

Propagation

Back

Propagation

12

Inferencedata

output

…

0.02 0.85 0.07 …0.01

person cat dog bike

Forward

Propagation

AIPG Nervana Deep Learning Portfolio

Nervana Graph

Nervana Deep Learning Studio

MKL-DNN, other math libraries

Frameworks

• Accelerate framework optimization on IA; open source• For framework developers & Intel• Multi-node optimizations • Extend to non-DC inference products and use cases

Titanium: HW mgmt.

• Data scientist and developer DL productivity tools• DL Cloud Service for POC, developers and academics• DL appliance for DLaaS

• Frameworks for developers • Back end APIs to Nervana Graph

HW Transformers,Non-x86 libraries

Datacenter Edge, client, gateway

• Comprehensive product portfolio• General purpose x86• Dedicated DL NPU accelerators

PRODUCTS

PRODUCT SOFTWARE

ENABLING

DEEP LEARNING PLATFORM

SYSTEMS

Deep Learning Systems

Node & rack reference designs

Channel sales

• Enable direct and end customers with Deep Learn System portfolio

• Intel branded under investigation

Intel Brain

Data Scientist

Team

BDM & Direct Optimization

Team

• Research new AI usages and models• Develop POC with customers to apply AI methods• Enable customers to deploy products

RESEARCH AND APPLICATION SUPPORT

Nervana Cloud

Intel branded

14

http://deeplearning.net/software/theano/


15

All purpose Flexible acceleration ADAS LOW power vision

Intel® FPGA

Enhanced DL InferenceAcceleration for deep

learning inference in real-time with higher efficiency, and wide range of workloads &

configurations

*Knights Mill (KNM); select = single-precision highly-parallel workloads generally scale to >100 threads and benefit from more vectorization, and may also benefit from greater memory bandwidth e.g. energy (reverse time migration), deep learning training, etc.All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel® Processors

agile AI PlatformsRange of performance and power for widest variety of

AI, gateway & edge workloads – including deep

learning inference

Mobileye’s EyeQ-5

autonomous drivingReal-time fused

camera/radar inference, path planning, road-

reconstruction in vehicle

Movidius Myriad-X

computer visionLow power computer vision engine using deep learning inference in gateway and

devices

AI gateway/Edge

16

All purpose Flexible acceleration Deep Learning

Intel® Nervana™ Neural Network Processor

Deep learning by designScalable acceleration with

best performance for intensive deep learning

training & inference, period

Intel®FPGA

Enhanced DL InferenceScalable acceleration for

deep learning inference in real-time with higher

efficiency, and wide range of workloads & configurations

Intel® Xeon® Scalable Processors

Known Compute for AIScalable performance for

widest variety of AI & other datacenter workloads –

including breakthrough deep learning training & inference

AI Datacenter

Most agile AI platform

Intel® Xeon® scalable processorBuilt-in ROIBegin your AI journey today usingexisting, familiar infrastructure

Potent performanceUp to 2.2X deep learning training & inference perf vs. prior gen1; 113X with SW optimizations2

Production-readyRobust support for full range of AI deployments

1,2Configuration details on slide: 18, 20, 24Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of November 2016Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804

Classic ML Deep Learning Reasoning

Emerging AI Analytics More

Scalable performance for widest variety of AI & other datacenter workloads – including deep learning

17

http://www.intel.com/performance

18

Performance Drivers for AI Workloads

Compute Bandwidth

SW Optimizations

19

Up to 3.4x Integer Matrix Multiply Performance on Intel® Xeon® Platinum 8180 Processor

Configuration Details on Slide: 24Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

1 1

2.3

3.4

0

1

2

3

4

Single Precision Floating Point General Matrix Multiply

SGEMM

(FP32)

Integer General Matrix Multiply

IGEMM

(INT8)

Matrix Multiply Performance on Intel® Xeon® Platinum 8180 Processor

compared to Intel® Xeon® Processor E5-2699 v4

1S Intel® Xeon® Processor E5-2699 v4 1S Intel® Xeon® Platinum 8180 Processor

Enhanced matrix multiply performance on Intel® Xeon® Scalable Processors

8bit IGEMM will be available in Intel® Math Kernel Library (Intel® MKL) 2018 Gold to be released by end of Q3 2017

GE

MM

pe

rfo

rma

nce

(Me

asu

red

in G

FL

OP

S)

rep

rese

nte

d r

ela

tiv

e t

o a

ba

seli

ne

1.0

Hig

he

r is

Be

tte

r


20

AI Performance – Gen over Gen

INFERENCE THROUGHPUT

Up to

2.4xIntel® Xeon® Platinum 8180 Processor

higher Neon ResNet 18 inference throughput compared to

Intel® Xeon® Processor E5-2699 v4

TRAINING THROUGHPUT

Up to

2.2xIntel® Xeon® Platinum 8180 Processor

higher Neon ResNet 18 training throughputcompared to

Intel® Xeon® Processor E5-2699 v4

Advance previous generation AI workload performance with Intel® Xeon® Scalable Processors

Inference throughput batch size: 1 Training throughput batch size: 256 Configuration Details on Slide: 18, 20 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specificcomputer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined withother products. For more complete information visit http://www.intel.com/performance Source: Intel measured as of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizationsinclude SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Inference and training throughput measured with FP32 instructions. Inference with INT8 will be higher.


21

AI Performance – Software + Hardware

INFERENCE THROUGHPUT

Up to

138xIntel® Xeon® Platinum 8180 Processor

higher Intel optimized Caffe GoogleNet v1 with Intel® MKL inference throughput compared to

Intel® Xeon® Processor E5-2699 v3 with BVLC-Caffe

INFERENCE using FP32 Batch Size Caffe GoogleNet v1 256 AlexNet 256 Configuration Details on Slide: 18, 25Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors maycause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intelmeasured as of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guaranteethe availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Pleaserefer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

TRAINING THROUGHPUT

Up to

113xIntel® Xeon® Platinum 8180 Processor

higher Intel Optimized Caffe AlexNet with Intel® MKL training throughput compared to

Intel® Xeon® Processor E5-2699 v3 with BVLC-Caffe

Deliver significant AI performance with hardware and software optimizations on Intel® Xeon® Scalable Processors

Optimized Frameworks

Optimized Intel® MKL Libraries

Inference and training throughput measured with FP32 instructions. Inference with INT8 will be higher.


22

Up to 2.4x Higher Inference Throughputon Intel® Xeon® Platinum 8180 Processor

Intel® Xeon® Platinum Processor delivers Inference throughput performance across different frameworks

1146

405

118 62

2135

427

140

1093

155 164 79

1305

445286

2656

814

226 136

3382

658

248

2439

333 250115

2889

1036

672

0

500

1000

1500

2000

2500

3000

3500

4000

AlexNet

BS = 1024

GoogLeNet

v1

BS = 1024

ResNet-50

BS = 1024

VGG-19

BS = 256

AlexNet

ConvNet

BS = 1024

GoogLeNet

ConvNet

BS = 1024

VGG ConvNet

BS = 256

AlexNet

BS = 1024

VGG-19

BS = 256

Inception V3

BS = 1024

ResNet-50

BS = 256

AlexNet

ConvNet

BS = 1024

GoogLeNet

v1 ConvNet

BS = 1024

ResNet 18

BS = 1024

Infe

ren

ce T

hro

ug

hp

ut

sho

wn

in

Ima

ge

s/S

eco

nd

2S Intel® Xeon® Processor E5-2699v4, 22C, 2.2GHz 2S Intel® Xeon® Platinum 8180 Processor, 28C, 2.5GHz

Caffe TensorFlow MXNet Neon

Inference throughput measured with FP32 instructions. Inference with INT8 will be higher. Additional optimizations may further improve performance.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors maycause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measuredas of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee theavailability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please referto the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.


23

Intel® Xeon® Scalable Processor Multi-node Performance

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any

change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete

information visit: http://www.intel.com/performance Source: Intel measured as of August 2017.

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

496.0

247.5

130.3

62.9

30.0

15.1

7.8

3.92.0

1.5 1.10.8

0.5

1.0

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

512.0

32

(1 node)

64

(2 nodes)

128

(4 nodes)

256

(8 nodes)

512

(16 nodes)

1024

(32 nodes)

2048

(64 nodes)

4096

(128 nodes)

8192

(256 nodes)

11264

(352 nodes)

11264

(470 nodes)

11264

(704 nodes)

MB-32 per node

MB-24 per

node

MB-16 per

node

Tim

e t

o T

rain

(h

ou

rs)

Global minibatch - scaled across nodes

ResNet-50 Time to train (Hours) - Weak scaling minibatch

SKX-6148 SKX-8180*


November 2017

24

https://arxiv.org/abs/1709.05011

Deep learningBy design

Scalable acceleration with best

performance for intensive deep

learning training & inference,

period

Crest family

Unprecedented compute density

Large reduction in time-to-train

32 GB of in package memory via HBM2 technology

8 Tera-bits/s of memory access speed

12 bi-directional high-bandwidth links

Seamless data transfer via interconnects

Custom hardware Blazing data access High-speed scalability

2017

Intel Nervana Lake Crest NPU Architecture

Interposer

Processing Cluster

PCI Express x16

SPI, IC2, GPIO

ICC

MGMT CPU

PCIe Controller & DMA

HBMPHY

ICL

Processing Cluster Processing Cluster

Processing Cluster

Processing Cluster

ICL

ICL

ICL

ICL

ICL

ICL

ICL

Processing Cluster

MemCtrlr

HBMPHY

MemCtrlr

HBMPHY

MemCtrlr

Processing Cluster

Processing Cluster

Processing Cluster

Processing Cluster

Processing Cluster

Processing Cluster

ICCICL

ICL

ICL

ICL

HBMPHY

MemCtrlr

HBM2

HBM2

HBM2

HBM2

Floorplan not to scale

26

27

FlexPoint™ Numerical Format Designed

EX

PO

NE

NT

MA

NT

ISS

A

EX

PO

NE

NT

MA

NT

ISS

A

• 11 bit mantissa precision (-1024 to 1023)

• Individual 5-bit exponents

• 16 bit mantissa 45% more precision than Float16(-32,768 to 32,767)

• Tensor-wide shared 5-bit exponent

929 -045 -195

935 -1011 549

-702 923 310

-13487 29475 22630

21964 -21581 29857

29884 -26049 30852

DEC=8 DEC=7 DEC=8

DEC=6 DEC=7 DEC=8

DEC=7 DEC=6 DEC=8

DEC=8

Float16 Flex16

Flex16 accuracy on par with Float32 but with much smaller cores

29

Diversity in Deep Networks

VVariety in Network Topology

▪ Recurrent NNs common for NLP/ASR, DAG for GoogLeNet,

Networks with memory…

BBut there are a few well defined building blocks

▪ Convolutions common for image recognition tasks

▪ GEMMs for recurrent network layers—could be sparse

▪ ReLU, tanh, softmax

GoogLeNet

Recurrent NN

CNN - AlexNet

30

Naïve Convolution

https://en.wikipedia.org/wiki/Convolutional_neural_network

31

Cache Friendly Convolution

arxiv.org/pdf/1602.06709v1.pdf

Performance Optimization on Modern Platforms

Utilize all the coresOpenMP, MPI, TBB…

Reduce synchronization events, serial code

Improve load balancing

Vectorize/SIMD

Unit strided access per SIMD lane

High vector efficiency

Data alignment

Efficient memory/cache use

Blocking

Data reuse

Prefetching

Memory allocation

Hierarchical Parallelism

Fine-Grained Parallelism / within node Sub-domain: 1) Multi-level domain decomposition (ex. across layers)

2) Data decomposition (layer parallelism)

Coarse-Grained / multi-node

Domain decomposition

Scaling

Improve load balancing

Reduce synchronization events, all-to-all comms

Xeon Xeon Phi FPGA

Deep Learning Frameworks

Intel®

MKL-DNN

Intel® Math Kernel

Library

(Intel® MKL)

* GEMM matrix multiply building blocks are binary

Intel® MKL and Intel® MKL-DNN for Deep Learning

Intel® MKL Intel® MKL-DNN

DNN primitives + wide variety of other math

functions

DNN primitives

C DNN APIs (C++ future) C/C++ DNN APIs

Binary distribution Open source DNN code*

Free community license. Premium support

available as part of Parallel Studio XE

Apache 2.0 license

Broad usage DNN primitives; not specific to

individual frameworks

Multiple variants of DNN primitives as required for framework integrations

Quarterly update releasesRapid development ahead of Intel MKL

releases



Intel® Nervana™ Deep Learning StudioCompress Innovation Cycle to Accelerate Time-to-Solution

What it isA comprehensive software suite to allow groups of data scientists to reduce the “innovation cycle” and enable them to develop custom, enterprise-grade deep learning solutions in record time.Available as part of Intel® Nervana Cloud and Intel® Nervana Deep Learning System.

Users Primary: Data scientists Secondary: Software developers

who take trained deep learning models and integrate into their applications.

Why it's importantIt is both time consuming and expensive to develop a deep learning solution due to expensive data scientists spending too much time wrangling data and manually executing hundreds of experiments to find the right network topology and combination of parameters to achieve a converged model that fits their use case.

Learn More: intelnervana.com

Images

Video

Text

Speech

Tabular

Time series

Deep Learning FrameworksNeon (more coming soon)

Intel® Nervana™ Deep Learning Studio

Intel® Nervana™ Hardware

35

High-Level Workflow

Dataset

Trained

Model

Data Scientist

Label

Import Dataset

Build Model

Model Library

Train Deploy

ncloudCommand Line Interface

Interactive Notebooks User Interface

Multiple Interface options

Edge

Cloud/Server

36

Intel Developer Zone for Artificial Intelligence

Deep Learning Frameworks, libraries and additional tools

Workshops, Webinars, Meet Ups & Remote Access software.intel.com/ai/academy

Intelnervana.com

Intel® Nervana™ ai academy

[1] CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

Jacek Czaja, Machine Learning Engineer, AI Product Group · legal disclaimer & optimization notice...

Documents

Transcript of Jacek Czaja, Machine Learning Engineer, AI Product Group · legal disclaimer & optimization notice...