Jacek Czaja, Machine Learning Engineer, AI Product Group · legal disclaimer & optimization notice...
Transcript of Jacek Czaja, Machine Learning Engineer, AI Product Group · legal disclaimer & optimization notice...
Jacek Czaja, Machine Learning Engineer, AI Product Group
Legal Disclaimer & Optimization NoticeINFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
4
Artificial Intelligence, Machine Learning & Deep Learning
Deep Learning use cases
Transport: Automated Driving
Finance: financial forecastingFinance: Customer support Energy: Oil & gas search
Health: Pneumonia detection Space: Lunar Craters detection
Consumer: Speech/text search
[1]
Agriculture: Robotics
Positive:
Negative:
6
Why Now?
Bigger Data Better Hardware Smarter Algorithms
Image: 1000 KB / picture
Audio: 5000 KB / song
Video: 5,000,000 KB / movie
Transistor density doubles every 18 months
Cost / GB in 1995: $1000.00
Cost / GB in 2017: $0.02
Advances in algorithm innovation, including neural networks, leading to better accuracy in training models
7
Sharing
Companies share algorithms and topologies
http://www.camdencca.org/content/uploads/2016/11/community-ideas-sharing.jpg
https://www.juanmerodio.com/en/wp-content/uploads/gold-data.jpg
Their gold is:
• Data
• Trained models
• Talent
8
Visual Understanding Research
@ Intel Labs China
2D/3D Face &
Emotion EngineVisual Parsing &
Multimodal Analysis
Face Analysis TechnologyMultimodal Emotion Recognition
…
Automatic Image/Video CaptioningVisual Question & Answering
…
Efficient DNN Design
& Compression
Efficient CNN Algorithm DesignDNN Model Compression
…
Innovate in cutting-edge visual cognition & machine learningtechnologies for smart computing to enable novel usages and user experience
FC
Loss
128
128
Face Analysis &
Emotion RecognitionVisual Parsing &
Multimodal Analysis
Deep Learning based
Visual Recognition
6
9
Machine Learning Types
http://www.frankichamaki.com/data-driven-market-segmentation-more-effective-marketing-to-segments-using-ai/
Reinforcement
Act in a environment to maximize reward
Build autonomous agents that learn
Supervised
Teach desired behavior with labeled data and infer new data
Unsupervised
Make inferences with unlabeled data and discover patterns
Semi-supervised
A combination of supervised and unsupervised learning
Labeled Data
Classified Data
Labeled and Unlabeled Data
Classified Data
Unlabeled Data Clustered Data
10
Machine Learning Types
http://www.frankichamaki.com/data-driven-market-segmentation-more-effective-marketing-to-segments-using-ai/
Reinforcement
Act in a environment to maximize reward
Build autonomous agents that learn
Supervised
Teach desired behavior with labeled data and infer new data
Unsupervised
Make inferences with unlabeled data and discover patterns
Semi-supervised
A combination of supervised and unsupervised learning
Labeled Data
Classified Data
Labeled and Unlabeled Data
Classified Data
Unlabeled Data Clustered Data
11
Trainingdata
output expected
…
0.10 0.15 0.20 …0.05
person cat dog bike
0 1 0 … 0
person cat dog bike
penalty(error or cost)
…
Forward
Propagation
Back
Propagation
12
Inferencedata
output
…
0.02 0.85 0.07 …0.01
person cat dog bike
Forward
Propagation
AIPG Nervana Deep Learning Portfolio
Nervana Graph
Nervana Deep Learning Studio
MKL-DNN, other math libraries
Frameworks
• Accelerate framework optimization on IA; open source• For framework developers & Intel• Multi-node optimizations • Extend to non-DC inference products and use cases
Titanium: HW mgmt.
• Data scientist and developer DL productivity tools• DL Cloud Service for POC, developers and academics• DL appliance for DLaaS
• Frameworks for developers • Back end APIs to Nervana Graph
HW Transformers,Non-x86 libraries
Datacenter Edge, client, gateway
• Comprehensive product portfolio• General purpose x86• Dedicated DL NPU accelerators
PRODUCTS
PRODUCT SOFTWARE
ENABLING
DEEP LEARNING PLATFORM
SYSTEMS
Deep Learning Systems
Node & rack reference designs
Channel sales
• Enable direct and end customers with Deep Learn System portfolio
• Intel branded under investigation
Intel Brain
Data Scientist
Team
BDM & Direct Optimization
Team
• Research new AI usages and models• Develop POC with customers to apply AI methods• Enable customers to deploy products
RESEARCH AND APPLICATION SUPPORT
Nervana Cloud
Intel branded
14
15
All purpose Flexible acceleration ADAS LOW power vision
Intel® FPGA
Enhanced DL InferenceAcceleration for deep
learning inference in real-time with higher efficiency, and wide range of workloads &
configurations
*Knights Mill (KNM); select = single-precision highly-parallel workloads generally scale to >100 threads and benefit from more vectorization, and may also benefit from greater memory bandwidth e.g. energy (reverse time migration), deep learning training, etc.All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
Intel® Processors
agile AI PlatformsRange of performance and power for widest variety of
AI, gateway & edge workloads – including deep
learning inference
Mobileye’s EyeQ-5
autonomous drivingReal-time fused
camera/radar inference, path planning, road-
reconstruction in vehicle
Movidius Myriad-X
computer visionLow power computer vision engine using deep learning inference in gateway and
devices
AI gateway/Edge
16
All purpose Flexible acceleration Deep Learning
Intel® Nervana™ Neural Network Processor
Deep learning by designScalable acceleration with
best performance for intensive deep learning
training & inference, period
Intel®FPGA
Enhanced DL InferenceScalable acceleration for
deep learning inference in real-time with higher
efficiency, and wide range of workloads & configurations
Intel® Xeon® Scalable Processors
Known Compute for AIScalable performance for
widest variety of AI & other datacenter workloads –
including breakthrough deep learning training & inference
AI Datacenter
Most agile AI platform
Intel® Xeon® scalable processorBuilt-in ROIBegin your AI journey today usingexisting, familiar infrastructure
Potent performanceUp to 2.2X deep learning training & inference perf vs. prior gen1; 113X with SW optimizations2
Production-readyRobust support for full range of AI deployments
1,2Configuration details on slide: 18, 20, 24Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of November 2016Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804
Classic ML Deep Learning Reasoning
Emerging AI Analytics More
Scalable performance for widest variety of AI & other datacenter workloads – including deep learning
17
18
Performance Drivers for AI Workloads
Compute Bandwidth
SW Optimizations
19
Up to 3.4x Integer Matrix Multiply Performance on Intel® Xeon® Platinum 8180 Processor
Configuration Details on Slide: 24Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
1 1
2.3
3.4
0
1
2
3
4
Single Precision Floating Point General Matrix Multiply
SGEMM
(FP32)
Integer General Matrix Multiply
IGEMM
(INT8)
Matrix Multiply Performance on Intel® Xeon® Platinum 8180 Processor
compared to Intel® Xeon® Processor E5-2699 v4
1S Intel® Xeon® Processor E5-2699 v4 1S Intel® Xeon® Platinum 8180 Processor
Enhanced matrix multiply performance on Intel® Xeon® Scalable Processors
8bit IGEMM will be available in Intel® Math Kernel Library (Intel® MKL) 2018 Gold to be released by end of Q3 2017
GE
MM
pe
rfo
rma
nce
(Me
asu
red
in G
FL
OP
S)
rep
rese
nte
d r
ela
tiv
e t
o a
ba
seli
ne
1.0
Hig
he
r is
Be
tte
r
20
AI Performance – Gen over Gen
INFERENCE THROUGHPUT
Up to
2.4xIntel® Xeon® Platinum 8180 Processor
higher Neon ResNet 18 inference throughput compared to
Intel® Xeon® Processor E5-2699 v4
TRAINING THROUGHPUT
Up to
2.2xIntel® Xeon® Platinum 8180 Processor
higher Neon ResNet 18 training throughputcompared to
Intel® Xeon® Processor E5-2699 v4
Advance previous generation AI workload performance with Intel® Xeon® Scalable Processors
Inference throughput batch size: 1 Training throughput batch size: 256 Configuration Details on Slide: 18, 20 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specificcomputer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined withother products. For more complete information visit http://www.intel.com/performance Source: Intel measured as of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizationsinclude SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Inference and training throughput measured with FP32 instructions. Inference with INT8 will be higher.
21
AI Performance – Software + Hardware
INFERENCE THROUGHPUT
Up to
138xIntel® Xeon® Platinum 8180 Processor
higher Intel optimized Caffe GoogleNet v1 with Intel® MKL inference throughput compared to
Intel® Xeon® Processor E5-2699 v3 with BVLC-Caffe
INFERENCE using FP32 Batch Size Caffe GoogleNet v1 256 AlexNet 256 Configuration Details on Slide: 18, 25Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors maycause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intelmeasured as of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guaranteethe availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Pleaserefer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
TRAINING THROUGHPUT
Up to
113xIntel® Xeon® Platinum 8180 Processor
higher Intel Optimized Caffe AlexNet with Intel® MKL training throughput compared to
Intel® Xeon® Processor E5-2699 v3 with BVLC-Caffe
Deliver significant AI performance with hardware and software optimizations on Intel® Xeon® Scalable Processors
Optimized Frameworks
Optimized Intel® MKL Libraries
Inference and training throughput measured with FP32 instructions. Inference with INT8 will be higher.
22
Up to 2.4x Higher Inference Throughputon Intel® Xeon® Platinum 8180 Processor
Intel® Xeon® Platinum Processor delivers Inference throughput performance across different frameworks
1146
405
118 62
2135
427
140
1093
155 164 79
1305
445286
2656
814
226 136
3382
658
248
2439
333 250115
2889
1036
672
0
500
1000
1500
2000
2500
3000
3500
4000
AlexNet
BS = 1024
GoogLeNet
v1
BS = 1024
ResNet-50
BS = 1024
VGG-19
BS = 256
AlexNet
ConvNet
BS = 1024
GoogLeNet
ConvNet
BS = 1024
VGG ConvNet
BS = 256
AlexNet
BS = 1024
VGG-19
BS = 256
Inception V3
BS = 1024
ResNet-50
BS = 256
AlexNet
ConvNet
BS = 1024
GoogLeNet
v1 ConvNet
BS = 1024
ResNet 18
BS = 1024
Infe
ren
ce T
hro
ug
hp
ut
sho
wn
in
Ima
ge
s/S
eco
nd
2S Intel® Xeon® Processor E5-2699v4, 22C, 2.2GHz 2S Intel® Xeon® Platinum 8180 Processor, 28C, 2.5GHz
Caffe TensorFlow MXNet Neon
Inference throughput measured with FP32 instructions. Inference with INT8 will be higher. Additional optimizations may further improve performance.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors maycause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measuredas of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee theavailability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please referto the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
23
Intel® Xeon® Scalable Processor Multi-node Performance
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any
change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete
information visit: http://www.intel.com/performance Source: Intel measured as of August 2017.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
496.0
247.5
130.3
62.9
30.0
15.1
7.8
3.92.0
1.5 1.10.8
0.5
1.0
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
512.0
32
(1 node)
64
(2 nodes)
128
(4 nodes)
256
(8 nodes)
512
(16 nodes)
1024
(32 nodes)
2048
(64 nodes)
4096
(128 nodes)
8192
(256 nodes)
11264
(352 nodes)
11264
(470 nodes)
11264
(704 nodes)
MB-32 per node
MB-24 per
node
MB-16 per
node
Tim
e t
o T
rain
(h
ou
rs)
Global minibatch - scaled across nodes
ResNet-50 Time to train (Hours) - Weak scaling minibatch
SKX-6148 SKX-8180*
November 2017
24
https://arxiv.org/abs/1709.05011
Deep learningBy design
Scalable acceleration with best
performance for intensive deep
learning training & inference,
period
Crest family
Unprecedented compute density
Large reduction in time-to-train
32 GB of in package memory via HBM2 technology
8 Tera-bits/s of memory access speed
12 bi-directional high-bandwidth links
Seamless data transfer via interconnects
Custom hardware Blazing data access High-speed scalability
2017
Intel Nervana Lake Crest NPU Architecture
Interposer
Processing Cluster
PCI Express x16
SPI, IC2, GPIO
ICC
MGMT CPU
PCIe Controller & DMA
HBMPHY
ICL
Processing Cluster Processing Cluster
Processing Cluster
Processing Cluster
ICL
ICL
ICL
ICL
ICL
ICL
ICL
Processing Cluster
MemCtrlr
HBMPHY
MemCtrlr
HBMPHY
MemCtrlr
Processing Cluster
Processing Cluster
Processing Cluster
Processing Cluster
Processing Cluster
Processing Cluster
ICCICL
ICL
ICL
ICL
HBMPHY
MemCtrlr
HBM2
HBM2
HBM2
HBM2
Floorplan not to scale
26
27
FlexPoint™ Numerical Format Designed
EX
PO
NE
NT
MA
NT
ISS
A
EX
PO
NE
NT
MA
NT
ISS
A
• 11 bit mantissa precision (-1024 to 1023)
• Individual 5-bit exponents
• 16 bit mantissa 45% more precision than Float16(-32,768 to 32,767)
• Tensor-wide shared 5-bit exponent
929 -045 -195
935 -1011 549
-702 923 310
-13487 29475 22630
21964 -21581 29857
29884 -26049 30852
DEC=8 DEC=7 DEC=8
DEC=6 DEC=7 DEC=8
DEC=7 DEC=6 DEC=8
DEC=8
Float16 Flex16
Flex16 accuracy on par with Float32 but with much smaller cores
29
Diversity in Deep Networks
VVariety in Network Topology
▪ Recurrent NNs common for NLP/ASR, DAG for GoogLeNet,
Networks with memory…
BBut there are a few well defined building blocks
▪ Convolutions common for image recognition tasks
▪ GEMMs for recurrent network layers—could be sparse
▪ ReLU, tanh, softmax
GoogLeNet
Recurrent NN
CNN - AlexNet
30
Naïve Convolution
https://en.wikipedia.org/wiki/Convolutional_neural_network
31
Cache Friendly Convolution
arxiv.org/pdf/1602.06709v1.pdf
Performance Optimization on Modern Platforms
Utilize all the coresOpenMP, MPI, TBB…
Reduce synchronization events, serial code
Improve load balancing
Vectorize/SIMD
Unit strided access per SIMD lane
High vector efficiency
Data alignment
Efficient memory/cache use
Blocking
Data reuse
Prefetching
Memory allocation
Hierarchical Parallelism
Fine-Grained Parallelism / within node Sub-domain: 1) Multi-level domain decomposition (ex. across layers)
2) Data decomposition (layer parallelism)
Coarse-Grained / multi-node
Domain decomposition
Scaling
Improve load balancing
Reduce synchronization events, all-to-all comms
Xeon Xeon Phi FPGA
Deep Learning Frameworks
Intel®
MKL-DNN
Intel® Math Kernel
Library
(Intel® MKL)
* GEMM matrix multiply building blocks are binary
Intel® MKL and Intel® MKL-DNN for Deep Learning
Intel® MKL Intel® MKL-DNN
DNN primitives + wide variety of other math
functions
DNN primitives
C DNN APIs (C++ future) C/C++ DNN APIs
Binary distribution Open source DNN code*
Free community license. Premium support
available as part of Parallel Studio XE
Apache 2.0 license
Broad usage DNN primitives; not specific to
individual frameworks
Multiple variants of DNN primitives as required for framework integrations
Quarterly update releasesRapid development ahead of Intel MKL
releases
Intel® Nervana™ Deep Learning StudioCompress Innovation Cycle to Accelerate Time-to-Solution
What it isA comprehensive software suite to allow groups of data scientists to reduce the “innovation cycle” and enable them to develop custom, enterprise-grade deep learning solutions in record time.Available as part of Intel® Nervana Cloud and Intel® Nervana Deep Learning System.
Users Primary: Data scientists Secondary: Software developers
who take trained deep learning models and integrate into their applications.
Why it's importantIt is both time consuming and expensive to develop a deep learning solution due to expensive data scientists spending too much time wrangling data and manually executing hundreds of experiments to find the right network topology and combination of parameters to achieve a converged model that fits their use case.
Learn More: intelnervana.com
Images
Video
Text
Speech
Tabular
Time series
Deep Learning FrameworksNeon (more coming soon)
Intel® Nervana™ Deep Learning Studio
Intel® Nervana™ Hardware
35
High-Level Workflow
Dataset
Trained
Model
Data Scientist
Label
Import Dataset
Build Model
Model Library
Train Deploy
ncloudCommand Line Interface
Interactive Notebooks User Interface
Multiple Interface options
Edge
Cloud/Server
36
Intel Developer Zone for Artificial Intelligence
Deep Learning Frameworks, libraries and additional tools
Workshops, Webinars, Meet Ups & Remote Access software.intel.com/ai/academy
Intelnervana.com
Intel® Nervana™ ai academy
39
[1] CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning