REAL WORLD PROBLEM SIMPLIFICATION USING DEEP LEARNING … · GPU Servers: Dual Xeon E5-2690...

Dr. Ettikan Kandasamy Karuppiah

Director, Developers’ Ecosystem

2 November 2017

REAL WORLD PROBLEM SIMPLIFICATION USING DEEP LEARNING / AI

2

“Find where I parked my car”

AI IS EVERYWHERE

“Find the bag I just saw in this magazine”

“What movie should I watch next?”

3

Bringing grandmother closer to family by bridging language barrier

TOUCHING OUR LIVES

Predicting sick baby’s vitals like heart rate, blood pressure, survival rate

Enabling the blind to “see” their surrounding, read emotions on faces

4

Increasing public safety with smart video surveillance at airports & malls

AI FOR PUBLIC GOOD

Providing intelligent services in hotels, banks and stores

Separating weeds as it harvests, reduces chemical usage by 90%

5

HOW A DEEP NEURAL NETWORK SEES

Image “Audi A7”

Raw data Low-level features Mid-level features High-level features

6

KNOW YOUR PROBLEM WELL

7

3000

AI MOMENTUM

TODAY BY 2020

AI startups

85% $47B 20%

of all customer service interactions will

be powered by AI bots

spend on AI technologies

of companies will dedicate workers to monitor and guide neural networks

8

EVERY INDUSTRY HAS AWOKEN TO AI

2014 2016

1,549

19,439

Higher Ed

Internet

Healthcare

Finance

Automotive

Others

Government

Developer Tools

Organizations engaged with NVIDIA on Deep Learning

9

1980 1990 2000 2010 2020

GPU-Computing perf

1.5X per year

1000X

by

2025

RISE OF GPU COMPUTING

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte,

O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected

for 2010-2015 by K. Rupp

102

103

104

105

106

107

Single-threaded perf

1.5X per year

1.1X per year

APPLICATIONS

SYSTEMS

ALGORITHMS

CUDA

ARCHITECTURE

10

GPUCPU

Add GPUs: Accelerate Data Processing & Analytics

© NVIDIA 2013

11

NVIDIA IGNITES

THE AI BIG BANGArtificial intelligence is the use of computers to simulate human intelligence.

AI amplifies our cognitive abilities — letting us solve problems where the

complexity is too great, the information is incomplete, or the details are

too subtle and require expert training.

Learning from data — a computer’s version of life experience — is how AI evolves.

GPU computing powers the computation required for deep neural networks to learn

to recognize patterns from massive amounts of data.

This new computing model sparked the AI era.

12

DEEP LEARNING FRAMEWORKS

VISION SPEECH BEHAVIOR

Object Detection Voice Recognition Language TranslationRecommendation

EnginesSentiment Analysis

DEEP LEARNING

cuDNN

MATH LIBRARIES

cuBLAS cuSPARSE

MULTI-GPU

NCCL

cuFFT

Mocha.jl

Image Classification

NVIDIA DEEP LEARNING SDKHigh Performance GPU-Acceleration for Deep Learning

13

ANNOUNCING TESLA V100GIANT LEAP FOR AI & HPCVOLTA WITH NEW TENSOR CORE

21B xtors | TSMC 12nm FFN | 815mm2

5,120 CUDA cores

7.5 FP64 TFLOPS | 15 FP32 TFLOPS

NEW 120 Tensor TFLOPS

20MB SM RF | 16MB Cache

16GB HBM2 @ 900 GB/s

300 GB/s NVLink

14

NEW TENSOR CORE

New CUDA TensorOp instructions & data formats

4x4 matrix processing array

D[FP32] = A[FP16] * B[FP16] + C[FP32]

Optimized for deep learning

Activation Inputs Weights Inputs Output Results

16

MODEL COMPLEXITY IS EXPLODING

2016 — Baidu Deep Speech 22015 — Microsoft ResNet 2017 — Google NMT

105 ExaFLOPS8.7 Billion Parameters

20 ExaFLOPS300 Million Parameters

7 ExaFLOPS60 Million Parameters

17

REVOLUTIONARY AI PERFORMANCE3X Faster DL Training Performance

Over 80x DL Training Performance in 3 Years

1x K80cuDNN2

4x M40cuDNN3

8x P100cuDNN6

8x V100cuDNN7

0x

20x

40x

60x

80x

100x

Q1

15

Q3

15

Q2

17

Q2

16

Googlenet Training Performance(Speedup Vs K80)

Speedup v

s K80

85% Scale-Out EfficiencyScales to 64 GPUs with Microsoft

Cognitive Toolkit

0 5 10 15

64X V100

8X V100

8X P100

Multi-Node Training with NCCL2.0(ResNet-50)

ResNet50 Training for 90 Epochs with 1.28M images dataset | Using Caffe2 | V100 performance measured on pre-production hardware.

1 Hour

7.4 Hours

18 Hours

3X Reduction in Time to Train Over P100

0 10 20

1XV100

1XP100

2XCPU

LSTM Training(Neural Machine Translation)

Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x Xeon E5 2699 V4 | V100 performance

measured on pre-production hardware.

15 Days

18 Hours

6 Hours

18

Deep Learning-Inferencing:TESLA V100 DELIVERS NEEDED RESPONSIVENESS WITH UP TO 99X MORE THROUGHPUT

0

1,000

2,000

3,000

4,000

5,000

CPU Server Tesla V100

ResNet-50

4,647 i/s@ 7mslatency

47 i/s@ 21msLatency

0

600

1,200

1,800


VGG-16

1,658i/s@ 7ms

Latency

23 i/s@ 43msLatency

0

500

1,000

1,500

2,000

2,500

3,000

3,500


GoogleNet

3,270 i/s@ 7ms

Latency

136 i/s@ 7ms

LatencyThro

ughput

(Im

ages/

Sec)

Thro

ughput

(Im

ages/

Sec)

Thro

ughput

(Im

ages/

Sec)

CPU: Xeon E5-2690 V4

GPU Servers: Dual Xeon E5-2690 [email protected] with 16GB PCIe GPUs configs as shownUbuntu 14.04.5, CUDA 9.0.103, cuDNN 7.0.1.13; NCCL 2.0.4, TensorRT pre-release, data set: ImageNet,GPU Optimal batch size used to achieve 7ms latency; CPU batch size reduced to 1 if latency exceeds 7ms

NVIDIA METROPOLIS —EDGE TO CLOUD AI CITY PLATFORM

Camera

CLOUDTraining and Inference

EDGE AND ON-PREMISESInference

DGX Appliance

Video recorder Server

JETSON TESLA/QUADRO

JETPACK, TENSOR RT, DEEPSTREAM

20

0

10

20

30

40

50

60

70

HPCG Performance EquivalencySingle GPU Server vs Multiple CPU-Only Servers

CPU Server: Dual Xeon E5-2690 [email protected], GPU Servers: same CPU server w/ V100s PCIeCUDA Version: CUDA 9.0.103; Dataset: 256x256x256 local sizeTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

HPCGBenchmark

Exercises computational and data access patterns that closely match a broad set of important HPC applications

VERSION3

ACCELERATED FEATURESAll

SCALABILITYMulti-GPU and Multi-Node

MORE INFORMATIONhttp://www.hpcg-benchmark.org/index.html

19 CPU Servers

37 CPU Servers

67 CPU Servers # of CPU Only Servers

1 server 8x V100 GPU’s

19x 36x 67xSpeed up vs

CPU server



REINVENTING OUR COMMUNITY

23

AUTOMOTIVE DEEP LEARNING

Deep Learning for Damage Estimation based on photo of the damage

Custom development for one of top 5 Insurance Companies

Total Data Available

90,000 Claims~380,000 car images

Scope of Work

Data Processed to Date*

1000 Claims4500 images

*Time lag in data migration, data clean-up and tagging images

Visualization of the activations of theconvolutions over a car damage image

Visualization of the layers of a neural network and the back propagation process

Neural Network Demo

https://www.galacticar.ai/ -> Galaxy.ai : Artificial Intelligence Driven Sedan Damage Estimator

Software as a Service

Annual fee + fixed fee every time API is pinged

Integration with Insurance Mobile app.

Use cases verified by the industry and insurance clients:

1) Whether client should file a claim or not - Claims triaging

2) Claims estimation from damaged sedan vehicle images

Business Model

27

(1) DEEP LEARNING IN FINANCE - TRADING

http://on-demand.gputechconf.com/gtc/2016/presentation/s6589-masahiko-todoriki-performance-improvement-algorithmic-trading.pdf

http://tradetechasia.wbresearch.com/ 20 Oct 2016

Generic Bigdata Operations

Correction

& Assurance

Real-time

Data

Sources

Data Analytics

Data Visualization

Scrambling

&

Protecting

Fields/Data

Data

Governanc

e

Published

Data

Marts/Lake

Harmonization

& Ontology

Mapping

Data

Harmonization

Harmonization

Terminologies

Data

Quality &

Integrity

Data

Cleansing

Data

Staging

Data

Acquisition

Unstructured

Data

Sources

Structured

Data

Sources

Data Preparation Data Dissemination

Virtualized Platform & Security Management

Data Warehousing/Storage (including Network Data)

Check for different Representation and usage Ontology

Check forCorrectionException

Data Anonymity &

Data Protection

Structured Data Access

Data Modeling

Sentiment Analytics

Data Reporting & Visualization

Data Interpretation

Social Network Understanding

Real-time Data Ingestion

Data Analytics

NetworkAnalytics

Unstructured Data Collection

Data ……….

Data Statistics

Mobile Sharing

Tablet Sharing

PC Sharing

Push & Pull Data Platform

Data Models

& Schema

29

GPU DATABASES ARE EVEN FASTER1.1 Billion Taxi Ride Benchmarks

21 30

1560

80 99

1250

150269

2250

372

696

2970

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node

Query 1 Query 2 Query 3 Query 4

Tim

e in M

illise

conds

Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82

10190 8134 19624 85942

http://tech.marksblogg.com/billion-nyc-taxi-rides-redshift-large-cluster.html

http://tech.marksblogg.com/billion-nyc-taxi-rides-spark-2-1-0-emr.html

30NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

Silicon Valley-based Blue River

Technology has developed

a deep learning solution called

LettuceBot that rolls through a

field photographing 5,000

young plants a minute, using

algorithms and machine vision

to identify each sprout as

lettuce or a weed. The

company trained their neural

network with GPUs and the

Caffe deep learning

framework.

Accurate within a quarter inch,

the LettuceBot automatically

pinpoints weeds,

underdeveloped sprouts, and

overplanted areas and then

applies tiny doses of herbicide

to maximize crop production.

Automated Crop Management

Source : https://news.developer.nvidia.com/artificial-intelligence-helping-to-ensure-humanitys-future-food-supply/

https://developer.nvidia.com/deep-learning

http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-x

31NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

Researchers from the Costa

Rica Institute of Technology

and French Agricultural

Research Centre for

International Development

developed a deep learning

algorithm to automatically

identify plant specimens that

have been pressed, dried and

mounted on herbarium sheets.

According to the researchers,

this is the first attempt to use

deep learning to tackle the

difficult taxonomic task of

identifying species in natural-

history collections.

Artificial Intelligence Helps Identify Plant Species

Source : https://news.developer.nvidia.com/artificial-intelligence-helps-identify-plant-species-for-science/

32

Train

whole slide image

sample

sample

training data

no

rmal

tum

or

Test

whole slide imageoverlapping image

patches tumor prob. map

1.0

0.0

0.5

Convolutional Neural Network

P(tumor)

How does it work?HOW DOES IT WORK?

SAFE AND SMART CITIES IS AN AI PROBLEM

0M

200M

400M

600M

800M

1,000M

2016 2020

1 billion installed security cameras WW (2020)

30 billion frames per day

Challenging real world conditions

Traditional video analytics not trustworthy

74%

9…

2010 2011 2012 2013 2014 2015 2016

Accura

cy

Image Classification

Human

Hand-coded CV

Deep Learning

AI achieves super human results

AI driven intelligent video analytics

39

AI FOR VISUAL SEARCH

IN MARKET PLACE

40

TESLA PLATFORMLeading Data Center Platform for HPC and AI

TESLA GPU & SYSTEMS

NVIDIA SDK

INDUSTRY TOOLS

APPLICATIONS & SERVICES

C/C++

ECOSYSTEM TOOLS & LIBRARIES

HPC

+400 More Applications

cuBLAS

cuDNN TensorRT

DeepStreamSDK

FRAMEWORKS MODELS

Cognitive Services

AI TRAINING & INFERENCE

Machine Learning Services

ResNetGoogleNetAlexNet

DeepSpeechInceptionBigLSTM

DEEP LEARNING SDK

NCCL

COMPUTEWORKS

NVLINK SYSTEM OEM CLOUDTESLA GPU

41

Instant productivity — plug-and-play, supports every AI framework

Performance optimized across the entire stack

Docker/Container Development and Deployment model

Always up-to-date via the cloud

Mixed framework environments — containerized

Direct access to NVIDIA experts

DGX STACKFully integrated Deep Learning platform

42

DEEP LEARNING REQUIREMENTS

DEEP LEARNING NEEDS… DEEP LEARNING CHALLENGES NVIDIA DELIVERS

Data Scientists Demand far exceeds supply DIGITS, DLI Training

Latest Algorithms Rapidly evolvingDL SDK, GPU-Accelerated

Frameworks

Fast Training Impossible -> Practical DGX-1, P100, P40, TITAN X

Deployment Platform Must be available everywhereTensorRT, P40, P4, Jetson,

Drive PX

43

DEEP LEARNING SOFTWARE

developer.nvidia.com/deep-learning

https://developer.nvidia.com/deep-learning

44

NVIDIA DEEP LEARNING INSTITUTE

Training organizations and individuals to solve challenging problems using Deep Learning

On-site workshops and online courses presented by certified experts

Covering complete workflows for proven application use cases

Self-driving cars, recommendation engines, medical image classification, intelligent video analytics and more

Hands-on training for data scientists and software engineers

Nvidia AI Conference Singapore

23/24th October 2017

THANK YOU

REAL WORLD PROBLEM SIMPLIFICATION USING DEEP LEARNING … · GPU Servers: Dual Xeon E5-2690...

Documents

Transcript of REAL WORLD PROBLEM SIMPLIFICATION USING DEEP LEARNING … · GPU Servers: Dual Xeon E5-2690...