CNN & cuDNN

Bin ZHOUBin ZHOU

USTC Jan.2015

AcknowledgementAcknowledgement

Reference:

1) Introducing NVIDIA® cuDNN, Sharan Chetlur, Software Engineer,

CUDA Libraries and Algorithms Group

深度卷积神经网络CNNs的多GPU并行框架及其在图像识别的应用 --http://data.qq.com/article?id=1516

CNNCNN

Figure1. ImageNet CNN Model

Recall BP Network

l layer

l‐1 layer

BP Brief ReviewBP Brief Review

• Cost(Loss) Function To Evaluate the output of the network

• Common Cost Function

• MSE(Mean Squared Error)

• Cross Function

CNN BriefCNN Brief

Interpret AI task as the evaluation of complex function

Facial Recognition: Map a bunch of pixels to a name

Handwriting Recognition: Image to a character

Neural Network: Network of interconnected simple“neurons”

Neuron typically made up of 2 stages:

Linear Transformation of data

Point-wise application of non-linear function

In a CNN, Linear Transformation is a convolution

2015/1/25

cuDNNcuDNN

implementations of routines implementations of routines Convolution

P liPooling

softmax

neuron activations, including:SigmoidSigmoidRectified linear (ReLU)H b li (TANH)Hyperbolic tangent (TANH)

2015/1/25

CNNs: Stacked Repeating TripletsCNNs: Stacked Repeating Triplets

• Filtering• Kernel • Block-wise

Max-pooling• Linear• Non linear• Kernel

• stride• Block-wise• Max

• Non-linear

Convolution Activation

2015/1/25

Applications ?Applications ?

Anyone Enlighten me?

You can bring more brilliant applicationsg pp

2015/1/25

Multi convolve overviewMulti-convolve overview

Linear Transformation part of the CNN neuron

Main computational workload

80-90% of execution time

Generalization of the 2D convolution (a 4D tensor convolution)

Very compute intensive, therefore good for GPUs

However, not easy to implement efficientlyy p y

2015/1/25

Multi convolve pictoriallyMulti-convolve, pictorially

Good Parallelism

Why do it once if you can do it n times ? Batch the whole thing, to get parallelism.

2015/1/25

cuDNN GPU accelerated CNN libcuDNN-GPU accelerated CNN lib

Low-level Library of GPU-accelerated routines; similarin intent to BLAS

Out-of-the-box speedup of Neural Networks

Developed and maintained by NVIDIA

Optimized for current and future NVIDIA GPU generations

First release focused on Convolutional Neural Networks

2015/1/25

cuDNN FeaturescuDNN Features

Flexible API : arbitrary dimension ordering, striding,and sub-regions for 4d tensors

Less memory, more performance : Efficient forward

d b k d l hand backward convolution routines with zero memory

h doverhead

Easy Integration : black box implementation ofl ti d th ti R L Si idconvolution and other routines – ReLu, Sigmoid,

Tanh,Pooling SoftmaxPooling, Softmax

2015/1/25

Tensor 4d: ImportantTensor-4d: Important

Image Batches described as 4D nStride

Tensor[n, c, h, w] with stride support

[nStride, cStride, hStride, d ]wStride]

Allows flexible data layout

Easy access to subsets of features (

)Caffe’s “groups”)

Implicit cropping of sub-images

Plan to handle negative 2015/1/25

Example OverFeat Layer 1Example – OverFeat Layer 1

2015/1/25

Real Code that runsReal Code that runs

Under Linux

Demostration

2015/1/25

Implementation 1: 2D conv as a GEMVImplementation 1: 2D conv as a GEMV

2015/1/25

Multi convolveMulti-convolve

More of the same, just a little differentLonger dot productsLonger dot productsMore filter kernelsBatch of images, not just oneg , jMathematically:

2015/1/25

Implementation 2: Multi-convolve as GEMMGEMM

2015/1/25

PerformancePerformance

2015/1/25

cuDNN IntegrationcuDNN Integration

cuDNN is already integrated in major open-sourceframeworks

2015/1/25

Using Caffe with cuDNNUsing Caffe with cuDNN

Accelerate Caffe layer types by 1.2 –3x

On average, 36% faster overall fortraining on Alexnet

Integrated into Caffe dev branchtoday!(official release with C ff 1 0)Caffe 1.0)

Seamless integration with a l b lglobal

switch 2015/1/25

*CPU is 24 core E5-2697v2 @ 2.4GHz Intel MKL 11.1.3

Caffe with cuDNN: No Programming RequiredRequired

layers { layers { layers {name: “MyData”type: DATAtop: “data”

layers {name: “Conv2”type: CONVOLUTIONbottom: “Conv1”top: data

top: “label”}

top: “Conv2”convolution_param {num_output: 256

layers {name: “Conv1”type: CONVOLUTION

kernel_size: 5}

bottom: “MyData”top: “Conv1”convolution param {convolution_param {num_output: 96kernel_size: 11stride: 4stride: 4}

2015/1/25

Caffe with cuDNN : Life is easyCaffe with cuDNN : Life is easy

install cuDNN

uncomment the USE_CUDNN := 1 flag in Makefile.config when installing Caffe.

Acceleration is automatic

2015/1/25

NVIDIA® cuDNN RoadmapNVIDIA® cuDNN Roadmap

2015/1/25

cuDNN availabilitycuDNN availability

Free for registered developers!

Release 1 / Release 2 – RC

available on Linux/Windows 64bit

GPU support for Kepler and newer

Already Done:Tegra K1 (Jetson board)Mac OSX support

2015/1/25

Multi GPU with CNNMulti-GPU with CNN

Problem:1) Single GPU has limited memory, which limits the size of the network2) Single GPU is still too slow for some very large scale network2) Single GPU is still too slow for some very large scale network

2015/1/25

Multi GPU ChallengeMulti-GPU Challenge

First, How to parallelize the whole process, to avoid or reduce data dependency between different nodes

Data IO and distribution to different Nodes

Pipeline and IO/Execution overlap to hide latency

Synchronize between all the nodes??

2015/1/25

Multi GPU StrategyMulti-GPU Strategy

2015/1/25

Data distribution IO/Exe OverlapData distribution IO/Exe Overlap

2015/1/25

8 GPU server8-GPU server

2015/1/25

Pipeline and Stream processing in CNNPipeline and Stream processing in CNN

2015/1/25

Familiar?? It’s a DAG!Familiar?? It s a DAG!

2015/1/25

My Algorithm for DAG auto-ParallelizationParallelization

2015/1/25

Test CaseTest Case

2015/1/25

A More Complex CaseA More Complex Case

2015/1/25

Speed with Multi GPUsSpeed with Multi-GPUs

Configuration Speedup vs. 1 GPU

2 GPU M d l P 1 712 GPUs Model P. 1.71

2 GPUs Data P 1.85

4 GPUs Data P. + Model P.

4 GPUs Data P. 2.67

2015/1/25

ConclusionConclusion

GPU is very well suitable for CNN

cuDNN is easy to use and good performance

Multi-GPU is improving more.p g

Carefully Designed parallel design on multi-GPU could get adequate scalabilityg q y

2015/1/25

CNN & cuDNN - Piazza

Transcript of CNN & cuDNN - Piazza

CNN & cuDNN - Piazza

Documents

Transcript of CNN & cuDNN - Piazza

Piazza Italia

Piazza Apartments

2sRanking-CNN: A 2-stage ranking-CNN for …2.2.1 1st-stage Ranking-CNN Before explaining 1st-stage ranking-CNN, let’s brieﬂy explain how ranking-CNN works. Ranking-CNN was proposed

Piazza Claudia

CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor

DU-08670-001 v7.6.5 | November 2019 INSTALLING CUDNN … · 2019-11-15 · Accept the Terms and Conditions. A list of available download versions of cuDNN displays. 5. Select the

DU-06702-001 v07 | December 2017 CUDNN Developer Guidedocs.nvidia.com/deeplearning/sdk/pdf/cuDNN-Developer-Guide.pdf · 3.22. cudnnConvolutionFwdAlgo_t ... 4.113. cudnnDropoutGetStatesSize

ARCH 242: BUILDING HISTORY II · 01 piazza del campidoglio 02 saint peter’s ROME, ITALY. 24 PIAZZA DEL CAMPIDOGLIO, 1536 on . 25 PIAZZA DEL CAMPIDOGLIO, 1536 on . 26 PIAZZA DEL

Piazza 3 lecture

Folleto PIAZZA temporizados web PIAZZA... · Title: Folleto PIAZZA temporizados web Author: PC Created Date: 1/30/2015 12:15:04 PM

Elementary Analysis - Piazza

Variable Selection - Piazza

GPU Accelerated Deep Learning for CUDNN V2

Piazza Nevada

cuDNN API Reference · cuDNN API Reference DA-09702-001_v8.0.2 | 1 Chapter 1. Introduction NVIDIA® CUDA® Deep Neural Network library™ (cuDNN) offers a context-based API that allows

Chainer Documentation - Read the Docs · Chainer Documentation, Release 1.18.0 1.2.5Install Chainer with CUDA and cuDNN cuDNN is a library for Deep Neural Networks that NVIDIA provides.

Catania Piazza Duomo

Piazza Del Duomo

Interactive Graphics - Piazza

cuDNN Developer Guide · 2020. 7. 28. · cuDNN Developer Guide DU-06702-001_v8.0.2 | 2 Chapter 2. Programming Model The cuDNN library exposes a Host API but assumes that for operations