Deep learning in MATLAB - GTC On Demand · Deep learning in MATLAB From Concept to CUDA Code Roy...

28
1 © 2017 The MathWorks, Inc. Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics [email protected] 03-7660111 Ram Kokku Principal Engineer MathWorks [email protected]

Transcript of Deep learning in MATLAB - GTC On Demand · Deep learning in MATLAB From Concept to CUDA Code Roy...

  • 1© 2017 The MathWorks, Inc.

    Deep learning in MATLABFrom Concept to CUDA Code

    Roy Fahn

    Applications Engineer

    Systematics

    [email protected]

    03-7660111

    Ram Kokku

    Principal Engineer

    MathWorks

    [email protected]

  • 2

    Talk Outline

    Design Deep

    Learning & Vision

    Algorithms

    High Performance

    Deployment

    • Manage large image sets

    • Automate image labeling

    • Easy access to models

    • Pre-built training

    frameworks

    Automate compilation

    with GPU Coder

    On TitanXP: 7x faster than TensorFlow

    5x faster than pyCaffe2

    On Jetson: On par with TensorRT

    2x faster than C++-Caffe

    Accelerate and Scale

    Training

    • Acceleration with GPU’s

    • Scale to clusters

  • 3

    Example: Transfer Learning Workflow

    Transfer Learning

    Images

    New

    ClassifierLearn New

    Weights

    Modify

    Network

    Structure

    Load

    Reference

    NetworkLabels

    Training Data

    Labels: Cars, Trucks,

    BigTrucks, SUVs, Vans

  • 4

    Example: Transfer Learning in MATLAB

    Set up

    training

    dataset

    Split, shuffle, re-arrange images

    Read image, Data augmentation

    (clip, rotate, resize, etc)

    Easily manage large sets of images

    Single line of code to access images

    Operates on disk, database, big-data file system

  • 5

    Example: Transfer Learning in MATLAB

    Load

    Reference

    Network

    Set up

    training

    dataset

    Create DNNs in MATLAB

    1. Easy access to research models

    2. Caffe Model importer

    3. Build from scratch

  • 6

    Example: Transfer Learning in MATLAB

    Modify

    Network

    Structure

    Load

    Reference

    Network

    Set up

    training

    dataset

  • 7

    Example: Transfer Learning in MATLAB

    Modify

    Network

    Structure

    Load

    Reference

    Network

    Set up

    training

    dataset

  • 8

    Example: Transfer Learning in MATLAB

    Learn New

    Weights

    Modify

    Network

    Structure

    Load

    Reference

    Network

    Set up

    training

    dataset

    Many more training options

  • 9

    Deep learning on CPU, GPU, multi-GPU and clusters

    More GPUs

  • 10

    Deep learning on CPU, GPU, multi-GPU and clusters

    More GPUs

    Mo

    re C

    PU

    s

  • 11

    Deep learning on CPU, GPU, multi-GPU and clusters

    More GPUs

    Mo

    re C

    PU

    s

  • 13

    Visualizing and Debugging Intermediate Results

    Filters…

    Activations

    Deep Dream

    Training Accuracy Visualization Deep Dream

    Layer Activations Feature Visualization

    • Many options for visualizations and debugging• Examples to get started

  • 14

    GPU Coder for Deployment: New Product in R2017b

    Neural Networks

    Deep Learning, machine learning

    Image Processing and

    Computer Vision

    Image filtering, feature detection/extraction

    Signal Processing and

    Communications FFT, filtering, cross correlation,

    7x faster than state-of-art 700x faster than CPUs

    for feature extraction

    20x faster than

    CPUs for FFTs

    GPU Coder

    Accelerated implementation of

    parallel algorithms on GPUs

  • 15

    GPU Coder Compilation Flow

    GPU Coder

    CUDA Kernel creation

    Memory allocation

    Data transfer minimization

    • Library function mapping

    • Loop optimizations

    • Dependence analysis

    • Data locality analysis

    • GPU memory allocation

    • Data-dependence analysis

    • Dynamic memcpy reduction

  • 16

    Scalarized MATLAB

    GPU Coder Generates CUDA from MATLAB: saxpy

    CUDA kernel for GPU parallelization

    CUDA

    Vectorized MATLAB

    Loops and matrix operations are

    directly compiled in to kernels

  • 17

    Generated CUDA Optimized for Memory Performance

    Mandelbrot space

    CUDA kernel for GPU parallelization

    … …

    … …

    CUDA

    Kernel data allocation is

    automatically optimized

  • 20

    Algorithm Design to Embedded Deployment Workflow

    MATLAB algorithm

    (functional reference)

    Functional test1 Deployment

    unit-test

    2

    Desktop

    GPU

    C++

    Deployment

    integration-test

    3

    Desktop

    GPU

    C++

    Real-time test4

    Embedded GPU

    .mex .lib Cross-compiled

    .lib

    Build type

    Call CUDA

    from MATLAB

    directly

    Call CUDA from

    (C++) hand-

    coded main()

    Call CUDA from (C++)

    hand-coded main().

  • 21

    Demo: Alexnet Deployment with ‘mex’ Code Generation

  • 22

    Algorithm Design to Embedded Deployment on Tegra GPU

    MATLAB algorithm

    (functional reference)

    Functional test1

    (Test in MATLAB on host)

    Deployment

    unit-test

    2

    (Test generated code in

    MATLAB on host + GPU)

    Tesla

    GPU

    C++

    Deployment

    integration-test

    3

    (Test generated code within

    C/C++ app on host + GPU)

    Tesla

    GPU

    C++

    Real-time test4

    (Test generated code within

    C/C++ app on Tegra target)

    Tegra GPU

    .mex .lib Cross-compiled

    .lib

    Build type

    Call CUDA

    from MATLAB

    directly

    Call CUDA from

    (C++) hand-

    coded main()

    Call CUDA from (C++)

    hand-coded main().

    Cross-compiled on host

    with Linaro toolchain

  • 23

    Alexnet Deployment to Tegra: Cross-Compiled with ‘lib’

    Two small changes

    1. Change build-type to ‘lib’

    2. Select cross-compile toolchain

  • 24

    End-to-End Application: Lane Detection

    Transfer Learning

    Alexnet

    Lane detection

    CNN

    Post-processing

    (find left/right lane

    points)Image

    Image with

    marked lanes

    Left lane co-efficients

    Right lane co-efficients

    Output of CNN is lane parabola co-efficients according to: y = ax^2 + bx + c

    GPU coder generates code for whole application

    https://tinyurl.com/ybaxnxjg

    https://devblogs.nvidia.com/parallelforall/deep-learning-automated-driving-matlab/https://tinyurl.com/ybaxnxjg

  • 25

    How Good is Generated Code Performance

    Performance of image processing and computer vision

    Performance of CNN inference (Alexnet) on Titan XP GPU

    Performance of CNN inference (Alexnet) on Jetson (Tegra) TX2

  • 26

    GPU Coder for Image Processing and Computer Vision

    Distance

    transform

    Fog removal

    SURF feature

    extraction

    Ray tracing

    Stereo disparity

    Orders magnitude speedup over CPU

  • 27

    Alexnet Inference on NVIDIA Titan XP

    MATLAB GPU Coder

    (R2017b)

    TensorFlow (1.2.0)

    Caffe2 (0.8.1)

    Fra

    mes p

    er

    second

    Batch Size

    CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz

    GPU Pascal Titan Xp

    cuDNN v5

    Testing platform

    mxNet (0.10)

    MATLAB (R2017b)

    2x 7x5x

  • 28

    0

    50

    100

    150

    200

    250

    300

    350

    400

    1 16 32 64 128 256

    Alexnet Inference on Jetson TX2: Frame-Rate Performance

    MATLAB GPU Coder

    (R2017b)

    Fra

    me

    s p

    er

    se

    co

    nd

    Batch Size

    C++ Caffe

    (1.0.0-rc5)

    TensorRT (2.1)

    2x

    0.85x

  • 30

    Alexnet Inference on Jetson TX2: Memory PerformanceP

    ea

    k M

    em

    ory

    (M

    B)

    Batch Size

    MATLAB GPU Coder

    (R2017b)

    C++ Caffe

    (1.0.0-rc5)

    TensorRT 2.1

    (using giexec wrapper)

  • 31

    Design Your DNNs in MATLAB, Deploy with GPU Coder

    Design Deep

    Learning & Vision

    Algorithms

    High Performance

    Deployment

    Manage large image sets

    Automate image labeling

    Easy access to models

    Pre-built training

    frameworks

    Automate compilation

    with GPU Coder

    On TitanXP: 7x faster than TensorFlow

    5x faster than pyCaffe2

    On Jetson TX2: On par with TensorRT

    2x faster than C++-Caffe

    Accelerate and Scale

    Training

    Acceleration with GPU’s

    Scale to clusters

  • 32

    Check Out Deep Learning in MATLAB and GPU Coder

    GPU Coder

    Deep learning in MATLAB

    systematics.co.il\mwevents

    https://www.mathworks.com/products/gpu-coder.htmlhttps://www.mathworks.com/solutions/deep-learning.htmlhttp://www.systematics.co.il/mwevents