A GENTLE INTRODUCTION TO AI · Networks ” Timur Sattarov ... “Designing a GPU-Based...

A GENTLE INTRODUCTION TO AIMiguel Martínez - Solution Architect

AI, ML, DL

1950 1960 1970 1980 1990 2000 2010

vehicle

A NEW COMPUTING MODELAlgorithms that Learn from Examples

Traditional Approach

Requires domain experts

Time consuming

Error prone

Not scalable to new problems

Expert Written

Computer

Program

vehicle

Deep Learning Approach

Learn from data

Easy to extend

Speedup with GPUs

CATS & DOGS

TRAININGLearning a new capability

from existing data

Deep LearningFramework

UntrainedNeural Network

Trained ModelNew capability

INFERENCEApplying this capability

to new data

Application or Service

Trained ModelOptimized for performance

from existing data

to new data

from existing data

to new data

from existing data

to new data

ORIGIN OF NEURAL NETWORKSBiologically inspired computational units

Input Output

dendrites

impulses carried toward cell body

impulses carried away from cell body

axonterminals

nucleus

cellbody

A SIMPLE NEURONNeurons apply weights to inputs to create output

Input OutputNeuron

COMBINING NEURONSStacking neurons and layers creates a more powerful model

Additional neurons can be added to createa layer.

Multiple layers can also be added, resultingin input, hidden, and output layers.

Expanding the neural network size createsadditional predictive power.

In feed forward neural networks, neuronsare fully connected to surrounding layers.

Output

Hidden

Layers

DEEP NEURAL NETWORKS (DNNS)Neural networks with many layers enable deep learning

Input Layer Output LayerMany Hidden Layers

WHAT PROBLEM ARE YOU SOLVING?Different Tasks to Different Problems

QUESTION AI/DL TASK

Is “it” present

or not?Detection

What type of thing

is “it”?Classification

To what extent

is “it” present?Segmentation

What is the likely

outcome? Prediction

What will likely

satisfy the objective?Recommendation

INPUTSEXAMPLE

OUTPUTS

Text Data Images

AudioVideo

Image/Text Classification

Fraud Detection

Size/Shape

Analysis

Analytic Prediction

Direction

Recommendation

NVIDIA GTC

22https://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php

“Building an Enterprise Machine Learning Center of Excellence”Zachary Hanif (Capital One)

“Juicing Up Ye Olde GPU Monte Carlo Code”Richard Hayden, Oleg Rasskazov (JP Morgan Chase)

“Extracting Data from Tables and Charts in Natural Document

Formats”Philipp Meerkamp, David Rosenberg (Bloomberg)

“Detection of Financial Statement Fraud using Deep Autoencoder

Networks”Timur Sattarov (PricewaterhouseCoopers GmbH WPG), Marco Schreyer

(German Research Center for Artificial Intelligence)

GTC ONLINE FSI CONTENT

“Deep Thinking: The Challenges of Deep Learning and GPU

Acceleration of Financial Data”Erind Brahimi (Wells Fargo)

“Finance - Parallel Processing for Derivative Pricing”Louis Scott (Federal Reserve Bank of New York)

“GPU Acceleration of Monte Carlo Simulation for Capital Markets

and Insurance”Serguei Issakov (Numerix)

“Applying Deep Learning to Financial Market Signal Identification

with News Data”Rafael Nicolas Fermin Cota, Andrew Tan (Triumph Asset Management)

“Practical Aspects of Porting Monte Carlo Exotic Derivative

Pricing Engines to IBM Power 8+ with Tesla P100 GPUs”Oleg Rasskazov (JP Morgan Chase)

“Algorithmic Trading Strategy Performance Improvement Using

Deep Learning”Masahiko Todoriki (Mizuho Securities. Co., Ltd.)

“Designing a GPU-Based Counterparty Credit Risk System”Patrik Tennberg (TriOptima)

“Accelerating Derivatives Contracts Pricing Computation with

GPGPUs”Daniel Augusto Magalhães Borges, Alexandre Barbosa (BMFBOVESPA)

“A True Story: GPU in Production for Intraday Risk Calculations”Regis Fricker (Societe Generale)

“Effortless GPU Models for Finance”Ben Young (SunGard)

“GPU Implementation of Explicit and Implicit Finite Difference

Methods in Finance”Mike Giles (University of Oxford)

“Monte-Carlo Simulation of American Options with GPUs”Julien Demouth (NVIDIA)

“Domain Specific Languages for Financial Payoffs”Matthew Leslie (Bank Of America Merrill Lynch)

“High Performance Counterparty Risk and CVA Calculations in Risk

Management”Dominique Delarue, Azim Siddiqi (BNP Paribas)

“GPU-enabled Real-time Risk Pricing in Option Market Making”Cris Doloc (Chicago Trading Company)

“High Productivity Computational Finance on GPUs”Peter Phillips, Aamir Mohammad (Aon Benfield Securities)

“Leveraging GPGPU Technology for Valuation of Complex

Insurance Products”Chris Stiefeling (Oliver Wyman Financial Services)

“GPU-enabled Real-time Risk Pricing in Option Market Making”Cris Doloc (Chicago Trading Company)

“kdb+ and GPUs for Market Data Analytics and Trading”Philip A. Beasley-Harling (Bank of America Merrill Lynch)

“How to Speed Up Financial Risk Management Cost Efficiently for

Intra-day and Pre-deal CVA Calculations”Thomas Moser (Misys)

“Running Risk on GPUs”Tim Wood (ING Bank nv)

“Accelerating Pricing Models with virtual GPUs”Scott Donovan- (Citadel Investment Group)

GTCE067

GTCE017

GPU ACCELERATED MACHINE LEARNING FOR BOND PRICE PREDICTION

Input: 100+ features per trade.

o Trade size / historical

o Coupon rate / time to maturity

o Bond rating

o Trade type buy/sell

o Reporting delays

o Current yield / yield to maturity

Output: Bond trading price.

Launch as many CUDA threads as there

are data elements leverage 5120 Cores

on V100 to run multiple Kernels in

parallel.

https://bit.ly/2GeQLse

NEARLY 10X SPEED UP

OVER CPU IMPLEMENTATION

20 21 22 23 24 25

Speedup o

Unoptimized CUDA

Optimized CUDA

N = 2P, number of rows

NVIDIA GPU CLOUD

DIY GPU-accelerated AI and HPCdeployments are complex andtime consuming to build, test andmaintain.

Development of software by thecommunity is moving very fast.

Requires high level of expertise tomanage driver, library, frameworkdependencies.

NVIDIA Libraries

NVIDIA Container

Runtime for Docker

NVIDIA Driver

NVIDIA GPU

Applications or

Frameworks

CHALLENGES WITH COMPLEX SOFTWARE

NVIDIA GPU CLOUD (NGC)

SIMPLE ACCESS TO GPU-ACCELERATED SOFTWARE

+60 GPU-Accelerated Containers

Deep learning, HPC applicationsand visualization tools, and partnerapplications.

Innovate in Minutes, Not Weeks

Optimized, pre-configured, andready-to-run.

Always up to date

Monthly updates by NVIDIA toensure maximum performance. DEEP LEARNING HPC APPS HPC VIZ

GPU-ACCELERATED CONTAINERS

Tuned and tested to maximizeperformance.

Cross-stack optimizations.

Pre-integrated and ready-to-run.

Frameworks and applications areisolated.

NVIDIA CONTAINER RUNTIME FOR DOCKERDOCKER ENGINE

NVIDIA DRIVERHOST OS

MOUNTED NVIDIA DRIVERCONTAINER OS

CUDA TOOLKIT

DEEP LEARNING FRAMEWORKSDEEP LEARNING LIBRARIES

APPLICATIONS

NGC SOFTWARE STACK

USING NGC CONTAINERS

Data Scientists and

ResearchersDevelopers

Eliminate setup time, focus on

science and research

Work with the latest software with

a known good starting point

Sysadmins

Deploy to production

immediately

BENEFITS FOR A WIDE VARIETY OF USERS

WHY GPUs

BEYOND MOORE´S LAW

1980 1990 2000 2010 2020

107 GPU PERFORMANCE

1.5X per year

CPU PERFORMANCE

BEYOND MOORE’S LAW — 1000X EVERY 10 YEARS ACCELERATED COMPUTING COMPUTERS WRITING SOFTWARE

DEEP NEURAL

NETWORK

PROGRAM

by 2025

1.5X per year

1.1X per year

TRADITIONALCOMPUTE CLUSTER

300 Dual-CPU Servers

180 kW

12 Accelerated Servers

with x4 NVIDIA Tesla V100

1/3 the Cost

1/4 the Space

1/5 the Power

ACCELERATED DATACENTER

Experiment

EXPERIMENTAL NATURE OF DEEP LEARNING

Unacceptable training time

WHAT IS RAPIDS

In GPU Memory

cuXFilter

Visualization

Data Preparation VisualizationModel Training

Machine Learning

cuGraph

Graph AnalyticsDeep Learning

Analytics

GPU Accelerated End-to-End Data Science

RAPIDS is a set of open source libraries for GPU accelerating

data preparation and machine learning.

rapids.ai

• GPU-accelerated data preparation and feature engineering

• Python drop-in Pandas replacement

• GPU-accelerated traditional machine learning libraries

• XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD…

cuGraph

• GPU-accelerated graph analytics libraries

cuXfilter

• Web Data Visualization library

• DataFrame kept in GPU-memory throughout the session

cuML roadmap

cuML Algorithms Available Q2-2019

XGBoost GBDT MGMN

XGBoost Random Forest MGMN

K-Means Clustering MG

K-Nearest Neighbors (KNN) MG

Principal Component Analysis (PCA) SG

Density-based Spatial Clustering of Applications with Noise (DBSCAN) SG

Truncated Singular Value Decomposition (tSVD) SG

Uniform Manifold Aproximation and Projection (UMAP) SG MG

Kalman Filters (KF) SG

Ordinary Least Squares Linear Regression (OLS) SG

Stochastic Gradient Descent (SGD) SG

Generalized Linear Model, including Logistic (GLM) MG

Time Series (Holts-Winters) SG

Autoregressive Integrated Moving Average (ARIMA) SG

SGSingle GPU

MGMulti-GPU

MGMNMulti-GPU Multi-Node

Last updated 29.03.19

HOW TO START

Source code on GitHub | Containers on NGC & Docker | Conda & PIP packages

On-premisesIn the cloud

https://anaconda.org/rapidsaihttps://github.com/rapidsai https://ngc.nvidia.com

Pascal architecture or better Ubuntu 16.04 or 18.04

CUDA 9.2 or 10.0

A step-by-step installation guide

(MS Azure)

1. Create a NC6s_v2 virtual machine, and select NVIDIA GPU Cloud Image for Deep Learning and HPC.

2. Connect to the virtual machine:

$ ssh -L 8080:localhost:8888 -L 8787:localhost:8787 username@public_ip_address

3. Pull the RAPIDS container from NGC. Run it.

$ docker pull nvcr.io/nvidia/rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04

$ docker run --runtime=nvidia --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 \

nvcr.io/nvidia/rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04

4. Run JupyterLab:

(rapids)$ bash /rapids/notebooks/utils/start-jupyter.sh

5. Open your browser, and navigate to http://localhost:8080.

6. Navigate to cuml folder for cuML examples, or mortgage folder for XGBoost examples.

A step-by-step installation guide

(Amazon Web Services)

1. Create a p3.8xlarge virtual machine, and select NVIDIA Volta Deep Learning AMI as image.

2. Connect to the virtual machine:

$ ssh -L 8080:localhost:8888 -L 8787:localhost:8787 ubuntu@public_ip_address

3. Pull the RAPIDS container from NGC. Run it.

$ docker pull nvcr.io/nvidia/rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04

$ docker run --runtime=nvidia --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 \

nvcr.io/nvidia/rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04

4. Run JupyterLab:

(rapids)$ bash /rapids/notebooks/utils/start-jupyter.sh

5. Open your browser, and navigate to http://localhost:8080.

6. Navigate to cuml folder for cuML examples, or mortgage folder for XGBoost examples.

PORT YOUR CODE

CPU vs GPU

Training results

CPU: 57.1 seconds

GPU: 4.28 seconds

System: AWS p3.8xlarge

CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz, 32 vCPU cores, 244 GB RAM

GPU: Tesla V100 SXM2 16GB

PRINCIPAL COMPONENT

ANALYSIS(PCA)

Specific: Import CPU algorithm

Common: Data loading and algo params Common: Data loading and algo params

Specific: DataFrame from Pandas to GPU

Common: Model training Common: Model training

Specific: Import GPU algorithm

CPU vs GPU

Training results

CPU: ~9 minutes

GPU: 1.12 seconds

System: AWS p3.8xlarge

CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz, 32 vCPU cores, 244 GB RAM

GPU: Tesla V100 SXM2 16GB

K-NEAREST NEIGHBORS

Specific: DataFrame from Pandas to GPU

Specific: Import CPU algorithm Specific: Import GPU algorithm

Common: Data loading and algo params Common: Data loading and algo params

Specific: Model trainingSpecific: Model training

TRAINING TIME COMPARISON

The bigger the dataset is, the higher

the training performance difference is

between CPU and GPU.

Dataset size trained in 15 minutes.

CPU: ~130.000 rows.

GPU: ~5.900.000 rows.

Specs NC6s_vs

Cores(Broadwell 2.6Ghz)

GPU 1 x P100

Memory 112 GB

Local Disk ~700 GB SSD

Network Azure Network

CPU vs GPU

BENCHMARKS

Benchmark

200GB CSV dataset; Data preparation includes joins, variable transformations.

CPU Cluster Configuration

CPU nodes (61 GiB of memory, 8 vCPUs, 64-bit platform), Apache Spark

DGX Cluster Configuration

5x DGX-1 on InfiniBand network

Time in seconds — Shorter is better

XGBoost

0 1.000 2.000 3.000

20 CPU Nodes

30 CPU Nodes

50 CPU Nodes

100 CPU Nodes

5x DGX-1

0 5.000 10.000

20 CPU Nodes

30 CPU Nodes

50 CPU Nodes

100 CPU Nodes

5x DGX-1

0 1.000 2.000 3.000

20 CPU Nodes

30 CPU Nodes

50 CPU Nodes

100 CPU Nodes

5x DGX-1

cuDF – Load and Data Prep cuML – XGBoost End-to-End

cuDF (Load and Data Preparation) Data Conversion

WHAT IS XGBOOST

DEFINITION

XGBoost is an implementation of gradient

boosted decision trees designed for speed

and performance.

It is a powerful tool for

solving classification and

regression problems in a

supervised learning setting.

XGBOOST

WHO ENJOYS COMPUTER GAMESExample of Decision Trees

Input: age, gender, hair colour, … Does the person like computer games?

Age < 30

Is male?

+2 +1 -1 -1 -1Prediction score in each leaf

COMBINE TREES FOR BETTER PREDICTIONSEnsembled Decision Trees

Age < 30

Is male?

+2 +1 -1 -1 -1

Use computer

daily?

+0.9 +0.9 +0.9 -0.9 -0.9

Tree 1 Tree 2

f(‘Bill’) = 2 + 0.9 = 2.9 f(‘Sam’) = -1 - 0.9 = -1.9

TRAINED MODELS VISUALIZATIONSingle Decision Tree vs Ensembled Decision Trees

Source: https://goo.gl/GWNdEm

WHY XGBOOST

Winner of Caterpiller Kaggle Contest 2015

– Machinery component pricing

Winner of CERN Large Hadron Collider Kaggle Contest 2015

– Classification of rare particle decay phenomena

Winner of KDD Cup 2016

– Research institutions’ impact on the acceptance of submitted academic papers

Winner of ACM RecSys Challenge 2017

– Job posting recommendation

A STRONG HISTORY OF SUCCESSOn a Wide Range of Problems

WHICH ML ALGORITHM PERFORMS BETTERAverage Rank Across 165 Datasets

Source: https://goo.gl/R8Y8Pp

better

XGBOOST + RAPIDS

XGBoost

• Tuned for eXtreme performance and high efficiency

• Multi-GPU and Multi-Node Support

RAPIDS

• E2E data science & analytics pipeline entirely on GPU

• User-friendly Python interfaces

• Relies on CUDA primitives, exposes parallelism and

high-memory bandwidth

• Dask integration for managing workers and data in

distributed environments

LEARN MORE

www.rapids.ai

CODE EXAMPLES

LOADING DATA INTO A GPU DATAFRAME

Create an empty DataFrame, and add a column

cuDF code examples

Create a DataFrame with two columns

Load a CSV file into a GPU DataFrame

Use Pandas to load a CSV file, and copy its content into a GPU DataFrame

WORKING WITH GPU DATAFRAMEScuDF code examples

Return the first three rows as a new DataFrame Row slicing with column selection

Find the mean and standard deviation of a column Count number of occurrences per value, and number of unique values

Transform column values with a custom function Change the data type of a column

QUERY, SORT, GROUP, JOIN, …cuDF code examples

Query a DataFrame with a boolean expression

Return the first ‘n’ rows ordered by ‘columns’

Sort a column by its values

One-hot encoding

Group by column with aggregate function

Join and merge DataFrames

SUMMARY

GPU Accelerated Data Science

RAPIDS is a set of open source libraries for GPU

accelerating data preparation and machine learning.

www.rapids.ai

ONE MORE THING

A GENTLE INTRODUCTION TO AI · Networks ” Timur Sattarov ... “Designing a GPU-Based...

Documents

Transcript of A GENTLE INTRODUCTION TO AI · Networks ” Timur Sattarov ... “Designing a GPU-Based...

NVIDIA Tegra 4 Family GPU Architecture · Title: NVIDIA Tegra 4 Family GPU Architecture Author: NVIDIA Created Date: 1/6/2014 11:11:07 AM

NVIDIA GPU Computing Webinars Further CUDA Optimization

NVIDIA A30 GPU Accelerator

GRID Virtual GPU - Nvidia · GRID Virtual GPU DU-06920-001 _v4.1 (GRID) Revision 02 | 1 Chapter 1. INTRODUCTION TO NVIDIA GRID VIRTUAL GPU NVIDIA GRID™ vGPU™ enables multiple

Nvidia - GPU Call Slides

PATRIC ZHAO, SR. GPU ARCHITECT, NVIDIA patricz@nvidia

These Notes: NVIDIA GPU Microarchitecture org 1 NVIDIA GPU Microarchitecture nv org 1 These Notes: NVIDIA GPU Microarchitecture Current state of notes: Under construction. nv org 1

Issued Date: 2020-JUL-20 Copyright © 2020 …...NVIDIA GPU AMD GPU

NVIDIA GPU Accelerated Applications Catalog

ACCELERATE INNOVATION IN MANUFACTURING.images.nvidia.com/content/grid/pdf/vgpu-manufacturing-brochure.pdf · NVIDIA Virtual GPU Solutions. NVIDIA VIRTUAL GPU | BROCHURE ... Systèmes

NVIDIA Tesla K20X GPU Accelerator

INTRODUCTION TO NVIDIA GPU COMPUTING

NVIDIA GRID€¦ · NVIDIA TESLA GPUS M10 M60 P40 M6 P6 GPU 4 NVIDIA Maxwell GPUs 2 NVIDIA Maxwell GPUs 1 NVIDIA Pascal GPU 1 NVIDIA Maxwell GPU 1 NVIDIA Pascal GPU CUDA Cores 2,560

Nvidia Gpu Programming Guide (Eng)

Ansys Fluent Nvidia Gpu Userguide

David Luebke NVIDIA Research GPU Architecture & Implications.

GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

GRID VIRTUAL GPU - Nvidia

NVIDIA GPU Cloud Image for Microsoft AzureNVIDIA GPU Cloud Image for Microsoft Azure RN-08963-19.11.3 _v01 | 1 Chapter 1. NVIDIA GPU CLOUD IMAGE OVERVIEW NVIDIA makes available on

NVIDIA TURING GPU ARCHITECTURE - Industry-Era · NVIDIA Turing GPU Architecture WP-09183-001_v01 | 3 NVIDIA TURING KEY FEATURES NVIDIA Turing is the world’s most advanced GPU architecture.