GPU コンピューティングが⽀える HPC と AI - Dell...2019/10/23 · Tensor...

Shinnosuke Furuya, Ph.D., HPC Developer Relations, NVIDIA

10/23/2019

GPU コンピューティングが⽀える HPC と AI

2

エヌビディアAI コンピューティングカンパニー

1993 年創業創業者兼 CEO ジェンスンフアン従業員 12,000 ⼈2018 会計年度売上⾼ 97 億ドル時価総額 1600 億ドル

3

⾃動⾞画像の⾃動認識とカテゴライズオンライン⾃動⾞販売サービスを提供する Carsales には毎⽇ 20,000 枚に及ぶ画像がアップロードされます。それらの写真を⼿作業で分類するのは時間のかかる作業でしたCarsales は Cyclops というアプリケーションを実装。これはGPU を活⽤した AI ツールで、投稿された写真から⾃動⾞の種類やアングルを⾃動認識して分類し、品質の悪い写真を売り⼿に通知しますCyclops によって⼈⼿による作業時間を⼀⽇あたり55 時間節約できます

4

GPU で⾼速化されたスマート農業機械除草剤耐性を持つ雑草の増加により、単純な除草剤散布に代わる⽅法が求められています。Blue River Technologies (John Deere が買収) は GPU を搭載したスマート農機でこれに応えます

Blue River Technology の See & Spray マシンは、画像認識で農作物と雑草を分類して必要最⼩限の除草剤を散布し、除草剤を 90% 削減します

5

CT scans are increasingly obtained today to diagnose lung cancer, adding to an already unmanageable workload for radiologists. Further, very small pulmonary nodes are difficult to spot with the human eye.

Powered by NVIDIA GPUs on the NVIDIA Clara platform, 12 Sigma Technologies’ σ-Discover/Lung system automatically detects lung nodules as small as .01% of an image, analyzes malignancy with >90% accuracy and provides a decision support tool to radiologists. When optimized on an NVIDIA T4 cluster the system runs up to 18x faster.

肺がんの早期発⾒

6

TESLA PLATFORM

7

歴代 NVIDIA GPU 製品 (抜粋)Maxwell(2014)

Pascal(2016)

Volta(2017)

M40

HPC ⽤

GRID ⽤

DL⽤

M60

Kepler(2012)

K80

K2

K520

V100データセンタ& クラウド

TeslaP40

P100

P6

Fermi(2010)

M2070

GeForceゲーミング GTX 980

GTX 780

GTX1080 TITAN X TITAN VGTX

580

P4

M6 M10

Turing(2018)

T4

RTX 2080 SUPER

TITAN RTX

Quadroプロフェッショナルグラフィックス M6000 GP100P5000K60006000 GV100

RTX 8000

RTX 6000

8


Pascal(2016)

Volta(2017)

M40

HPC ⽤

GRID ⽤

DL⽤

M60

Kepler(2012)

K80

K2

K520


TeslaP40

P100

P6

Fermi(2010)

M2070


GTX 780


580

P4

M6 M10

Turing(2018)

T4

RTX 2080 SUPER

TITAN RTX


RTX 8000

RTX 6000

Fermi アーキテクチャ

倍精度浮動⼩数点 (FP64) 演算への本格対応

9


Pascal(2016)

Volta(2017)

M40

HPC ⽤

GRID ⽤

DL⽤

M60

Kepler(2012)

K80

K2

K520


TeslaP40

P100

P6

Fermi(2010)

M2070


GTX 780


580

P4

M6 M10

Turing(2018)

T4

RTX 2080 SUPER

TITAN RTX


RTX 8000

RTX 6000

Pascal アーキテクチャ

FP16 最適化の導⼊FP16 演算のスループットが FP32 の 2 倍に

10


Pascal(2016)

Volta(2017)

M40

HPC ⽤

GRID ⽤

DL⽤

M60

Kepler(2012)

K80

K2

K520


TeslaP40

P100

P6

Fermi(2010)

M2070


GTX 780


580

P4

M6 M10

Turing(2018)

T4

RTX 2080 SUPER

TITAN RTX


RTX 8000

RTX 6000

Volta / Turing アーキテクチャ

FP16 と FP32 による混合精度⾏列演算器 Tensor コアの導⼊

Tesla V100 では FP32: 15.7 TFLOPS のところ、Tensor コアによる混合精度演算は 125 TFLOPS (8倍の理論演算性能)

Turing アーキテクチャでは INT8 と INT4 のサポートも追加Tesla T4 の INT4 性能は 260 TOPS

11

NVIDIA TESLA V100

210 億トランジスタ | TSMC 12nm FFN | 815mm2

5,120 CUDA コア | 640 Tensor コア

FP64 (Double) 7.8 TFLOPS | FP32 (Single) 15.7 TFLOPS

FP16/FP32 (Mixed) 125 TFLOPS

総レジスタファイル 20MB | 16MB キャッシュ

900 GB/s の 32GB HBM2

300 GB/s NVLink

AI と HPC のための⼤きな⾶躍Tensor コアを搭載した Volta アーキテクチャ

13

NVIDIA DGX SYSTEMSDGX-2 DGX Station DGX-1

2 PFLOPS (Mixed Precision) 500 TFLOPS (Mixed Precision) 1 PFLOPS (Mixed Precision)

16x Tesla V100 4x Tesla V100 8x Tesla V100

NVSwitch接続 NVLink 全結合 NVLink ハイブリッドキューブ

8x IB EDR | 2x 100GbE 2x 10GbE 4x IB EDR | 2x 10GbE

14

VOLTAHPC 性能を⼤きく向上

P100

に対する相対性能

HPC アプリケーション性能

System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware.

Summit Supercomputer200 PetaFlops4,608 Nodes

15

VOLTAディープラーニング性能を⼤幅に向上

P100 V100 P100 V100

Imag

es p

er S

econ

d

Imag

es p

er S

econ

d

2.4x faster 3.7x faster

FP32 Tensorコア FP16 Tensorコア

トレーニングインファレンスTensorRT - 7ms Latency

(*) DL モデルは ResNet50

16

TENSOR コア混合精度⾏列演算ユニット

D = FP32

(FP16)FP16 FP16 FP32

(FP16)

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

B0,0 B0,1 B0,2 B0,3

B1,0 B1,1 B1,2 B1,3

B2,0 B2,1 B2,2 B2,3

B3,0 B3,1 B3,2 B3,3

C0,0 C0,1 C0,2 C0,3

C1,0 C1,1 C1,2 C1,3

C2,0 C2,1 C2,2 C2,3

C3,0 C3,1 C3,2 C3,3

⾏列の FMA (Fused Multiply-Add)

4x4 の⾏列の積和演算を 1 サイクルで計算する性能:

128 演算/サイクル/Tensor コア、1024 演算/サイクル/SM

A B C

17

HPC と AI の融合が科学技術を⾰新AI

> ⼤量のデータからパターンを学んだニューラルネットワーク

> 予測精度の向上とレスポンスの⾼速化

HPC> 第⼀原理に基づくアルゴリズム

> 証明されたモデルで正確な結果

化合物の同定⼤規模気象モデリング断層の探索プラズマディスラプション検知

90% Prediction AccuracyPublish in Nature April 2019

Tensor Cores Achieved 1.13 EF2018 Gordon Bell Winner

Orders Of Magnitude Speedup3M New Compounds In 1 Day

Time-to-solution Reduced From Weeks To 2 Hours

18

科学技術計算での TENSOR コアの活⽤混合精度演算

7.815.7

125

0

20

40

60

80

100

120

140

V100 TFLOPS

Tesla V100 の理論性能核融合炉のプラズマ維持(オークリッジ国⽴研究所)

FP16 Solver3.5x faster

都市の地震シミュレーション(東京⼤学地震研究所)

FP16-FP21-FP32-FP6425x faster

混合精度演算による気象予測(オックスフォード⼤学)

FP16/FP32/FP644x faster

19

HPL-AI と反復改良法ソルバー

SUMMIT での HPL-AI が3倍の性能を記録HPL-AI: AI スーパーコンピューティングの性能計測における新たな試み

HPC と AI の融合

HPC (Simulation) – FP64

AI (Machine Learning) – FP16, FP32

Tensor コア GPU を備えるSummit で 3倍の性能

FP64(HPL)

Mixed Precision(HPL-AI)

149 PF

445 PF

Proposed by Prof Jack Dongarra, et al

20

複雑化・巨⼤化するモデル計算パワーは、もっと必要

2016 - Baidu Deep Speech 22015 - Microsoft ResNet 2017 - Google NMT

105 ExaFLOPS1GPU で 1 年以上

20 ExaFLOPS1GPU で 2.5 ヶ⽉7 ExaFLOPS

1GPU で 1 ヶ⽉弱

21

GPU

2

分散学習

GPU

1

GPU

1G

PU2

データ並列モデル並列

データセットを分割する• 各 GPU は、分割後の別サブデータセットを担当• GPU 間のデータ交換量少ない• ラージ・バッチサイズ問題

モデルを分割する• 各 GPU は、分割後の別サブモデルを担当• GPU 間のデータ交換量多い

22

分散学習 (データ並列) の最前線

Processor DL framework TimeMicrosoft Tesla P100 x8 Caffe 29 hoursFacebook Tesla P100 x256 Caffe2 1 hourGoogle TPUv2 x256 TensorFlow 30 mins

PFN Tesla P100 x1024 Chainer 15 minsTencent Tesla P40 x2048 TensorFlow 6.6 minsSONY Tesla V100 x2176 NNL 3.7 mins

Google TPUv3 x1024 TensorFlow 2.2 mins富⼠通 Tesla V100 x2048 MxNet 75 sec

ImageNet + ResNet50

23

NGC

25

継続的なパフォーマンス改善ソフトウェアの最適化により、同じハードウェアでも性能が向上

ディープラーニングフレームワークと HPC ソフトウェアスタックの⽉例更新で性能向上

0

2000

4000

6000

8000

10000

12000

18.02 18.09 19.02

Imag

es/S

econ

d

MxNet

Mixed Precision | 128 Batch Size | ResNet-50 Training | 8x V100

0

50000

100000

150000

200000

250000

300000

350000

400000

18.05 18.09 19.02

Toke

ns/S

econ

dPyTorch

0

1000

2000

3000

4000

5000

6000

7000

8000

18.02 18.09 19.02Im

ages

/Sec

ond

TensorFlow

Mixed Precision | 128 Batch Size | GNMT | 8x V100

Mixed Precision | 256 Batch Size | ResNet-50 Training | 8x V100

Speedup across Chroma, GROMACS, LAMMPS, QE, MILC, VASP, SPECFEM3D, NAMD, AMBER, GTC, RTM | 4x V100 v. Dual-Skylake | CUDA 9 for Mar '18 & Nov '18, CUDA 10 for Mar '19

x

2x

4x

6x

8x

10x

12x

14x

16x

18x

Mar '18 Nov '18 Mar '19

HPC Applications

26

新しい NGC機械学習と HPC のワークフローをシンプルにする GPU 最適化ソフトウェアハブ

NGC50 以上のコンテナイメージ

DL, ML, HPC

学習済みモデル⾃然⾔語処理、画像分類、物体検出など

業種別ソリューション医⽤画像処理、⾼度な映像解析

モデルスクリプト⾃然⾔語処理、画像分類、物体検出など

Innovate Faster

Deploy Anywhere

Simplify Deployments

27

NGC モデルスクリプト様々なユースケースとフレームワーク⽤の Tensor コアサンプルスクリプト

18 のスクリプトが利⽤可能● Tensor コアに最適化済● AMP をすぐに検証可能● NADIA が積極的に更新● Tensor コアを使った SOTA モデル● リファレンス実装として提供● ハイパーパラメータやソースコードを公開

⼊⼿はこちらから:● NVIDIA NGC https://ngc.nvidia.com/catalog/model-scripts● GitHub https://www.github.com/NVIDIA/deeplearningexamples● NVIDIA NGC Framework containers https://ngc.nvidia.com/catalog/containers

https://ngc.nvidia.com/catalog/model-scripts

https://www.github.com/Nvidia/deeplearningexamples

https://www.github.com/NVIDIA/deeplearningexamples

https://ngc.nvidia.com/catalog/containers

28

様々なモデルスクリプトhttps://developer.nvidia.com/deep-learning-examples

Computer Vision Speech & NLP

Recommender Systems

● SSD PyTorch

● SSD TensorFlow

● UNET-Industrial TensorFlow

● UNET-Medical TensorFlow

● ResNet-50 v1.5 MXNet

● ResNet-50 PyTorch

● ResNet-50 TensorFlow

● Mask R-CNN PyTorch

● GNMT v2 TensorFlow

● GNMT v2 PyTorch

● Transformer PyTorch

● BERT (Pre-training and Q&A)

TensorFlow

● NCF PyTorch

● NCF TensorFlow

Text to Speech

● Tacotron2 and WaveGlow

PyTorch

https://developer.nvidia.com/deep-learning-examples

29

NGC 学習済みモデルファイル

• 複数のフレームワーク向けに⽤意: TensorRT, TensorFlow, PyTorch, MXNet

• 様々なデータセットで学習: ImageNet, MSCOCO, LibreSpeech, Wikipedia/BookCorpus, 等• 複数の精度で提供: FP32, FP16, and INT8

• Key customer benefits:

○ 学習済みモデルを使うことで推論性能を簡単に検証可能

○ モデルアンサンブルパイプライン構築のための部品として利⽤可能

○ エヌビディアの推論ベンチマークを⼿元で再現可能

30

NGC CONTAINER REPLICATOR

• NGC コンテナイメージのローカルレプリカを作成

• ⾼速なアクセス | トラフィック削減 | ストレージ節約

• 新しいイメージを⾃動的にダウンロード

• ネットワークから隔離された環境へコンテナイメージをエクスポート

• Singularity イメージの⽣成• Github | How-to guide

最新のイメージをローカルに複製

Docker と Singularity で実⾏可能

v1

v1 v1

NGC

ローカルリポジトリCron job

Time

v1 v1

v1 v1

v2

ローカルリポジトリv1 v1 Cron job

v2

NGC

v1 v1

v2

ローカルリポジトリ

v1v1 v1

v2

NGC

https://github.com/NVIDIA/ngc-container-replicator

https://devblogs.nvidia.com/automating-downloads-ngc-container-replicator

31

広がる NGC の利⽤200 以上のスパコンセンターと 800 以上の⼤学が利⽤

「我々のサンプリングデータによると、Singularity で実⾏されている 10 万以上のジョブの 80% で、NGC のコンテナが使⽤されています。」

産業技術総合研究所⼈⼯知能クラウド研究チーム⼩川宏⾼研究チーム⻑

https://blogs.nvidia.co.jp/2019/06/19/abci-adopts-ngc/

https://blogs.nvidia.co.jp/2019/06/19/abci-adopts-ngc/

32

NGC コンテナーユーザーガイド

https://www.nvidia.com/content/dam/en-zz/ja/Solutions/cloud/NGC-User-Guide_JA.pdf

33

AI 基盤のリファレンスアーキテクチャ

34

AI 導⼊における課題は適切な基盤環境の構築と提供

AI 施策の導⼊により利益率が 15% 向上

40% の企業が AI 導⼊の課題としてAI 開発に適した基盤不備を問題視

source: 2018 CTA Market Research

35

スケール可能な AI 基盤環境⾃社ディープラーニングデータセンターからのノウハウ反映

ラック設計ネットワークストレージファシリティソフトウェア

• DL 学習性能を運⽤仕様限界まで発揮

• HPC のベストプラクティスを適⽤

• Ethernet / IB based fabric

• 100Gbps inter-connect

• ⾼帯域、超低遅延

• 数百万のオブジェクトを含んだデータセット

• テラバイト以上のデータセット

• ⾼ IOPS、低遅延

• 電⼒密度の⾼いラック設計

• ⾼フロップスper Watt を実現することにより DC フロアスペースの縮⼩

• スケールするために必要なクラスタ概念を念頭においた基盤管理ソフトウェア

例 :

• ⾃動運転データ = 1TB / hr

• 学習データ量 : 500 PB

• RN50: 113 ⽇規模の学習

• ⽬標: 7 ⽇間

• ⼯数 : 6 開発者が同時稼働

= 97 ノードクラスタ

36

NVIDIA DGX POD™

• NVIDIA® DGX-1™ を⽤いたリファレンスアーキテクチャ

• NVIDIA⾃社データセンタ環境である DGX SATURNV の導⼊におけるベストプラクティス、システム概念を反映

• ディープラーニング学習ワークフローを前提に設計

• 他プラットフォームの基礎となるリファレンスアーキテクチャ:

• NVIDIA DGX-2™ サーバーへのアップグレードが可能

• メディカルなど業種別 POD の基礎

• ストレージ、ネットワークパートナーとの検証を前提としたエコシステムソリューション

AI 基盤向けリファレンスアーキテクチャ

37

⾼密度な演算を実現するリファレンスアーキテクチャ

DGX-1 Servers 9 台• Tesla V100 GPUs x 8pcs• NVIDIA. GPUDirect™ over RDMA support• Run at MaxQ• 100 GbE networking (up to 4 x 100 GbE)

ストレージノード 12 台• 192 GB RAM• 3.8 TB SSD• 100 TB HDD (1.2 PB Total HDD)• 50 GbE networking

ネットワーク• In-rack: 100 GbE to DGX-1 servers• In-rack: 50 GbE to storage nodes• Out-of-rack: 4 x 100 GbE (up to 8)

ラック• 35 kW Power• 42U x 1200 mm x 700 mm (minimum)• Rear Door Cooler

温度管理を前提とした 4 POD設計DGX-1 POD

• NVIDIA DGX POD

• 数百のノードに対するスケーラビリティをサポート

• 実績のある SATURNV アーキテクチャを踏襲

38

DGX POD — DGX-1Reference Architecture in a Single 35 kW High-Density Rack

Fit within a standard-height 42 RU data center rack• Nine DGX-1 servers

(9 x 3 RU = 27 RU)• Twelve storage servers

(12 x 1 RU = 12 RU)• 10 GbE (min) storage and

management switch(1 RU)

• Mellanox 100 Gbps intra-rack high speed network switches(1 or 2 RU)

In real-life DL application development, one to two DGX-1 servers per developer are often required

One DGX POD supports five developers (AV workload)

Each developer works on two experiments per day

One DGX-1/developer/experiment/day*

*300,000 0.5M images * 120 epochs @ 480 images/sec Resnet-18 backbone detection network per experiment

39

DGX POD — DGX-2Reference Architecture in a Single 35 kW High-Density Rack

Fit within a standard-height 48 RU data center rack• Three DGX-2 servers

(3 x 10 RU = 30 RU)• Twelve storage servers

(12 x 1 RU = 12 RU)• 10 GbE (min) storage and

management switch(1 RU)

• Mellanox 100 Gbps intra-rack high speed network switches(1 or 2 RU)

In real-life DL application development, one DGX-2 per developer minimizes model training time

One DGX POD supports at least three developers (AV workload)

Each developer works on two experiments per day

One DGX-2/developer/2 experiments/day*

*300,000 0.5M images * 120 epochs @ 480 images/sec Resnet-18 backbone detection network per experiment

40

NVIDIA DGX-READY SOLUTION PARTNERS

NVIDIA Partners Ready to Build Your DGX SuperPODs

41

NVIDIA DGX SUPERPODAI LEADERSHIP REQUIRES AI INFRASTRUCTURE LEADERSHIP

Test Bed for Highest Performance Scale-Up Systems• 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list• <2 mins To Train RN-50

Modular & Scalable GPU SuperPOD Architecture• Built in 3 Weeks• Optimized For Compute, Networking, Storage & Software

Integrates Fully Optimized Software Stacks• Freely Available Through NGC

• 96 DGX-2H • 10 Mellanox EDR IB per node • 1,536 V100 Tensor Core

GPUs• 1 megawatt of power

Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC

42

NVIDIA DGX SUPERPOD

Mellanox EDR 100G InfiniBand Network

Mellanox Smart Director Switches

In-Network Computing Acceleration Engines

Fast and Efficient Storage Access with RDMA

Up to 130Tb/s Switching Capacity per Switch

Ultra-Low Latency of 300ns

Integrated Network Manager

Terabit-Speed InfiniBand Networking per Node

…

Rack 1 Rack 16

ComputeBackplane

Switch

Storage Backplane

Switch

64 DGX-2

GPFS

200 Gb/s per node

800 Gb/s per node

43

MLPERF 2019NVIDIA DGX SUPERPOD BREAKS AI RECORDS

Record Type Benchmark Record

Max Scale(Minutes to Train)

Object Detection (Heavy Weight) Mask R-CNN 18.47 Mins

Translation (Recurrent) GNMT 1.8 Mins

Reinforcement Learning (MiniGo) 13.57 Mins

Per Accelerator(Hours to Train)

Object Detection (Heavy Weight) Mask R-CNN 25.39 Hrs

Object Detection (Light Weight) SSD 3.04 Hrs

Translation (Recurrent) GNMT 2.63 Hrs

Translation (Non-recurrent) Transformer 2.61 Hrs

Reinforcement Learning (MiniGo) 3.65 Hrs

Per Accelerator comparison using reported performance for MLPerf 0.6 NVIDIA DGX-2H (16 V100s) compared to other submissions at same scale except for MiniGo where NVIDIA DGX-1 (8 V100s) submission was used| MLPerf ID Max Scale: Mask R-CNN: 0.6-23, GNMT: 0.6-26, MiniGo: 0.6-11 | MLPerf ID Per Accelerator: Mask R-CNN, SSD, GNMT, Transformer: all use 0.6-20, MiniGo: 0.6-10

44

DGX SUPERPOD AI ソフトウェアスタック

DGX SuperPOD

Multi-Node NVIDIA GPU Containers

Docker

NVIDIA GPU Cloud Applications

Kubernetes & Slurm

DGX SuperPOD

MgmtSoftware

Cluster Mgmt,Orchestration,

Workload Scheduler

GPU Enabled

45

ちょっと宣伝

46

SNS

Facebook: NVIDIA AI Japan | https://www.facebook.com/NVIDIAAI.JP

Twitter: @NVIDIAAIJP | https://twitter.com/NVIDIAAIJP

Follow and Like us!

47

投稿例イベント開催レポート

48

WEBINAR

佐々⽊邦暢

データセンターソリューションアーキテクト

このウェビナーでは次のことを学べます

ニューラルネットワークの学習における混合精度演算の利点Volta / Turing アーキテクチャのTensor コアによる混合精度演算の性能Automatic Mixed Precision (AMP) 機能を有効にする⽅法

AUTOMATIC MIXED PRECISION で学習を⾼速化

https://info.nvidia.com/jp-amp-webinar-reg-page.html

49

WEBINAR

丹愛彦HPC ソリューションアーキテクト

このウェビナーでは次のことを学べますOpenACC で GPU コンピューティングをはじめるメリットOpenACC による簡単でポータブルなプログラミングの概要例題を⽤いた OpenACC の具体的なプログラミング例

OpenACC ではじめる GPU コンピューティング -⼊⾨編-

https://info.nvidia.com/intro-openacc-jp-reg-page.html

GPU コンピューティングが⽀える HPC と AI - Dell...2019/10/23 · Tensor...

Documents

Transcript of GPU コンピューティングが⽀える HPC と AI - Dell...2019/10/23 · Tensor...