Deep Learning at 15 PF - GitHub Pages · • presented large scale deep learning results for two...

Deep Learning at 15 PF

Supervised and Semi-supervised Pattern Classification for Scientific Data

Thorsten Kurth, Jian Zhang, Nadathur Satish, Evan Racah, Ioannis Mitliagkas, Md. Mostofa Ali Patwary, Tareq Malas, Narayanan Sundaram, Wahid Bhimji, Mikhail Smorkalov, Jack Deslippe, Mikhail Shiryaev, Srinivas Sridharan, Prabhat, Pradeep Dubey

NIPS17 Long Beach, CA, USA DOI: 10.1145/3126908.3126916

High-Energy Physics (HEP) Finding New Physics Candidates

• Find rare signals of new particles in collisions (events) at LHC

• Represent data from cylindrical detector as sparse 2D (224x224) image

• 3 channels for different instruments (hadron & EM calorimeter, track multiplicity)

• ~10M event (7.4 TB) simulated dataset

• Existing selections on derived high-level physics variables used as a benchmark

• binary classification

atlas.ch

2

High-Energy Physics (HEP) Finding New Physics Candidates

• Find rare signals of new particles in collisions (events) at LHC

• Represent data from cylindrical detector as sparse 2D (224x224) image

• 3 channels for different instruments (hadron & EM calorimeter, track multiplicity)

• ~10M event (7.4 TB) simulated dataset

• Existing selections on derived high-level physics variables used as a benchmark

• binary classification

atlas.ch

2

−3 −2 −1 0 1 2 3azimuth

−4

−2

0

2

4

rapi

dity

hadron calorimeter

−3 −2 −1 0 1 2 3azimuth

−4

−2

0

2

4

rapi

dity

em calorimeter fraction

−3 −2 −1 0 1 2 3azimuth

−4

−2

0

2

4

rapi

dity

track multiplicity

−3 −2 −1 0 1 2 3azimuth

−4

−2

0

2

4

rapi

dity

hadron calorimeter

−3 −2 −1 0 1 2 3azimuth

−4

−2

0

2

4

rapi

dity

em calorimeter fraction

−3 −2 −1 0 1 2 3azimuth

−4

−2

0

2

4

rapi

dity

track multiplicity

Standard Model

New Physics

HEP Network Architecture

3

• sparse/lightweight layers

• 3 channels, suitable image dimensions

• total model size: ~2.3 MB

• training: SGD + momentum, per-layer LR, weight decay

Climate Science Finding Hurricanes

• locate and identify extreme weather phenomena in simulated climate data (CAM 25km resolution, 30 years)

• image size 768x1156 cropped to 768x768

• 16 channels (temperature, wind speed, pressure, etc.)

• not all images are annotated

• ~400K images (15 TB) data

• semi-supervised bounding box regression+classification (7 classes)

4

Climate Network Architecture16

x768

x115

212

8x38

4x57

6

512x

96x1

44

256x

192x

288

768x

48x7

210

24x2

4x36

2x12x18

4x12x18

4x12x18

5x5 conv

1024x24x36

512x96x144

768x48x72

16x768x1152

128x384x576

256x192x288

1280

x12x

18

• only sparse layers

• regression/detection performed using YOLO (DOI: 10.1109/CVPR.2016.91)

• yolo-loss combined with euclidian L2 loss from autoencoder

• training: SGD + momentum/ADAM, weight decay5

https://doi.org/10.1109/CVPR.2016.91

Software Stack and Hardware

• Intel® Distribution of Caffe • fast DL framework with distributed computing support

• Intel® Machine Learning Scaling Library (MLSL)

• communication abstraction layer

• here: use cray-mpich as backend with RDMA optimizations • Intel® Math Kernel Library (Intel® MKL) • library with highly optimized DL primitives (soon to be

replaced/merged with Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) )

• Cori-KNL HPC system • 9688 Intel® Xeon Phi™ 7250 processor nodes (Knight’s

Landing) • 90 GB DDR + 16 GB MCDRAM memory per node • 68 cores with 272 threads, 1KB L1/core, 1MB L2 / 2 cores • high speed Aries interconnect w/ dragonfly topology

6Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

Single Node Performance

HEP Model Climate Model

• local batch size of 8

• HEP: 1.90 TFLOP/s/node

• Climate: 2.09 TFLOP/s/node 7Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated

purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation

Single Node Performance Speedup using latest libraries

• local batch size of 8 • HEP: 1.90 TFLOP/s/node in the paper

-> improved to 3.21 TFLOP/s/node w/ Winograd optimizations in Intel® MKL-DNN8

HEP Model (MKLDNN + Winograd)Sp

eedu

p

0.5

1.125

1.75

2.375

3

BW_c

onv2

Upda

te

FW_c

onv2

BW_c

onv3

BW_p

ool1

BW_c

onv1

FW_c

onv1

BW_r

elu1

FW_p

ool1

FW_c

onv3

Data

BW_c

onv4

BW_p

ool2

BW_c

onv5

FW_c

onv4

BW_r

elu2

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated

purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation

Multi Node Parallelization Scheme

9

FORWARD PROPAGATION

BACK PROPAGATION

LAYER

1

LAYER

2

LAYER

N

Allre

duce

Allre

duce

Allre

duce

From Pradeep Dubey, “Scaling to Meet the Growing Needs of Artificial Intelligence (AI), IDF 2016 https://software.intel.com/en-us/articles/scaling-to-meet-the-growing-needs-of-ai

• Use data parallelism for the Solver

• Each node takes part of the data and computes model updates independently without communication

• These updates are then collectively applied to the model

https://software.intel.com/en-us/articles/scaling-to-meet-the-growing-needs-of-ai

Synchronous and Asynchronous Update Strategies

WWWW

WWWW

ALL-RE

DUCE

SYNC

.STEP

SYNCHRONOUS

WWWW

WWWW

ASYNCHRONOUS

ASYNC.PS

WWWW

WWWW

ALL-RE

DUCE

SYNC

.STEP

SYNCHRONOUS

WWWW

WWWW

ASYNCHRONOUS

ASYNC.PS

• understood convergence

• nodes have to wait for reduction to complete (stragglers slow everyone down)

• global (effective) batch size grows with number of nodes

• no node waits for anybody

• resilient

• stale gradients can have impact on convergence rate

• parameter server can be bottleneck

Hybrid Update

• impact of stragglers reduced compared to fully synchronous mode

• negative impact on stochastic convergence controlled

• finer control on total batch size

• group size needs to be tuned

ASYNC.PSW

WWW

WWWW

ALL-RE

DUCE

SYNC

.STEP

WWWW

WWWW

ALL-RE

DUCE

SYNC

.STEP

Computegroup1 ComputegroupG

11

LayerNPS

LayerN-1PS

Layer2PS

Layer1PS

Group1

Group2

GroupG

Modelupdate

Newmodel

Hadjis et.al., Omnivore, 2016

Strong Scaling Results

0 200 400 600 800 1000# nodHs (66 corHs/nodH)

0

200

400

600

800

1000

1200

6SHH

duS

6ynchronousHybrid, 2 grouSsHybrid, 4 grouSs,dHal

HEP Model

400 600 800 1000# nodHs (66 corHs/nodH)

0

200

400

600

800

1000

1200

6SHH

duS

6ynchronousHybrid, 2 grouSsHybrid, 4 grouSs,dHal

Climate Model

• batch size 2048 per group

• bad strong scaling beyond 512 nodes

• climate model better because compute/communication ratio is better 12

Weak Scaling Results

0 500 1000 1500 2000# nodHs (66 corHs/nodH)

0

500

1000

1500

2000

2500

6SHH

d-uS

6ynchronousHybrid, 2 grouSsHybrid, 4 grouSsHybrid, 8 grouSs,dHal

HEP Model

0 500 1000 1500 2000

# nodHs (66 corHs/nodH)

0

500

1000

1500

2000

2500

6S

HHd

-uS

6ynchronous

Hybrid, 4 grouSs

Hybrid, 8 grouSs

,dHal

Climate Model

• batch size 8 per node

• good weak scaling properties

• climate network shows almost ideal scaling 13

Impact of Asynchronous Updates

• asynchronous groups

• decrease time per update

• increase number of iterations to reach the same loss

• plot shows that performance can still be gained for reasonable number of async groups

• much better improvement over worst synchronous case (improved resiliency)

14

Full System Scale

• HEP Model

• 9594 worker nodes + 6 parameters servers

• 9 groups, 8528 examples/group

• 11.4 PFLOP/s sustained, 11.7 PFLOP/s peak performance

• 1.3x improvement over baseline in ~12 minutes(improvement in signal (true positive rate) over standard selection cuts for given background rejection (false positive rate)

• Climate Model

• 9608 worker nodes + 14 parameters servers

• 8 groups, 9608 examples/group

• 13.3 PFLOP/s sustained, 15.1 PFLOP/s peak performance15

Summary

• presented large scale deep learning results for two selected science problems

• combination of synchronous and asynchronous algorithms, on-node optimizations and tuning network topology essential

• deep learning is well suited for large scale HPC systems

• Hyperparameter optimization at scale is difficult

16

Thank you

Deep Learning at 15 PF - GitHub Pages · • presented large scale deep learning results for two...

Documents

Transcript of Deep Learning at 15 PF - GitHub Pages · • presented large scale deep learning results for two...