Deep Learning at 15 PF - GitHub Pages · • presented large scale deep learning results for two...
Transcript of Deep Learning at 15 PF - GitHub Pages · • presented large scale deep learning results for two...
Deep Learning at 15 PF
Supervised and Semi-supervised Pattern Classification for Scientific Data
Thorsten Kurth, Jian Zhang, Nadathur Satish, Evan Racah, Ioannis Mitliagkas, Md. Mostofa Ali Patwary, Tareq Malas, Narayanan Sundaram, Wahid Bhimji, Mikhail Smorkalov, Jack Deslippe, Mikhail Shiryaev, Srinivas Sridharan, Prabhat, Pradeep Dubey
NIPS17 Long Beach, CA, USA DOI: 10.1145/3126908.3126916
High-Energy Physics (HEP) Finding New Physics Candidates
• Find rare signals of new particles in collisions (events) at LHC
• Represent data from cylindrical detector as sparse 2D (224x224) image
• 3 channels for different instruments (hadron & EM calorimeter, track multiplicity)
• ~10M event (7.4 TB) simulated dataset
• Existing selections on derived high-level physics variables used as a benchmark
• binary classification
atlas.ch
2
High-Energy Physics (HEP) Finding New Physics Candidates
• Find rare signals of new particles in collisions (events) at LHC
• Represent data from cylindrical detector as sparse 2D (224x224) image
• 3 channels for different instruments (hadron & EM calorimeter, track multiplicity)
• ~10M event (7.4 TB) simulated dataset
• Existing selections on derived high-level physics variables used as a benchmark
• binary classification
atlas.ch
2
−3 −2 −1 0 1 2 3azimuth
−4
−2
0
2
4
rapi
dity
hadron calorimeter
−3 −2 −1 0 1 2 3azimuth
−4
−2
0
2
4
rapi
dity
em calorimeter fraction
−3 −2 −1 0 1 2 3azimuth
−4
−2
0
2
4
rapi
dity
track multiplicity
−3 −2 −1 0 1 2 3azimuth
−4
−2
0
2
4
rapi
dity
hadron calorimeter
−3 −2 −1 0 1 2 3azimuth
−4
−2
0
2
4
rapi
dity
em calorimeter fraction
−3 −2 −1 0 1 2 3azimuth
−4
−2
0
2
4
rapi
dity
track multiplicity
Standard Model
New Physics
HEP Network Architecture
3
• sparse/lightweight layers
• 3 channels, suitable image dimensions
• total model size: ~2.3 MB
• training: SGD + momentum, per-layer LR, weight decay
Climate Science Finding Hurricanes
• locate and identify extreme weather phenomena in simulated climate data (CAM 25km resolution, 30 years)
• image size 768x1156 cropped to 768x768
• 16 channels (temperature, wind speed, pressure, etc.)
• not all images are annotated
• ~400K images (15 TB) data
• semi-supervised bounding box regression+classification (7 classes)
4
Climate Network Architecture16
x768
x115
212
8x38
4x57
6
512x
96x1
44
256x
192x
288
768x
48x7
210
24x2
4x36
2x12x18
4x12x18
4x12x18
5x5 conv
1024x24x36
512x96x144
768x48x72
16x768x1152
128x384x576
256x192x288
1280
x12x
18
• only sparse layers
• regression/detection performed using YOLO (DOI: 10.1109/CVPR.2016.91)
• yolo-loss combined with euclidian L2 loss from autoencoder
• training: SGD + momentum/ADAM, weight decay5
Software Stack and Hardware
• Intel® Distribution of Caffe • fast DL framework with distributed computing support
• Intel® Machine Learning Scaling Library (MLSL)
• communication abstraction layer
• here: use cray-mpich as backend with RDMA optimizations • Intel® Math Kernel Library (Intel® MKL) • library with highly optimized DL primitives (soon to be
replaced/merged with Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) )
• Cori-KNL HPC system • 9688 Intel® Xeon Phi™ 7250 processor nodes (Knight’s
Landing) • 90 GB DDR + 16 GB MCDRAM memory per node • 68 cores with 272 threads, 1KB L1/core, 1MB L2 / 2 cores • high speed Aries interconnect w/ dragonfly topology
6Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
Single Node Performance
HEP Model Climate Model
• local batch size of 8
• HEP: 1.90 TFLOP/s/node
• Climate: 2.09 TFLOP/s/node 7Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation
Single Node Performance Speedup using latest libraries
• local batch size of 8 • HEP: 1.90 TFLOP/s/node in the paper
-> improved to 3.21 TFLOP/s/node w/ Winograd optimizations in Intel® MKL-DNN8
HEP Model (MKLDNN + Winograd)Sp
eedu
p
0.5
1.125
1.75
2.375
3
BW_c
onv2
Upda
te
FW_c
onv2
BW_c
onv3
BW_p
ool1
BW_c
onv1
FW_c
onv1
BW_r
elu1
FW_p
ool1
FW_c
onv3
Data
BW_c
onv4
BW_p
ool2
BW_c
onv5
FW_c
onv4
BW_r
elu2
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation
Multi Node Parallelization Scheme
9
FORWARD PROPAGATION
BACK PROPAGATION
LAYER
1
LAYER
2
LAYER
N
Allre
duce
Allre
duce
Allre
duce
From Pradeep Dubey, “Scaling to Meet the Growing Needs of Artificial Intelligence (AI), IDF 2016 https://software.intel.com/en-us/articles/scaling-to-meet-the-growing-needs-of-ai
• Use data parallelism for the Solver
• Each node takes part of the data and computes model updates independently without communication
• These updates are then collectively applied to the model
Synchronous and Asynchronous Update Strategies
WWWW
WWWW
ALL-RE
DUCE
SYNC
.STEP
SYNCHRONOUS
WWWW
WWWW
ASYNCHRONOUS
ASYNC.PS
WWWW
WWWW
ALL-RE
DUCE
SYNC
.STEP
SYNCHRONOUS
WWWW
WWWW
ASYNCHRONOUS
ASYNC.PS
• understood convergence
• nodes have to wait for reduction to complete (stragglers slow everyone down)
• global (effective) batch size grows with number of nodes
• no node waits for anybody
• resilient
• stale gradients can have impact on convergence rate
• parameter server can be bottleneck
Hybrid Update
• impact of stragglers reduced compared to fully synchronous mode
• negative impact on stochastic convergence controlled
• finer control on total batch size
• group size needs to be tuned
ASYNC.PSW
WWW
WWWW
ALL-RE
DUCE
SYNC
.STEP
WWWW
WWWW
ALL-RE
DUCE
SYNC
.STEP
Computegroup1 ComputegroupG
11
LayerNPS
LayerN-1PS
Layer2PS
Layer1PS
Group1
Group2
GroupG
Modelupdate
Newmodel
Hadjis et.al., Omnivore, 2016
Strong Scaling Results
0 200 400 600 800 1000# nodHs (66 corHs/nodH)
0
200
400
600
800
1000
1200
6SHH
duS
6ynchronousHybrid, 2 grouSsHybrid, 4 grouSs,dHal
HEP Model
400 600 800 1000# nodHs (66 corHs/nodH)
0
200
400
600
800
1000
1200
6SHH
duS
6ynchronousHybrid, 2 grouSsHybrid, 4 grouSs,dHal
Climate Model
• batch size 2048 per group
• bad strong scaling beyond 512 nodes
• climate model better because compute/communication ratio is better 12
Weak Scaling Results
0 500 1000 1500 2000# nodHs (66 corHs/nodH)
0
500
1000
1500
2000
2500
6SHH
d-uS
6ynchronousHybrid, 2 grouSsHybrid, 4 grouSsHybrid, 8 grouSs,dHal
HEP Model
0 500 1000 1500 2000
# nodHs (66 corHs/nodH)
0
500
1000
1500
2000
2500
6S
HHd
-uS
6ynchronous
Hybrid, 4 grouSs
Hybrid, 8 grouSs
,dHal
Climate Model
• batch size 8 per node
• good weak scaling properties
• climate network shows almost ideal scaling 13
Impact of Asynchronous Updates
• asynchronous groups
• decrease time per update
• increase number of iterations to reach the same loss
• plot shows that performance can still be gained for reasonable number of async groups
• much better improvement over worst synchronous case (improved resiliency)
14
Full System Scale
• HEP Model
• 9594 worker nodes + 6 parameters servers
• 9 groups, 8528 examples/group
• 11.4 PFLOP/s sustained, 11.7 PFLOP/s peak performance
• 1.3x improvement over baseline in ~12 minutes(improvement in signal (true positive rate) over standard selection cuts for given background rejection (false positive rate)
• Climate Model
• 9608 worker nodes + 14 parameters servers
• 8 groups, 9608 examples/group
• 13.3 PFLOP/s sustained, 15.1 PFLOP/s peak performance15
Summary
• presented large scale deep learning results for two selected science problems
• combination of synchronous and asynchronous algorithms, on-node optimizations and tuning network topology essential
• deep learning is well suited for large scale HPC systems
• Hyperparameter optimization at scale is difficult
16
Thank you