Performance Optimizations for Deep Image Matting...

48
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Performance Optimizations for Deep Image Matting in Photoshop Betty Leong | Photoshop Engineering Manager, Adobe Salil Tambe | Computer Vision Engineer, Adobe Chris Hebert | DevTech Engineer, Nvidia

Transcript of Performance Optimizations for Deep Image Matting...

Page 1: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Performance Optimizations for Deep Image Matting in PhotoshopBetty Leong | Photoshop Engineering Manager, Adobe

Salil Tambe | Computer Vision Engineer, Adobe

Chris Hebert | DevTech Engineer, Nvidia

Page 2: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Topics

2

▪ Introduction to Matting

▪ Deep Matting vs Photoshop Matting

▪ Deployment Challenges

▪ Optimization

▪ Results and Demo

▪ Conclusion

Page 3: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3

Select and Mask in Photoshop

Page 4: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4

Select and Mask in Photoshop

Page 5: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Matting in Photoshop

5

Page 6: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6

Deep Matting

Image Input Output

Xu et al. Deep Image Matting. CVPR 2017.

Brian Price (GTC 2017)

Page 7: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Topics

▪ Introduction to Matting

▪ Deep Matting vs Photoshop Matting

▪ Deployment Challenges

▪ Optimization

▪ Results

▪ Conclusion

7

Page 8: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 8

Photoshop Matte Deep Matte

8

Image Image

Photoshop Matte Deep MatteCorrect matting for hairHair, should

be whiteTrimap

Photoshop Matte Deep Matte

Page 9: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9

Deep MattePhotoshop Matte

Background grass included

Grass excluded from matte

Trimap

Image

Page 10: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Topics

▪ Introduction to Matting

▪ Deep Matting vs Photoshop Matting

▪ Deployment Challenges

▪ Optimization

▪ Results and Demo

▪ Conclusion

10

Page 11: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 11

• Resolution (320 x 320)

• Model size (80 MB)

• Memory

• Run time performance

• Cross platform support

Tech Transfer Challenges

Image Credits: Deep Image Matting [Xu et. al; CVPR 2017]

Page 12: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Challenges

▪ Interactive editing with Deep Matting should be possible, but it is computationally expensive!

Matting uses a VGG-Net based Encoder-Decoder network that requires > 1sec and 600mbmemory for inferencing a 320 x 320 image on Intel i-7 8650K CPU using caffe

▪ It should be deployable on all platforms supported by Photoshop

i.e. all combinations of Intel, AMD and Nvidia hardware on Mac and Windows

12

Page 13: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Deep Matting Deep Dive

13

Input trimap

+

Page 14: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Input trimap

+

Deep Matting Deep Dive

14

Page 15: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Inference Per Tile

15

Framework used: Caffe (with CUDA and CUDNN)

Deconv5

Deconv4

Deconv2

Conv2

Conv5

Conv1[320 x 320 x 64]

Conv4

Alpha Prediction

Image Credits: Deep Image Matting [Xu et. al; CVPR 2017]

[160 x 160 x 128]

Conv3[80 x 80 x 256]

[40 x 40 x 512]

Conv4[20 x 20 x 512]

[20 x 20 x 512]

[40 x 40 x 256]

[80 x 80 x 128]

[160 x 160 x 64]

Deconv3

[320 x 320 x 64]

[320 x 320 x 1]

Deconv1

Page 16: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Fine Network

16

+

+

Refinement Network

Final Matte

+

Page 17: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Explorations/Experiments

▪ Collaborate with Nvidia DevTech ProVis Team to come up with better per tile inference performance

▪ Chris Hebert – DevTech Engineer

▪ Inference customization with CuDNN kernels for optimal performance

17

Page 18: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Topics

▪ Introduction to Matting

▪ Deep Matting vs Photoshop Matting

▪ Deployment Challenges

▪ Optimization

▪ Results and Demo

▪ Conclusion

18

Page 19: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Caffe: Inference Timeline(nv prof)

19

160ms

Titan-X

Page 20: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Caffe: CPU Overhead

20

160ms

Titan-X

Layer weights transferred one a time

Page 21: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Caffe: Uses FFT(memory intensive)

21

160ms

Titan-X

FFT

FFT

Page 22: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

cuDNN – A bit like OpenGL for Neural Networks

▪ Networks for inferencing are not difficult to implement with cuDNN

▪ cuDNN Provides a set of common network operations

▪ Convolution

▪ Activation

▪ Tensor Ops – Add, multiply etc

▪ Highly optimized for respective HW architectures

▪ cuDNN is the backend for most frameworks that target NVIDIA Hardware

22

Page 23: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 1: Better memory management

▪ Pre-allocate the max buffer size required for the workspace

23

Page 24: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Output layers size

Conv1_1[320 x 320 x 64] Memory: 25mb

Conv1_2[320 x 320 x 64] Memory: 25mb

Conv2_1[160 x 160 x 128] Memory: 12.5mb

Conv2_2[160 x 160 x 128] Memory: 12.5mb

Conv3_1[80 x 80 x 256] Memory: 6.25mb

Conv3_2[80 x 80 x 256] Memory: 6.25mb

Conv3_3[80 x 80 x 256] Memory: 6.25mb

Conv4_1[40 x 40 x 512] Memory: 3.125mb

Conv4_2[40 x 40 x 512] Memory: 3.125mb

Conv4_2[40 x 40 x 512] Memory: 3.125mb

Conv4_3[20 x 20 x 512] Memory: 0.78125mb

Conv5_1[20 x 20 x 512] Memory: 0.78125mb

Conv5_2[20 x 20 x 512] Memory: 0.78125mb

Deconv1[40 x 40 x 256] Memory: 1.5625mb

Deconv2[80 x 80 x 128] Memory: 3.125mb

Deconv3[160 x 160 x 64] Memory: 6.25mb

Deconv4[320 x 320 x 64] Memory: 25mb

Deconv5[320 x 320 x 1] Memory: 0.390625mb

24

Page 25: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 1(a): Pre-allocate the max buffer size required

Conv1_1[320 x 320 x 64] Memory: 25mb

Conv1_2[320 x 320 x 64] Memory: 25mb

Conv2_1[160 x 160 x 128] Memory: 12.5mb

Conv2_2[160 x 160 x 128] Memory: 12.5mb

Conv3_1[80 x 80 x 256] Memory: 6.25mb

Conv3_2[80 x 80 x 256] Memory: 6.25mb

Conv3_3[80 x 80 x 256] Memory: 6.25mb

Conv4_1[40 x 40 x 512] Memory: 3.125mb

Conv4_2[40 x 40 x 512] Memory: 3.125mb

Conv4_2[40 x 40 x 512] Memory: 3.125mb

Conv4_3[20 x 20 x 512] Memory: 0.78125mb

Conv5_1[20 x 20 x 512] Memory: 0.78125mb

Conv5_2[20 x 20 x 512] Memory: 0.78125mb

Deconv1[40 x 40 x 256] Memory: 1.5625mb

Deconv2[80 x 80 x 128] Memory: 3.125mb

Deconv3[160 x 160 x 64] Memory: 6.25mb

Deconv4[320 x 320 x 64] Memory: 25mb

Deconv5[320 x 320 x 1] Memory: 0.390625mb

25

Page 26: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 1: Better memory management

▪ Pre-allocate the max buffer size required for the workspace

26

Conv1_1

A 25mb

B 25mb

Memory Pool2x Largest TensorConv1_2

Conv1_3

Pool1

Conv2_1

A

B

A

B

Page 27: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 1: Better memory management

▪ Pre-allocate the max buffer size required for the workspace

▪ Carefully choose convolution algorithm for performance and memory requirements

▪ FFT – Fast, but requires a lot of device workspace memory

▪ GEMM – In place general matrix multiply

▪ Winograd – fast, but unstable for large filter sizes.

27

Page 28: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 1: Better memory management

▪ Pre-allocate the max buffer size required for the workspace

▪ Carefully choose convolution algorithm for performance and memory requirements

▪ FFT – Fast, but requires a lot of device workspace memory

▪ GEMM – In place general matrix multiply

▪ Winograd – fast, but unstable for large filter sizes.

▪ Convolution algorithm chosen on a per layer basis

▪ According to per layer constraints

28

Page 29: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 1: Better memory management

▪ Pre-allocate the max buffer size required for the workspace

▪ Carefully choose convolution algorithm for performance and memory requirements

▪ FFT – Fast, but requires a lot of device workspace memory

▪ GEMM – In place general matrix multiply

▪ Winograd – fast, but unstable for large filter sizes.

▪ Convolution algorithm chosen on a per layer basis

▪ According to per layer constraints

▪ Share max per layer workspace memory between all layers

▪ Re-use the buffer for the computation of each layer

29

Page 30: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Conv1_1[3 x 3 x 4 x 64] Memory: 0.00878906mb

Conv1_2[3 x 3 x 64 x 64] Memory: 0.140625mb

Conv2_1[3 x 3 x 64 x 128] Memory: 0.28125mb

Conv2_2[3 x 3 x 128 x 128] Memory: 0.5625mb

Conv3_1[3 x 3 x 128 x 256] Memory: 1.125mb

Conv3_2[3 x 3 x 256 x 256] Memory: 2.25mb

Conv3_3[3 x 3 x 256 x 256] Memory: 2.25mb

Conv4_1[3 x 3 x 256 x 512] Memory: 4.5mb

Conv4_2[3 x 3 x 512 x 512] Memory: 9mb

Conv4_3[3 x 3 x 512 x 512] Memory: 9mb

Conv5_1[3 x 3 x 512 x 512] Memory: 9mb

Conv5_2[3 x 3 x 512 x 512] Memory: 9mb

Deconv5[5 x 5 x 512 x 512] Memory: 25mb

Deconv4[5 x 5 x 512 x 256] Memory: 12.5mb

Deconv3[5 x 5 x 256 x 128] Memory: 3.125mb

Deconv2[5 x 5 x 128 x 64] Memory: 0.78125mb

Deconv1[5 x 5 x 64 x 64] Memory: 0.390625mb

Optimization 1(b): Load the entire model in one go

30

Titan-X

Weights + biases ~ 87 MB

Page 31: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Conv1_1[3 x 3 x 4 x 64] Memory: 0.59mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM

Conv1_2[3 x 3 x 64 x 64] Memory: 0.25mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD

Conv2_1[3 x 3 x 64 x 128] Memory: 0.50mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD

Conv2_2[3 x 3 x 128 x 128] Memory: 1mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD

Conv3_1[3 x 3 x 128 x 256] Memory: 2mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD

Conv3_2[3 x 3 x 256 x 256] Memory: 4mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD

Conv3_3[3 x 3 x 256 x 256] Memory: 4mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD

Conv4_1[3 x 3 x 256 x 512] Memory: 8mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD

Conv4_2[3 x 3 x 512 x 512] Memory: 16mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD

Conv4_3[3 x 3 x 512 x 512] Memory: 16mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD

Conv5_1[3 x 3 x 512 x 512] Memory: 0mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM

Conv5_2[3 x 3 x 512 x 512] Memory: 0mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM

Deconv5[5 x 5 x 512 x 512] Memory: 0mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM

Deconv4[5 x 5 x 512 x 256] Memory: 0mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM

Deconv3[5 x 5 x 256 x 128] Memory: 0.04mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM

Deconv2[5 x 5 x 128 x 64] Memory: 0.15mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM

Deconv1[5 x 5 x 64 x 64] Memory: 0.59mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM

Optimization 1(c): Choose optimal convolution algorithm

31

Titan-X

Page 32: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 1

32

17ms41msStart-up time (one time overhead)

Titan-X

Page 33: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 1: Results

33

Image size Caffe Optimization 1 %Reduction

320 x 320 588mb 159mb 73%

640 x 640 4113mb 323mb 92%

960 x 960 8778mb 643mb 92.7%

1280 x 1280 Cannot run 977mb -

Memory

Titan-V

Image size Caffe Optimization 1 %Reduction

320 x 320 210ms 20ms 95%

640 x 640 540ms 68ms 87.4%

960 x 960 1.04sec 153ms 85%

1280 x 1280 cannot run 261ms -

Performance

Page 34: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 2: Use FP16 instead of FP32

▪ All the weights for convolution and de-convolution were converted to float16.

▪ Volta has hardware for FAST fp16 – TRUE_HALF_CONFIG

▪ On Pascal and below, store in fp16 but process in fp32 – PSEUDO_HALF_CONFIG

▪ Pooling and un-pooling indices were stored with 8 bits(for 2 x 2 kernel size).

34

Page 35: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 2: Use FP16 instead of FP32

▪ Results slightly different -> retrain with FP16

35

FP32 FP16 Abs difference

x100

Page 36: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 2: Results

36

1 FP32 FP16 %Reduction

320 x 320 20ms 13ms 35%

640 x 640 68ms 35ms 50.7%

960 x 960 153ms 87ms 43.1%

1280 x 1280 261ms 153ms 41.4%

Titan-V

Image size FP32 FP16 %Reduction

320 x 320 159mb 99mb 37.7%

640 x 640 323mb 210mb 35%

960 x 960 643mb 361mb 43.8%

1280 x 1280 977mb 559mb 42.8%

Performance

Memory

Titan-V

Page 37: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 3: Use Tensor Core on Volta

▪ Tensor Core performs half matrix multiply accumulate (HMMA)

▪ cuDNN 7.0 has optimizations for HMMA

▪ Convolutions must use

▪ CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED

▪ CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM

▪ X,y,w tensors must be FP16

▪ Input and output filter maps must be multiple of 8 for alignment

37

Page 38: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 3: Use Tensor Core on Volta

Mixed Precision Matrix Math4x4 matrices

D = AB + C

D =

FP16 or FP32 FP16 FP16 FP16 or FP32

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

B0,0 B0,1 B0,2 B0,3

B1,0 B1,1 B1,2 B1,3

B2,0 B2,1 B2,2 B2,3

B3,0 B3,1 B3,2 B3,3

C0,0 C0,1 C0,2 C0,3

C1,0 C1,1 C1,2 C1,3

C2,0 C2,1 C2,2 C2,3

C3,0 C3,1 C3,2 C3,3

Page 39: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 3: Results

39

Image size FP16 FP16(with TensorCore) %Reduction

320 x 320 13ms 5ms 61.5%

640 x 640 35ms 15ms 57.4%

960 x 960 87ms 32ms 63.2%

1280 x 1280 153ms 53ms 65.3%

Titan-V

Image size FP16 FP16(with TensorCore) %Increase

320 x 320 99mb 111mb 12%

640 x 640 210mb 262mb 24.7%

960 x 960 361mb 533mb 47.6%

1280 x 1280 559mb 914mb 63.5%

Titan-V

Performance

Memory

Page 40: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 4: Network Fusing

▪ Original Caffe implementation used 2 networks:

▪ Coarse matting : VGG16 autoencoder

▪ Fine matting : shallow 4 layer cnn

▪ Input A : Original mean subtracted RGB as per course network

▪ Input B : Output of coarse network scaled back to 0:255 and mean subtracted.

40

Conv

Conv

Conv

Deconv

Deconv

Conv

Conv

Deconv

Mean Subtracted RGB + TrimapCP

UG

PU

CP

U

Mean Subtracted

Rescale coarse output

Coarse Network Fine Network

Final Output

Page 41: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 4: Network Fusing

▪ Original Caffe implementation used 2 networks:

▪ Coarse matting : VGG16 autoencoder

▪ Fine matting : shallow 4 layer CNN

▪ Input A : Original mean subtracted RGB as per course network

▪ Input B : Output of coarse network scaled back to 0:255 and mean subtracted.

▪ Causes unnecessary driver overhead copying to and from the CPU

▪ Pre and post processing can be done on the GPU

41

Page 42: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 4: Network Fusing

▪ Treat both networks as a single network.

▪ Keep mean subtracted RGB on GPU.

▪ Treat coarse output post processing as a custom network layer.

42

Conv

Conv

Conv

Deconv

Deconv

Trimap

CP

UC

oar

se

Preprocess

Postprocess

RGB

Conv

Conv

Deconv

Fin

e

Final Ouptut

Everything Stays On The GPU

Page 43: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Optimization 5: Layer fusing

▪ Some layer operations can be fused

▪ Convolution

▪ Bias Add

▪ Activation (eg. Relu)

▪ Advantages

▪ Reduces kernel launch overhead

▪ Some arithmetic operations can be combined (eg. FMAD)

▪ cuDNN as a combined version of Convolution+Bias+Activation

▪ cudnnStatus_t cudnnConvolutionBiasActivationForward(…)

▪ TensorRT will find the best fused configuration at serialization time

43

Page 44: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Topics

▪ Introduction to Matting

▪ Deep Matting vs Photoshop Matting

▪ Deployment Challenges

▪ Optimization

▪ Results and Demo

▪ Conclusion

44

Page 45: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Caffe vs Caffe2 vs Our Optimizations

45

Titan-V

210

540

1040

0

76

195

375

605

5 15 32 53

0

200

400

600

800

1000

1200

320 X 320 640 x 640 960 x 960 1280 x 1280

Tim

e(m

s)

Caffe Caffe2 Cudnn optimized

Performance

Page 46: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Caffe vs Caffe2 vs Our Optimizations

46

Titan-V

588

4113

8778

0235

1645

3582

6369

111 262533

914

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

320 X 320 640 x 640 960 x 960 1280 x 1280

Me

mo

ry(m

b)

Caffe Caffe2 Cudnn optimized

Memory

Page 47: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Summary

47

✓Do better memory management

✓Use optimal algorithm for convolution based on image and filter size(gemm/fft/winograd)

✓Do inference at lower precision(fp16 or uint8) if possible

✓Use hardware-specific optimizations(ex. HMMA on TensorCore)

✓Do layer fusion

Page 48: Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Conclusion and Future Work

48

▪ Explore WindowsML/DirectML for optimized cross platform inference

▪ Try TensorRT, Nvidia’s latest solution for optimized high performance inference