Across-Stack Profiling and Characterization of State …...nostic tool for evaluating and...

3
Across-Stack Profiling and Characterization of State-of-the-art Machine Learning Models on GPU Extended Abstract ABSTRACT The past few years have seen a surge of using Machine Learning (ML) and Deep Learning (DL) algorithms for traditional HPC tasks such as feature detection, numerical analysis, and graph analytics. While ML and DL help solving HPC tasks, their adoption has been hampered in part because of the complexity of understanding ML/DL and their interactions with systems utilization. Optimizing these al- gorithms requires characterizing their performance and resource utilization across the hardware/software (HW/SW) stack, but the lack of easy-to-use tools to automate the process and the reliance on researchers to perform manual characterization are the bottlenecks. To alleviate this, we propose an across-stack profiling scheme and integrate it within MLModelScope — a hardware and software ag- nostic tool for evaluating and benchmarking ML/DL at scale. We demonstrate the across-stack profiling and characterization func- tionality through the evaluation of state-of-art ML/DL models and present insights that are only made possible through this design. 1 INTRODUCTION Everyday, an increasingly diverse Machine Learning (ML) and Deep Learning (DL) algorithms and workloads (collectively referred to as ML models) are introduced. These ML models are introduced at such a pace that researchers are hard-pressed to systematically analyze and study their performance and impact on system optimization. The major difficulty is the complex nature of these ML models, where performance is impacted by the interplay between frameworks, system libraries, compilers, and hardware platforms (or HW/SW stacks). We observe that the inability to rapidly characterize state- of-the art model performance is partly due to the lack of tooling that allow researchers to introspect model performance across the HW/SW stack — while still being agile to cope with the diverse and fast paced nature of the ML landscape. The current practice of measuring and profiling ML models is cumbersome. It involves the use of a concoction of tools that are aimed at capturing ML model performance characteristics at dif- ferent levels of the HW/SW stack. Full stack profiling thus means the use of multiple tools and the (hopefully) automatic stitch their outputs. A profiling tool that captures ML model characteristics at different granularities (coupled with automated aggregation and sum- marization of the results) would boost the productivity of researchers and help understand the model/system performance and identify the bottlenecks. We propose an across-stack profiling scheme and its integration with MLModelScope [5] — a HW/SW agnostic platform for ML models evaluation and benchmarking at scale. We couple the pro- filing capability with automatic analyses that reveal insights which can not be obtained easily through other tools or methods. Using our design, we characterized the model/layer/GPU kernel performance of several state-of-the-art models, and demonstrate its capability to ID Name Peak Throughput (inputs/s) Batch Size 1 MobileNet-v1 2585.5 128 2 ResNet50-v1.5 996.3 256 3 SSD-MobileNet-v1-300x300 35.5 64 4 SSD-ResNet34-1200x1200 11.34 1 5 Densenet-121 944.8 128 6 ResNet152-v1 468.5 256 7 Faster-RCNN-ResNet50 16.8 4 8 Mask-RCNN-ResNet50-v2 4.4 1 Table 1: Eight models from MLPerf, AI-Matrix, and Tensor- Flow model zoos were selected for evaluation. We measured the peak throughput achieved on Amazon P3 and the correspond- ing batch size. Figure 1: MLPerf ResNet50 v1.5 throughput across batch sizes. introspect model execution at different levels of the HW/SW stack, identify bottlenecks, and systematically compare model or system offerings. This poster highlights the results for MLPerf ResNet50-v1.5 and further results for all models are shown in mlmodelscope-sc19. netlify.com. The tool will also be demoed during the poster ses- sion. 2 ACROSS-STACK PROFILING AND CHARACTERIZATION We extended MLModelScope to capture performance characteris- tics for different HW/SW abstraction levels — application, model, layer, GPU kernel, and hardware event. We focus our discussion on model/layer/GPU kernel levels in this poster. To measure per- formance at model granularity, MLModelScope measures the time spent running the inference for C API (TF_SessionRun for Ten- sorFlow) within the inference pipeline. To capture the layer timing, MLModelScope leverages existing frameworks’ profiling capabili- ties (RunOptions.TraceLevel for TensorFlow) and converts the framework profiles to the MLModelScope ’s timing format. Fi- nally, to obtain GPU kernel profiles, MLModelScope integrates with the NVIDIA CUDA Profiling Tools Interface (CUPTI) [4] library to capture the CUDA API and events. The captured performance timeline (referred to as “trace") can be processed on the fly or sent to a database for subsequent analyses. We chose a set of state-of-the-art ML models (Table 1) for eval- uation. Models 1 4 are from MLPerf Inference v0.5 release [7], models 5 6 are from AI-Matrix [1], and models 7 8 are from the

Transcript of Across-Stack Profiling and Characterization of State …...nostic tool for evaluating and...

Page 1: Across-Stack Profiling and Characterization of State …...nostic tool for evaluating and benchmarking ML/DL at scale. We demonstrate the across-stack profiling and characterization

Across-Stack Profiling and Characterization ofState-of-the-art Machine Learning Models on GPU

Extended Abstract

ABSTRACTThe past few years have seen a surge of using Machine Learning(ML) and Deep Learning (DL) algorithms for traditional HPC taskssuch as feature detection, numerical analysis, and graph analytics.While ML and DL help solving HPC tasks, their adoption has beenhampered in part because of the complexity of understanding ML/DLand their interactions with systems utilization. Optimizing these al-gorithms requires characterizing their performance and resourceutilization across the hardware/software (HW/SW) stack, but thelack of easy-to-use tools to automate the process and the reliance onresearchers to perform manual characterization are the bottlenecks.To alleviate this, we propose an across-stack profiling scheme andintegrate it within MLModelScope — a hardware and software ag-nostic tool for evaluating and benchmarking ML/DL at scale. Wedemonstrate the across-stack profiling and characterization func-tionality through the evaluation of state-of-art ML/DL models andpresent insights that are only made possible through this design.

1 INTRODUCTIONEveryday, an increasingly diverse Machine Learning (ML) and DeepLearning (DL) algorithms and workloads (collectively referred to asML models) are introduced. These ML models are introduced at sucha pace that researchers are hard-pressed to systematically analyzeand study their performance and impact on system optimization.The major difficulty is the complex nature of these ML models,where performance is impacted by the interplay between frameworks,system libraries, compilers, and hardware platforms (or HW/SWstacks). We observe that the inability to rapidly characterize state-of-the art model performance is partly due to the lack of toolingthat allow researchers to introspect model performance across theHW/SW stack — while still being agile to cope with the diverse andfast paced nature of the ML landscape.

The current practice of measuring and profiling ML models iscumbersome. It involves the use of a concoction of tools that areaimed at capturing ML model performance characteristics at dif-ferent levels of the HW/SW stack. Full stack profiling thus meansthe use of multiple tools and the (hopefully) automatic stitch theiroutputs. A profiling tool that captures ML model characteristics atdifferent granularities (coupled with automated aggregation and sum-marization of the results) would boost the productivity of researchersand help understand the model/system performance and identify thebottlenecks.

We propose an across-stack profiling scheme and its integrationwith MLModelScope [5] — a HW/SW agnostic platform for MLmodels evaluation and benchmarking at scale. We couple the pro-filing capability with automatic analyses that reveal insights whichcan not be obtained easily through other tools or methods. Using ourdesign, we characterized the model/layer/GPU kernel performanceof several state-of-the-art models, and demonstrate its capability to

ID Name Peak Throughput (inputs/s) Batch Size

1 MobileNet-v1 2585.5 1282 ResNet50-v1.5 996.3 2563 SSD-MobileNet-v1-300x300 35.5 644 SSD-ResNet34-1200x1200 11.34 15 Densenet-121 944.8 1286 ResNet152-v1 468.5 2567 Faster-RCNN-ResNet50 16.8 48 Mask-RCNN-ResNet50-v2 4.4 1

Table 1: Eight models from MLPerf, AI-Matrix, and Tensor-Flow model zoos were selected for evaluation. We measured thepeak throughput achieved on Amazon P3 and the correspond-ing batch size.

Figure 1: MLPerf ResNet50 v1.5 throughput across batch sizes.

introspect model execution at different levels of the HW/SW stack,identify bottlenecks, and systematically compare model or systemofferings.

This poster highlights the results for MLPerf ResNet50-v1.5 andfurther results for all models are shown in mlmodelscope-sc19.netlify.com. The tool will also be demoed during the poster ses-sion.

2 ACROSS-STACK PROFILING ANDCHARACTERIZATION

We extended MLModelScope to capture performance characteris-tics for different HW/SW abstraction levels — application, model,layer, GPU kernel, and hardware event. We focus our discussionon model/layer/GPU kernel levels in this poster. To measure per-formance at model granularity, MLModelScope measures the timespent running the inference for C API (TF_SessionRun for Ten-sorFlow) within the inference pipeline. To capture the layer timing,MLModelScope leverages existing frameworks’ profiling capabili-ties (RunOptions.TraceLevel for TensorFlow) and convertsthe framework profiles to the MLModelScope ’s timing format. Fi-nally, to obtain GPU kernel profiles, MLModelScope integrates withthe NVIDIA CUDA Profiling Tools Interface (CUPTI) [4] libraryto capture the CUDA API and events. The captured performancetimeline (referred to as “trace") can be processed on the fly or sentto a database for subsequent analyses.

We chose a set of state-of-the-art ML models (Table 1) for eval-uation. Models 1 − 4 are from MLPerf Inference v0.5 release [7],models 5 − 6 are from AI-Matrix [1], and models 7 − 8 are from the

Page 2: Across-Stack Profiling and Characterization of State …...nostic tool for evaluating and benchmarking ML/DL at scale. We demonstrate the across-stack profiling and characterization

Figure 2: Layer occurrence percentage Figure 3: Layer aggregated latency

Figure 4: Latency per layer Figure 5: Allocated memory per layer Figure 6: GPU and CPU latency per layer

Figure 7: Kernel flops per layer Figure 8: Kernel DRAM reads per layer Figure 9: Kernel DRAM writes per layer

TensorFlow model zoo [9]. We ran these models using NVIDIA’sNGC [8] TensorFlow container v19.06 on an Amazon P3 [2] in-stance. The P3 instance is equipped with a Tesla V100-SXM2 GPUand achieves a peak throughput of 15.7 TFlops and 900 GB/s globalmemory bandwidth.

Using the data collected, we demonstrate analyses that are eitherimpossible or hard to perform using other tools. The results aregrouped into three categories based on the characterization granular-ity. We present ResNet50-v1.5 from MLPerf inference v0.5 [7] as anexample. We show results for batch size 256, where ResNet50-v1.5achieves the optimal throughput on the system, when discussingthe layer and kernel characteristics. Full results are available in theaforementioned website.

Model Level Analysis— Using only model level performance,we can compute a 1 model’s throughput and latency. Figure 1shows the throughput for ResNet50-v1.5 across different batch sizeswith the peak throughput shown in Table 1. As can be seen, the GPUthroughput saturates at batch size 256, we then perform layer- andkernel-level analysis to further understand the performance.

Layer Level Analysis— By profiling the layers within a model,we can capture 2 the number of times each layer operator is exe-cuted (Figure 2). We give the 3 aggregated layer latency or memoryinformation (Figure 3) of the executed layer. Using the same evalu-ation data, both 4 layer latency (Figure 4) and 5 layer memoryusage (Figure 5) give insights into the latency and memory demandsof each layer.

GPU Kernel Level Analysis— For fine-grained performanceanalyses, we look at the ML model at the GPU kernel level. The6 measured flops, measured DRAM reads and writes for each

kernel can be correlated to each layer (shown in Figures 7, 8, and9 respectively) or 7 aggregated to compute these metrics for theentire model. The kernel latency can be correlated with the layer

timing to compute 8 the time not spent within GPU computation.Figure 6 shows that a non-negligible time is spent on the CPU foreach layer computation. This hints potential optimizations withinTensorFlow for kernel fusion and scheduling.

We further leverage the GPU kernel level analyses to 9 com-pute the arithmetic intensity and classify a model, layer, or kernelto be either compute or memory bound. For ResNet50-v1.5 we findthe most time consuming GPU kernel (volta_scudnn_128×64_relu_interior_nn_v1 from cuDNN [3] library) to be compute-bound and takes 28.7% of the overall model latency. The second andthird most time consuming GPU kernels are scalar_product_opand scalar_sum_op are from Eigen [6] library, and are memory-bound. They take 11.1% and 10.3% of the model latency respectively.The arithmetic intensity for each layers confirms that Conv operatorsare compute-bound where as the other layers such as Add, Mul, andRelu are memory-bound. Overall ResNet50-v1.5 is compute-bound.

This poster highlights some of the analyses and insights that canbe derived from MLModelScope . We encourage the reader to visitmlmodelscope-sc19.netlify.com to find characterizations for othermodels in Table 1.

3 CONCLUSIONA big hurdle in deploying, or optimizing ML workloads within theHPC ecosystem is understanding their performance characteristics.We extended MLModelScope with across-stack profiling capabilityand highlighted 9 types of analyses and conclusions that are auto-matically computed. MLModelScope helps researchers gain insightsto understand the current sources of inefficiency of ML models andsystems. We are currently working on using the data captured fromMLModelScope to give users suggestions on the model/system touse for a dataset, and to perform intelligent scheduling within HPCclusters.

2

Page 3: Across-Stack Profiling and Characterization of State …...nostic tool for evaluating and benchmarking ML/DL at scale. We demonstrate the across-stack profiling and characterization

REFERENCES[1] AI-Matrix 2019. AI-Matrix. https://github.com/alibaba/ai-matrix. (2019). Accessed:

2019-07-04.[2] Amazon EC2 P3 2019. Amazon EC2 P3 Instances. https://aws.amazon.com/ec2/

instance-types/p3/. (2019). Accessed: 2019-07-04.[3] cuDNN 2018. NVIDIA cuDNN. https://developer.nvidia.com/cudnn. (2018).

Accessed: 2019-7-04.[4] CUPTI 2018. The CUDA Profiling Tools Interface. https://developer.nvidia.com/

cuda-profiling-tools-interface. (2018). Accessed: 2019-07-04.[5] Abdul Dakkak, Cheng Li, Jinjun Xiong, and Wen-Mei Hwu. 2019. Frus-

trated with Replicating Claims of a Shared Model? A Solution. arXiv preprintarXiv:1811.09737 (2019).

[6] Gaël Guennebaud, Benoît Jacob, et al. 2010. Eigen v3. http://eigen.tuxfamily.org.(2010).

[7] mlperf 2019. MLPerf Inference. https://github.com/mlperf/inference. (2019).Accessed: 2019-07-04.

[8] NGC 2019. NVIDIA GPU-Accelerated Containers. https://www.nvidia.com/en-us/gpu-cloud/containers/. (2019). Accessed: 2019-07-04.

[9] tfzoo 2019. TensorFlow model zoo. https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md. (2019). Ac-cessed: 2019-07-04.

3