DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level...
Transcript of DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level...
June 2019
DCGM OVERVIEW AND HANDS ON
2
TOOLS FOR MANAGING GPUs
Out-of-Band
GPU Metrics and Monitoring via BMC (SMBPBI)
Provide metrics (thermals, power, etc.) without the NVIDIA driver
Typically used at public CSPs (i.e. multi-tenant environments)
In-Band
Tools use the NVIDIA driver to provide GPU and NVSwitchmetrics
DCGM, NVML (smi) are in-band tools
Typically used at single tenant environments
3
NVIDIA IN-BAND TOOLS ECOSYSTEM
DCGM
NVML
3rd Party Tools
▶ Customers building their own GPU metrics/monitoring stack using NVML
▶ Customers integrating DCGM; CSPs for system validation
▶ Cluster managers, Job schedulers, TSDBs, Visualization tools
4
HOW SHOULD I MANAGE MY GPUS?
3RD PARTY TOOLS
DCGMNVML
Stateless queries. Can only query current data
Low overhead while running, high overhead to develop
Low-level control of GPUs
Management app must run on same box as GPUs
Provide database, graphs, and a nice UI
Need management node(s)
Development already done. You just have to configure the tools.
Can query a few hours of metrics
Provides health checks and diagnostics
Can batch queries/operations to groups of GPUs
Can be remote or local
5
DATA CENTER GPU MANAGER (DCGM)
▶ Pre-configured Policies▶ Job Level Statistics▶ Stateful Configuration
POLICY AND ALERTING
▶ Software Deployment Tests▶ Stress Tests▶ Hardware Issues and Interface Tests
(PCIe, NVLink)
GPU DIAGNOSTICS
▶ Dynamic Power Capping▶ Synchronous Clock Boost▶ Fixed Clocks
CONFIGURATION MANAGEMENT
▶ Runtime Health Checks▶ Prologue Checks▶ Epilogue Checks
ACTIVE HEALTH MONITORING
6
https://developer.nvidia.com/data-center-gpu-manager-dcgm
GPU Management in the Accelerated Data Center
DCGM OVERVIEW
Supported NVIDIA Hardware
● Fully supported on Tesla GPUs (Kepler+)
● Supported on Quadro, GeForce, and Titan GPUs (Maxwell+,
since v1.3)
● Supports NvSwitch and DGX-2
● Driver R384 or Later (Linux only)
SDK Installer Packages
● .deb and .rpm Packages
● Includes Binaries – CLI (dcgmi) and daemon (nv-hostengine)
● Libraries and Headers (includes NVML)
● C and Python Bindings and Code samples● Documentation - User Guides and API docs
Latest Release: v1.5.6
7
AVAILABLE NVIDIA MANAGEMENT TOOLS
Software Stack
NVML
NVIDIA Driver
CUDA
Data Center GPU Manager (DCGM)
▶ Additional diagnostics (aka NVVS) and active health monitoring
▶ Policy management and more
NVIDIA Management Library(NVML)
▶ Low level control of GPUs▶ Included as part of driver ▶ Header is part of CUDA Toolkit / DCGM
DCGM Daemon
DCGM-Based 3rd Party Tools
DCGMI
Client Lib Client Lib
GPU
Diagnostics
(NVVS)
8
Two ways to use DCGM
APIs
Accessible programmatically through APIs
Supporting C/Python
Embedded library within 3-rd management tools
Also auto-manage scripts by using Python bindings
CLI
Command line interface: dcgmi
Simple, interactively
Favored by common users and system admins
Equivalent functionality provided through APIs or CLI
9
Two modes of DCGM running
Embedded
mode
DCGM agent is loaded as a shared library by 3-rd
party agent
Periodically triggered to gather data and
management activities by 3-rd party agent
Standalone
mode
Embedded into a daemon called NVIDIA Host Engine
DCGM clients prefer to interact with a daemon
Multiple clients wish to interact with DCGM, not just one
node agent
Users who wants to leverage CML tool: DCGMI
Standalone mode is used widely for its flexibility and lowest
maintenance cost to users.
10
Package Select▪ Linux - x64 and POWER▪ Ubuntu (deb)and CentOS/RHEL(rpm)▪ Normal version used for server without nvswitch▪ FM version used for server with nvswith ,such as DGX2/HGX2
11
Install and quick Start
▪ $dpkg –i *.deb/ rpm –ivh *.rpm▪ $nv-hostengine▪ Parameter: -b allow other host access this daemon
12
Device for management
❑ Host: --host
❑ Group: -g
❑ GPU: --gpuid
Usage: dcgmi discovery
dcgmi discovery [--host <IP/FQDN>] -l
dcgmi discovery [--host <IP/FQDN>] -i <flags> [-g <groupId>] [-v]
13
DCGM Usage of CLI
GroupDmonPolicyJob StatsHealth & DiagnosticsTopologyNVLINK
14
Group
15
Groups in DCGM
Groups:Almost all DCGM operations take place on groups. User can create, modify and destroy collections of GPUs on local node, using these constructs to control all subsequent DCGM activities.
Partitioned groups, consisting of only a subset of GPUs, are useful for job-level concepts such as job stats and health.
16
Groups create, list, delete
Managing group is simple, only with “dcgmi group” subcommand.
“dcgmi group –d GroupID” to delete a group
“dcgmi group –h” will list more detailed usage.
17
Add GPUs to a group
18
Health&diag
19
Health—check select items heath status
20
ACTIVE HEALTH MONITORING & ANALYSIS
NON INVASIVE CHECKS
Real-time monitoring & aggregated health indicator
Checks health of all GPUs and NVSwitch subsystems • PCIe, ECC, Inforom, Power
Thermal, NVLink
dcgmi health --check -g 1
Health Monitor Report
+------------------+---------------------------------------------------------+
| Overall Health: Healthy |
+==================+=========================================================+
Run Health Check : Healthy System
dcgmi health -g 1 –cHealth Monitor Report
+----------------------------------------------------------------------------+
| Group 1 | Overall Health: Warning |
+==================+=========================================================+
| GPU ID: 0 | Warning |
| | PCIe system: Warning - Detected more than 8 PCIe |
| | replays per minute for GPU 0: 13 |
+------------------+---------------------------------------------------------+
| GPU ID: 1 | Warning |
| | InfoROM system: Warning - A corrupt InfoROM has been |
| | detected in GPU 1. |
+------------------+---------------------------------------------------------+
Run Health Check : System with problems
21
COMPREHENSIVE DIAGNOSTICS
ACTIVE HEALTH CHECKS
Identification, recovery & isolation of failed GPUs and NVSwitches.
Diagnostics to root cause failures, Pre & post job GPU health checks
System sanity to stress performance, bandwidth, power and thermal characteristics
Multi-level diagnostic options from few seconds to minutes
dcgmi diag -r 3
+---------------------------+-------------+
| Diagnostic | Result |
+===========================+=============+
|----- Deployment --------+-------------|
| Blacklist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| CUDA Toolkit Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Environment Variables | Pass |
| Page Retirement | Pass |
| Graphics Processes | Pass |
| Inforom | Pass |
+----- Hardware ----------+-------------+
| GPU Memory | Pass - All |
| Diagnostic | Pass - All |
+----- Integration -------+-------------+
| PCIe | Pass - All |
+----- Stress ------------+-------------+
| SM Stress | Pass - All |
| Targeted Stress | Pass - All |
| Targeted Power | Warn - All |
| Memory Bandwidth | Pass - All |
+---------------------------+-------------+
22
Policy
23
Policy: when –T/-P/-e/-n/-x what to do
24
FLEXIBLE GPU GOVERNANCE POLICIES
Continuous monitoring by the
user
Identify GPUs with double bit errors
Manually perform GPU reset to
correct problems
Auto-detects double bit errors, performs reset gpu, and notifies the
user
Using DCGMWith Existing Tools
Condition Action Notification
Condition: Watch for DBEAction: reset GPUNotification: Callback
25
Job stats
26
MANAGING JOB LIFECYCLE
Which GPUs did my job run on?
How much of the GPUs did my job use?
Any error or warning conditions during my job (ECC errors, clock throttling, etc)
Are the GPUs healthy and ready for the next job?
Create GPU group
and check health
Start Job Stats
Run Job
Stop Job Stats
Display Job Stats
27
JOB STATISTICSdcgmi stats --job demojob -v -g 2Successfully retrieved statistics for job: demojob.
+----------------------------------------------------------------------------+
| GPU ID: 0 |
+==================================+=========================================+
|----- Execution Stats ----------+-----------------------------------------|
| Start Time | Wed Mar 7 10:02:34 2018 |
| End Time | Wed Mar 7 10:10:00 2018 |
| Total Execution Time (sec) | 445.48 |
| No. of Processes | 1 |
| Compute PID | 23112 |
+----- Performance Stats --------+-----------------------------------------+
| Energy Consumed (Joules) | 1437 |
| Max GPU Memory Used (bytes) | 120324096 |
| SM Clock (MHz) | Avg: 998, Max: 1177, Min: 405 |
| Memory Clock (MHz) | Avg: 2068, Max: 2505, Min: 324 |
| SM Utilization (%) | Avg: 76, Max: 100, Min: 0 |
| Memory Utilization (%) | Avg: 0, Max: 1, Min: 0 |
| PCIe Rx Bandwidth (megabytes) | Avg: 0, Max: 0, Min: 0 |
| PCIe Tx Bandwidth (megabytes) | Avg: 0, Max: 0, Min: 0 |
+----- Event Stats --------------+-----------------------------------------+
| Single Bit ECC Errors | 5 |
| Double Bit ECC Errors | 0 |
| PCIe Replay Warnings | 0 |
| Critical XID Errors | 0 |
+----- Slowdown Stats -----------+-----------------------------------------+
| Due to - Power (%) | 0 |
| - Thermal (%) | Not Supported |
| - Reliability (%) | Not Supported |
| - Board Limit (%) | Not Supported |
| - Low Utilization (%) | Not Supported |
| - Sync Boost (%) | 0 |
+----------------------------------+-----------------------------------------+
Detailed stats show
utilization, performance and
more…
28
Config
29
GPU CONFIGURATION MANAGEMENT
Initialization: Configure all GPUs (global group)
Per-job basis: Individual partitioned group settings
Maintains settings across driver restarts, GPU resets or at job start
Supports SET, GET and ENFORCE
MAINTAINS CONFIGURATIONSUPPORTED SETTINGSdcgmi config -g 1 --set –P 200
Configuration successfully set.
Disable ECC mode
dcgmi config -g 1 --get+--------------------------+------------------------+------------------------+
| all_gpu_group | | |
| Group of 2 GPUs | TARGET CONFIGURATION | CURRENT CONFIGURATION |
+==========================+========================+========================+
| Sync Boost | Not Specified | Disabled |
| SM Application Clock | Not Specified | 705 |
| Memory Application Clock | Not Specified | 2600 |
| ECC Mode | Disabled | Disabled |
| Power Limit | 200 | 225 |
| Compute Mode | Not Specified | E. Process |
+--------------------------+------------------------+------------------------+
Get Group config [Note DCGM performed reset]
30
dmon
31
Dmon—monitor GPUs stats
32
Lists items for monitor
$dcgmi dmon -l
33
Create fieldgroup for monitor
34
Monitor field or fieldgroup
$dcgmi dmon –e/-f
35
集成Promethues&Grafana
36
Promethues
❑ https://prometheus.io/
❑ Open Source systems monitoring and alerting toolkit
❑ Third-party exporters
37
$dcgm exporter
https://github.com/NVIDIA/gpu-monitoring-tools
$ docker run -d --rm --net="host" --pid="host" --volumes-from nvidia-dcgm-exporter:ro
quay.io/prometheus/node-exporter --collector.textfile.directory="/run/prometheus"
$ curl localhost:9100/metrics# HELP dcgm_dec_utilization Decoder utilization (in %).
# TYPE dcgm_dec_utilization gauge
dcgm_dec_utilization{gpu="0",uuid="GPU-c09da0a8-361d-4a42-0e58-9fe098e4d6d4"} 0
dcgm_dec_utilization{gpu="1",uuid="GPU-3d598278-88a6-6979-f600-5b7c33341502"} 0
dcgm_dec_utilization{gpu="2",uuid="GPU-e0725273-7885-235f-be70-5a333ac6fd63"} 0
dcgm_dec_utilization{gpu="3",uuid="GPU-eea85835-2787-da88-cf6f-a8b790f6ec2c"} 0
dcgm_dec_utilization{gpu="4",uuid="GPU-fc9dd930-84df-d270-dc93-dc3122bb901f"} 0
dcgm_dec_utilization{gpu="5",uuid="GPU-95320c1d-4f08-66a8-e834-c731136a7822"} 0
# HELP dcgm_ecc_dbe_aggregate_total Total number of double-bit persistent ECC errors.
# TYPE dcgm_ecc_dbe_aggregate_total counter
dcgm_ecc_dbe_aggregate_total{gpu="0",uuid="GPU-c09da0a8-361d-4a42-0e58-9fe098e4d6d4"} 0
dcgm_ecc_dbe_aggregate_total{gpu="1",uuid="GPU-3d598278-88a6-6979-f600-5b7c33341502"} 0
dcgm_ecc_dbe_aggregate_total{gpu="2",uuid="GPU-e0725273-7885-235f-be70-5a333ac6fd63"} 0
dcgm_ecc_dbe_aggregate_total{gpu="3",uuid="GPU-eea85835-2787-da88-cf6f-a8b790f6ec2c"} 0
dcgm_ecc_dbe_aggregate_total{gpu="4",uuid="GPU-fc9dd930-84df-d270-dc93-dc3122bb901f"} 0
dcgm_ecc_dbe_aggregate_total{gpu="5",uuid="GPU-95320c1d-4f08-66a8-e834-c731136a7822"} 0
38
Promethues config file
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'szdgxstation'
static_configs:
- targets: ['127.0.0.1:9100']
- job_name: 'bjdlserver'
static_configs:
- targets: ['10.19.203.85:9100']
./promethues –config.file=***.yml
39
Dcgm+promethues
40
Grafana与数据源配置The open platform for beautiful analysitic and monitoring
41
DCGM+Promethues+Grafana