Tesla Cluster Monitoring & Management
Introductions
Robert Alexander
— CUDA Tools Software Engineer at NVIDIA
— Tesla Software Group
Overview
Management and Monitoring APIs
Kepler Tesla Power Management Features
Health and Diagnostics
Monitoring and Management
NVIDIA Display Driver
NVML
C API
nvidia-smi
Command
line
pyNVML
Python API
nvidia::ml
Perl API
NVML Supported OSes
Windows:
— Windows 7
— Windows Server 2008 R2
— 64 bit only
Linux:
— All supported by driver
— 32-bit
— 64-bit
NVML Supported GPUs
NVIDIA Tesla Brand:
— All
NVIDIA Quadro Brand:
— Kepler – All
— Fermi - 4000, 5000, 6000, 7000, M2070-Q
NVIDIA VGX Brand:
— All
— Supported in the hypervisor
NVML Queries
• Board serial number, GPU UUID
• PCI Information
• GPU utilization, memory utilization, pstate
• GPU compute processes, PIDs
• Power draw, temperature, fan speeds
• Clocks
• ECC errors
• Events API
NVML Commands
• Enable or Disable ECC
• Change Compute mode
• Applies only to CUDA
• Default – multiple contexts
• Exclusive Process
• Exclusive Thread
• Prohibited
• Change persistence mode (Linux)
• Keeps NVIDIA driver loaded
nvidia-smi
ralexander@ralexander-test:~> nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Wed May 16 11:24:16 2012
Driver Version : 295.54
Attached GPUs : 1
GPU 0000:02:00.0
Product Name : Tesla C2050
Display Mode : Disabled
Persistence Mode : Disabled
nvidia-smi
Driver Model
Current : N/A
Pending : N/A
Serial Number : xxxxxxxxxx
GPU UUID : GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
VBIOS Version : 70.00.23.00.02
Inforom Version
OEM Object : 1.0
ECC Object : 1.0
Power Management Object : N/A
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x06D110DE
nvidia-smi
Bus Id : 0000:02:00.0
Sub System Id : 0x077110DE
GPU Link Info
PCIe Generation
Max : 2
Current : 2
Link Width
Max : 16x
Current : 16x
Fan Speed : 30 %
Performance State : P0
Memory Usage
Total : 2687 MB
Used : 6 MB
Free : 2681 MB
nvidia-smi
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
nvidia-smi
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : 0
nvidia-smi
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : 0
Temperature
Gpu : 56 C
Power Readings
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
nvidia-smi
Clocks
Graphics : 573 MHz
SM : 1147 MHz
Memory : 1494 MHz
Max Clocks
Graphics : 573 MHz
SM : 1147 MHz
Memory : 1500 MHz
Compute Processes : None
nvidia-smi - XML
ralexander@ralexander-test:~> nvidia-smi -q -x
<?xml version="1.0" ?>
<!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v3.dtd">
<nvidia_smi_log>
<timestamp>Wed May 16 11:33:28 2012</timestamp>
<driver_version>295.54</driver_version>
<attached_gpus>1</attached_gpus>
<gpu id="0000:02:00.0">
<product_name>Tesla C2050</product_name>
<display_mode>Disabled</display_mode>
<persistence_mode>Disabled</persistence_mode>
<driver_model>
<current_dm>N/A</current_dm>
<pending_dm>N/A</pending_dm>
…
pyNVML Example
ralexander@ralexander-test $ python
>>> from pynvml import *
>>> nvmlInit()
>>> count = nvmlDeviceGetCount()
>>> for index in range(count):
... h = nvmlDeviceGetHandleByIndex(index)
... print nvmlDeviceGetName(h)
Tesla C2075
Tesla C2075
pyNVML Example Continued
>>> gpu = nvmlDeviceGetHandleByIndex(0)
>>> print nvmlDeviceGetClockInfo(gpu, NVML_CLOCK_SM)
101
>>> print nvmlDeviceGetMaxClockInfo(gpu, NVML_CLOCK_SM)
1147
>>> print nvmlDeviceGetPowerUsage(gpu)
31899
>>> nvmlShutdown()
In milliwatts
Also:
NVML_CLOCK_GRAPHICS
NVML_CLOCK_MEM
In megahertz
Downloads
NVML SDK
— http://developer.nvidia.com/tesla-deployment-kit
Python NVML Bindings
— http://pypi.python.org/pypi/nvidia-ml-py/
Perl NVML Bindings
— http://search.cpan.org/~nvbinding/nvidia-ml-pl/
Adaptive Computing
Bright Computing
Platform Computing
Third Party Software
Ganglia
Penguin Computing
Univa
Ganglia – NVML plug-in
Data from http://www.ncsa.illinois.edu/
Out of Band API
Why Out of band?
— Doesn’t use CPU or operating system
— Lights out management
— Minimize performance jitter
Subset of in band functionality
— ECC
— Power Draw
— Temperature
— Static info – Serial number, UUID
Out of Band API
Requires system vendor integration
BMC can control and monitor GPU
— Control system fans based on GPU temperature
IPMI may be exposed
Kepler Power Management
New Kepler Only APIs
Set Power Limit
Set Fixed Maximum Clocks
Query Performance Limiting Factors
Set Power Limit
Limit the amount of power GPU can consume
Exposed in NVML and nvidia-smi
Set power budgets and power policies
Set Fixed Maximum Clocks
From a set of supported clocks
Will be overridden by over power or over thermal events
Fixed performance when multiple GPUs operate in lock step
— Equivalent Performance
— Reliable Performance
— Save Power
Lockstep with Dynamic Performance
GPU0
GPU1
GPU2
GPU3
Time
Lockstep with Dynamic Performance
GPU0
GPU1
GPU2
GPU3
Thermal Event
Time
Lockstep with Dynamic Performance
GPU0
GPU1
GPU2
GPU3
Time
Lockstep with Dynamic Performance
GPU0
GPU1
GPU2
GPU3
Time
Query Performance Limiting Factors
GPU clocks will adjust based on environment
Over thermal or over power limits GPU will reduce
performance
Health and Diagnostics
nvidia-healthmon
—Quick health check
—Suggest remedies to SW and system configuration
problems
—Help users help themselves
nvidia-healthmon
What it’s not
—Not a full HW diagnostic
—Not comprehensive
nvidia-healthmon – Feature Set
Basic CUDA and NVML sanity check
Diagnosis of GPU failure-to-initialize problems
Check for conflicting drivers (I.E. VESA)
InfoROM validation
Poorly seated GPU detection
Check for disconnected power cables
ECC error detection and reporting
Bandwidth test
nvidia-healthmon – Use Cases
Cluster scheduler’s prologue / epilogue script
Heath and diagnostic suites
— Designed to integrate into third party tools
After provisioning cluster node
Run directly, manually
nvidia-healthmon
ralexander@ralexander-test:~> ./nvidia-healthmon
Loading Config: SUCCESS
Global Tests:
NVML Sanity: SUCCESS
Tesla Devices Count: SKIPPED
Result: 1 success, 0 errors, 0 warnings, 1 did not run
-----------------------------------------------------------
GPU 0000:02:00.0 #0 : Tesla C2050 (Serial: xxxxxxxxxxx):
NVML Sanity: SUCCESS
nvidia-healthmon
InfoROM: SKIPPED
GEMENI InfoROM: SKIPPED
ECC: SUCCESS
CUDA Sanity: SUCCESS
PCIe Maximum Link Generation: SKIPPED
PCIe Maximum Link Width: SKIPPED
PCI Seating: SUCCESS
PCI Bandwidth: SKIPPED
Result: 4 success, 0 errors, 0 warnings, 5 did not run
5 success, 0 errors, 0 warnings, 6 did not run
WARNING: One or more tests didn't run.
Contact Info and Links
Robert Alexander
developer.nvidia.com/nvidia-management-library-nvml
forums.nvidia.com
Thanks!
Questions?
Tesla Cluster Monitoring & Management
Top Related