Tesla Cluster Monitoring & Management - GTC 2012

41
Tesla Cluster Monitoring & Management

Transcript of Tesla Cluster Monitoring & Management - GTC 2012

Page 1: Tesla Cluster Monitoring & Management - GTC 2012

Tesla Cluster Monitoring & Management

Page 2: Tesla Cluster Monitoring & Management - GTC 2012

Introductions

Robert Alexander

— CUDA Tools Software Engineer at NVIDIA

— Tesla Software Group

Page 3: Tesla Cluster Monitoring & Management - GTC 2012

Overview

Management and Monitoring APIs

Kepler Tesla Power Management Features

Health and Diagnostics

Page 4: Tesla Cluster Monitoring & Management - GTC 2012

Monitoring and Management

NVIDIA Display Driver

NVML

C API

nvidia-smi

Command

line

pyNVML

Python API

nvidia::ml

Perl API

Page 5: Tesla Cluster Monitoring & Management - GTC 2012

NVML Supported OSes

Windows:

— Windows 7

— Windows Server 2008 R2

— 64 bit only

Linux:

— All supported by driver

— 32-bit

— 64-bit

Page 6: Tesla Cluster Monitoring & Management - GTC 2012

NVML Supported GPUs

NVIDIA Tesla Brand:

— All

NVIDIA Quadro Brand:

— Kepler – All

— Fermi - 4000, 5000, 6000, 7000, M2070-Q

NVIDIA VGX Brand:

— All

— Supported in the hypervisor

Page 7: Tesla Cluster Monitoring & Management - GTC 2012

NVML Queries

• Board serial number, GPU UUID

• PCI Information

• GPU utilization, memory utilization, pstate

• GPU compute processes, PIDs

• Power draw, temperature, fan speeds

• Clocks

• ECC errors

• Events API

Page 8: Tesla Cluster Monitoring & Management - GTC 2012

NVML Commands

• Enable or Disable ECC

• Change Compute mode

• Applies only to CUDA

• Default – multiple contexts

• Exclusive Process

• Exclusive Thread

• Prohibited

• Change persistence mode (Linux)

• Keeps NVIDIA driver loaded

Page 9: Tesla Cluster Monitoring & Management - GTC 2012

nvidia-smi

ralexander@ralexander-test:~> nvidia-smi -q

==============NVSMI LOG==============

Timestamp : Wed May 16 11:24:16 2012

Driver Version : 295.54

Attached GPUs : 1

GPU 0000:02:00.0

Product Name : Tesla C2050

Display Mode : Disabled

Persistence Mode : Disabled

Page 10: Tesla Cluster Monitoring & Management - GTC 2012

nvidia-smi

Driver Model

Current : N/A

Pending : N/A

Serial Number : xxxxxxxxxx

GPU UUID : GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

VBIOS Version : 70.00.23.00.02

Inforom Version

OEM Object : 1.0

ECC Object : 1.0

Power Management Object : N/A

PCI

Bus : 0x02

Device : 0x00

Domain : 0x0000

Device Id : 0x06D110DE

Page 11: Tesla Cluster Monitoring & Management - GTC 2012

nvidia-smi

Bus Id : 0000:02:00.0

Sub System Id : 0x077110DE

GPU Link Info

PCIe Generation

Max : 2

Current : 2

Link Width

Max : 16x

Current : 16x

Fan Speed : 30 %

Performance State : P0

Memory Usage

Total : 2687 MB

Used : 6 MB

Free : 2681 MB

Page 12: Tesla Cluster Monitoring & Management - GTC 2012

nvidia-smi

Compute Mode : Default

Utilization

Gpu : 0 %

Memory : 0 %

Ecc Mode

Current : Enabled

Pending : Enabled

ECC Errors

Volatile

Single Bit

Device Memory : 0

Register File : 0

L1 Cache : 0

L2 Cache : 0

Total : 0

Page 13: Tesla Cluster Monitoring & Management - GTC 2012

nvidia-smi

Double Bit

Device Memory : 0

Register File : 0

L1 Cache : 0

L2 Cache : 0

Total : 0

Aggregate

Single Bit

Device Memory : N/A

Register File : N/A

L1 Cache : N/A

L2 Cache : N/A

Total : 0

Page 14: Tesla Cluster Monitoring & Management - GTC 2012

nvidia-smi

Double Bit

Device Memory : N/A

Register File : N/A

L1 Cache : N/A

L2 Cache : N/A

Total : 0

Temperature

Gpu : 56 C

Power Readings

Power Management : N/A

Power Draw : N/A

Power Limit : N/A

Page 15: Tesla Cluster Monitoring & Management - GTC 2012

nvidia-smi

Clocks

Graphics : 573 MHz

SM : 1147 MHz

Memory : 1494 MHz

Max Clocks

Graphics : 573 MHz

SM : 1147 MHz

Memory : 1500 MHz

Compute Processes : None

Page 16: Tesla Cluster Monitoring & Management - GTC 2012

nvidia-smi - XML

ralexander@ralexander-test:~> nvidia-smi -q -x

<?xml version="1.0" ?>

<!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v3.dtd">

<nvidia_smi_log>

<timestamp>Wed May 16 11:33:28 2012</timestamp>

<driver_version>295.54</driver_version>

<attached_gpus>1</attached_gpus>

<gpu id="0000:02:00.0">

<product_name>Tesla C2050</product_name>

<display_mode>Disabled</display_mode>

<persistence_mode>Disabled</persistence_mode>

<driver_model>

<current_dm>N/A</current_dm>

<pending_dm>N/A</pending_dm>

Page 17: Tesla Cluster Monitoring & Management - GTC 2012

pyNVML Example

ralexander@ralexander-test $ python

>>> from pynvml import *

>>> nvmlInit()

>>> count = nvmlDeviceGetCount()

>>> for index in range(count):

... h = nvmlDeviceGetHandleByIndex(index)

... print nvmlDeviceGetName(h)

Tesla C2075

Tesla C2075

Page 18: Tesla Cluster Monitoring & Management - GTC 2012

pyNVML Example Continued

>>> gpu = nvmlDeviceGetHandleByIndex(0)

>>> print nvmlDeviceGetClockInfo(gpu, NVML_CLOCK_SM)

101

>>> print nvmlDeviceGetMaxClockInfo(gpu, NVML_CLOCK_SM)

1147

>>> print nvmlDeviceGetPowerUsage(gpu)

31899

>>> nvmlShutdown()

In milliwatts

Also:

NVML_CLOCK_GRAPHICS

NVML_CLOCK_MEM

In megahertz

Page 20: Tesla Cluster Monitoring & Management - GTC 2012

Adaptive Computing

Bright Computing

Platform Computing

Third Party Software

Ganglia

Penguin Computing

Univa

Page 21: Tesla Cluster Monitoring & Management - GTC 2012

Ganglia – NVML plug-in

Data from http://www.ncsa.illinois.edu/

Page 22: Tesla Cluster Monitoring & Management - GTC 2012

Out of Band API

Why Out of band?

— Doesn’t use CPU or operating system

— Lights out management

— Minimize performance jitter

Subset of in band functionality

— ECC

— Power Draw

— Temperature

— Static info – Serial number, UUID

Page 23: Tesla Cluster Monitoring & Management - GTC 2012

Out of Band API

Requires system vendor integration

BMC can control and monitor GPU

— Control system fans based on GPU temperature

IPMI may be exposed

Page 24: Tesla Cluster Monitoring & Management - GTC 2012

Kepler Power Management

New Kepler Only APIs

Set Power Limit

Set Fixed Maximum Clocks

Query Performance Limiting Factors

Page 25: Tesla Cluster Monitoring & Management - GTC 2012

Set Power Limit

Limit the amount of power GPU can consume

Exposed in NVML and nvidia-smi

Set power budgets and power policies

Page 26: Tesla Cluster Monitoring & Management - GTC 2012

Set Fixed Maximum Clocks

From a set of supported clocks

Will be overridden by over power or over thermal events

Fixed performance when multiple GPUs operate in lock step

— Equivalent Performance

— Reliable Performance

— Save Power

Page 27: Tesla Cluster Monitoring & Management - GTC 2012

Lockstep with Dynamic Performance

GPU0

GPU1

GPU2

GPU3

Time

Page 28: Tesla Cluster Monitoring & Management - GTC 2012

Lockstep with Dynamic Performance

GPU0

GPU1

GPU2

GPU3

Thermal Event

Time

Page 29: Tesla Cluster Monitoring & Management - GTC 2012

Lockstep with Dynamic Performance

GPU0

GPU1

GPU2

GPU3

Time

Page 30: Tesla Cluster Monitoring & Management - GTC 2012

Lockstep with Dynamic Performance

GPU0

GPU1

GPU2

GPU3

Time

Page 31: Tesla Cluster Monitoring & Management - GTC 2012

Query Performance Limiting Factors

GPU clocks will adjust based on environment

Over thermal or over power limits GPU will reduce

performance

Page 32: Tesla Cluster Monitoring & Management - GTC 2012

Health and Diagnostics

nvidia-healthmon

—Quick health check

—Suggest remedies to SW and system configuration

problems

—Help users help themselves

Page 33: Tesla Cluster Monitoring & Management - GTC 2012

nvidia-healthmon

What it’s not

—Not a full HW diagnostic

—Not comprehensive

Page 34: Tesla Cluster Monitoring & Management - GTC 2012

nvidia-healthmon – Feature Set

Basic CUDA and NVML sanity check

Diagnosis of GPU failure-to-initialize problems

Check for conflicting drivers (I.E. VESA)

InfoROM validation

Poorly seated GPU detection

Check for disconnected power cables

ECC error detection and reporting

Bandwidth test

Page 35: Tesla Cluster Monitoring & Management - GTC 2012

nvidia-healthmon – Use Cases

Cluster scheduler’s prologue / epilogue script

Heath and diagnostic suites

— Designed to integrate into third party tools

After provisioning cluster node

Run directly, manually

Page 36: Tesla Cluster Monitoring & Management - GTC 2012

nvidia-healthmon

ralexander@ralexander-test:~> ./nvidia-healthmon

Loading Config: SUCCESS

Global Tests:

NVML Sanity: SUCCESS

Tesla Devices Count: SKIPPED

Result: 1 success, 0 errors, 0 warnings, 1 did not run

-----------------------------------------------------------

GPU 0000:02:00.0 #0 : Tesla C2050 (Serial: xxxxxxxxxxx):

NVML Sanity: SUCCESS

Page 37: Tesla Cluster Monitoring & Management - GTC 2012

nvidia-healthmon

InfoROM: SKIPPED

GEMENI InfoROM: SKIPPED

ECC: SUCCESS

CUDA Sanity: SUCCESS

PCIe Maximum Link Generation: SKIPPED

PCIe Maximum Link Width: SKIPPED

PCI Seating: SUCCESS

PCI Bandwidth: SKIPPED

Result: 4 success, 0 errors, 0 warnings, 5 did not run

5 success, 0 errors, 0 warnings, 6 did not run

WARNING: One or more tests didn't run.

Page 39: Tesla Cluster Monitoring & Management - GTC 2012

Thanks!

Page 40: Tesla Cluster Monitoring & Management - GTC 2012

Questions?

Page 41: Tesla Cluster Monitoring & Management - GTC 2012

Tesla Cluster Monitoring & Management