Impact of GPU Virtualization on Higher Education |...

Post on 06-Jul-2018

226 views 0 download

Transcript of Impact of GPU Virtualization on Higher Education |...

S3467:Impact of GPU Virtualization on Higher Education

Didier ContisCollege of Engineering / Georgia Techdidier.contis@coe.gatech.edu

How of all of this started…Back in early 2007, we were trying to address the following issues:

• Student Computer Ownership policy did not address Engineering Applications Licensing problems.

• Despite growing enrollment, funding for computer labs more difficult to get. 

• 24 x 7 access to computer labs = physical security and support issues.

• Computer labs are inefficient (e.g. SPACE, power, cooling).

• Student population is increasingly mobile and geographically dispersed.

A picture is worth a thousand words…

Objective: Support Pedagogy and Delivery Modes

Provide elastic capabilities for• Design • Simulation• Experimentation 

Which are accessible on‐demand from • Anywhere• Anytime• Any device (and we mean any)

The Vlab and Matrix Projects @GT

2 x Cisco Nexus 5596 + Nexus 2K Expanders

EMC NS-120 (30TB)

CoEServers

CoAServers

IACServers

CoSServers

VmwareViewCitrix XenDesktop 5.6 Redhat VDIApplication

VMwareESX 4.x

MicrosoftHyper-V 2008 R2

RedhatKVM

Hypervisor

Server

Storage

Network

Xen6.0.2

NetApp 3240 (76.8TB)

CoBServers

Windows RDS

MicrosoftHyper-V 2012

EQL PS-6000E (10TB)

1312 cores7.15TB mem

• Introduction to Engineering Graphics and Visualization

• Required classSpring 2013 Semester: 12 sections / 40 students per section

• Course description:Introduction to engineering graphics and visualization including sketching, line drawing, and solid modeling. Development and interpretation of drawings and specifications for product realization. 

• Course built around latest version of AutoCAD and Autodesk Inventor Professional

• Supported by two computers labs (40 seats each)

AE / CEE / ME 1770

VDI and 3D CAD1770 Course – Example of a team projectDesign of the Atlantic Station Millennium Gatehttp://www.thegateatlanta.com/

The Problematic of Rendering with VDI

Computer labs used by 1770 Course due for refresh

80 x Dell T3500 workstations to be replaced

What do we do?

• VDI or NOT ??

• If we go VDI how do we implement virtual gpu?

• Need a solution operational on the 1st day of Fall 2012 Semester class: Monday August 20th !!!

The Challenge early Spring 2012

“We couldn’t prove that it couldn’t be done…So we decided it could be.”Tony Tamasi, NVIDIA Senior Vice President of Content and Strategy

Major risk but only realistic solution with August 20th 2012 as a target deadline.

Could it have been done differently…. Debatable

Lots of things would need to fall into in place

Our solution: Win 2012 VDI & GRID K1

How we did it…

Microsoft VDI + RemoteFX Architecture

RemoteFX GPU Enabled Hyper-V nodes

Our Server Hardware Configuration

2 x E5‐2660 (16 cores – 95W)192GB of memory2 x 10GB NICs2 x 1100W power supplies

Dell R720

Virtual GPU NVIDIA GRID K1GPU 4 Kepler GPUs

CUDA cores 768 (192 / GPU)

Memory Size 16GB  DDR3 (4GB / GPU)

Max Power 130 W

Form Factor Dual Slot ATX, 10.5”

Display IO None

Aux power requirement 6‐pin connector

PCIe x16

PCIe Generation Gen3 (Gen2 compatible)

Cooling solution Passive

# users 4 ‐ 1001

OpenGL 4.3

Microsoft DirectX 11

VGX Hypervisor support Yes

Hyper-V Configuration

Virtualizing GPUs

!!!Microsoft RDHV 

Virtualization Host Role enabled

Virtual Machine RemoteFX Configuration

RemoteFX 3D Adapter ConfigurationRemote FX VM Configuration

220MB

Theoretical density using one 1920x1200 screen per VM# of VMs per K1 GPU: 18# of VMs per K1 board: 72 

Note: 72 x average of 2.5GB memory usage per VM = 180GB Fits server memory footprint.

Microsoft VDI + RemoteFX Architecture

How everything ties together

Our Windows 8 VDI Collections

Civil Engineering Computer Lab – Before40 Dell Precision T3500 + 19” Screens

Civil Engineering Computer Lab – After40 Dell Wyse Z90D7 + 23.5” Screens

Customized WES7Customized WES7 Image with RDP 8.0 Client

Access via Microsoft Web UI

Lessons Learned

Students using Win8 on 1st day of class was a non‐event

There is a cost for being on the bleeding edge

Pre‐production hardware (1st GPU card lasted 17h) Beta drivers

SAN Lun alignment problem == bad performance

Tracking post‐doc who saturates building uplink every day 

“Sabotage” by our Friends of Central IT 

Most of our problems were self-inflicted

How to monitor virtualized GPUs usage?Strategy #1: nvidia‐smi tool and scriptsnvidia-smi -q --display=UTILIZATION,PERFORMANCE --loop=60 \

--filename=c:\Temp\NVIDIA_Log_2.txt

==============NVSMI LOG==============Timestamp : Wed Mar 06 16:35:26 2013Driver Version : 310.90

Attached GPUs : 4GPU 0000:07:00.0

Performance State : P0Clocks Throttle Reasons : N/AUtilization

Gpu : 21 %Memory : 11 %

GPU 0000:08:00.0Performance State : P0Clocks Throttle Reasons : N/AUtilization

Gpu : 61 %Memory : 33 %

[……]

Trying to visualize nvidia-smi results

0

50

100

150

200

250

300

8:00

:59

8:11

:59

8:22

:59

8:33

:59

8:44

:59

8:55

:59

9:06

:59

9:17

:59

9:28

:59

9:39

:59

9:50

:59

10:01:59

10:12:59

10:23:59

10:34:59

10:45:59

10:56:59

11:07:59

11:18:59

11:29:59

11:40:59

11:51:59

12:02:59

12:13:59

12:24:59

12:35:59

12:46:59

12:57:59

13:08:59

13:20:00

13:31:00

13:42:00

13:53:00

14:04:00

14:15:00

14:26:00

14:37:00

14:48:00

14:59:00

15:10:00

15:21:00

15:32:00

15:43:00

15:54:00

16:05:00

16:16:00

16:27:00

16:38:00

16:49:00

17:00:00

17:11:00

17:22:00

17:33:00

17:44:00

17:55:00

coe‐hyperv401g ‐ 3/6/2013

GPU #4

GPU #3

GPU #2

GPU #1

Cumulated usage in %from each GPU

0

50

100

150

200

250

300

350

8:00

:25

8:11

:25

8:22

:25

8:33

:25

8:44

:25

8:55

:25

9:06

:25

9:17

:25

9:28

:25

9:39

:25

9:50

:25

10:01:25

10:12:25

10:23:25

10:34:25

10:45:25

10:56:25

11:07:25

11:18:25

11:29:25

11:40:25

11:51:25

12:02:25

12:13:25

12:24:25

12:35:25

12:46:25

12:57:26

13:08:26

13:19:26

13:30:26

13:41:26

13:52:26

14:03:26

14:14:26

14:25:26

14:36:26

14:47:26

14:58:26

15:09:26

15:20:26

15:31:26

15:42:26

15:53:26

16:04:26

16:15:26

16:26:26

16:37:26

16:48:26

16:59:26

17:10:26

17:21:26

17:32:26

17:43:26

17:54:26

coe‐hyperv402g 3/6/2013

GPU #4

GPU #3

GPU #2

GPU #1

Cumulated usage in %from each GPU

A closer look: visualizing nvidia-smi results

0

50

100

150

200

250

300

March 6th 2013 ‐ 12:55pm to 14:00pm  1 minute sampling

coe‐hyperv401g

GPU #4

GPU #3

GPU #2

GPU #1

How to monitor GPUs and VMs ? Strategy #2: Microsoft Server 2012 Perfmon tool

Trying to visualize 12h of Perfmon data

157MB of data for 12 hours !!!

A closer look at one Perfmon RemoteFX valueLet’s focus on the TDR timeouts from coe‐hyperv401g – March 6th 2013

No TDR timeout – RemoteFX Root GPUIt is a good thing…..

TDR Timeout Detection and 

RecoveryDetects when the GPU stops responding. If necessary tries to fix it via a re‐initialization, avoiding the need for reboots.

GPU… Potential single point of failure ?What happens when the host GPU card or driver crashes....

HOST DRIVERCRASH

ONE OF THE STUDENT VM’s

Not all applications are created equal…

Performance Tuner Results LogVersion: 19.0.2.0Date of Last Tune: 1/7/2013

Machine Configuration---------------------Processor Speed : 2.2 GHzRAM : 1660 MB

3D Device---------Name : Microsoft RemoteFX Graphics Device - WDDMManufacturer : MicrosoftChip set : RemoteFX Graphics Device -WDDMMemory : 191 MBDriver : 6.2.9200.16384

Your machine contains a 3D Device that is not certified.

[…]

Current application driver: Software

AutoCAD 2013 SP 1.1 Autodesk Inventor 2013 SP 1.1

“Who you gonna call” when your CAD apps crash

• Inventor 2013 SP1.1 Update 1

• Windows 8 64bit patched

• Windows 2012 patched

• NVIDIA WHQL GRID 310.90 driver

• Production Grid K1 board

• Dell R720 with Bios 1.4.8

Problematic CAD Software certification in a Virtualized Environment…

Final Thoughtsand

Future Work

Does the technology work  Yes !!!• Technology is maturing quickly. Expect things to move very quickly in 

the next 6 to 12 months.

Virtual GPU…. Is it a “Game Changer” for VDI ??• Not exactly. But it does satisfy a BIG need we had.

We DO need more (management) integration between Virtual GPU and Hypervisors.• Better dashboard / monitoring from the Hypervisor

• Session load balancing / VM placement based on GPU usage.

What have we learned?

Mixing hardware

Dell R720 + GRID cards (K1 / K2) == building block

K1 for 80% of workload needs / K2 for remaining 20%

Future direction

K1 K2+

Mixing Technologies

Can run different hypervisor / VDI solution / App publishing on top of the brick.

Testing XenDesktop.Next / XenApp.Next / RemoteFX.Next…

Future direction

Backup Slides

Thin Client Challenges• Why WES7 and not ThinOS?

• Active Directory not feasible for these devices

• Secure, limited access by local user account

• Access to Citrix and RemoteFX (W8) Pools

– WES7 with RDC 6.2.9200 update

• Reduced workload for local support

• Central management of client image

– Wyse Device Manager

Creating the W8 Master and Collection• Create Base OS Image

• Provide Departmental RDP Access to Master

• Departmental Software Install and Testing

• Save copy of Virtual Hard Drive

• Sysprep

• Create Collection with Virtual GPU and User Profile Disks enabled

• Apply GPO

sysprep -generalize -oobe -shutdown -mode:vm

Collection Properties

CEE W8 Desktop