GPUvm: Why Not Virtualizing GPUs at the Hypervisor?

Yusuke Suzuki*in collaboration with

Shinpei Kato**, Hiroshi Yamada***, Kenji Kono*

* Keio University** Nagoya University

*** Tokyo University of Agriculture and Technology

Graphic Processing Unit (GPU)• GPUs are used for data-parallel computations

– Composed of thousands of cores

– Peak double-precision performance exceeds 1 TFLOPS– Performance-per-watt of GPUs outperforms CPUs

• GPGPU is widely accepted for various uses– Network Systems [Jang et al. ’11], FS [Silberstein et al. ’13] [Sun

et al. ’12], DBMS [He et al. ’08] etc.

L2 CacheL1 L1 L1 L1 L1 L1 L1

Video Memory

Main MemoryCPU

NVIDIA/GPU

Motivation

• GPU is not the first-class citizen of cloud computing environment

– Can not multiplex GPGPU among virtual machines (VM)

– Can not consolidate VMs that run GPGPU applications

• GPU virtualization is necessary

– Virtualization is the norms in the clouds

HypervisorSharea single GPUamong VMs Physical

Machine

Virtualization Approaches

• Categorized into three approaches

1. I/O pass-through

2. API remoting

3. Para-virtualization

I/O pass-through

• Amazon EC2 GPU instance, Intel VT-d

– Assign physical GPUs to VMs directly

– Multiplexing is impossible

Hypervisor

Assign GPUs to VMs directly

API remoting

• GViM [Gupta et al. ’09], rCUDA [Duato et al ’10],VMGL [Largar-Cavilla et al. ’07] etc.

– Forward API calls from VMs to the host’s GPUs

– API and its version compatibility problem

– Enlarge the trusted computing base (TCB)

WrapperLibrary v4

Hypervisor

GPUForwarding

API calls

…WrapperLibrary v5

Driver

Library v4Library v4

Para-virtualization

• VMWare SVGA2 [Dowty ’09] LoGV [Gottschalk et al. ’10]

– Expose an ideal GPU device model to VMs

– Guest device driver must be modified or rewritten

Driver

Hypervisor

GPUHypercalls

…Library

PV Driver

Library

PV Driver

• Fully virtualize GPUs

– allow multiple VMs to share a single GPU

– without any driver modification

• Vanilla driver can be used “as is” in VMs

• GPU runtime can be used “as is” in VMs

• Identify performance bottlenecks of full virtualization

– GPU details are not open…Library

Driver

Virtual GPU

Library

Driver

Outline

• Motivation & Goals

• GPU Internals

• Proposal: GPUvm

• Experiments

• Related Work

• Conclusion

GPU Internals• PCIe connected discrete GPU (NVIDIA, AMD GPU)

• Driver accesses to GPU w/ MMIO through PCIe BARs

• Three major components

– GPU computing cores, GPU channel and GPU memory

…… GPU Computing Cores

Driver, Apps (CPU)

PCIe BARs

GPUChannel

…GPUChannel

GPU Channels

GPU Memory

GPU Channel & Computing Cores

• GPU channel is a hardware unit to submit commands to GPU computing cores

• The number of GPU channels is fixed

• Multiple channels can be active at a timeApp App

GPUChannel

GPU Commands

……

GPU Computing Cores

Commands are executedon computing cores

Computing

GPU Memory

• Memory accesses from computing cores are confined by GPU page tables

App App

GPUChannel

GPUChannel …

GPU Commands

……

GPU Computing Cores

GPU Memory

Pointer to GPU Page Table

GPU Virtual Address

GPU Physical Address

GPUPageTable

Unified Address Space

• GPU and CPU memory spaces are unified

– GPU virtual address (GVA) is translated CPU physical addresses as well as GPU physical addresses (GPA)

……

GPU Computing Cores

GPUChannel …

GPU Commands

GPU Memory

GPACPU Memory

CPU physical address

Unified Address Space

GPUPageTable

DMA handling in GPU

• DMAs from computing cores are issued with GVA

– Confined by GPU Page Tables

• DMAs must be isolated between VMs

……

GPU Computing Cores

GPUChannel …

GPU Commands

GPU Memory

Memcpy(GVA1, GVA2)

CPU Memory

GVA1 translatedto GPA

GVA2 translatedto CPU physical address

GPUPageTable

Outline

• GPU Internals

• Proposal: GPUvm

• Experiments

• Related Work

• Conclusion

GPUvm overview

• Isolate GPU channel, computing cores & memory

GPUGPUChannel

GPUChannel

Assigned to VM1

GPUChannel

Assigned to VM2

…GPU Memory

Assigned to VM1 …Assigned

to VM2

……

GPU Computing Cores

Time Sharing

…VirtualGPU…

GPUvm Architecture

• Expose the Virtual GPU to each VM andintercept & aggregate MMIO to them

• Maintain Virtual GPU views and arbitrate accesses to physical GPU

GPUvm Hypervisor

…Library

Driver

Library

Driver

Virtual GPU Virtual GPU

Intercept MMIOs

GPUvm components

1. GPU shadow page table

– Isolate GPU memory

2. GPU shadow channel

– Isolate GPU channels

3. GPU fair-share scheduler

– Isolate GPU time using GPU computing cores

GPU Shadow Page Table

• Create GPU shadow page tables

– Memory accesses from GPU computing cores are confined by GPU shadow page tables

GPUChannel

……

GPUVirtual GPU

GPU Commands

…VM1

……

GPUChannel

……

GPU Memory

GPUShadow

PageTable

GPUPageTable

Virtual GPU

GVA…

Access allowed

Access notallowed

Keepconsistency

GPU Shadow Page Table &DMA

• DMA is also confined by GPU shadow page tables

– Since DMA is issued with the GVA

• Other DMAs can be intercepted by MMIO handling

GPUChannel

……

GPUVirtual GPU

GPU Commands

……

GPUChannel

……

GPU Memory

Virtual GPU DMAwithGVAs…

CPU Memory

VM1 Memory

VM2 MemoryDMA allowed

GPUShadow

PageTable

DMAnot

allowed

• Channels are logically partitioned for VMs

• Maintain mappings between virtual & shadow channels

GPU Shadow Channel

Assigned to VM10 1 …2

Virtual GPU

GPU Virtual Channels

0 1 …2

Virtual GPU

GPU Virtual Channels

0 1 …2

64 65 66 …

0 1 2 …0 1 2 …

0 1 2 …

Mappings betweenvirtual & shadow channels

Assigned to VM264 65 …66

GPUShadowChannels

GPU Fair-Share Scheduler• Schedules non-preemptive command executions

• Employs BAND scheduling algorithm [Kato et al. ’12]

• GPUvm can employ existing algorithms– VGRIS [Yu et al. ’13], Pegasus [Gupta et al. ’12], TimeGraph [Kato et al. ’11],

Disengaged Scheduling [Menychtas et al. ’14]

GPU ShadowChannels

GPU……Assignedto VM1

…Assignedto VM2

VM1…Virtual

ChannelsVGPU

VM2…Virtual

ChannelsVGPU

GPUfair-sharescheduler

VM1Queue

VM2Queue

Optimization Techniques

• Introduce several optimization techniquesto reduce overhead caused by GPUvm

1. BAR Remap

2. Lazy Shadowing

3. Para-virtualization

• MMIO through PCIe BARs is intercepted by GPUvm

• Allow direct BAR accesses to thenon-virtualization-sensitive areas

Virtual BAR

BAR Remap

Sensitive Non-sensitive

Naive w/ BAR Remap

Guest DriverMMIO

Intercepted

Physical BAR

Aggregated

Issued

Virtual BAR

Sensitive Non-sensitive

Guest DriverMMIO

Intercepted

Physical BAR

Aggregated

Issued

Direct Access

TLB flush

Lazy Shadowing• Page-fault-driven shadowing cannot be applied

– When fault occurs, computation cannot be resumed

• Scanning entire page tables incurs high overhead

• Delay the reflection to the shadow page tables until the channel is used

LazyShadowing

ChannelbecomesactiveScan 3 times

Scan only once

Ignore

Channelbecomesactive

Para-virtualization

• Shadowing is still a major source of overhead

• Provide para-virtualized driver

– Manipulate page table entries through hypercalls(similar to Xen direct-paging)

– Provide a multicall interface that can batch several hypercalls into one (borrowed from Xen)

• Eliminate cost of scanning entire page tables

Outline

• GPU Internals

• Proposal: GPUvm

• Experiments

• Related Work

• Conclusion

Evaluation Setup

• Implementation– Xen 4.2.0, Linux 3.6.5

– Nouveau [http://nouveau.freedesktop.org/]

• Open-source device driver for NVIDIA GPUs

– Gdev [Kato et al. ’12]

• Open-source CUDA runtime

• Xeon E5-24700, NVIDIA Quadro6000 GPU

• Schemes– Native: non-virtualized

– FV Naive: Full-virtualization w/o optimizations

– FV Optimized: FV w/ optimizations

– PV: Para-virtualization

Overhead

• Significant overhead over Native

– Can be mitigated by optimization techniques

– PV is faster than FV since it eliminates shadowing

– PV is still 2-3x slower than Native

• Hypercalls, MMIO interception

275.39

2.081.00

100.00

1000.00

FV Naive FV Optimized PV Native

madd (short term workload)

Shadowing eliminated

Shadowing & MMIO handling reduced

1 2 4 8

Number of GPU Contexts

Kernel execution time Virtualization overhead

Performance at Scale

• FV incurs large overhead in 4- and 8- VM case

– Since page shadowing locks GPU resources

Performance Isolation

• In FIFO and CREDIT a long-running task occupies GPU

• BAND achieves fair-share

FIFO CREDIT BAND

0 10 20 30 40 50

Time(seconds)

short long

0 10 20 30 40 50

Time(seconds)

short long

0 10 20 30 40 50

Time(seconds)

short long

Outline

• GPU Internals

• Proposal: GPUvm

• Experiments

• Related Work

• Conclusion

Related Work• I/O pass-through

– Amazon EC2

• API remoting

– GViM [Gupta et al. ’09], vCUDA [Shi et al. ’12],rCUDA [Duato et al ’10],VMGL [Largar-Cavilla et al. ’07], gVirtuS [Giunta et al. ’10]

• Para-virtualization

– VMware SVGA2 [Dowty et al. ’09], LoGV [Gottschalk et al. ’10]

• Full-virtualization

– XenGT [Tian et al. ’14]

– GPU Architecture is different (Integrated Intel GPU)

Conclusion

• GPUvm shows the design of full GPU virtualization

– GPU shadow page table

– GPU shadow channel

– GPU fair-share scheduler

• Full-virtualization exhibits non-trivial overhead

– MMIO handling

• Intercept TLB flush and scan page table

– Optimizations and para-virtualizationreduce this overhead

– However still 2-3 times slower

GPUvm: Why Not Virtualizing GPUs at the Hypervisor? · 2019-11-29 · GPUvm: Why Not Virtualizing...

Transcript of GPUvm: Why Not Virtualizing GPUs at the Hypervisor? · 2019-11-29 · GPUvm: Why Not Virtualizing...

GPUvm: Why Not Virtualizing GPUs at the Hypervisor? · 2019-11-29 · GPUvm: Why Not Virtualizing...

Documents

Transcript of GPUvm: Why Not Virtualizing GPUs at the Hypervisor? · 2019-11-29 · GPUvm: Why Not Virtualizing...

Hypervisor and Nova

Virtualization under *BSD The case of Xen...Xen, the hypervisor The privileged domain : dom0 Diving into Xen’s world Virtualizing devices domUs : the unprivileged brethren Virtualization

Virtualizing the ubc network

Zorua: A Holistic Approach to Resource Virtualization in GPUs · GPU application from the actual allocation in the on-chip hardware resources. Zorua enables this decoupling by virtualizing

Virtualizing SP Services - Cisco...Rich Network Services • Routing, VPN, App Visibility & Control, DC Interconnect, and more Server Hypervisor Virtual Switch OS App OS App ... vCPE

HPIoT: High Performance Internet of Things€¦ · Hypervisor FE/BE comm Notes ... • Remote / Distributed virtualized resources (i.e. GPUs) • High Performance Internet of Things

Xen Hypervisor

virtual Gpu Software R390 For Vmware Vsphere - Nvidia · VMware vSphere Hypervisor (ESXi) 6.7 6.7 and compatible updates All NVIDIA GPUs that support NVIDIA vGPU software are supported.

VIRTUALIZING MACHINE LEARNING WORKLOADS WITH NVIDIA …€¦ · VIRTUALIZING MACHINE LEARNING WORKLOADS WITH NVIDIA GPUs PROVIDES THE BEST OF BOTH WORLDS. ... Amazon® uses ML to

Hypervisor Comparison

VMworld 2014: Virtualizing Databases

SUCCESSFULLY VIRTUALIZING BUSINESS-CRITICAL APPLICATIONS · SUCCESSFULLY VIRTUALIZING BUSINESS-CRITICAL APPLICATIONS ... that can successfully manage its ... processes. Virtualizing

Software Defined Cognitive Radio Network …abdallah/SDN.pdfGlobal Hypervisor (GH) among the different cognitive users joining the network without affecting the PUs, and (3) virtualizing

Virtualizing Time

VIRTUALIZING MACHINE LEARNING WORKLOADS WITH ......NVIDIA GPUs PROVIDES THE BEST OF BOTH WORLDS The ML solution used a four-node vSphere cluster with Dell® R730 servers containing

High-Performance Hypervisor Architectures: …...• Self-virtualizing devices (and accelerators) – challenges and opportunities they present • Platform management in virtualized

Securing Self-Virtualizing I/O Devices · Figure 1. Types of I/O Virtualization The hypervisor can reduce the overhead of device emulation or paravirtualization by assign-ing I/O

Virtualizing the Network

Multi-Hypervisor OpenStack

MANUSCRIPTORIUM: COOPERATING, NETWORKING, VIRTUALIZING