GPUvm: Why Not Virtualizing GPUs at the Hypervisor? · 2019-11-29 · GPUvm: Why Not Virtualizing...
Transcript of GPUvm: Why Not Virtualizing GPUs at the Hypervisor? · 2019-11-29 · GPUvm: Why Not Virtualizing...
GPUvm: Why Not Virtualizing GPUs at the Hypervisor?
Yusuke Suzuki*in collaboration with
Shinpei Kato**, Hiroshi Yamada***, Kenji Kono*
* Keio University** Nagoya University
*** Tokyo University of Agriculture and Technology
Graphic Processing Unit (GPU)• GPUs are used for data-parallel computations
– Composed of thousands of cores
– Peak double-precision performance exceeds 1 TFLOPS– Performance-per-watt of GPUs outperforms CPUs
• GPGPU is widely accepted for various uses– Network Systems [Jang et al. ’11], FS [Silberstein et al. ’13] [Sun
et al. ’12], DBMS [He et al. ’08] etc.
L2 CacheL1 L1 L1 L1 L1 L1 L1
Video Memory
Main MemoryCPU
NVIDIA/GPU
Motivation
• GPU is not the first-class citizen of cloud computing environment
– Can not multiplex GPGPU among virtual machines (VM)
– Can not consolidate VMs that run GPGPU applications
• GPU virtualization is necessary
– Virtualization is the norms in the clouds
VM
GPU
VM VM
HypervisorSharea single GPUamong VMs Physical
Machine
Virtualization Approaches
• Categorized into three approaches
1. I/O pass-through
2. API remoting
3. Para-virtualization
I/O pass-through
• Amazon EC2 GPU instance, Intel VT-d
– Assign physical GPUs to VMs directly
– Multiplexing is impossible
VM
GPU
VM
GPU
VM
Hypervisor
…
Assign GPUs to VMs directly
GPU
API remoting
• GViM [Gupta et al. ’09], rCUDA [Duato et al ’10],VMGL [Largar-Cavilla et al. ’07] etc.
– Forward API calls from VMs to the host’s GPUs
– API and its version compatibility problem
– Enlarge the trusted computing base (TCB)
WrapperLibrary v4
VM
Hypervisor
GPUForwarding
API calls
…WrapperLibrary v5
VM
Driver
Host
Library v4Library v4
Para-virtualization
• VMWare SVGA2 [Dowty ’09] LoGV [Gottschalk et al. ’10]
– Expose an ideal GPU device model to VMs
– Guest device driver must be modified or rewritten
Driver
Host
Hypervisor
GPUHypercalls
…Library
VM
PV Driver
Library
VM
PV Driver
Goals
• Fully virtualize GPUs
– allow multiple VMs to share a single GPU
– without any driver modification
• Vanilla driver can be used “as is” in VMs
• GPU runtime can be used “as is” in VMs
• Identify performance bottlenecks of full virtualization
– GPU details are not open…Library
VM
Driver
Virtual GPU
GPU
Virtual GPU
Library
VM
Driver
Outline
• Motivation & Goals
• GPU Internals
• Proposal: GPUvm
• Experiments
• Related Work
• Conclusion
GPU Internals• PCIe connected discrete GPU (NVIDIA, AMD GPU)
• Driver accesses to GPU w/ MMIO through PCIe BARs
• Three major components
– GPU computing cores, GPU channel and GPU memory
MMIO
…… GPU Computing Cores
GPU
Driver, Apps (CPU)
PCIe BARs
GPUChannel
…GPUChannel
GPU Channels
GPU Memory
GPU Channel & Computing Cores
• GPU channel is a hardware unit to submit commands to GPU computing cores
• The number of GPU channels is fixed
• Multiple channels can be active at a timeApp App
GPUChannel
GPUChannel
…
GPU Commands
……
GPU Computing Cores
GPU
Commands are executedon computing cores
Computing
GPU Memory
• Memory accesses from computing cores are confined by GPU page tables
App App
GPUChannel
GPUChannel …
GPU Commands
……
GPU Computing Cores
GPU
GPU Memory
Pointer to GPU Page Table
GPU Virtual Address
GPU Physical Address
GPUPageTable
GPUPageTable
Unified Address Space
• GPU and CPU memory spaces are unified
– GPU virtual address (GVA) is translated CPU physical addresses as well as GPU physical addresses (GPA)
……
GPU Computing Cores
App
GPUChannel …
GPU Commands
GPU
GPU Memory
GVA
GPACPU Memory
CPU physical address
Unified Address Space
GPUPageTable
DMA handling in GPU
• DMAs from computing cores are issued with GVA
– Confined by GPU Page Tables
• DMAs must be isolated between VMs
……
GPU Computing Cores
App
GPUChannel …
GPU Commands
GPU
GPU Memory
Memcpy(GVA1, GVA2)
CPU Memory
GVA1 translatedto GPA
GVA2 translatedto CPU physical address
DMA
GPUPageTable
Outline
• Motivation & Goals
• GPU Internals
• Proposal: GPUvm
• Experiments
• Related Work
• Conclusion
GPUvm overview
• Isolate GPU channel, computing cores & memory
GPUGPUChannel
GPUChannel
…
Assigned to VM1
GPUChannel
GPUChannel
…
Assigned to VM2
…GPU Memory
Assigned to VM1 …Assigned
to VM2
…
VM1
……
GPU Computing Cores
Time Sharing
…VirtualGPU…
VM2
…VirtualGPU…
GPUvm Architecture
• Expose the Virtual GPU to each VM andintercept & aggregate MMIO to them
• Maintain Virtual GPU views and arbitrate accesses to physical GPU
Host
GPUvm Hypervisor
GPU
…Library
VM
Driver
Library
VM
Driver
Virtual GPU Virtual GPU
GPUvm
Intercept MMIOs
GPUvm components
1. GPU shadow page table
– Isolate GPU memory
2. GPU shadow channel
– Isolate GPU channels
3. GPU fair-share scheduler
– Isolate GPU time using GPU computing cores
GPU Shadow Page Table
• Create GPU shadow page tables
– Memory accesses from GPU computing cores are confined by GPU shadow page tables
GPUChannel
GPUChannel
…
……
GPUVirtual GPU
VM2
…
GPU Commands
…VM1
……
GPUChannel
……
GPU Memory
GPUShadow
PageTable
GPUPageTable
Virtual GPU
GVA…
Access allowed
Access notallowed
Keepconsistency
GPU Shadow Page Table &DMA
• DMA is also confined by GPU shadow page tables
– Since DMA is issued with the GVA
• Other DMAs can be intercepted by MMIO handling
GPUChannel
GPUChannel
…
……
GPUVirtual GPU
VM2
…
GPU Commands
VM1
……
GPUChannel
……
GPU Memory
Virtual GPU DMAwithGVAs…
CPU Memory
VM1 Memory
VM2 MemoryDMA allowed
GPUShadow
PageTable
DMAnot
allowed
• Channels are logically partitioned for VMs
• Maintain mappings between virtual & shadow channels
VM2
GPU Shadow Channel
Assigned to VM10 1 …2
VM1
Virtual GPU
GPU Virtual Channels
0 1 …2
VM2
Virtual GPU
GPU Virtual Channels
0 1 …2
VM1
64 65 66 …
0 1 2 …0 1 2 …
0 1 2 …
Mappings betweenvirtual & shadow channels
Assigned to VM264 65 …66
GPUShadowChannels
GPU
GPU Fair-Share Scheduler• Schedules non-preemptive command executions
• Employs BAND scheduling algorithm [Kato et al. ’12]
• GPUvm can employ existing algorithms– VGRIS [Yu et al. ’13], Pegasus [Gupta et al. ’12], TimeGraph [Kato et al. ’11],
Disengaged Scheduling [Menychtas et al. ’14]
GPU ShadowChannels
GPU……Assignedto VM1
…Assignedto VM2
Time
VM1…Virtual
ChannelsVGPU
VM2…Virtual
ChannelsVGPU
…
GPUfair-sharescheduler
VM1Queue
VM2Queue
Optimization Techniques
• Introduce several optimization techniquesto reduce overhead caused by GPUvm
1. BAR Remap
2. Lazy Shadowing
3. Para-virtualization
• MMIO through PCIe BARs is intercepted by GPUvm
• Allow direct BAR accesses to thenon-virtualization-sensitive areas
Virtual BAR
BAR Remap
Sensitive Non-sensitive
Naive w/ BAR Remap
Guest DriverMMIO
GPUvm
Intercepted
Physical BAR
Aggregated
Issued
Virtual BAR
Sensitive Non-sensitive
Guest DriverMMIO
GPUvm
Intercepted
Physical BAR
Aggregated
Issued
Direct Access
TLB flush
Lazy Shadowing• Page-fault-driven shadowing cannot be applied
– When fault occurs, computation cannot be resumed
• Scanning entire page tables incurs high overhead
• Delay the reflection to the shadow page tables until the channel is used
Naive
Time
LazyShadowing
ChannelbecomesactiveScan 3 times
Scan only once
Scan
Ignore
Channelbecomesactive
Para-virtualization
• Shadowing is still a major source of overhead
• Provide para-virtualized driver
– Manipulate page table entries through hypercalls(similar to Xen direct-paging)
– Provide a multicall interface that can batch several hypercalls into one (borrowed from Xen)
• Eliminate cost of scanning entire page tables
Outline
• Motivation & Goals
• GPU Internals
• Proposal: GPUvm
• Experiments
• Related Work
• Conclusion
Evaluation Setup
• Implementation– Xen 4.2.0, Linux 3.6.5
– Nouveau [http://nouveau.freedesktop.org/]
• Open-source device driver for NVIDIA GPUs
– Gdev [Kato et al. ’12]
• Open-source CUDA runtime
• Xeon E5-24700, NVIDIA Quadro6000 GPU
• Schemes– Native: non-virtualized
– FV Naive: Full-virtualization w/o optimizations
– FV Optimized: FV w/ optimizations
– PV: Para-virtualization
Overhead
• Significant overhead over Native
– Can be mitigated by optimization techniques
– PV is faster than FV since it eliminates shadowing
– PV is still 2-3x slower than Native
• Hypercalls, MMIO interception
275.39
40.10
2.081.00
1.00
10.00
100.00
1000.00
FV Naive FV Optimized PV Native
madd (short term workload)
Re
lati
ve t
ime
(lo
g-sc
aled
)
Shadowing eliminated
Shadowing & MMIO handling reduced
0
100
200
300
400
FV O
pti
miz
ed PV
Nat
ive
FV O
pti
miz
ed PV
Nat
ive
FV O
pti
miz
ed PV
Nat
ive
FV O
pti
miz
ed PV
Nat
ive
1 2 4 8
Tim
e (
seco
nd
s)
Number of GPU Contexts
Kernel execution time Virtualization overhead
Performance at Scale
• FV incurs large overhead in 4- and 8- VM case
– Since page shadowing locks GPU resources
Performance Isolation
• In FIFO and CREDIT a long-running task occupies GPU
• BAND achieves fair-share
FIFO CREDIT BAND
GP
U U
tiliz
atio
n (
%)
0
25
50
75
100
0 10 20 30 40 50
Time(seconds)
short long
0
25
50
75
100
0 10 20 30 40 50
Time(seconds)
short long
0
25
50
75
100
0 10 20 30 40 50
Time(seconds)
short long
Outline
• Motivation & Goals
• GPU Internals
• Proposal: GPUvm
• Experiments
• Related Work
• Conclusion
Related Work• I/O pass-through
– Amazon EC2
• API remoting
– GViM [Gupta et al. ’09], vCUDA [Shi et al. ’12],rCUDA [Duato et al ’10],VMGL [Largar-Cavilla et al. ’07], gVirtuS [Giunta et al. ’10]
• Para-virtualization
– VMware SVGA2 [Dowty et al. ’09], LoGV [Gottschalk et al. ’10]
• Full-virtualization
– XenGT [Tian et al. ’14]
– GPU Architecture is different (Integrated Intel GPU)
Conclusion
• GPUvm shows the design of full GPU virtualization
– GPU shadow page table
– GPU shadow channel
– GPU fair-share scheduler
• Full-virtualization exhibits non-trivial overhead
– MMIO handling
• Intercept TLB flush and scan page table
– Optimizations and para-virtualizationreduce this overhead
– However still 2-3 times slower