I/O Page Faults
Ilya Lesokhin
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
I/O Page Faults
Research Thesis
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Computer Science
Ilya Lesokhin
Submitted to the Senate
of the Technion — Israel Institute of Technology
Heshvan 5776 Haifa November 2015
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
This research was carried out under the supervision of Prof. Dan Tsafrir, in the Faculty
of Computer Science.
Results pertaining to the Infiniband setup were generated by the Mellanox team, the
Ethernet results were created by the author of the thesis.
Acknowledgements
I would like to thanks my advisor, Prof. Dan Tsafrir, for teaching me how to do
research and pushing me when I wanted to give up. I would also like to thank the fellow
students Nadav Amit, Omer Peleg and Muli Ben-Yehuda for many fruitful discussions
and the technical help they provided during my work on this thesis.
I thank Mellanox and especially the architecture team, Liran Liss, Shachar Raindel
and Haggai Eran for providing the hardware, many of the results and helping with the
writing, without thier support this work would not have been possible. Finally, I would
like to thank my parents for supporting me all along.
The generous financial help of the Technion is gratefully acknowledged.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Contents
List of Figures
Abstract 1
Abbreviations and Notations 3
1 Introduction 5
2 Background 7
2.1 PCIe Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 InfiniBand Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Motivation 11
4 Basic IOPF Support 13
4.1 Non-Recoverable Failures . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 IOPF support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5 The Page Fault Latency Problem 19
6 HW Transport IOPF Support 21
7 SW Transport IOPF Support 23
7.1 The Cold Ring Problem with TCP . . . . . . . . . . . . . . . . . . . . . 23
7.2 The Backup Ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8 Evaluation 33
8.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.2 Cost of IOPFs on ConnectX-3 . . . . . . . . . . . . . . . . . . . . . . . . 34
8.3 Cost of IOPFs and Invalidations on ConnectX-IB . . . . . . . . . . . . . 35
8.4 Network Transport and IOPF Interplay . . . . . . . . . . . . . . . . . . 37
8.4.1 Impact of Periodic IOPFs on Bandwidth . . . . . . . . . . . . . . 37
8.4.2 Cold Ring Problem and Backup Ring . . . . . . . . . . . . . . . 40
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
8.5 System-level IOPF Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 42
8.5.1 Cloud and Web 2.0 Environment . . . . . . . . . . . . . . . . . . 42
8.5.2 Applications with Direct-I/O . . . . . . . . . . . . . . . . . . . . 48
9 Related Work 51
9.1 Existing Direct Device Assignment Solutions . . . . . . . . . . . . . . . 51
9.2 Generic IOPF support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.3 Networking IOPF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.4 GPUs and other accelerators . . . . . . . . . . . . . . . . . . . . . . . . 52
9.5 Handling latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
10 Discussion and Future Work 55
10.1 Problems with ATS/PRI . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
10.2 Optimizations and Future Work . . . . . . . . . . . . . . . . . . . . . . . 56
11 Conclusion 59
Hebrew Abstract i
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
List of Figures
4.1 IOPF (1 – 4) and invalidation (a – d) flows . . . . . . . . . . . . . . . . 16
7.1 High level design of the backup ring. . . . . . . . . . . . . . . . . . . . . 24
7.2 Software psuedo-code of the backup ring . . . . . . . . . . . . . . . . . . 25
7.3 Hardware psuedo-code for the backup ring. . . . . . . . . . . . . . . . . 27
7.4 lwIP main loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.5 Psuedo-code for the backup ring approximation. . . . . . . . . . . . . . 30
7.6 lwIP vs. Linux performance evaluation. . . . . . . . . . . . . . . . . . . 32
8.1 Minor IOPF handling breakdown for ConnectX-3. . . . . . . . . . . . . 34
8.2 (a) IOPF and (b) invalidation flow execution breakdown on ConnectX-IB. 36
8.3 Throughput of a stream benchmark in the presence of rIOPFs of varying
frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.4 Transient operation of a TCP stream benchmark over time in the presence
of minor rIOPFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8.5 (a), (b) Startup with 64 entries in receive ring. (c) Time it takes to
perform 10,000 operations as a function of receive ring size. . . . . . . . 41
8.6 Pinning vs. IOPF with dynamic working set: (a) with IOPFs (b) with
pinning (c) combined throughput . . . . . . . . . . . . . . . . . . . . . . 43
8.7 The transition period. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.8 Flipping the working set with different swap devices. . . . . . . . . . . . 45
8.9 No swap experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.10 Rare good results with HDD. . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.11 System never recovers (90-10-90). . . . . . . . . . . . . . . . . . . . . . . . 47
8.12 (a) Storage bandwidth with single initiator and varying memory limit.
(b) Memory usage with multiple initiators and a fixed memory limit. . . 47
8.13 IMB running time for different MPI operations by message size. The
ratio between the copying and the pinning run-times is shown in the labels. 49
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Abstract
Virtual memory is used in most modern general purpose computer systems. This
invention simplifies systems and increases their usability and efficiency. In recent years,
I/O devices also started using virtual addresses. However support for I/O page faults is
still lacking. I/O devices are designed under the assumption that the virtual addresses
they are using are always valid and the software is forced to makes sure that this is
indeed the case. This deficiency deprives one class of software from the benefits of
virtual memory: it prevents memory overcommitment, complicates the programming
model and hinders administration. The effected software class is exclusively comprised
of software that performs direct I/O, which is the act of accessing I/O devices without
any involvement of intermediary privileged software such as the operating system (OS)
kernel or the hypervisor. Prominent example of this are direct device assignment of
SR-IOV (single root I/O virtualization) instances in virtualization scenarios and kernel
bypassing access to I/O devices by user space applications.
This thesis presents a working hardware and software support for I/O page faults
(IOPFs) in a network interface card (NIC). It described the challenges involved in
implementing this support and demonstrates that an IOPF-enabled NIC allows for
efficient memory overcommitment.
1
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
2
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Abbreviations and Notations
I/O : Input/Output
IOPF : I/O page fault
rIOPF : receive I/O page fault
MMU : Memory Management Unit
IOMMU : I/O Memory Management Unit
TLB : Translation Lookaside Buffer
IOTLB : I/O Translation Lookaside Buffer
VA : Virtual Address
IOVA : I/O Virtual Address
OS : Operating System
DMA : Direct Memory Access
NIC : Network Interface Card
TX : Transmit
RX : Receive
PT : Page Table
PTE : Page Table Entry
IP : The Internet Protocol
TCP : The Transport Control Protocol
VM : Virtual Machine
API : Application Programming Interface
TPS : Transactions Per Second
HPC : High-Performance Computing
CWND : Congestion Window
RTT : Round-Trip Time
3
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
4
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Chapter 1
Introduction
The availability of physical memory often determines the performance of the system
for a given workload. Lack of sufficient memory may even render the system unusable.
Virtual memory, which was introduced in the 1960s, allows a computing system to run
multiple workloads concurrently while sharing the physical memory in a transparent
manner. In addition, virtual memory optimizes physical memory usage by holding only
the necessary working set of each workload.
Virtual memory is usually implemented by the CPU’s memory management unit
(MMU), and is thus not exposed to I/O devices. I/O memory management units
(IOMMUs), which are integrated in modern servers and devices, provide similar MMU
services to I/O devices. However, until recently, I/O devices were not able to inform the
operating system about page fault events. This deficiency limits the I/O devices’ virtual
memory support to isolation only; I/O devices cannot handle dynamically changing
working sets.
Nonetheless, IOMMU enabled devices are prevalent today. Prominent examples
include direct device assignment of I/O devices to virtual machines (VMs) [RS07,
WSC+07, YBYW08], high performance computing (HPC) applications [JLJ+04], and
packet processing applications [Int]. In §2 we provide a brief overview of PCIe, virtualmemory and recent advancements in IOMMU technology.
By adding support for I/O Page Faults (IOPF), I/O devices become true first-class
citizens in virtual memory (§3). IOPF support provides the means for I/O devicesto directly access virtual memory pages, which are not guaranteed to be resident in
physical memory at the time of access. It comprises two complementary mechanisms:
(1) allowing an I/O device to request from the OS physical mappings of currently
non-present pages on-demand; (2) allowing the OS to invalidate mappings on the I/O
device. In this work we focus on NICs, one of the most demanding class of I/O device,
and provide a prototype implementation of IOPF support.
Initially, we detail the design trade-offs and the implementation of the fundamental
building blocks necessary for IOPF (§4). These building blocks have broad applicability,and may serve any class of I/O device that supports IOPFs.
5
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Next, we observe that basic IOPF support is not enough. For a large class of I/O
devices it is crucial to efficiently tolerate the increased latency incurred by page faults,
which now occurs directly in the I/O fast path. We elaborate on this observation for
the general case and for the specific case of networking (§5).We then describe how IOPF latency may be tolerated in two prominent implemen-
tation approaches in use today for high-end network devices: HW offloaded (§6) andSW-managed (§7) transport protocols. In the first, IOPF and transport processingare coupled together, and the intimate knowledge between the two may be leveraged
accordingly. In the latter, the I/O device provides the basic IOPF support, while SW
manages the transport (e.g., TCP). Here, the coordination is less tight, but the SW
implementation allows more freedom in the design space.
In §8 we provide a performance evaluation of IOPF. We begin by showing the basiccosts of a single IOPF in our implementation, followed by the effects of recurring page
faults on the network bandwidth. We show that even with a page fault probability
of 2−15, we achieve full network bandwidth. Next, we demonstrate the performance
gains of IOPF in multiple real-world deployment scenarios. We examine virtualization,
storage and HPC scenarios. In both the virtualization and storage use cases, the IOPF
implementation achieved about 80% performance gain compared to the current art. The
IOPF implementation was also able to pack more VMs on the same physical machine.
In the HPC scenario, the IOPF implementation achieved performance on par with the
current state of the art, while simplifying the programing model significantly.
We provide an overview of related commercial and academic works(§9) and discussinsights gained in this work and future directions(§10). To the best of our knowledge,this is the first detailed study and evaluation of IOPF in real world systems. We
conclude in §11.
6
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Chapter 2
Background
2.1 PCIe Primer
Presently, most I/O devices are connected to the computer using PCIe (Peripheral
Components Interconnect Express)[PS14]. PCIe was designed by PCI-SIG to replace
its predecessor, PCI, which is a true bus. Conversely, PCIe is a point-to-point link. It is
nevertheless commonly referred to as a “bus” because it is backward compatible with
PCI and, as a result, it behaves like a bus from the software perspective.
The actual topology, however, is not a bus. It consists of many point-to-point links
and switches that connect all the peripheral devices to the PCIe root complex device.
The latter typically resides on die and is responsible for connecting peripheral devices
to the CPU and memory. It facilitates three important functionalities, as follows.
The first functionality is Memory Mapped I/O (MMIO). The physical memory space
contains at least one, contiguous address interval, denoted PCI MMIO range. Such a
range is owned by the root complex and is thus ignored by the memory controller. Any
memory operation issued by the CPU that is directed at a PCI MMIO range is handled
by the root complex. The latter converts the operations into PCIe requests, which are
then fulfilled by the corresponding I/O devices. This mechanism is denoted MMIO. It
is used to communicate with the I/O devices, e.g., by allowing access to their registers
as if they are “ordinary” memory.
We note in passing that MMIO operations can be translated to either PCIe memory
operations or PCIe I/O operations. I/O operations are slower and should only be used
for initialization. The PCIE I/O and memory operations operate in different address
spaces. Each device has a fixed size configuration space in the I/O address space. The
address of this space can be calculated using the device ID. The device ID itself is
composed of bus, device and function numbers. The configuration spaces expose, among
other things, the base address registers (BARs) of the devices, through which devices
are notified to which address ranges in the PCIe memory address space they should
respond. The BARs themselves must therefore be configured before the PCIe memory
operations can be used.
7
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
The second functionality facilitated by the root complex is direct memory access
(DMA), which allows devices to access the main memory. I/O devices perform DMA
by issuing PCIe memory read and write requests. The root complex processes these
requests and passes them to the memory controller similarly to how the CPU does it.
DMA accesses carry a bit that specifies whether cache coherency should be applied to
them.
The third root complex functionality is interrupt delivery. PCIe devices trigger
interrupts to receive the attention of the CPU asynchronously. Interrupts are triggered
similarly to DMAs, i.e., by the device writing to a special address1. The PCIe root
complex also implements the IOMMU functionality. MMIO operations to the device
are translated by the MMU, whereas DMA operations from the device are translated by
the IOMMU. Interrupt requests, which are essentially PCIe write operations, are also
translated by the IOMMU—a functionality that is commonly referred to as interrupt
remapping.
2.2 Virtual Memory
Virtual memory is used in most modern general purpose computer systems [Den70].
This invention simplifies systems and increases their usability and efficiency. Virtual
memory isolates processes from each other. Each process has its own virtual address
space. Independently compiled applications can reside and reference memory at any
location without risking conflicts.
Not all virtual address ranges referenced by a process always reside in physical
memory. Only the current working set is resident. The process relies on the OS
and underlying paging mechanisms to track changes in the working set and adjust
the memory mapping accordingly. Locality of reference [Den05] makes paging work
well [SK09]. It allows substantial savings by overcommitting in physical memory. Only
a small portion of the large virtual address spaces is actually mapped to the limited
physical memory.
As the active working sets change, OSes use secondary storage (e.g., disks) as
an extension to physical memory. Data that is not part of the active working set is
transparently moved to the secondary storage. If an application attempts to access
data that was moved, a page fault exception is raised. This exception is handled by the
OS, which brings the data back from the secondary storage to memory. During this
handling, the application is suspended. Once the data is again in memory, the virtual
memory mapping is updated, and the application is resumed. Page faults are typically
classified as major or minor depending on whether disk access is required to satisfy a
missing virtual-to-physical page mapping.
The OS uses free physical memory as a disk cache. A special API [Gal95] allows
the application to access this cache directly, using the virtual memory mechanism. In a
1PCIe also supports legacy PCI interrupts. Those interrupts are not implemented as memory writes.
8
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
balanced computing systems, disk paging activity usually occurs only during sporadic
transient periods. These periods can occur for example, when new processes are spawned
and use the physical memory of currently idle processes or rarely accessed cached disk
blocks.
Virtual memory provides additional system-wide benefits. Prominent examples in-
clude speeding up process startup time by reading only the needed parts of the executable;
efficient fork() implementation using copy-on-write (CoW); and reducing the physical
memory footprint using active de-duplication [Wal02a] or page compression [Gup09].
Until recently, virtual memory was the sole domain of applications running on the
CPU but its use is now expanding to peripheral devices as well. System IOMMUs
reside between the I/O device and memory, and map virtual I/O address ranges
that are provided to the device into physical page frames in a secure way. IOMMUs
are typically used in direct HW pass-through to VMs [YBYW08] and user-level I/O
[CBD+98, Sch01, SD06]. In addition, some device classes, such as RDMA devices
[vE98], employ embedded IOMMU units within the device. All of these IOMMU devices
address only memory mapping for isolation; they are not able to indicate a page fault
to the OS. In the past year, commercial devices that support paging have been showing
up in the market [PS09, SBJS15]. However, none of these implementations examine the
impact of a page fault during an I/O operation. Thus, the implications of introducing
the full benefits of virtual memory to I/O devices remain unexplored.
2.3 InfiniBand Primer
InfiniBand[Inf15] is computer networking standard widely used in high performance
computing. InfiniBand exposes both a send/receive semantic and a remote direct
memory access(RDMA) semantic for data transfers.
The send/receive is the standard semantic we are used to from Ethernet. It is also
called the two-sided semantic as software on both sides is involved in data transfer. The
receiver must post a buffer large enough to receive the incoming data and the sender
needs to tell the NIC what buffer to send. Since the receiver usually cannot know the
size of incoming messages in advance, there is usually an agreed upon maximal message
size and all the buffers posted by the receiver are of that size. This creates inefficiencies
for both the sender and the receiver. The sender is limited in the amount of data it can
send in one message, forcing it to split large data transfers into multiple transactions
while the receiver wastes memory when the sender passes messages that are smaller
than the maximal message size.
The RDMA semantic, also called the one-sided semantic, allows data with software
involvement from only one of the parties. RDMA does require initial software involvement
on both sides to establish a connection and decide which memory areas are accessible to
other parties. But after this initial setup, one of the parties can issue RDMA read and
RDMA write operations to access the memory of the remote party without any software
9
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
involvement from the remote party. This semantic reduces the overhead of data transfer
for both parties. The initiating party is free to use the optimal transaction size for the
transfer and the other party does not incur any software overhead. We note that while
RDMA formally refers only to the one-sided semantic described above, it is commonly
used informally for all Infiniband user level I/O operations including send and receive.
While one could imagine an application where all the communication is done using
RDMA, this is usually not the case. Consider, for example, a remote storage application
with a client and a server. Since the data is stored on secondary storage, the client
usually can not access it directly using RMDA operations. Even if it could access all
the data, we would probably want some server involvement to synchronize concurrent
access from multiple clients. Consequently, RDMA is typically used only for large data
transfers and there is usually a side channel for control. In our storage example, the
client would use the control channel to say: “I want to read k blocks starting at block
number x”, the server would respond with “Ok, it is available at address p, please let
me know when you are done”. The client would then use RDMA to read the actual
data in the blocks and use the control channel once again to notify the server when
it is done. The control channel is typically implemented with send/receive semantics
because the software on the receiver does want to be notified about the reception of a
control message and be able to do something about it.
Unlike the Ethernet standard which ends at the link layer InfiniBand also includes
specifications for the transport layer. encouraging implementation of that layer in
hardware. The most commonly used transport is the reliable connection (RC) transport,
which provides reliable data transfer between two parties similarly to TCP/IP. Other
notable transports are unreliable connection (UC) and unreliable datagram(UD) like
UDP/IP, they are both unreliable. The difference is that UC is connection oriented and
mandates a send queue per connection while UD supports multi-cast and using a single
send queue to talk to multiple parties. One notable place where UD, rather than RC,
is typically used is in the implementation of IP over InfiniBand (IPoIB)[CK06] driver.
IPoIB allows regular applications designed to be used in the IP environment to work
over InfiniBand networks. The rationale for using UD in IPoIB is that it is simpler to
add InfiniBand headers to IP packets and send them with a shared send queue than to
maintain multiple reliable send queues corresponding to all the active remote parties.
10
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Chapter 3
Motivation
Direct I/O applications and virtualized systems with direct device assignment have two
ways of accessing the memory. Directly using the CPU and indirectly by asking an I/O
device to access the memory on the applications behalf. While CPU page faults are
supported on all modern general purpose systems, support for IOPFs is almost non
existent. Consequently such systems are forced to use pinning and cannot enjoy the full
benefits of virtual memory.
Translation only Translation + page fault support
isolation 3 3
on demand paging 3 7
swapping 3 7
overcommitment 3 7
easy programming 3 7
page migration 3 7
Table 3.1: Benefits of virtual memory
Table 3.1 lists the benefits of virtual memory and specify whether they can be
provided by a system that only supports translation or whether page fault support is
also required. As we can see, the only benefit that a translation only virtual memory
system can provide is isolation. Such a system cannot provide on demand paging or
swapping because a page fault is required to page in the data. It cannot provide Copy
on Write (COW) because a page fault upon write is required to copy the data and
break the cow mapping. And as a result of the limitation above there is no memory
overcommitment in such a system. The ability to over-commit memory allows programs
to allocate memory even if there is not enough physical memory to satisfy the allocation.
This ability greatly simplifies the programming model as relives the programmer from
writing fall-back code paths for every memory allocation failure. In typical system today
memory allocation failures in user space are so rare that its acceptable for a program
to simply terminate when an allocation fails. In addition to all the over-commitment
related benefits mentioned above page faults are also required page migration, which
11
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
allows compacting the physical memory for more efficient usage.
With IOPF support, the inherent benefits of virtual memory apply seamlessly to
application buffers used for direct I/O. Storage servers, for example, may allocate large
buffer pools up front to accommodate the worst case, but reference only “hot” buffers
in the common case. As a result, the I/O memory footprint follows the current working
set.
In other applications, such as HPC, large memory ranges may be mapped for remote
direct memory access (RDMA) by I/O devices [Inf15]. A remote host in an RDMA-
capable network may read and write local application memory directly (after a proper
key exchange) without involving the (local) CPU. Here, the working set is determined
by remote IO activity rather than by the local CPU.
In the context of virtualization, direct device assignment mandates pinning the entire
VM address space [ABYTS11], even though the VMs themselves are held in virtual
memory. IOPF maintains the benefits of the large body of work done on over-committing
VM memory without giving up the performance advantages of direct I/O.
Virtual memory moves the complex memory management code employed by state
of the art direct I/O applications to the operating system. Complex code that decides
what data should be in memory at any given time can be discarded. Finally, pinning
memory requires special administrative privileges. IOPF support will allow running
unprivileged direct I/O applications.
12
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Chapter 4
Basic IOPF Support
When we add an indirection level and decide that DMA transactions should use virtual
rather than physical addresses, we also have to decide how to handle translation failures:
a situation where a DMA transaction references a virtual address that has no translation
or has permissions which do not allow the transaction to complete. We can leave the
I/O devices oblivious to the possibility of translation failure and force the software to
avoid them or we could inform the I/O devices about translation failures and give the
devices a chance to do something about them. The VT-d spec[Int14] uses the terms
non-recoverable and recoverable address translation failures to describe those options.
4.1 Non-Recoverable Failures
The simpler and widely used option is to keep the I/O devices oblivious to translation
failures. A DMA transaction to a virtual address that cannot be translated is treated
the same as a DMA transaction to an invalid physical address. IO devices are not
notified about translation failures, and do not require any modification to work in this
mode. Under this design option, a DMA access should never encounter a translation
failure. Such a failure, if it does happen, indicates that either the device or its driver
are misbehaving. The IOMMU nevertheless detects and reports translation failures.
But OSes currently do not have a standard interface for drivers to register translation
failure callbacks [Cor14]. As a result, all the IOMMU driver can currently do is to log
the failure. Even if OSes did have an interface for registering such callbacks, recovering
from this failure could be prohibitively difficult or altogether impossible, because most
I/O devices are designed under the assumption that DMA operations do not fail.
DMA Read When a device issues a read, it expects an answer. The IOMMU indeed
returns an answer when the corresponding translation fails, but this answer is a generic
“unsupported request”[Int14], namely, the device has no way of knowing that the failure
was due to a translation failure. The specific reaction of devices to such generic failures
varies. The network and disk controllers that we tested exhibited various undesirable
13
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
behaviors, ranging from getting stuck to corrupting the filesystem. Devices may, in
principle, behave in a more civilized manner by raising an interrupt, informing their
driver to reset the device. Conceivably, devices can ignore the read error and continue
as if nothing bad happened. For example, a sound card can skip the data that it was not
able to read and be silent for a moment. Such an approach, however, might colossally
fail if a disk controller is involved. For example, if the controller is instructed to read
from memory and write the content to the disk, then silently ignoring a DMA read error
would likely result in data loss, because the OS would rightfully expect the information
to be persistent on disk. This example coincides with the behavior we empirically
observed.
DMA Write DMA write operations are more challenging than reads with respect to
address translation failures. Whereas DMA read operations expect a response, DMA
write operations are conducted in a “write and forget” manner. Namely, devices are not
acknowledged when DMA write operations complete1. Consequently, there is no way
for the IOMMU to inform devices that write operations have failed, so the operations
fail silently upon translation failures. Silent failures are usually worse than other
outcomes such as, say, crashing. Assume, for example, that the host directly assigns a
disk controller to a guest without pinning the associated memory in a system where
translation failures cause DMA writes to fail silently. Further consider that the guest
intends to read and run an executable from the disk. The guest will therefore instruct
the disk controller to read the relevant blocks and DMA write them to memory. If an
translation failure occurs during the DMA write, it will fail silently, without writing the
requested data to memory. Being oblivious to the IOPF, the disk controller will next
inform the guest that the operation has completed successfully. The guest, in turn, will
run the executable, which will crash when the CPU will try to execute “random” data
that was supposed to be overwritten with the executable code but was left unchanged
due to the DMA write failure. The crash might happen at the worst possible moment,
for example, when trying to save a big document that the user has been working on for
hours. Worse, the executable could be a kernel module and thus crash the OS entirely.
Note that in such a scenario, the hypervisor is notified about the IOPF through the
IOMMU, but there is no standard way for the hypervisor to notify the guest about the
error.
NIC translation failures The above examples for what might go wrong when
encountering address translation failures are disk-related. Seemingly, the network is
not as “trustworthy” as the disk. For example, one trusts a disk write to transpire as
1PCIe does have a link layer ack. But this ack only means that the packet containing the writerequest has been received successfully. It does not mean that the write operation has been successfullycompleted. Furthermore, the entity sending the ack is not necessarily the root complex and hence hasno knowledge as to whether the DMA write was successful. Rather, it may be a PCIe switch that hasno understanding of (I/O) virtual memory semantics.
14
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
requested, whereas one does not equivalently trust a packet to reach the other side. As
a result, network packets have checksums and sequence numbers to recover from packet
loss and data corruption. If random data is passed to the network stack, it will most
likely be dropped. The reason for the drop can be an unknown protocol, bad checksum
or bad sequence number. Consequently, when the NIC receives a packet, fails to write
its content to memory due to an IOPF and informs the NIC’s driver that a packet was
received, the most likely scenario is that the packet will get dropped.
However, we would like to point out that the disk related failure scenarios we
mentioned earlier are actually applicable to NIC translation failures when working with
NFS (network file system). Many modern NICs have hardware checksum offloading
and scatter gather capabilities. In order to take advantage of the hardware checksum
offloading, the network stack is usually designed to skip the checksum checking step
in cases where the hardware has confirmed that the checksum is good. When the
NIC receives a good packet but fails to write its content to memory, it will actually
report good checksum because it checked the original packet and not the content that
the network stack will see in memory. In such a scenario that packet will still most
likely get dropped due to a bad header. However, if we further assume that the packet
crosses a page boundary, or that the NIC’s driver uses the scatter gather capabilities
and receives each incoming packet into multiple pages, then it is possible that a part
of the packet containing the header will be written successfully to memory while the
rest of the packet will not be. In such a scenario, the header is valid and there is
no software checksum check because the network stack blindly trusts the hardware
checksum checking capabilities. As a result, the network stack will give corrupted data
to the application using it. If the application is an NFS client, we would experience a
failure similar to the disk read failure and if the application is an NFS server we would
experience a failure similar to the disk write failure2.
4.2 IOPF support
The second option for handling translation failures is to inform the device when a
translation failure occurs and allow the device to respond. We mentioned earlier that
the VT-d spec uses the term recoverable address translation failures for this option.
However the VT-d spec actually refers specifically to the ATS/PRI standards by PCI-
SIG, and there are other ways to implement this design option. For the purpose of
this paper we will use the terms IO page fault (IOPF) for translation failure and
IOPF support for any design where devices are notified about IOPFs and have the
hardware/software interfaces described next in order to do be able to handle IOPFs
gracefully. Existing I/O devices need to be modified in order to support IOPFs, and as
a result, this support is still very rare. In fact, the only other implementation we are
2Application level data integrity testing can save us from data corruption in those case. However, itis only supported in NFSv4, which is not widely deployed.
15
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
os
driver
io device
io page tables(1)
(3)
(4)(2)
(a)
(b) (c)
(d)
Figure 4.1: IOPF (1 – 4) and invalidation (a – d) flows
aware of is the GPU in AMD Kaveri. We developed two devices with IOPF support: a
Mellanox ConnectX-3 40G Ethernet NIC, and a Mellanox Connect-IB 56G Infiniband
Host Channel Adapter (HCA). Both adapters employ a similar internal IOMMU, which
initially assumed that all PTE (page table entries) mappings are valid.
Page Faults To support page faults, we allowed the internal IOMMU to hold invalid
mappings. The resulting IOPF flow is illustrated in Figure 4.1 and described as follows:
(1) The device starts processing an I/O request. It consults the I/O page tables and
determines that one of the pages involved is not present. (2) The device raises an
IOPF interrupt to its driver in order to resolve the page faults. (3) The driver calls
get user pages() to obtain the relevant pages from the OS, and immediately calls
put page() to avoid pinning them. (4) The driver updates the I/O page table with
the corresponding physical addresses and informs the device that the IOPF has been
resolved, allowing it to resume normal operation.
In the unlikely case that the memory is not available, for example due to an access
attempt to unallocated virtual memory, the device driver notifies the hardware that the
page fault could not be resolved. The hardware follows the same error semantics that
InfiniBand [Inf15] defines for local access errors.
Invalidations As the memory pages accessed by the device are no longer pinned, the
OS is allowed to unmap and reuse pages at will. This requires an invalidation flow,
in which the OS notifies the device that a virtual mapping is no longer valid. The
invalidation flow is illustrated in Figure 4.1 and described as follows: (a) The OS decides
to change a virtual mapping and asks the driver via the Linux kernel MMU notifiers
infrastructure [Arc08] to remove the old mapping and stop the device from using it. (b)
The driver updates the I/O page tables and issues an invalidation to the device. (c) The
device acknowledges the invalidation and stops using the relevant mapping. (d) The
driver notifies the OS that the old mapping has been removed and that the relevant
16
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
pages can be reused.
We note that multiple IOPFs and invalidation may execute concurrently and software-
based locking is used for synchronization. Page fault handling might naturally block on
an invalidation. Due to this fact, the locking scheme had to ensure that invalidations
never block on page fault handling. If an IOPF collides with an invalidation, the IOPF
handling is aborted and restarted after the invalidation completes.
17
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
18
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Chapter 5
The Page Fault Latency Problem
When a CPU page fault occurs, the current thread of execution is halted until the
operating system handles the page fault. Similarly, for devices that perform purely local
work (GPU, FPGA, ASIC accelerators and local storage devices), it is usually possible
to suspend the specific execution context until the page fault is resolved.
However, there exists a large class of devices for which pausing I/O due to a page
fault disrupts normal I/O operation, even if the average I/O rate in the presence of
page faults does not limit the desired throughput. Sensor data, audio sampling, video
input, and CDROM burning are examples of such devices. In the context of network
devices, This problem makes it difficult to send an receive packets in a timely manner.
We denote by rIOPF (receive IOPF) the scenario in which a NIC encounters an IOPF
while receiving a packet from the network. Arguably, the simplest approach is to drop
all incoming packets designated to the faulting ring until the rIOPF is resolved. The
problem with this approach is that if the drop is done without informing the transport
layer, performance will be greatly impacted due to the relatively large timeout values
used today – 200 milliseconds in TCP [VPS+09], and around 4 seconds with InfiniBand,
which assumes a (nearly) lossless network. In the next chapters, we elaborate on how
we deal with the rIOPF problem for the specific cases of both HW and SW transports.
For transmission, the situation is less complicated as suspending a transmit ring
for the duration of an IOPF will only delay the transmission and will not cause data
loss. However, network protocols might rely on delivering acknowledgments in a timely
manner because the peer might interpret the delay as an indication of packet loss. This
does not pose a problem in practice due to the large network timeouts, as mentioned
above, and in our implementation we suspend the sending queue until page faults are
resolved.
19
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
20
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Chapter 6
HW Transport IOPF Support
When the page fault and the transport are both handled in the same hardware unit,
interfacing them is relatively easy. This is the case for InfiniBand adapters. We
specifically consider the reliable connection (RC) transport.
The InfiniBand wire protocol supports explicit link level flow control. Therefore, a
näıve implementation could block the incoming network traffic until an rIOPF is resolved.
This would allow the implementation to be work preserving. However, blocking the
flow control for an extended period of time will lead to congestion spreading [AAK+08,
SCS+14] in the network. Therefore, an end-to-end mechanism is needed to handle such
a situation.
The RC protocol contains an end-to-end mechanism for the receiver to stop the
sender. When an rIOPF is encountered, the receiver sends a receiver-not-ready (RNR)
negative acknowledgment (NACK) packet. this NACK notifies the sender that it should
pause the transmission of the relevant flow for a specified period of time T . This
notification explicitly informs the sender that the packet was lost, allowing the sender to
retransmit quickly rather than the using the generic 4 seconds retransmission timeout
mentioned earlier. The notification also stops the sender early and reduces the amount
of data that will be lost due to the rIOPF. We note that in this solution some data is lost
and retransmission is required. However, retransmission is possible because the protocol
is reliable and so, by definition, the sender must keep the data until it is acknowledged.
We further note that in InfiniBand packet loss is decoupled from congestion con-
trol and consequently the retransmission does not have a negative impact on future
transmissions. This is not the case in TCP/IP.
We show in §8 that this “drop and stop the sender” solution works reasonably wellin the context of InfiniBand reliable connected transport.
21
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
22
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Chapter 7
SW Transport IOPF Support
While for hardware transport we could rely on the transport state during page fault
handling, in software transports such information is not readily available. In the prevalent
setup of TCP/IP over Ethernet, the transport layer is implemented in software, while
page faults are handled by the hardware and the device driver in the hypervisor. The
hardware and the hypervisor, being only aware of the Ethernet layer, cannot handle
the rIOPF by sending an RNR-NACK equivalent for TCP. An unmodified guest’s
TCP/IP stack, on the other hand, has the full state information, but is not aware
that a page fault has occurred, and does not have the required information from the
TCP/IP headers of the faulting packet. Consequently, for the software transport case
we cannot implement the RNR-NACK solution, and we are left with the drop and wait
for retransmit option.
Initially, we hoped this option would suffice to support TCP/IP. This hope was
based on the fact that TCP/IP stacks usually use a limited size memory pool for
transmission and reception buffers, and the buffers are frequently used, so we expected
IOPFs to be relatively rare. However, when we tested this approach with TCP/IP, we
discovered a critical problem. We call it the cold ring problem.
7.1 The Cold Ring Problem with TCP
Our implementation uses a separate page table in the device, and does not pre-populate
it. Consequently, when we first start an application, the receive ring is “cold”. Namely,
the receive buffers are not mapped and rIOPFs are not as rare as one would expect
during steady state operation. We discovered that a “cold” receive ring poses a unique
problem: TCP retransmission and congestion avoidance results in a near-deadlock of
the communication. The cold ring problem is not limited merely to startup situations.
It can also happen, for example, when the VM is resumed from suspension or brought
back from swap.
New TCP connections start in a slow start phase in which they send at a very low
rate in order to avoid exceeding the network capacity. Drops are considered a sign
23
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
guest VM/application
hypervisor/OS kernel
NIC
(1)
(2)
(3)
(4)
pf
Figure 7.1: High level design of the backup ring.
of congestion and cause TCP to reduce the transmission rate even further. Similarly,
during the connection establishment stage, TCP utilizes an exponential backoff scheme
to avoid overloading the network and the receiver. However, if the cause of the packet
loss is an IOPF, communication all but stops. The transmitter will be waiting for
acknowledgments from the receiver before it increases the transmission speed. Instead,
due to timeouts, it will actually try to reduce the transmission rate. At the same time,
the receiver will wait for more packets to arrive from the network to page-in the receive
ring. The effective visible behavior strongly resembles a deadlock. Nearly no network
traffic is sent, with both parties waiting for the retransmission timeout. In some of the
cases, the issue is so severe that the TCP stack announces a failure to the application
layer. This happens once the maximal retry number is exceeded. We demonstrate such
an issue in §8.4.2.In addition to suspend/resume of a VM or a startup case, a ring can become cold
due to other reasons. Specifically, operations such as NUMA migration or fork can
cause the same effect.
7.2 The Backup Ring
In para-virtualized guests, the receive ring is also likely to be cold from time to time.
However, the problem is much less severe thanks to buffering performed by the hypervisor
virtual switch for the guest traffic. When considering where to buffer such packets, we
rule out buffering on the NIC itself, since adding enough on-chip memory to buffer a
major page fault would be too expensive. The backup ring solution to the cold ring
problem is based upon these observations.
The design of the solution can be seen in Figure 7.1. Traffic is received from the
network by the NIC (1). For each incoming packet, the NIC inspects the designated
receive buffer of the Guest VM. If this buffer is available, the data is written directly to
it (2). However, if a page fault is encountered while writing into the buffer, the packet
is written to a backup ring owned by the hypervisor (3). After the hypervisor fixes the
page fault, it copies the packet into the original receive buffer (4). To maintain ordering,
the NIC skips receive descriptors that encountered page faults. For the same reasons,
the NIC does not report the reception of new packets to the guest until all previous
24
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
struct ring_data {// size of ringint size;// the ring itselfdescriptor_t *descriptor;int head;int tail;
bit *bitmap;};
// unresolved IOPFslist IOPF_list;
void br_interrupt() {head = get_head( br );tail = get_tail( br );count = 0;while(tail != head) {
brentre_t e = br_re_arm( tail );IOPF_list.append( e );tail = (tail + 1) % ring_size(br);
}br_update_tail( tail );wake_rIOPF_thread();
}
struct bbentry_t {int ringID;int index;int bitmap_index;packet_t pkt;
};
void rIOPF_thread() {while( true ) {
if( IOPF_list.size() == 0 )wait_for_rIOPF();
assert(IOPF_list.size() > 0 );brentry_t e = IOPF_list.pop( );r = get_ring(e.ringID);if( ! has_room( r ) )
wait_for_tail_change( r );make_present( r.descriptor[e.index] );store_packet( r.descriptor[e.index], e.pkt);bitmap[e.bitmap_index] = 0;free(e);//triggers the NIC’s//resolve_rIOPFs flowresolve_rIOPFs(r);}
}
Figure 7.2: Software psuedo-code of the backup ring
page faults have been handled.
The backup ring mechanism allows a graceful fallback. During page fault, the guest
machine behaves exactly like a guest with a para-virtualized or emulated NIC. Namely,
the hypervisor will have to buffer incoming traffic for the guest. When the buffer space
for this guest is exhausted, incoming packets are dropped. At the same time, if the
IOPF is resolved in a timely manner, this will only cause minor latency jitter instead
of packet loss. In the common case of page fault free traffic, the guest experiences
high performance thanks to direct device assignment. We note is passing that while
the backup ring solution is described and evaluated in the context of TCP/IP over
Ethernet it is also applicable to the unreliable transport protocols of InfiniBand which
lack hardware retransmission.
Figure 7.2 contains psuedo-code describing how the hypervisor manages the backup
ring. Note that in our design the guest is unaware of the backup ring and does not need
to be modified in order to benefit from the backup buffer mechanism.
The struct ring data contains data that the hypervisor maintains for each ring.
size is the size of the ring. descriptor is a pointer to the ring itself which is an array
with size buffer descriptors. head is the index of next descriptor to be used by the NIC
and tail is the index of the next descriptor to be consumed by the software. bitmap
is a special bitmap used to track which packets experienced rIOPFs and allows the
NIC to continue storing new incoming packets in the guest’s ring even when there are
pending unresolved rIOPFs. The bitmap is initialized with all bits set to zero. Its size
is controlled by the hypervisor and it limits the number of packets the hypervisor will
store for a specific guest.
We also explored a simpler design where the existence of an unresolved rIOPF would
25
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
cause all the subsequents packet to be stored in the backup ring. But we were concerned
that such a design could get stuck in an operation mode where all the packets go through
the backup ring because a new packet always arrives before the NIC is notified that the
previous packet has been copied to the guest’s ring.
The interrupt handler br interrupt is called after a new packet has been stored
in the backup buffer. This function moves used entries in the backup ring to a list of
unresolved rIOPFs and posts a new buffer to the backup ring so that it will not run
out of buffer for new entries. In addition, it wakes up a thread whose job is to resolve
the rIOPFs. This thread is required because handling a rIOPF might require sleeping,
which is forbidden in an interrupt context.
The rIOPF thread executes rIOPF thread(). The first thing it does is to check
whether there are any rIOPFs it needs to resolve. If there are none, it goes to sleep until
a rIOPF occurs. After the thread is woken up, it makes sure that the designated ring
has an available descriptor where it can store the faulting packet. The condition is likely
to be true for most of the packets. However, our backup ring mechanism allows the
hypervisor to buffer more then “ring size” packets for a guest. The reason behind this
decision is that after a rIOPF occurs and until it is resolved, the NIC does not notify
the guest about the reception of new packets. Consequently, the guest does not post
new buffers in the receive ring and there is an overflow risk. As a result of this design,
the number of faulting packets in the backup ring might exceed the number of entries
in the original receive ring. Since the hypervisor does not want to drop packets, it will
have to process a number of packets, ask the NIC to report that they were received,
and then wait for the guest to post new buffers. The waiting can be implemented either
in software only using by using sleep and polling, or it can be hardware assisted. The
hypervisor will ask the NIC to raise an interrupt when the guest changes the tail of the
ring and go to sleep until the interrupt arrives. After making sure that there is a free
descriptor, the IOPF thread will make sure that the descriptor and the buffers are all
present. It will copy the packet into the descriptors buffer, update the faulting packets
bitmap, and tell the NIC that an IOPF has been resolved.
For simplicity’s sake, our pseudo code handles the faulting packets one at a time. A
practical implementation is likely to use batching and possibly multiple threads when
handling the faulting packet. Such an implementation would read a predefined number
of faulting packets and start the I/O required to make all of them present. After the
I/O is done, the hypervisor will copy the faulting packets one at a time and tell the
NIC that the whole chuck has been resolved using a single operation.
Figure 7.3 show the corresponding NIC hardware pseudo-code. The struct ring t
contains the state the hardware maintains for each ring. The head offset variable is
used to determine where the next incoming packet should be stored. When there are no
unresolved rIOPFs, head offset is zero and the packet is stored in head. When there
are unresolved rIOPFs, head keeps pointing to the index of the first rIOPF triggering
descriptor as we can not inform the guest about reception of new packets before this
26
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
sturct ring r{int size;descriptor_t *descriptor;int tail;int head;
int head_offset
int bitmap index;bit *bitmap;
}
// HW - receive packet:// invoked if pkt specifically// designated for r,// or if broadcastvoid recv(Ring r, Packet pkt) {
if (!store in ring(r, pkt))store in backup(r, pkt);
}
// HW - receive packet:// invoked by the hypervisor// after a rIOPF has been resolvedvoid resolve_rIOPFs(ring r) {
i = r.bitmap_index;
while (head_offset > 0 &&r.bitmap[i] == 0) {
atomic {r.head_offset--;r.head = (r.head + 1) % r.size;}i = (i + 1) % r.bitmap_size;
}
//takes care of coalesingr.raise_isr();
}
bool ring overflow(ring r) {tail = r.tail;if (tail < r.head)
tail += r.size;
return r.head + r.head offset >= tail;}
bool store in ring(ring r, Packet pkt) {if (!ring_overflow(r) &&
r.is_descriptor_present(r.tail)) {
head = (r.head + r.head_offset) % r.sizer.descriptor[head].store(pkt);if (r.head offset != 0)
r.head offset++else {
r.head = (r.head + 1) % r.size;//takes care of coalesingr.raise_isr();}
return TRUE;}return FALSE;}
void store in backup(ring r, Packet pkt) {offset = r.head offset;if (offset < r.bitmap_size &&
backup.tail != backup.head) {
offset += r.bitmap_index;head = (r.head + r.head_offset) % r.sizebackup.descriptor[backup.head].store(
concat( r.id, head, offset, pkt));backup.head = (backup.head + 1) % backup.size;r.bitmap[offset] = 1;r.head_offset++;
//takes care of coalesingbackup.raise_isr();
}}
Figure 7.3: Hardware psuedo-code for the backup ring.
27
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
rIOPF has been resolved. The NIC always tries to store new incoming packet in head
+ head offset. bitmap is the bitmap we mentioned earlier when we talked about the
ring data, and bitmap index is the index of the bit in the bitmap corresponding to
the descriptor at index head in the ring.
The recv() function describes how a NIC handles a packet, pkt, designated for the
ring r. The NIC first checks whether the packet can be stored in the ring r and if it
cannot, the packet is redirected to the backup ring. store in ring() tries to store the
packet in the ring r. The conditions for using the ring are that target index does not
exceed the tail and that the relevant descriptor and buffers are present1. Assuming the
conditions above hold, the packet is stored directly in the ring. If there are unresolved
rIOPFs for this ring we only advance head offset, and if there are no unresolved
rIOPFs we advance head and raise an interrupt to signal the reception of a new packet
in the ring r.
Alternatively, if the packet is redirected to the backup ring store in backup() is
executed. It checks that the distance from the first unresolved packet does not exceed
the bitmap size and that there is room in the backup ring. If this is not the case
the packet is dropped. Assuming the packet is not dropped, we append additional
information to the packet in order to help the hypervisor resolve the rIOPF, mark the
bitmap and raise an interrupt to the hypervisor. We also advance head offset to skip
an entry in the designated ring.
The resolve rIOPFs() flow is executed when the hypervisor notifies the ring that
a rIOPF has been resolved. It uses the bitmap to update head to point to the next
unresolved rIOPF or to the top of the ring if there are none. We note that while this
loop might take some time it does not have to be atomic with respect to packet reception.
Only head and head offset must be updated together because the destination of new
packets is determined by their sum.
7.3 Implementation
Our backup ring evaluation had to overcome two hardware limitations. The hardware
IOPF implementation did not support redirecting packets to a secondary receive ring
upon a page fault. Nor did it not support the combination of IOPF and SR-IOV.
Consequently, to evaluate the backup ring solution, we approximated it using software.
Lightweight virtualization To address the lack of support for SR-IOV instances,
the evaluation was performed with lightweight virtualization. Each lightweight virtual
machine used a user space TCP/IP stack and kernel bypass technology based upon
IB-verbs [Inf15]. Linux cgroups [Men], was used in order to limit the memory available
1A real implementation would also check that the incoming packet is not too large for the corre-sponding descriptor and drop the packet without storing it to the backup ring if this is not the case.For simplicity, we ignore this complication.
28
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
while(1) {/* poll the driver, get any outstanding frames, alloc memoryfor them, andcall netif->input. */poll driver(netif);/* Handle all system timeouts for all core protocols and theapplication */sys check timeouts();}
Figure 7.4: lwIP main loop
to each lightweight virtual machine.
A good user space TCP/IP stack was surprisingly hard to find. The candidates
we considered were OpenOnload[Rid], libvma and lwIP. OpenOnload only works on
Solarflare NICs and porting it to a Mellanox NIC is difficult. libvma was closed source
when we started the work, and only become open source when we already had a working
system. It’s an interesting library, but making it work requires some effort, so it should
be considered as future work.
The option we chose was lwIP. We discovered that it is targeted for embedded
systems and as a result has many shortcomings: It is single threaded, its malloc is very
slow, there is no support for window scaling, or hardware offloading. and its socket API
is very slow. To address these shortcomings, we replaced the malloc that came with
lwIP with dlmalloc[Lea] and borrowed some pathches from libvma which is also based
on lwIP. Among other things, this gave us support for window scaling.
To address the bad performance of lwIP’s socket API issue we followed the advice of
[Gol] and used the raw API of lwIP. Unlike the socket API, the raw API is event driven
rather than sequential. The main loop of an application using the raw API should look
like the one in Figure 7.4[SC]. The application itself should be event driven and the
code of the application should be called from lwIP callbacks. The lwIP callbacks are
called on different events: when a new connection is established, new data is receive etc.
As a result, porting a generic application to use the raw API is nontrivial.
Backup ring approximation Due to the inability to redirecting packets to a sec-
ondary receive ring upon a page fault, we had to a settle for an approximation of the
real backup ring. Instead of redirecting only rIOPF triggering packets, we used an
existing hardware feature to duplicate all incoming packets into a secondary receive ring.
This secondary ring is populated with pre-allocated pinned buffers. In the absence of
rIOPFs, the duplicated packets stored in the secondary ring are ignored and discarded
by the software. However, when packets are dropped from the primary receive ring due
to a rIOPF, copies of those packets are still written to the secondary ring, allowing us to
avoid packet loss. When an rIOPF occurs, the software is notified and starts collecting
the dropped packets from the secondary ring. The software then waits for the rIOPF to
be resolved before forwarding copies of those packets to the network stack. The copying
29
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
void discard_duplicates(bool block) {do {
while (packets_to_discard !=discarded_packets &&!secondary_ring.is_empty()) {
secondary_ring.post_buffer(secondary_ring.consume_buffer());
discarded_packets++}
//if the caller asked to block,//continue to discard_duplicate} while (block &&
packets_to_discard !=discarded_packets);
}
void sync(void) {in_sync=FALSE;while(!in_sync &&
!secondary_ring.is_empty()) {p = secondary_ring.dequeue();if (!primary_ring.is_empty() &&
memcmp(primary_ring.peek(),p ,len(p)) == 0) {
unprocessed.enqueue(primary_ring.consume_buffer());
in_sync = TRUE;} else {
//primary_ring.is_empty() ||//memcmp(...) != 0new_buf = alloc_buf();memcpy(new_buf, p, len(p));unprocessed.enqueue(new_buf);
}secondary_ring.post_buffer(p)}
}
void fast_path(void) {while (!primary_ring.is_empty()) {
unprocessed.enqueue(primary_ring.consume_packet());
primary_ring.post_buffer(alloc_buf());packets_to_discard++; //from backup
}
// throw duplicated packets from backup but dont block:discard_duplicates(/*block = */ FALSE);
}
void slow_path(void) {clear_rIOPF_flag();
//block until there are no more duplicatesdiscard_duplicates(/*block = */ TRUE);//remove all packets from backupsync_backup();
resolve_rIOPF();//sync with primary receive queuesync_backup();
}
void poll_RX(void) {rIOPF = check_rIOPF_flag();
fast_path();if (rIOPF) {
slow_path();}
if (!unprocessed.is_empty())//pass packet to the network stacknetif_input(unproccesed.dequeue());
}
Figure 7.5: Psuedo-code for the backup ring approximation.
is done to allow reusing the pinned buffers for the secondary ring and to improve the
approximation. Since the hardware neither skips faulting receive ring entries nor reports
how many packets are dropped during an rIOPF, packet content matching is used to
detect when to switch back to the primary ring.
The psuedo-code of our approximation is shown in Figure 7.5. The main function
of interest is poll RX whose job is to decide which packet should be passed next to
netif input(), which does the network stack processing. It checks the rIOPF flag
and then executes the fast path(), which re-arms the primary ring and moves all
new packets to a software maintained unprocessed packets queue. After removing the
new packets from the primary ring, discard duplicates() is used to discard copies of
those packets from the secondary ring. This is done in a nonblocking manner to improve
the performance in the absence of rIOPFs. If the rIOPF flag was set before executing
fast path(), the slow path() is also invoked. It first discards possible remaining
duplicates and calls sync() to drain and re-arm the secondary ring. Next, the rIOPF is
resolved, making the primary ring operational again and sync() is called once more
to drain and re-arm the secondary ring until it is either empty or we find a matching
packet in the primary ring. Finally, a packet from the unprocessed queue is pushed to
30
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
the network stack.
We note that our approximation does not have a clear guest/host separation and
lacks context switches; as a result, it cannot tell us much about the CPU usage of
the real solution. However, we believe it approximates the delayed passing of faulting
packets to the networks stack and can teach us about the behavior of the network and
network stack in the presence of IOPFs.
Figure 7.6 shows the performance of a memcached port that uses our lwIP network
stack. We can see that in the absence of IOPF our backup approximation does not
significantly harm performance We further see that our memcached port, which uses
user space I/O, is about twice as fast as the native version up to a value size of 4 KB.
For large value sizes the native version benefits significantly from the Hardware TCP
offloading, which is not supported in lwIP.
31
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
0 0.2 0.4 0.6 0.8
1 1.2 1.4 1.6 1.8
1B 4B 16B64B
256B1KB
4KB16KB
64KB256KB
norm
aliz
ed th
roug
hput
value size
0
5
10
15
20
25
1B 4B 16B64B
256B1KB
4KB16KB
64KB256KB
thou
ghpu
t [G
b/s]
lwip (pin)lwip backup buffer (no pin)linux (pin)linux w/ offload (pin)
Figure 7.6: lwIP vs. Linux performance evaluation.
32
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
Chapter 8
Evaluation
8.1 Methodology
We evaluate the impact of IOPFs in the context of network devices. Initially, we strive
to characterize the latency of a single IOPF or a single invalidation event. We then look
at the cold ring problem, measuring its effect and the efficacy of the solution. In §8.4, wemeasure the impact of an IOPF on the network behavior of the system under synthetic
load. In the following subsections we evaluate the performance impact of using IOPFs
in a variety of real-life scenarios. We examine use cases of high performance computing
(HPC) interconnect fabric, Web 2.0 workloads, and storage systems workloads.
Experimental Setup The setup for TCP over Ethernet evaluation is comprised of
two identical Dell PowerEdge R210 II Rack Server machines that communicate through
Mellanox ConnectX-3 40 Gbit/sec Ethernet NICs. The NICs are connected back-to-
back. Each machine has a 8GB 1333MHz memory and a single-socket 4-core Intel Xeon
E3-1220 CPU running at 3.10GHz. The machines run Ubuntu 13.10 with Linux 3.11.4
kernel modified to support IOPFs.
The HPC and storage experiments used a test cluster with 8 computing nodes. The
nodes were HP ProLiant DL380p Gen8 servers, with a dual-socket Intel Xeon E5-2697
v2 (IvyBridge) CPU and 128GB of RAM. Each node had a single Connect-IB card
installed. The cluster was connected by a single SwitchX-2 SX6036 switch. The nodes
ran RedHat 7.0, with kernel version 3.10.0-123.el7.x86 64. We used a Connect-IB driver
based upon the driver in the Mellanox OFED 2.4 package.
Both setups were tuned for performance and to avoid reporting artifacts caused by
nondeterministic events. All power optimizations —sleep states (C-states) and dynamic
voltage and frequency scaling (DVFS)— were turned off. Hyper-threading was disabled.
In our backup ring implementation, the receiver duplicates each packet into two
buffers. The receiver’s PCIe bus becomes a bottleneck compared to the transmitter
and the network. This asymmetry causes packet loss that disturbs our measurements.
In order to avoid this, Ethernet flow control [IEE97] was enabled. In addition, this
33
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
0
20
40
60
80
100
120
140
160
180
200
kernel=eventuser=poll
kernel=polluser=poll
kernel=eventuser=event
kernel=polluser=event
mic
rose
cond
sminor IOPF handling breakdown
interrupt latencyuntil IOPF thread startsread faulting ring entryget_user_pages()+update IO PTinvalidate IOTLBpagefault resolve
Figure 8.1: Minor IOPF handling breakdown for ConnectX-3.
bottleneck gives the backup ring approach an unfair disadvantage compared to the drop
configuration. To address this issue, our driver duplicates the incoming packets in both
configurations. When working in the drop configuration, the copy is simply discarded.
8.2 Cost of IOPFs on ConnectX-3
Figure 8.1 shows the results of a micro-benchmark measuring how long it takes to
resolve an IOPF in the ConnectX-3 implementation. We limited our kernel to use only
1 CPU. We instrumented the Linux kernel to store timestamps at various points in
the IOPF handling flow and wrote a simple application that does many iterations of
the following: Take a timestamp, post a packet to be sent, do a busy wait until the
corresponding completion appears in the CQ, retrieve the timestamps recorded by the
kernel, and log all the timestamps. The average of 1 million iterations can be seen
in the leftmost column of Figure 8.1. The breakdown is as follows. First there is an
’interrupt latency’, which includes the time it takes the NIC to identify and raise an
IOPF interrupt. It also includes a very minor software component of waking up an IOPF
thread to handle the IOPF. Next, there is a small scheduling delay represented by ’until
IOPF thread starts’. The IOPF thread then asks the NIC to read the faulting entry
from the ring on its behalf. It then calls get user pages() to make sure the pages are
present and updates the I/O page table. Finally there are two commands– ’invalidate
IOTLB’ and ’pagefault resolve– that than need to be issued in order to resolve the
34
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
IOPF. After seeing those results we were a bit surprised that handling a single IOPF
takes so long, as a minor CPU page fault on the same machine takes about 0.3 µs. We
noticed that a significant portion of the time is spent waiting for the NIC to execute
the relevant commands. Looking at the code, we noticed that commands are executed
in an event driven manner. Namely, after a command is issued to the NIC, the issuer
goes to sleep until the NIC sends an interrupt to notify the issuer about the commands
completion. We were concerned that so much time is spent waiting for commands to
complete because the scheduler might introduce an arbitrary long delay between the
time that the command issuer is woken up and the time it receives the CPU.
In order to see whether our concern was real, we modified the kernel to work in a
polling mode. whereby instead of sleeping after issuing a command, the issuer enters
a busy wait loop, continuously asking the NIC for the command status until the NIC
reports its completed. Scheduling is thereby avoided, at the price of the CPU not doing
any useful work while commands are being issued to the NIC. The results can be seen
in the second column of Figure 8.1. To our surprise the result were significantly worse.
In particular, the average time until the IOPF thread begins executing was significantly
lengthened. We investigated the issue and found that every so often the scheduler
keeps running our user level application for a relatively long time after the kernel IOPF
kernel thread is woken up. This behavior did not happen in every iteration of our send
loop, but when it did happen the resulting delay was long enough to skew the average.
Because Linux scheduling is quite irrelevant to our research, and because this issue only
occurs when we run the kernel driver in the non-standard polling mode, we decided
not to pinpoint the exact scenario that causes this anomaly1. Instead we modified our
user application to work in an event driven mode. After a packet is queued for sending,
rather than doing a busy wait loop, our application goes to sleep and asks to be woken
up when something is posted to the CQ. This way, our application is sleeping while
the IOPF is handled and does not compete with the IOPF thread for the CPU. The
anomaly is thus avoided. The results of this modification with the kernel working in
either polling or event driven mode can be seen in the next two columns of Figure 8.1.
We can see that using polling in the kernel is indeed slightly faster. However, the
difference is negligible, justifying the use of the event driven mode, which allows the
CPU to do useful work while waiting for the NIC to execute a command.
8.3 Cost of IOPFs and Invalidations on ConnectX-IB
On ConnectX-IB, we evaluated both the IOPF and the invalidation flows for 4KB
(page size) messages and for larger 4MB messages. Large messages are native to
InfiniBand/RDMA-based communication. They are also applicable to Ethernet when
1 However, for real implementations, one might want to take measures to avoid this situation. Namely,it is undesirable for the IOPF handling thread to have to wait for the user application to release theCPU while this application is busy polling the NIC waiting for a packet to be sent.
35
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
0
50
100
150
200
250
300
350
400
time
[µse
cond
s](a) IOPF
trigger interrupt [hw only]os overhead [sw only]
update hw PT [sw + hw]resume process [hw only]
4KB
4MB
0
10
20
30
40
50
60
70
(b) Invalidation
check shadow PT [sw only]update hw PT [sw + hw]
update shadow PT [sw only]
4KB
4MB
Figure 8.2: (a) IOPF and (b) invalidation flow execution breakdown on ConnectX-IB.
using offloads [FHL+05]. The evaluation was performed using an InfiniBand request-
response micro-benchmark modified to call madvise(..., MADV DONTNEED) and initial-
ize the memory that was going to be used for transmitted messages. The call itself
triggered the invalidation flow, and the send operation that came next triggered the
IOPF flow. The initialization is important because omitting it would have caused the
relevant pages to be zeroed out during the IOPF flow. The benchmark ran on a Linux
kernel that was instrumented using kprobes to store timestamps at points of interest
along those flows.
Figure 8.2(a) shows the timing breakdown of the IOPF flow. The different message
sizes allow us to identify what parts of the page-in process are sensitive to the amount
of data mapped. By comparing the results for the different sizes, we can tell that the
fixed page-in cost amounts to about 220 µs. The cost per page is about 100 ns, roughly
2-3 memory accesses. For an IOPF of a single page, the run time is dominated by the
HW overheads. This is composed of the ‘trigger interrupt’ and the ‘process resume’
phases. The former is the time it takes the hardware to notice and report an IOPF,
36
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
while the latter is the time required for the hardware to resume the transmission of a
ring after the IOPF is resolved. These phases total to 90% of the fixed page-in cost.
As the number of pages requested in the page fault increases, the ‘os overhead’ phase
becomes more dominant. During this phase, the driver detects the page fault, reads
the ring to determine which virtual pages need to be mapped, and asks the OS for the
physical addresses. Based upon finer profiling, we know that the major time consumer
is the OS providing the driver with physical page addresses. The increased size of the
page-in request also prolongs the ‘update hw PT’ phase, in which the hardware page
table is updated with the appropriate PTEs.
Figure 8.2(b) shows the timing breakdown of the invalidation flow. First, in the
‘check shadow PT’ phase, the software finds the memory region in which the invalidation
occurred, and scans a shadow copy of the hardware PT to see whether any of the
mappings were visible to the hardware. If the pages were not mapped through an IOPF,
this is the only overhead incurred. If the pages were mapped, the driver updates the
hardware table, in the ‘update hw PT’ phase. Finally, in the ‘update shadow PT’ phase,
the driver updates the shadow copy of the hardware PT, to indicate that they are no
longer visible to the hardware. Parallelizing the ‘update PTEs’ and ‘update hw PTEs’
phases is possible. However, doing so involves a lock granularity trade-off, which hurts
the common case.
8.4 Network Transport and IOPF Interplay
The page-in latency measured above is relevant to all IOPF-capable devices. We now
move to measuring phenomena that arise from the interplay between IOPFs and the
network-specific transport.
8.4.1 Impact of Periodic IOPFs on Bandwidth
We used a simple stream benchmark to measure impact of periodic IOPFs on bandwidth.
The benchmark strongly resembles netperf’s TCP STREAM benchmark [J+96]. The
sender side performed 64 KB sends in an infinite loop using a standard Linux TCP stack.
The receiver side ran our lwIP stack and discarded received packets as soon as it received
them, while keeping track of how much data was received. To allow comparison between
Ethernet and InfiniBand, we also used the ib send bw InfiniBand micro-benchmark from
the perftest package as an equivalent InfiniBand stream workload. In order to highlight
the impact of IOPFs and suppress other memory pressure phenomena, we synthetically
generated rIOPFs at a variable frequency. The benchmark pre-faulted the entire receive
ring at start, so that the cold ring problem would not affect the measurements.
We forced a minor IOPF using mprotect. We changed the permission of the relevant
page to read only and then back to read/write2. The mprotect calls invalidated the
2We use mprotect, rather than madvise, as here we want to preserve the meta-data in the page
37
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
page mapping, and forced an IOPF upon next access. Triggering a major IOPF is
more complicated. Our implementation used writes to a file opened with O DIRECT
to evict pages from the page cache. An mmap of the same file experienced a major
page fault when accessing the same page following the write. We also had to modify
lwIP’s code such that it would not touch the relevant page prematurely. LwIP touched
the page before posting it to the NIC’s receive ring, causing a major page fault on
the CPU, as opposed to a major IOPF. In order to evaluate the hardware-based
RNR-NACK implementation similarly, we ran a similar InfiniBand benchmark. We
modified ib send bw to trigger a minor page fault once every X messages.
0
2
4
6
8
10
12
thro
uphp
ut [G
b/s]
minor dropmajor drop
minor backupmajor backup
0 8
16 24 32 40 48
2-10 2-15 2-20 2-25
thro
uphp
ut [G
b/s]
frequency
minor hw
Figure 8.3: Throughput of a stream benchmark in the presence of rIOPFs of varyingfrequencies.
The result are shown in Figure 8.3. Note that due to the different setup, the
InfiniBand benchmark has a different y-axis. The backup ring approximation significantly
improves performance for both major and minor page faults. In the case of drop, the type
of page fault does not matter because the TCP retransmission timeout is significantly
longer than the time it takes to resolve a major page fault. The hardware implementation,
shown in the lower figure, notifies the remote sender immediately upon a page fault.
The notification allows the sender to use a relatively short IOPF-specific timeout,
38
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015
0 2 4 6 8
10 12
1 1.5 2 2.5 3 3.5 4 4.5 5
thro
ughp
ut [G
bps]
(a) time [seconds]
backupdrop
pinning 0
50 100 150 200 250 300 350
1 1.5 2 2.5 3 3.5 4 4.5 5
cwnd
[pac
kets
]
(b) time [seconds]
0 100 200 300 400 500 600 700 800
1 1.5 2 2.5 3 3.5 4 4.5 5
retr
ansm
itted
pac
kets
(c) time [seconds]
0
200
400
600
800
1000
1200
1 1.5 2 2.5 3 3.5 4 4.5 5re
cove
red
pack
ets
(d) time [seconds]
Figure 8.4: Transient operation of a TCP stream benchmark over time in the presenceof minor rIOPFs.
resulting in significant performance improvement relative to drop. Nevertheless, network
utilization-wise, this solution is less efficient than the backup ring solution.
Figure 8.4 examines the steady state behavior of the TCP stream benchmark over
time, given a fixed rIOPF frequency of one in 1M packets. We added a baseline pinning
configuration in which we run our stream test with no IOPFs at all. Figure 8.4(a) shows
that IOPFs configurations experience a decrease in throughput when a rIOPF occurs.
Top Related