Download - I/O Page Faults - TechnionTechnion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015 This research was carried out under the supervision of Prof. Dan Tsafrir, in the

I/O Page Faults

Ilya Lesokhin

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

I/O Page Faults

Research Thesis

Submitted in partial fulfillment of the requirements

for the degree of Master of Science in Computer Science

Ilya Lesokhin

Submitted to the Senate

of the Technion — Israel Institute of Technology

Heshvan 5776 Haifa November 2015


This research was carried out under the supervision of Prof. Dan Tsafrir, in the Faculty

of Computer Science.

Results pertaining to the Infiniband setup were generated by the Mellanox team, the

Ethernet results were created by the author of the thesis.

Acknowledgements

I would like to thanks my advisor, Prof. Dan Tsafrir, for teaching me how to do

research and pushing me when I wanted to give up. I would also like to thank the fellow

students Nadav Amit, Omer Peleg and Muli Ben-Yehuda for many fruitful discussions

and the technical help they provided during my work on this thesis.

I thank Mellanox and especially the architecture team, Liran Liss, Shachar Raindel

and Haggai Eran for providing the hardware, many of the results and helping with the

writing, without thier support this work would not have been possible. Finally, I would

like to thank my parents for supporting me all along.

The generous financial help of the Technion is gratefully acknowledged.


Contents

List of Figures

Abstract 1

Abbreviations and Notations 3

1 Introduction 5

2 Background 7

2.1 PCIe Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 InfiniBand Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Motivation 11

4 Basic IOPF Support 13

4.1 Non-Recoverable Failures . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 IOPF support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 The Page Fault Latency Problem 19

6 HW Transport IOPF Support 21

7 SW Transport IOPF Support 23

7.1 The Cold Ring Problem with TCP . . . . . . . . . . . . . . . . . . . . . 23

7.2 The Backup Ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

8 Evaluation 33

8.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

8.2 Cost of IOPFs on ConnectX-3 . . . . . . . . . . . . . . . . . . . . . . . . 34

8.3 Cost of IOPFs and Invalidations on ConnectX-IB . . . . . . . . . . . . . 35

8.4 Network Transport and IOPF Interplay . . . . . . . . . . . . . . . . . . 37

8.4.1 Impact of Periodic IOPFs on Bandwidth . . . . . . . . . . . . . . 37

8.4.2 Cold Ring Problem and Backup Ring . . . . . . . . . . . . . . . 40


8.5 System-level IOPF Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 42

8.5.1 Cloud and Web 2.0 Environment . . . . . . . . . . . . . . . . . . 42

8.5.2 Applications with Direct-I/O . . . . . . . . . . . . . . . . . . . . 48

9 Related Work 51

9.1 Existing Direct Device Assignment Solutions . . . . . . . . . . . . . . . 51

9.2 Generic IOPF support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

9.3 Networking IOPF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

9.4 GPUs and other accelerators . . . . . . . . . . . . . . . . . . . . . . . . 52

9.5 Handling latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

10 Discussion and Future Work 55

10.1 Problems with ATS/PRI . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

10.2 Optimizations and Future Work . . . . . . . . . . . . . . . . . . . . . . . 56

11 Conclusion 59

Hebrew Abstract i


List of Figures

4.1 IOPF (1 – 4) and invalidation (a – d) flows . . . . . . . . . . . . . . . . 16

7.1 High level design of the backup ring. . . . . . . . . . . . . . . . . . . . . 24

7.2 Software psuedo-code of the backup ring . . . . . . . . . . . . . . . . . . 25

7.3 Hardware psuedo-code for the backup ring. . . . . . . . . . . . . . . . . 27

7.4 lwIP main loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.5 Psuedo-code for the backup ring approximation. . . . . . . . . . . . . . 30

7.6 lwIP vs. Linux performance evaluation. . . . . . . . . . . . . . . . . . . 32

8.1 Minor IOPF handling breakdown for ConnectX-3. . . . . . . . . . . . . 34

8.2 (a) IOPF and (b) invalidation flow execution breakdown on ConnectX-IB. 36

8.3 Throughput of a stream benchmark in the presence of rIOPFs of varying

frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8.4 Transient operation of a TCP stream benchmark over time in the presence

of minor rIOPFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

8.5 (a), (b) Startup with 64 entries in receive ring. (c) Time it takes to

perform 10,000 operations as a function of receive ring size. . . . . . . . 41

8.6 Pinning vs. IOPF with dynamic working set: (a) with IOPFs (b) with

pinning (c) combined throughput . . . . . . . . . . . . . . . . . . . . . . 43

8.7 The transition period. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

8.8 Flipping the working set with different swap devices. . . . . . . . . . . . 45

8.9 No swap experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

8.10 Rare good results with HDD. . . . . . . . . . . . . . . . . . . . . . . . . . . 46

8.11 System never recovers (90-10-90). . . . . . . . . . . . . . . . . . . . . . . . 47

8.12 (a) Storage bandwidth with single initiator and varying memory limit.

(b) Memory usage with multiple initiators and a fixed memory limit. . . 47

8.13 IMB running time for different MPI operations by message size. The

ratio between the copying and the pinning run-times is shown in the labels. 49


Abstract

Virtual memory is used in most modern general purpose computer systems. This

invention simplifies systems and increases their usability and efficiency. In recent years,

I/O devices also started using virtual addresses. However support for I/O page faults is

still lacking. I/O devices are designed under the assumption that the virtual addresses

they are using are always valid and the software is forced to makes sure that this is

indeed the case. This deficiency deprives one class of software from the benefits of

virtual memory: it prevents memory overcommitment, complicates the programming

model and hinders administration. The effected software class is exclusively comprised

of software that performs direct I/O, which is the act of accessing I/O devices without

any involvement of intermediary privileged software such as the operating system (OS)

kernel or the hypervisor. Prominent example of this are direct device assignment of

SR-IOV (single root I/O virtualization) instances in virtualization scenarios and kernel

bypassing access to I/O devices by user space applications.

This thesis presents a working hardware and software support for I/O page faults

(IOPFs) in a network interface card (NIC). It described the challenges involved in

implementing this support and demonstrates that an IOPF-enabled NIC allows for

efficient memory overcommitment.

1


2


Abbreviations and Notations

I/O : Input/Output

IOPF : I/O page fault

rIOPF : receive I/O page fault

MMU : Memory Management Unit

IOMMU : I/O Memory Management Unit

TLB : Translation Lookaside Buffer

IOTLB : I/O Translation Lookaside Buffer

VA : Virtual Address

IOVA : I/O Virtual Address

OS : Operating System

DMA : Direct Memory Access

NIC : Network Interface Card

TX : Transmit

RX : Receive

PT : Page Table

PTE : Page Table Entry

IP : The Internet Protocol

TCP : The Transport Control Protocol

VM : Virtual Machine

API : Application Programming Interface

TPS : Transactions Per Second

HPC : High-Performance Computing

CWND : Congestion Window

RTT : Round-Trip Time

3


4


Chapter 1

Introduction

The availability of physical memory often determines the performance of the system

for a given workload. Lack of sufficient memory may even render the system unusable.

Virtual memory, which was introduced in the 1960s, allows a computing system to run

multiple workloads concurrently while sharing the physical memory in a transparent

manner. In addition, virtual memory optimizes physical memory usage by holding only

the necessary working set of each workload.

Virtual memory is usually implemented by the CPU’s memory management unit

(MMU), and is thus not exposed to I/O devices. I/O memory management units

(IOMMUs), which are integrated in modern servers and devices, provide similar MMU

services to I/O devices. However, until recently, I/O devices were not able to inform the

operating system about page fault events. This deficiency limits the I/O devices’ virtual

memory support to isolation only; I/O devices cannot handle dynamically changing

working sets.

Nonetheless, IOMMU enabled devices are prevalent today. Prominent examples

include direct device assignment of I/O devices to virtual machines (VMs) [RS07,

WSC+07, YBYW08], high performance computing (HPC) applications [JLJ+04], and

packet processing applications [Int]. In §2 we provide a brief overview of PCIe, virtualmemory and recent advancements in IOMMU technology.

By adding support for I/O Page Faults (IOPF), I/O devices become true first-class

citizens in virtual memory (§3). IOPF support provides the means for I/O devicesto directly access virtual memory pages, which are not guaranteed to be resident in

physical memory at the time of access. It comprises two complementary mechanisms:

(1) allowing an I/O device to request from the OS physical mappings of currently

non-present pages on-demand; (2) allowing the OS to invalidate mappings on the I/O

device. In this work we focus on NICs, one of the most demanding class of I/O device,

and provide a prototype implementation of IOPF support.

Initially, we detail the design trade-offs and the implementation of the fundamental

building blocks necessary for IOPF (§4). These building blocks have broad applicability,and may serve any class of I/O device that supports IOPFs.

5


Next, we observe that basic IOPF support is not enough. For a large class of I/O

devices it is crucial to efficiently tolerate the increased latency incurred by page faults,

which now occurs directly in the I/O fast path. We elaborate on this observation for

the general case and for the specific case of networking (§5).We then describe how IOPF latency may be tolerated in two prominent implemen-

tation approaches in use today for high-end network devices: HW offloaded (§6) andSW-managed (§7) transport protocols. In the first, IOPF and transport processingare coupled together, and the intimate knowledge between the two may be leveraged

accordingly. In the latter, the I/O device provides the basic IOPF support, while SW

manages the transport (e.g., TCP). Here, the coordination is less tight, but the SW

implementation allows more freedom in the design space.

In §8 we provide a performance evaluation of IOPF. We begin by showing the basiccosts of a single IOPF in our implementation, followed by the effects of recurring page

faults on the network bandwidth. We show that even with a page fault probability

of 2−15, we achieve full network bandwidth. Next, we demonstrate the performance

gains of IOPF in multiple real-world deployment scenarios. We examine virtualization,

storage and HPC scenarios. In both the virtualization and storage use cases, the IOPF

implementation achieved about 80% performance gain compared to the current art. The

IOPF implementation was also able to pack more VMs on the same physical machine.

In the HPC scenario, the IOPF implementation achieved performance on par with the

current state of the art, while simplifying the programing model significantly.

We provide an overview of related commercial and academic works(§9) and discussinsights gained in this work and future directions(§10). To the best of our knowledge,this is the first detailed study and evaluation of IOPF in real world systems. We

conclude in §11.

6


Chapter 2

Background

2.1 PCIe Primer

Presently, most I/O devices are connected to the computer using PCIe (Peripheral

Components Interconnect Express)[PS14]. PCIe was designed by PCI-SIG to replace

its predecessor, PCI, which is a true bus. Conversely, PCIe is a point-to-point link. It is

nevertheless commonly referred to as a “bus” because it is backward compatible with

PCI and, as a result, it behaves like a bus from the software perspective.

The actual topology, however, is not a bus. It consists of many point-to-point links

and switches that connect all the peripheral devices to the PCIe root complex device.

The latter typically resides on die and is responsible for connecting peripheral devices

to the CPU and memory. It facilitates three important functionalities, as follows.

The first functionality is Memory Mapped I/O (MMIO). The physical memory space

contains at least one, contiguous address interval, denoted PCI MMIO range. Such a

range is owned by the root complex and is thus ignored by the memory controller. Any

memory operation issued by the CPU that is directed at a PCI MMIO range is handled

by the root complex. The latter converts the operations into PCIe requests, which are

then fulfilled by the corresponding I/O devices. This mechanism is denoted MMIO. It

is used to communicate with the I/O devices, e.g., by allowing access to their registers

as if they are “ordinary” memory.

We note in passing that MMIO operations can be translated to either PCIe memory

operations or PCIe I/O operations. I/O operations are slower and should only be used

for initialization. The PCIE I/O and memory operations operate in different address

spaces. Each device has a fixed size configuration space in the I/O address space. The

address of this space can be calculated using the device ID. The device ID itself is

composed of bus, device and function numbers. The configuration spaces expose, among

other things, the base address registers (BARs) of the devices, through which devices

are notified to which address ranges in the PCIe memory address space they should

respond. The BARs themselves must therefore be configured before the PCIe memory

operations can be used.

7


The second functionality facilitated by the root complex is direct memory access

(DMA), which allows devices to access the main memory. I/O devices perform DMA

by issuing PCIe memory read and write requests. The root complex processes these

requests and passes them to the memory controller similarly to how the CPU does it.

DMA accesses carry a bit that specifies whether cache coherency should be applied to

them.

The third root complex functionality is interrupt delivery. PCIe devices trigger

interrupts to receive the attention of the CPU asynchronously. Interrupts are triggered

similarly to DMAs, i.e., by the device writing to a special address1. The PCIe root

complex also implements the IOMMU functionality. MMIO operations to the device

are translated by the MMU, whereas DMA operations from the device are translated by

the IOMMU. Interrupt requests, which are essentially PCIe write operations, are also

translated by the IOMMU—a functionality that is commonly referred to as interrupt

remapping.

2.2 Virtual Memory

Virtual memory is used in most modern general purpose computer systems [Den70].

This invention simplifies systems and increases their usability and efficiency. Virtual

memory isolates processes from each other. Each process has its own virtual address

space. Independently compiled applications can reside and reference memory at any

location without risking conflicts.

Not all virtual address ranges referenced by a process always reside in physical

memory. Only the current working set is resident. The process relies on the OS

and underlying paging mechanisms to track changes in the working set and adjust

the memory mapping accordingly. Locality of reference [Den05] makes paging work

well [SK09]. It allows substantial savings by overcommitting in physical memory. Only

a small portion of the large virtual address spaces is actually mapped to the limited

physical memory.

As the active working sets change, OSes use secondary storage (e.g., disks) as

an extension to physical memory. Data that is not part of the active working set is

transparently moved to the secondary storage. If an application attempts to access

data that was moved, a page fault exception is raised. This exception is handled by the

OS, which brings the data back from the secondary storage to memory. During this

handling, the application is suspended. Once the data is again in memory, the virtual

memory mapping is updated, and the application is resumed. Page faults are typically

classified as major or minor depending on whether disk access is required to satisfy a

missing virtual-to-physical page mapping.

The OS uses free physical memory as a disk cache. A special API [Gal95] allows

the application to access this cache directly, using the virtual memory mechanism. In a

1PCIe also supports legacy PCI interrupts. Those interrupts are not implemented as memory writes.

8


balanced computing systems, disk paging activity usually occurs only during sporadic

transient periods. These periods can occur for example, when new processes are spawned

and use the physical memory of currently idle processes or rarely accessed cached disk

blocks.

Virtual memory provides additional system-wide benefits. Prominent examples in-

clude speeding up process startup time by reading only the needed parts of the executable;

efficient fork() implementation using copy-on-write (CoW); and reducing the physical

memory footprint using active de-duplication [Wal02a] or page compression [Gup09].

Until recently, virtual memory was the sole domain of applications running on the

CPU but its use is now expanding to peripheral devices as well. System IOMMUs

reside between the I/O device and memory, and map virtual I/O address ranges

that are provided to the device into physical page frames in a secure way. IOMMUs

are typically used in direct HW pass-through to VMs [YBYW08] and user-level I/O

[CBD+98, Sch01, SD06]. In addition, some device classes, such as RDMA devices

[vE98], employ embedded IOMMU units within the device. All of these IOMMU devices

address only memory mapping for isolation; they are not able to indicate a page fault

to the OS. In the past year, commercial devices that support paging have been showing

up in the market [PS09, SBJS15]. However, none of these implementations examine the

impact of a page fault during an I/O operation. Thus, the implications of introducing

the full benefits of virtual memory to I/O devices remain unexplored.

2.3 InfiniBand Primer

InfiniBand[Inf15] is computer networking standard widely used in high performance

computing. InfiniBand exposes both a send/receive semantic and a remote direct

memory access(RDMA) semantic for data transfers.

The send/receive is the standard semantic we are used to from Ethernet. It is also

called the two-sided semantic as software on both sides is involved in data transfer. The

receiver must post a buffer large enough to receive the incoming data and the sender

needs to tell the NIC what buffer to send. Since the receiver usually cannot know the

size of incoming messages in advance, there is usually an agreed upon maximal message

size and all the buffers posted by the receiver are of that size. This creates inefficiencies

for both the sender and the receiver. The sender is limited in the amount of data it can

send in one message, forcing it to split large data transfers into multiple transactions

while the receiver wastes memory when the sender passes messages that are smaller

than the maximal message size.

The RDMA semantic, also called the one-sided semantic, allows data with software

involvement from only one of the parties. RDMA does require initial software involvement

on both sides to establish a connection and decide which memory areas are accessible to

other parties. But after this initial setup, one of the parties can issue RDMA read and

RDMA write operations to access the memory of the remote party without any software

9


involvement from the remote party. This semantic reduces the overhead of data transfer

for both parties. The initiating party is free to use the optimal transaction size for the

transfer and the other party does not incur any software overhead. We note that while

RDMA formally refers only to the one-sided semantic described above, it is commonly

used informally for all Infiniband user level I/O operations including send and receive.

While one could imagine an application where all the communication is done using

RDMA, this is usually not the case. Consider, for example, a remote storage application

with a client and a server. Since the data is stored on secondary storage, the client

usually can not access it directly using RMDA operations. Even if it could access all

the data, we would probably want some server involvement to synchronize concurrent

access from multiple clients. Consequently, RDMA is typically used only for large data

transfers and there is usually a side channel for control. In our storage example, the

client would use the control channel to say: “I want to read k blocks starting at block

number x”, the server would respond with “Ok, it is available at address p, please let

me know when you are done”. The client would then use RDMA to read the actual

data in the blocks and use the control channel once again to notify the server when

it is done. The control channel is typically implemented with send/receive semantics

because the software on the receiver does want to be notified about the reception of a

control message and be able to do something about it.

Unlike the Ethernet standard which ends at the link layer InfiniBand also includes

specifications for the transport layer. encouraging implementation of that layer in

hardware. The most commonly used transport is the reliable connection (RC) transport,

which provides reliable data transfer between two parties similarly to TCP/IP. Other

notable transports are unreliable connection (UC) and unreliable datagram(UD) like

UDP/IP, they are both unreliable. The difference is that UC is connection oriented and

mandates a send queue per connection while UD supports multi-cast and using a single

send queue to talk to multiple parties. One notable place where UD, rather than RC,

is typically used is in the implementation of IP over InfiniBand (IPoIB)[CK06] driver.

IPoIB allows regular applications designed to be used in the IP environment to work

over InfiniBand networks. The rationale for using UD in IPoIB is that it is simpler to

add InfiniBand headers to IP packets and send them with a shared send queue than to

maintain multiple reliable send queues corresponding to all the active remote parties.

10


Chapter 3

Motivation

Direct I/O applications and virtualized systems with direct device assignment have two

ways of accessing the memory. Directly using the CPU and indirectly by asking an I/O

device to access the memory on the applications behalf. While CPU page faults are

supported on all modern general purpose systems, support for IOPFs is almost non

existent. Consequently such systems are forced to use pinning and cannot enjoy the full

benefits of virtual memory.

Translation only Translation + page fault support

isolation 3 3

on demand paging 3 7

swapping 3 7

overcommitment 3 7

easy programming 3 7

page migration 3 7

Table 3.1: Benefits of virtual memory

Table 3.1 lists the benefits of virtual memory and specify whether they can be

provided by a system that only supports translation or whether page fault support is

also required. As we can see, the only benefit that a translation only virtual memory

system can provide is isolation. Such a system cannot provide on demand paging or

swapping because a page fault is required to page in the data. It cannot provide Copy

on Write (COW) because a page fault upon write is required to copy the data and

break the cow mapping. And as a result of the limitation above there is no memory

overcommitment in such a system. The ability to over-commit memory allows programs

to allocate memory even if there is not enough physical memory to satisfy the allocation.

This ability greatly simplifies the programming model as relives the programmer from

writing fall-back code paths for every memory allocation failure. In typical system today

memory allocation failures in user space are so rare that its acceptable for a program

to simply terminate when an allocation fails. In addition to all the over-commitment

related benefits mentioned above page faults are also required page migration, which

11


allows compacting the physical memory for more efficient usage.

With IOPF support, the inherent benefits of virtual memory apply seamlessly to

application buffers used for direct I/O. Storage servers, for example, may allocate large

buffer pools up front to accommodate the worst case, but reference only “hot” buffers

in the common case. As a result, the I/O memory footprint follows the current working

set.

In other applications, such as HPC, large memory ranges may be mapped for remote

direct memory access (RDMA) by I/O devices [Inf15]. A remote host in an RDMA-

capable network may read and write local application memory directly (after a proper

key exchange) without involving the (local) CPU. Here, the working set is determined

by remote IO activity rather than by the local CPU.

In the context of virtualization, direct device assignment mandates pinning the entire

VM address space [ABYTS11], even though the VMs themselves are held in virtual

memory. IOPF maintains the benefits of the large body of work done on over-committing

VM memory without giving up the performance advantages of direct I/O.

Virtual memory moves the complex memory management code employed by state

of the art direct I/O applications to the operating system. Complex code that decides

what data should be in memory at any given time can be discarded. Finally, pinning

memory requires special administrative privileges. IOPF support will allow running

unprivileged direct I/O applications.

12


Chapter 4

Basic IOPF Support

When we add an indirection level and decide that DMA transactions should use virtual

rather than physical addresses, we also have to decide how to handle translation failures:

a situation where a DMA transaction references a virtual address that has no translation

or has permissions which do not allow the transaction to complete. We can leave the

I/O devices oblivious to the possibility of translation failure and force the software to

avoid them or we could inform the I/O devices about translation failures and give the

devices a chance to do something about them. The VT-d spec[Int14] uses the terms

non-recoverable and recoverable address translation failures to describe those options.

4.1 Non-Recoverable Failures

The simpler and widely used option is to keep the I/O devices oblivious to translation

failures. A DMA transaction to a virtual address that cannot be translated is treated

the same as a DMA transaction to an invalid physical address. IO devices are not

notified about translation failures, and do not require any modification to work in this

mode. Under this design option, a DMA access should never encounter a translation

failure. Such a failure, if it does happen, indicates that either the device or its driver

are misbehaving. The IOMMU nevertheless detects and reports translation failures.

But OSes currently do not have a standard interface for drivers to register translation

failure callbacks [Cor14]. As a result, all the IOMMU driver can currently do is to log

the failure. Even if OSes did have an interface for registering such callbacks, recovering

from this failure could be prohibitively difficult or altogether impossible, because most

I/O devices are designed under the assumption that DMA operations do not fail.

DMA Read When a device issues a read, it expects an answer. The IOMMU indeed

returns an answer when the corresponding translation fails, but this answer is a generic

“unsupported request”[Int14], namely, the device has no way of knowing that the failure

was due to a translation failure. The specific reaction of devices to such generic failures

varies. The network and disk controllers that we tested exhibited various undesirable

13


behaviors, ranging from getting stuck to corrupting the filesystem. Devices may, in

principle, behave in a more civilized manner by raising an interrupt, informing their

driver to reset the device. Conceivably, devices can ignore the read error and continue

as if nothing bad happened. For example, a sound card can skip the data that it was not

able to read and be silent for a moment. Such an approach, however, might colossally

fail if a disk controller is involved. For example, if the controller is instructed to read

from memory and write the content to the disk, then silently ignoring a DMA read error

would likely result in data loss, because the OS would rightfully expect the information

to be persistent on disk. This example coincides with the behavior we empirically

observed.

DMA Write DMA write operations are more challenging than reads with respect to

address translation failures. Whereas DMA read operations expect a response, DMA

write operations are conducted in a “write and forget” manner. Namely, devices are not

acknowledged when DMA write operations complete1. Consequently, there is no way

for the IOMMU to inform devices that write operations have failed, so the operations

fail silently upon translation failures. Silent failures are usually worse than other

outcomes such as, say, crashing. Assume, for example, that the host directly assigns a

disk controller to a guest without pinning the associated memory in a system where

translation failures cause DMA writes to fail silently. Further consider that the guest

intends to read and run an executable from the disk. The guest will therefore instruct

the disk controller to read the relevant blocks and DMA write them to memory. If an

translation failure occurs during the DMA write, it will fail silently, without writing the

requested data to memory. Being oblivious to the IOPF, the disk controller will next

inform the guest that the operation has completed successfully. The guest, in turn, will

run the executable, which will crash when the CPU will try to execute “random” data

that was supposed to be overwritten with the executable code but was left unchanged

due to the DMA write failure. The crash might happen at the worst possible moment,

for example, when trying to save a big document that the user has been working on for

hours. Worse, the executable could be a kernel module and thus crash the OS entirely.

Note that in such a scenario, the hypervisor is notified about the IOPF through the

IOMMU, but there is no standard way for the hypervisor to notify the guest about the

error.

NIC translation failures The above examples for what might go wrong when

encountering address translation failures are disk-related. Seemingly, the network is

not as “trustworthy” as the disk. For example, one trusts a disk write to transpire as

1PCIe does have a link layer ack. But this ack only means that the packet containing the writerequest has been received successfully. It does not mean that the write operation has been successfullycompleted. Furthermore, the entity sending the ack is not necessarily the root complex and hence hasno knowledge as to whether the DMA write was successful. Rather, it may be a PCIe switch that hasno understanding of (I/O) virtual memory semantics.

14


requested, whereas one does not equivalently trust a packet to reach the other side. As

a result, network packets have checksums and sequence numbers to recover from packet

loss and data corruption. If random data is passed to the network stack, it will most

likely be dropped. The reason for the drop can be an unknown protocol, bad checksum

or bad sequence number. Consequently, when the NIC receives a packet, fails to write

its content to memory due to an IOPF and informs the NIC’s driver that a packet was

received, the most likely scenario is that the packet will get dropped.

However, we would like to point out that the disk related failure scenarios we

mentioned earlier are actually applicable to NIC translation failures when working with

NFS (network file system). Many modern NICs have hardware checksum offloading

and scatter gather capabilities. In order to take advantage of the hardware checksum

offloading, the network stack is usually designed to skip the checksum checking step

in cases where the hardware has confirmed that the checksum is good. When the

NIC receives a good packet but fails to write its content to memory, it will actually

report good checksum because it checked the original packet and not the content that

the network stack will see in memory. In such a scenario that packet will still most

likely get dropped due to a bad header. However, if we further assume that the packet

crosses a page boundary, or that the NIC’s driver uses the scatter gather capabilities

and receives each incoming packet into multiple pages, then it is possible that a part

of the packet containing the header will be written successfully to memory while the

rest of the packet will not be. In such a scenario, the header is valid and there is

no software checksum check because the network stack blindly trusts the hardware

checksum checking capabilities. As a result, the network stack will give corrupted data

to the application using it. If the application is an NFS client, we would experience a

failure similar to the disk read failure and if the application is an NFS server we would

experience a failure similar to the disk write failure2.

4.2 IOPF support

The second option for handling translation failures is to inform the device when a

translation failure occurs and allow the device to respond. We mentioned earlier that

the VT-d spec uses the term recoverable address translation failures for this option.

However the VT-d spec actually refers specifically to the ATS/PRI standards by PCI-

SIG, and there are other ways to implement this design option. For the purpose of

this paper we will use the terms IO page fault (IOPF) for translation failure and

IOPF support for any design where devices are notified about IOPFs and have the

hardware/software interfaces described next in order to do be able to handle IOPFs

gracefully. Existing I/O devices need to be modified in order to support IOPFs, and as

a result, this support is still very rare. In fact, the only other implementation we are

2Application level data integrity testing can save us from data corruption in those case. However, itis only supported in NFSv4, which is not widely deployed.

15


os

driver

io device

io page tables(1)

(3)

(4)(2)

(a)

(b) (c)

(d)

Figure 4.1: IOPF (1 – 4) and invalidation (a – d) flows

aware of is the GPU in AMD Kaveri. We developed two devices with IOPF support: a

Mellanox ConnectX-3 40G Ethernet NIC, and a Mellanox Connect-IB 56G Infiniband

Host Channel Adapter (HCA). Both adapters employ a similar internal IOMMU, which

initially assumed that all PTE (page table entries) mappings are valid.

Page Faults To support page faults, we allowed the internal IOMMU to hold invalid

mappings. The resulting IOPF flow is illustrated in Figure 4.1 and described as follows:

(1) The device starts processing an I/O request. It consults the I/O page tables and

determines that one of the pages involved is not present. (2) The device raises an

IOPF interrupt to its driver in order to resolve the page faults. (3) The driver calls

get user pages() to obtain the relevant pages from the OS, and immediately calls

put page() to avoid pinning them. (4) The driver updates the I/O page table with

the corresponding physical addresses and informs the device that the IOPF has been

resolved, allowing it to resume normal operation.

In the unlikely case that the memory is not available, for example due to an access

attempt to unallocated virtual memory, the device driver notifies the hardware that the

page fault could not be resolved. The hardware follows the same error semantics that

InfiniBand [Inf15] defines for local access errors.

Invalidations As the memory pages accessed by the device are no longer pinned, the

OS is allowed to unmap and reuse pages at will. This requires an invalidation flow,

in which the OS notifies the device that a virtual mapping is no longer valid. The

invalidation flow is illustrated in Figure 4.1 and described as follows: (a) The OS decides

to change a virtual mapping and asks the driver via the Linux kernel MMU notifiers

infrastructure [Arc08] to remove the old mapping and stop the device from using it. (b)

The driver updates the I/O page tables and issues an invalidation to the device. (c) The

device acknowledges the invalidation and stops using the relevant mapping. (d) The

driver notifies the OS that the old mapping has been removed and that the relevant

16


pages can be reused.

We note that multiple IOPFs and invalidation may execute concurrently and software-

based locking is used for synchronization. Page fault handling might naturally block on

an invalidation. Due to this fact, the locking scheme had to ensure that invalidations

never block on page fault handling. If an IOPF collides with an invalidation, the IOPF

handling is aborted and restarted after the invalidation completes.

17


18


Chapter 5

The Page Fault Latency Problem

When a CPU page fault occurs, the current thread of execution is halted until the

operating system handles the page fault. Similarly, for devices that perform purely local

work (GPU, FPGA, ASIC accelerators and local storage devices), it is usually possible

to suspend the specific execution context until the page fault is resolved.

However, there exists a large class of devices for which pausing I/O due to a page

fault disrupts normal I/O operation, even if the average I/O rate in the presence of

page faults does not limit the desired throughput. Sensor data, audio sampling, video

input, and CDROM burning are examples of such devices. In the context of network

devices, This problem makes it difficult to send an receive packets in a timely manner.

We denote by rIOPF (receive IOPF) the scenario in which a NIC encounters an IOPF

while receiving a packet from the network. Arguably, the simplest approach is to drop

all incoming packets designated to the faulting ring until the rIOPF is resolved. The

problem with this approach is that if the drop is done without informing the transport

layer, performance will be greatly impacted due to the relatively large timeout values

used today – 200 milliseconds in TCP [VPS+09], and around 4 seconds with InfiniBand,

which assumes a (nearly) lossless network. In the next chapters, we elaborate on how

we deal with the rIOPF problem for the specific cases of both HW and SW transports.

For transmission, the situation is less complicated as suspending a transmit ring

for the duration of an IOPF will only delay the transmission and will not cause data

loss. However, network protocols might rely on delivering acknowledgments in a timely

manner because the peer might interpret the delay as an indication of packet loss. This

does not pose a problem in practice due to the large network timeouts, as mentioned

above, and in our implementation we suspend the sending queue until page faults are

resolved.

19


20


Chapter 6

HW Transport IOPF Support

When the page fault and the transport are both handled in the same hardware unit,

interfacing them is relatively easy. This is the case for InfiniBand adapters. We

specifically consider the reliable connection (RC) transport.

The InfiniBand wire protocol supports explicit link level flow control. Therefore, a

näıve implementation could block the incoming network traffic until an rIOPF is resolved.

This would allow the implementation to be work preserving. However, blocking the

flow control for an extended period of time will lead to congestion spreading [AAK+08,

SCS+14] in the network. Therefore, an end-to-end mechanism is needed to handle such

a situation.

The RC protocol contains an end-to-end mechanism for the receiver to stop the

sender. When an rIOPF is encountered, the receiver sends a receiver-not-ready (RNR)

negative acknowledgment (NACK) packet. this NACK notifies the sender that it should

pause the transmission of the relevant flow for a specified period of time T . This

notification explicitly informs the sender that the packet was lost, allowing the sender to

retransmit quickly rather than the using the generic 4 seconds retransmission timeout

mentioned earlier. The notification also stops the sender early and reduces the amount

of data that will be lost due to the rIOPF. We note that in this solution some data is lost

and retransmission is required. However, retransmission is possible because the protocol

is reliable and so, by definition, the sender must keep the data until it is acknowledged.

We further note that in InfiniBand packet loss is decoupled from congestion con-

trol and consequently the retransmission does not have a negative impact on future

transmissions. This is not the case in TCP/IP.

We show in §8 that this “drop and stop the sender” solution works reasonably wellin the context of InfiniBand reliable connected transport.

21


22


Chapter 7

SW Transport IOPF Support

While for hardware transport we could rely on the transport state during page fault

handling, in software transports such information is not readily available. In the prevalent

setup of TCP/IP over Ethernet, the transport layer is implemented in software, while

page faults are handled by the hardware and the device driver in the hypervisor. The

hardware and the hypervisor, being only aware of the Ethernet layer, cannot handle

the rIOPF by sending an RNR-NACK equivalent for TCP. An unmodified guest’s

TCP/IP stack, on the other hand, has the full state information, but is not aware

that a page fault has occurred, and does not have the required information from the

TCP/IP headers of the faulting packet. Consequently, for the software transport case

we cannot implement the RNR-NACK solution, and we are left with the drop and wait

for retransmit option.

Initially, we hoped this option would suffice to support TCP/IP. This hope was

based on the fact that TCP/IP stacks usually use a limited size memory pool for

transmission and reception buffers, and the buffers are frequently used, so we expected

IOPFs to be relatively rare. However, when we tested this approach with TCP/IP, we

discovered a critical problem. We call it the cold ring problem.

7.1 The Cold Ring Problem with TCP

Our implementation uses a separate page table in the device, and does not pre-populate

it. Consequently, when we first start an application, the receive ring is “cold”. Namely,

the receive buffers are not mapped and rIOPFs are not as rare as one would expect

during steady state operation. We discovered that a “cold” receive ring poses a unique

problem: TCP retransmission and congestion avoidance results in a near-deadlock of

the communication. The cold ring problem is not limited merely to startup situations.

It can also happen, for example, when the VM is resumed from suspension or brought

back from swap.

New TCP connections start in a slow start phase in which they send at a very low

rate in order to avoid exceeding the network capacity. Drops are considered a sign

23


guest VM/application

hypervisor/OS kernel

NIC

(1)

(2)

(3)

(4)

pf

Figure 7.1: High level design of the backup ring.

of congestion and cause TCP to reduce the transmission rate even further. Similarly,

during the connection establishment stage, TCP utilizes an exponential backoff scheme

to avoid overloading the network and the receiver. However, if the cause of the packet

loss is an IOPF, communication all but stops. The transmitter will be waiting for

acknowledgments from the receiver before it increases the transmission speed. Instead,

due to timeouts, it will actually try to reduce the transmission rate. At the same time,

the receiver will wait for more packets to arrive from the network to page-in the receive

ring. The effective visible behavior strongly resembles a deadlock. Nearly no network

traffic is sent, with both parties waiting for the retransmission timeout. In some of the

cases, the issue is so severe that the TCP stack announces a failure to the application

layer. This happens once the maximal retry number is exceeded. We demonstrate such

an issue in §8.4.2.In addition to suspend/resume of a VM or a startup case, a ring can become cold

due to other reasons. Specifically, operations such as NUMA migration or fork can

cause the same effect.

7.2 The Backup Ring

In para-virtualized guests, the receive ring is also likely to be cold from time to time.

However, the problem is much less severe thanks to buffering performed by the hypervisor

virtual switch for the guest traffic. When considering where to buffer such packets, we

rule out buffering on the NIC itself, since adding enough on-chip memory to buffer a

major page fault would be too expensive. The backup ring solution to the cold ring

problem is based upon these observations.

The design of the solution can be seen in Figure 7.1. Traffic is received from the

network by the NIC (1). For each incoming packet, the NIC inspects the designated

receive buffer of the Guest VM. If this buffer is available, the data is written directly to

it (2). However, if a page fault is encountered while writing into the buffer, the packet

is written to a backup ring owned by the hypervisor (3). After the hypervisor fixes the

page fault, it copies the packet into the original receive buffer (4). To maintain ordering,

the NIC skips receive descriptors that encountered page faults. For the same reasons,

the NIC does not report the reception of new packets to the guest until all previous

24


struct ring_data {// size of ringint size;// the ring itselfdescriptor_t *descriptor;int head;int tail;

bit *bitmap;};

// unresolved IOPFslist IOPF_list;

void br_interrupt() {head = get_head( br );tail = get_tail( br );count = 0;while(tail != head) {

brentre_t e = br_re_arm( tail );IOPF_list.append( e );tail = (tail + 1) % ring_size(br);

}br_update_tail( tail );wake_rIOPF_thread();

}

struct bbentry_t {int ringID;int index;int bitmap_index;packet_t pkt;

};

void rIOPF_thread() {while( true ) {

if( IOPF_list.size() == 0 )wait_for_rIOPF();

assert(IOPF_list.size() > 0 );brentry_t e = IOPF_list.pop( );r = get_ring(e.ringID);if( ! has_room( r ) )

wait_for_tail_change( r );make_present( r.descriptor[e.index] );store_packet( r.descriptor[e.index], e.pkt);bitmap[e.bitmap_index] = 0;free(e);//triggers the NIC’s//resolve_rIOPFs flowresolve_rIOPFs(r);}

}

Figure 7.2: Software psuedo-code of the backup ring

page faults have been handled.

The backup ring mechanism allows a graceful fallback. During page fault, the guest

machine behaves exactly like a guest with a para-virtualized or emulated NIC. Namely,

the hypervisor will have to buffer incoming traffic for the guest. When the buffer space

for this guest is exhausted, incoming packets are dropped. At the same time, if the

IOPF is resolved in a timely manner, this will only cause minor latency jitter instead

of packet loss. In the common case of page fault free traffic, the guest experiences

high performance thanks to direct device assignment. We note is passing that while

the backup ring solution is described and evaluated in the context of TCP/IP over

Ethernet it is also applicable to the unreliable transport protocols of InfiniBand which

lack hardware retransmission.

Figure 7.2 contains psuedo-code describing how the hypervisor manages the backup

ring. Note that in our design the guest is unaware of the backup ring and does not need

to be modified in order to benefit from the backup buffer mechanism.

The struct ring data contains data that the hypervisor maintains for each ring.

size is the size of the ring. descriptor is a pointer to the ring itself which is an array

with size buffer descriptors. head is the index of next descriptor to be used by the NIC

and tail is the index of the next descriptor to be consumed by the software. bitmap

is a special bitmap used to track which packets experienced rIOPFs and allows the

NIC to continue storing new incoming packets in the guest’s ring even when there are

pending unresolved rIOPFs. The bitmap is initialized with all bits set to zero. Its size

is controlled by the hypervisor and it limits the number of packets the hypervisor will

store for a specific guest.

We also explored a simpler design where the existence of an unresolved rIOPF would

25


cause all the subsequents packet to be stored in the backup ring. But we were concerned

that such a design could get stuck in an operation mode where all the packets go through

the backup ring because a new packet always arrives before the NIC is notified that the

previous packet has been copied to the guest’s ring.

The interrupt handler br interrupt is called after a new packet has been stored

in the backup buffer. This function moves used entries in the backup ring to a list of

unresolved rIOPFs and posts a new buffer to the backup ring so that it will not run

out of buffer for new entries. In addition, it wakes up a thread whose job is to resolve

the rIOPFs. This thread is required because handling a rIOPF might require sleeping,

which is forbidden in an interrupt context.

The rIOPF thread executes rIOPF thread(). The first thing it does is to check

whether there are any rIOPFs it needs to resolve. If there are none, it goes to sleep until

a rIOPF occurs. After the thread is woken up, it makes sure that the designated ring

has an available descriptor where it can store the faulting packet. The condition is likely

to be true for most of the packets. However, our backup ring mechanism allows the

hypervisor to buffer more then “ring size” packets for a guest. The reason behind this

decision is that after a rIOPF occurs and until it is resolved, the NIC does not notify

the guest about the reception of new packets. Consequently, the guest does not post

new buffers in the receive ring and there is an overflow risk. As a result of this design,

the number of faulting packets in the backup ring might exceed the number of entries

in the original receive ring. Since the hypervisor does not want to drop packets, it will

have to process a number of packets, ask the NIC to report that they were received,

and then wait for the guest to post new buffers. The waiting can be implemented either

in software only using by using sleep and polling, or it can be hardware assisted. The

hypervisor will ask the NIC to raise an interrupt when the guest changes the tail of the

ring and go to sleep until the interrupt arrives. After making sure that there is a free

descriptor, the IOPF thread will make sure that the descriptor and the buffers are all

present. It will copy the packet into the descriptors buffer, update the faulting packets

bitmap, and tell the NIC that an IOPF has been resolved.

For simplicity’s sake, our pseudo code handles the faulting packets one at a time. A

practical implementation is likely to use batching and possibly multiple threads when

handling the faulting packet. Such an implementation would read a predefined number

of faulting packets and start the I/O required to make all of them present. After the

I/O is done, the hypervisor will copy the faulting packets one at a time and tell the

NIC that the whole chuck has been resolved using a single operation.

Figure 7.3 show the corresponding NIC hardware pseudo-code. The struct ring t

contains the state the hardware maintains for each ring. The head offset variable is

used to determine where the next incoming packet should be stored. When there are no

unresolved rIOPFs, head offset is zero and the packet is stored in head. When there

are unresolved rIOPFs, head keeps pointing to the index of the first rIOPF triggering

descriptor as we can not inform the guest about reception of new packets before this

26


sturct ring r{int size;descriptor_t *descriptor;int tail;int head;

int head_offset

int bitmap index;bit *bitmap;

}

// HW - receive packet:// invoked if pkt specifically// designated for r,// or if broadcastvoid recv(Ring r, Packet pkt) {

if (!store in ring(r, pkt))store in backup(r, pkt);

}

// HW - receive packet:// invoked by the hypervisor// after a rIOPF has been resolvedvoid resolve_rIOPFs(ring r) {

i = r.bitmap_index;

while (head_offset > 0 &&r.bitmap[i] == 0) {

atomic {r.head_offset--;r.head = (r.head + 1) % r.size;}i = (i + 1) % r.bitmap_size;

}

//takes care of coalesingr.raise_isr();

}

bool ring overflow(ring r) {tail = r.tail;if (tail < r.head)

tail += r.size;

return r.head + r.head offset >= tail;}

bool store in ring(ring r, Packet pkt) {if (!ring_overflow(r) &&

r.is_descriptor_present(r.tail)) {

head = (r.head + r.head_offset) % r.sizer.descriptor[head].store(pkt);if (r.head offset != 0)

r.head offset++else {

r.head = (r.head + 1) % r.size;//takes care of coalesingr.raise_isr();}

return TRUE;}return FALSE;}

void store in backup(ring r, Packet pkt) {offset = r.head offset;if (offset < r.bitmap_size &&

backup.tail != backup.head) {

offset += r.bitmap_index;head = (r.head + r.head_offset) % r.sizebackup.descriptor[backup.head].store(

concat( r.id, head, offset, pkt));backup.head = (backup.head + 1) % backup.size;r.bitmap[offset] = 1;r.head_offset++;

//takes care of coalesingbackup.raise_isr();

}}

Figure 7.3: Hardware psuedo-code for the backup ring.

27


rIOPF has been resolved. The NIC always tries to store new incoming packet in head

+ head offset. bitmap is the bitmap we mentioned earlier when we talked about the

ring data, and bitmap index is the index of the bit in the bitmap corresponding to

the descriptor at index head in the ring.

The recv() function describes how a NIC handles a packet, pkt, designated for the

ring r. The NIC first checks whether the packet can be stored in the ring r and if it

cannot, the packet is redirected to the backup ring. store in ring() tries to store the

packet in the ring r. The conditions for using the ring are that target index does not

exceed the tail and that the relevant descriptor and buffers are present1. Assuming the

conditions above hold, the packet is stored directly in the ring. If there are unresolved

rIOPFs for this ring we only advance head offset, and if there are no unresolved

rIOPFs we advance head and raise an interrupt to signal the reception of a new packet

in the ring r.

Alternatively, if the packet is redirected to the backup ring store in backup() is

executed. It checks that the distance from the first unresolved packet does not exceed

the bitmap size and that there is room in the backup ring. If this is not the case

the packet is dropped. Assuming the packet is not dropped, we append additional

information to the packet in order to help the hypervisor resolve the rIOPF, mark the

bitmap and raise an interrupt to the hypervisor. We also advance head offset to skip

an entry in the designated ring.

The resolve rIOPFs() flow is executed when the hypervisor notifies the ring that

a rIOPF has been resolved. It uses the bitmap to update head to point to the next

unresolved rIOPF or to the top of the ring if there are none. We note that while this

loop might take some time it does not have to be atomic with respect to packet reception.

Only head and head offset must be updated together because the destination of new

packets is determined by their sum.

7.3 Implementation

Our backup ring evaluation had to overcome two hardware limitations. The hardware

IOPF implementation did not support redirecting packets to a secondary receive ring

upon a page fault. Nor did it not support the combination of IOPF and SR-IOV.

Consequently, to evaluate the backup ring solution, we approximated it using software.

Lightweight virtualization To address the lack of support for SR-IOV instances,

the evaluation was performed with lightweight virtualization. Each lightweight virtual

machine used a user space TCP/IP stack and kernel bypass technology based upon

IB-verbs [Inf15]. Linux cgroups [Men], was used in order to limit the memory available

1A real implementation would also check that the incoming packet is not too large for the corre-sponding descriptor and drop the packet without storing it to the backup ring if this is not the case.For simplicity, we ignore this complication.

28


while(1) {/* poll the driver, get any outstanding frames, alloc memoryfor them, andcall netif->input. */poll driver(netif);/* Handle all system timeouts for all core protocols and theapplication */sys check timeouts();}

Figure 7.4: lwIP main loop

to each lightweight virtual machine.

A good user space TCP/IP stack was surprisingly hard to find. The candidates

we considered were OpenOnload[Rid], libvma and lwIP. OpenOnload only works on

Solarflare NICs and porting it to a Mellanox NIC is difficult. libvma was closed source

when we started the work, and only become open source when we already had a working

system. It’s an interesting library, but making it work requires some effort, so it should

be considered as future work.

The option we chose was lwIP. We discovered that it is targeted for embedded

systems and as a result has many shortcomings: It is single threaded, its malloc is very

slow, there is no support for window scaling, or hardware offloading. and its socket API

is very slow. To address these shortcomings, we replaced the malloc that came with

lwIP with dlmalloc[Lea] and borrowed some pathches from libvma which is also based

on lwIP. Among other things, this gave us support for window scaling.

To address the bad performance of lwIP’s socket API issue we followed the advice of

[Gol] and used the raw API of lwIP. Unlike the socket API, the raw API is event driven

rather than sequential. The main loop of an application using the raw API should look

like the one in Figure 7.4[SC]. The application itself should be event driven and the

code of the application should be called from lwIP callbacks. The lwIP callbacks are

called on different events: when a new connection is established, new data is receive etc.

As a result, porting a generic application to use the raw API is nontrivial.

Backup ring approximation Due to the inability to redirecting packets to a sec-

ondary receive ring upon a page fault, we had to a settle for an approximation of the

real backup ring. Instead of redirecting only rIOPF triggering packets, we used an

existing hardware feature to duplicate all incoming packets into a secondary receive ring.

This secondary ring is populated with pre-allocated pinned buffers. In the absence of

rIOPFs, the duplicated packets stored in the secondary ring are ignored and discarded

by the software. However, when packets are dropped from the primary receive ring due

to a rIOPF, copies of those packets are still written to the secondary ring, allowing us to

avoid packet loss. When an rIOPF occurs, the software is notified and starts collecting

the dropped packets from the secondary ring. The software then waits for the rIOPF to

be resolved before forwarding copies of those packets to the network stack. The copying

29


void discard_duplicates(bool block) {do {

while (packets_to_discard !=discarded_packets &&!secondary_ring.is_empty()) {

secondary_ring.post_buffer(secondary_ring.consume_buffer());

discarded_packets++}

//if the caller asked to block,//continue to discard_duplicate} while (block &&

packets_to_discard !=discarded_packets);

}

void sync(void) {in_sync=FALSE;while(!in_sync &&

!secondary_ring.is_empty()) {p = secondary_ring.dequeue();if (!primary_ring.is_empty() &&

memcmp(primary_ring.peek(),p ,len(p)) == 0) {

unprocessed.enqueue(primary_ring.consume_buffer());

in_sync = TRUE;} else {

//primary_ring.is_empty() ||//memcmp(...) != 0new_buf = alloc_buf();memcpy(new_buf, p, len(p));unprocessed.enqueue(new_buf);

}secondary_ring.post_buffer(p)}

}

void fast_path(void) {while (!primary_ring.is_empty()) {

unprocessed.enqueue(primary_ring.consume_packet());

primary_ring.post_buffer(alloc_buf());packets_to_discard++; //from backup

}

// throw duplicated packets from backup but dont block:discard_duplicates(/*block = */ FALSE);

}

void slow_path(void) {clear_rIOPF_flag();

//block until there are no more duplicatesdiscard_duplicates(/*block = */ TRUE);//remove all packets from backupsync_backup();

resolve_rIOPF();//sync with primary receive queuesync_backup();

}

void poll_RX(void) {rIOPF = check_rIOPF_flag();

fast_path();if (rIOPF) {

slow_path();}

if (!unprocessed.is_empty())//pass packet to the network stacknetif_input(unproccesed.dequeue());

}

Figure 7.5: Psuedo-code for the backup ring approximation.

is done to allow reusing the pinned buffers for the secondary ring and to improve the

approximation. Since the hardware neither skips faulting receive ring entries nor reports

how many packets are dropped during an rIOPF, packet content matching is used to

detect when to switch back to the primary ring.

The psuedo-code of our approximation is shown in Figure 7.5. The main function

of interest is poll RX whose job is to decide which packet should be passed next to

netif input(), which does the network stack processing. It checks the rIOPF flag

and then executes the fast path(), which re-arms the primary ring and moves all

new packets to a software maintained unprocessed packets queue. After removing the

new packets from the primary ring, discard duplicates() is used to discard copies of

those packets from the secondary ring. This is done in a nonblocking manner to improve

the performance in the absence of rIOPFs. If the rIOPF flag was set before executing

fast path(), the slow path() is also invoked. It first discards possible remaining

duplicates and calls sync() to drain and re-arm the secondary ring. Next, the rIOPF is

resolved, making the primary ring operational again and sync() is called once more

to drain and re-arm the secondary ring until it is either empty or we find a matching

packet in the primary ring. Finally, a packet from the unprocessed queue is pushed to

30


the network stack.

We note that our approximation does not have a clear guest/host separation and

lacks context switches; as a result, it cannot tell us much about the CPU usage of

the real solution. However, we believe it approximates the delayed passing of faulting

packets to the networks stack and can teach us about the behavior of the network and

network stack in the presence of IOPFs.

Figure 7.6 shows the performance of a memcached port that uses our lwIP network

stack. We can see that in the absence of IOPF our backup approximation does not

significantly harm performance We further see that our memcached port, which uses

user space I/O, is about twice as fast as the native version up to a value size of 4 KB.

For large value sizes the native version benefits significantly from the Hardware TCP

offloading, which is not supported in lwIP.

31


0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6 1.8

1B 4B 16B64B

256B1KB

4KB16KB

64KB256KB

norm

aliz

ed th

roug

hput

value size

0

5

10

15

20

25

1B 4B 16B64B

256B1KB

4KB16KB

64KB256KB

thou

ghpu

t [G

b/s]

lwip (pin)lwip backup buffer (no pin)linux (pin)linux w/ offload (pin)

Figure 7.6: lwIP vs. Linux performance evaluation.

32


Chapter 8

Evaluation

8.1 Methodology

We evaluate the impact of IOPFs in the context of network devices. Initially, we strive

to characterize the latency of a single IOPF or a single invalidation event. We then look

at the cold ring problem, measuring its effect and the efficacy of the solution. In §8.4, wemeasure the impact of an IOPF on the network behavior of the system under synthetic

load. In the following subsections we evaluate the performance impact of using IOPFs

in a variety of real-life scenarios. We examine use cases of high performance computing

(HPC) interconnect fabric, Web 2.0 workloads, and storage systems workloads.

Experimental Setup The setup for TCP over Ethernet evaluation is comprised of

two identical Dell PowerEdge R210 II Rack Server machines that communicate through

Mellanox ConnectX-3 40 Gbit/sec Ethernet NICs. The NICs are connected back-to-

back. Each machine has a 8GB 1333MHz memory and a single-socket 4-core Intel Xeon

E3-1220 CPU running at 3.10GHz. The machines run Ubuntu 13.10 with Linux 3.11.4

kernel modified to support IOPFs.

The HPC and storage experiments used a test cluster with 8 computing nodes. The

nodes were HP ProLiant DL380p Gen8 servers, with a dual-socket Intel Xeon E5-2697

v2 (IvyBridge) CPU and 128GB of RAM. Each node had a single Connect-IB card

installed. The cluster was connected by a single SwitchX-2 SX6036 switch. The nodes

ran RedHat 7.0, with kernel version 3.10.0-123.el7.x86 64. We used a Connect-IB driver

based upon the driver in the Mellanox OFED 2.4 package.

Both setups were tuned for performance and to avoid reporting artifacts caused by

nondeterministic events. All power optimizations —sleep states (C-states) and dynamic

voltage and frequency scaling (DVFS)— were turned off. Hyper-threading was disabled.

In our backup ring implementation, the receiver duplicates each packet into two

buffers. The receiver’s PCIe bus becomes a bottleneck compared to the transmitter

and the network. This asymmetry causes packet loss that disturbs our measurements.

In order to avoid this, Ethernet flow control [IEE97] was enabled. In addition, this

33


0

20

40

60

80

100

120

140

160

180

200

kernel=eventuser=poll

kernel=polluser=poll

kernel=eventuser=event

kernel=polluser=event

mic

rose

cond

sminor IOPF handling breakdown

interrupt latencyuntil IOPF thread startsread faulting ring entryget_user_pages()+update IO PTinvalidate IOTLBpagefault resolve

Figure 8.1: Minor IOPF handling breakdown for ConnectX-3.

bottleneck gives the backup ring approach an unfair disadvantage compared to the drop

configuration. To address this issue, our driver duplicates the incoming packets in both

configurations. When working in the drop configuration, the copy is simply discarded.

8.2 Cost of IOPFs on ConnectX-3

Figure 8.1 shows the results of a micro-benchmark measuring how long it takes to

resolve an IOPF in the ConnectX-3 implementation. We limited our kernel to use only

1 CPU. We instrumented the Linux kernel to store timestamps at various points in

the IOPF handling flow and wrote a simple application that does many iterations of

the following: Take a timestamp, post a packet to be sent, do a busy wait until the

corresponding completion appears in the CQ, retrieve the timestamps recorded by the

kernel, and log all the timestamps. The average of 1 million iterations can be seen

in the leftmost column of Figure 8.1. The breakdown is as follows. First there is an

’interrupt latency’, which includes the time it takes the NIC to identify and raise an

IOPF interrupt. It also includes a very minor software component of waking up an IOPF

thread to handle the IOPF. Next, there is a small scheduling delay represented by ’until

IOPF thread starts’. The IOPF thread then asks the NIC to read the faulting entry

from the ring on its behalf. It then calls get user pages() to make sure the pages are

present and updates the I/O page table. Finally there are two commands– ’invalidate

IOTLB’ and ’pagefault resolve– that than need to be issued in order to resolve the

34


IOPF. After seeing those results we were a bit surprised that handling a single IOPF

takes so long, as a minor CPU page fault on the same machine takes about 0.3 µs. We

noticed that a significant portion of the time is spent waiting for the NIC to execute

the relevant commands. Looking at the code, we noticed that commands are executed

in an event driven manner. Namely, after a command is issued to the NIC, the issuer

goes to sleep until the NIC sends an interrupt to notify the issuer about the commands

completion. We were concerned that so much time is spent waiting for commands to

complete because the scheduler might introduce an arbitrary long delay between the

time that the command issuer is woken up and the time it receives the CPU.

In order to see whether our concern was real, we modified the kernel to work in a

polling mode. whereby instead of sleeping after issuing a command, the issuer enters

a busy wait loop, continuously asking the NIC for the command status until the NIC

reports its completed. Scheduling is thereby avoided, at the price of the CPU not doing

any useful work while commands are being issued to the NIC. The results can be seen

in the second column of Figure 8.1. To our surprise the result were significantly worse.

In particular, the average time until the IOPF thread begins executing was significantly

lengthened. We investigated the issue and found that every so often the scheduler

keeps running our user level application for a relatively long time after the kernel IOPF

kernel thread is woken up. This behavior did not happen in every iteration of our send

loop, but when it did happen the resulting delay was long enough to skew the average.

Because Linux scheduling is quite irrelevant to our research, and because this issue only

occurs when we run the kernel driver in the non-standard polling mode, we decided

not to pinpoint the exact scenario that causes this anomaly1. Instead we modified our

user application to work in an event driven mode. After a packet is queued for sending,

rather than doing a busy wait loop, our application goes to sleep and asks to be woken

up when something is posted to the CQ. This way, our application is sleeping while

the IOPF is handled and does not compete with the IOPF thread for the CPU. The

anomaly is thus avoided. The results of this modification with the kernel working in

either polling or event driven mode can be seen in the next two columns of Figure 8.1.

We can see that using polling in the kernel is indeed slightly faster. However, the

difference is negligible, justifying the use of the event driven mode, which allows the

CPU to do useful work while waiting for the NIC to execute a command.

8.3 Cost of IOPFs and Invalidations on ConnectX-IB

On ConnectX-IB, we evaluated both the IOPF and the invalidation flows for 4KB

(page size) messages and for larger 4MB messages. Large messages are native to

InfiniBand/RDMA-based communication. They are also applicable to Ethernet when

1 However, for real implementations, one might want to take measures to avoid this situation. Namely,it is undesirable for the IOPF handling thread to have to wait for the user application to release theCPU while this application is busy polling the NIC waiting for a packet to be sent.

35


0

50

100

150

200

250

300

350

400

time

[µse

cond

s](a) IOPF

trigger interrupt [hw only]os overhead [sw only]

update hw PT [sw + hw]resume process [hw only]

4KB

4MB

0

10

20

30

40

50

60

70

(b) Invalidation

check shadow PT [sw only]update hw PT [sw + hw]

update shadow PT [sw only]

4KB

4MB

Figure 8.2: (a) IOPF and (b) invalidation flow execution breakdown on ConnectX-IB.

using offloads [FHL+05]. The evaluation was performed using an InfiniBand request-

response micro-benchmark modified to call madvise(..., MADV DONTNEED) and initial-

ize the memory that was going to be used for transmitted messages. The call itself

triggered the invalidation flow, and the send operation that came next triggered the

IOPF flow. The initialization is important because omitting it would have caused the

relevant pages to be zeroed out during the IOPF flow. The benchmark ran on a Linux

kernel that was instrumented using kprobes to store timestamps at points of interest

along those flows.

Figure 8.2(a) shows the timing breakdown of the IOPF flow. The different message

sizes allow us to identify what parts of the page-in process are sensitive to the amount

of data mapped. By comparing the results for the different sizes, we can tell that the

fixed page-in cost amounts to about 220 µs. The cost per page is about 100 ns, roughly

2-3 memory accesses. For an IOPF of a single page, the run time is dominated by the

HW overheads. This is composed of the ‘trigger interrupt’ and the ‘process resume’

phases. The former is the time it takes the hardware to notice and report an IOPF,

36


while the latter is the time required for the hardware to resume the transmission of a

ring after the IOPF is resolved. These phases total to 90% of the fixed page-in cost.

As the number of pages requested in the page fault increases, the ‘os overhead’ phase

becomes more dominant. During this phase, the driver detects the page fault, reads

the ring to determine which virtual pages need to be mapped, and asks the OS for the

physical addresses. Based upon finer profiling, we know that the major time consumer

is the OS providing the driver with physical page addresses. The increased size of the

page-in request also prolongs the ‘update hw PT’ phase, in which the hardware page

table is updated with the appropriate PTEs.

Figure 8.2(b) shows the timing breakdown of the invalidation flow. First, in the

‘check shadow PT’ phase, the software finds the memory region in which the invalidation

occurred, and scans a shadow copy of the hardware PT to see whether any of the

mappings were visible to the hardware. If the pages were not mapped through an IOPF,

this is the only overhead incurred. If the pages were mapped, the driver updates the

hardware table, in the ‘update hw PT’ phase. Finally, in the ‘update shadow PT’ phase,

the driver updates the shadow copy of the hardware PT, to indicate that they are no

longer visible to the hardware. Parallelizing the ‘update PTEs’ and ‘update hw PTEs’

phases is possible. However, doing so involves a lock granularity trade-off, which hurts

the common case.

8.4 Network Transport and IOPF Interplay

The page-in latency measured above is relevant to all IOPF-capable devices. We now

move to measuring phenomena that arise from the interplay between IOPFs and the

network-specific transport.

8.4.1 Impact of Periodic IOPFs on Bandwidth

We used a simple stream benchmark to measure impact of periodic IOPFs on bandwidth.

The benchmark strongly resembles netperf’s TCP STREAM benchmark [J+96]. The

sender side performed 64 KB sends in an infinite loop using a standard Linux TCP stack.

The receiver side ran our lwIP stack and discarded received packets as soon as it received

them, while keeping track of how much data was received. To allow comparison between

Ethernet and InfiniBand, we also used the ib send bw InfiniBand micro-benchmark from

the perftest package as an equivalent InfiniBand stream workload. In order to highlight

the impact of IOPFs and suppress other memory pressure phenomena, we synthetically

generated rIOPFs at a variable frequency. The benchmark pre-faulted the entire receive

ring at start, so that the cold ring problem would not affect the measurements.

We forced a minor IOPF using mprotect. We changed the permission of the relevant

page to read only and then back to read/write2. The mprotect calls invalidated the

2We use mprotect, rather than madvise, as here we want to preserve the meta-data in the page

37


page mapping, and forced an IOPF upon next access. Triggering a major IOPF is

more complicated. Our implementation used writes to a file opened with O DIRECT

to evict pages from the page cache. An mmap of the same file experienced a major

page fault when accessing the same page following the write. We also had to modify

lwIP’s code such that it would not touch the relevant page prematurely. LwIP touched

the page before posting it to the NIC’s receive ring, causing a major page fault on

the CPU, as opposed to a major IOPF. In order to evaluate the hardware-based

RNR-NACK implementation similarly, we ran a similar InfiniBand benchmark. We

modified ib send bw to trigger a minor page fault once every X messages.

0

2

4

6

8

10

12

thro

uphp

ut [G

b/s]

minor dropmajor drop

minor backupmajor backup

0 8

16 24 32 40 48

2-10 2-15 2-20 2-25

thro

uphp

ut [G

b/s]

frequency

minor hw

Figure 8.3: Throughput of a stream benchmark in the presence of rIOPFs of varyingfrequencies.

The result are shown in Figure 8.3. Note that due to the different setup, the

InfiniBand benchmark has a different y-axis. The backup ring approximation significantly

improves performance for both major and minor page faults. In the case of drop, the type

of page fault does not matter because the TCP retransmission timeout is significantly

longer than the time it takes to resolve a major page fault. The hardware implementation,

shown in the lower figure, notifies the remote sender immediately upon a page fault.

The notification allows the sender to use a relatively short IOPF-specific timeout,

38


0 2 4 6 8

10 12

1 1.5 2 2.5 3 3.5 4 4.5 5

thro

ughp

ut [G

bps]

(a) time [seconds]

backupdrop

pinning 0

50 100 150 200 250 300 350

1 1.5 2 2.5 3 3.5 4 4.5 5

cwnd

[pac

kets

]

(b) time [seconds]

0 100 200 300 400 500 600 700 800

1 1.5 2 2.5 3 3.5 4 4.5 5

retr

ansm

itted

pac

kets

(c) time [seconds]

0

200

400

600

800

1000

1200

1 1.5 2 2.5 3 3.5 4 4.5 5re

cove

red

pack

ets

(d) time [seconds]

Figure 8.4: Transient operation of a TCP stream benchmark over time in the presenceof minor rIOPFs.

resulting in significant performance improvement relative to drop. Nevertheless, network

utilization-wise, this solution is less efficient than the backup ring solution.

Figure 8.4 examines the steady state behavior of the TCP stream benchmark over

time, given a fixed rIOPF frequency of one in 1M packets. We added a baseline pinning

configuration in which we run our stream test with no IOPFs at all. Figure 8.4(a) shows

that IOPFs configurations experience a decrease in throughput when a rIOPF occurs.