Download - I/O Page Faults - TechnionTechnion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015 This research was carried out under the supervision of Prof. Dan Tsafrir, in the

Transcript
  • I/O Page Faults

    Ilya Lesokhin

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • I/O Page Faults

    Research Thesis

    Submitted in partial fulfillment of the requirements

    for the degree of Master of Science in Computer Science

    Ilya Lesokhin

    Submitted to the Senate

    of the Technion — Israel Institute of Technology

    Heshvan 5776 Haifa November 2015

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • This research was carried out under the supervision of Prof. Dan Tsafrir, in the Faculty

    of Computer Science.

    Results pertaining to the Infiniband setup were generated by the Mellanox team, the

    Ethernet results were created by the author of the thesis.

    Acknowledgements

    I would like to thanks my advisor, Prof. Dan Tsafrir, for teaching me how to do

    research and pushing me when I wanted to give up. I would also like to thank the fellow

    students Nadav Amit, Omer Peleg and Muli Ben-Yehuda for many fruitful discussions

    and the technical help they provided during my work on this thesis.

    I thank Mellanox and especially the architecture team, Liran Liss, Shachar Raindel

    and Haggai Eran for providing the hardware, many of the results and helping with the

    writing, without thier support this work would not have been possible. Finally, I would

    like to thank my parents for supporting me all along.

    The generous financial help of the Technion is gratefully acknowledged.

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Contents

    List of Figures

    Abstract 1

    Abbreviations and Notations 3

    1 Introduction 5

    2 Background 7

    2.1 PCIe Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.2 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.3 InfiniBand Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3 Motivation 11

    4 Basic IOPF Support 13

    4.1 Non-Recoverable Failures . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    4.2 IOPF support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    5 The Page Fault Latency Problem 19

    6 HW Transport IOPF Support 21

    7 SW Transport IOPF Support 23

    7.1 The Cold Ring Problem with TCP . . . . . . . . . . . . . . . . . . . . . 23

    7.2 The Backup Ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    7.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    8 Evaluation 33

    8.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    8.2 Cost of IOPFs on ConnectX-3 . . . . . . . . . . . . . . . . . . . . . . . . 34

    8.3 Cost of IOPFs and Invalidations on ConnectX-IB . . . . . . . . . . . . . 35

    8.4 Network Transport and IOPF Interplay . . . . . . . . . . . . . . . . . . 37

    8.4.1 Impact of Periodic IOPFs on Bandwidth . . . . . . . . . . . . . . 37

    8.4.2 Cold Ring Problem and Backup Ring . . . . . . . . . . . . . . . 40

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • 8.5 System-level IOPF Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 42

    8.5.1 Cloud and Web 2.0 Environment . . . . . . . . . . . . . . . . . . 42

    8.5.2 Applications with Direct-I/O . . . . . . . . . . . . . . . . . . . . 48

    9 Related Work 51

    9.1 Existing Direct Device Assignment Solutions . . . . . . . . . . . . . . . 51

    9.2 Generic IOPF support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    9.3 Networking IOPF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    9.4 GPUs and other accelerators . . . . . . . . . . . . . . . . . . . . . . . . 52

    9.5 Handling latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    10 Discussion and Future Work 55

    10.1 Problems with ATS/PRI . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    10.2 Optimizations and Future Work . . . . . . . . . . . . . . . . . . . . . . . 56

    11 Conclusion 59

    Hebrew Abstract i

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • List of Figures

    4.1 IOPF (1 – 4) and invalidation (a – d) flows . . . . . . . . . . . . . . . . 16

    7.1 High level design of the backup ring. . . . . . . . . . . . . . . . . . . . . 24

    7.2 Software psuedo-code of the backup ring . . . . . . . . . . . . . . . . . . 25

    7.3 Hardware psuedo-code for the backup ring. . . . . . . . . . . . . . . . . 27

    7.4 lwIP main loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    7.5 Psuedo-code for the backup ring approximation. . . . . . . . . . . . . . 30

    7.6 lwIP vs. Linux performance evaluation. . . . . . . . . . . . . . . . . . . 32

    8.1 Minor IOPF handling breakdown for ConnectX-3. . . . . . . . . . . . . 34

    8.2 (a) IOPF and (b) invalidation flow execution breakdown on ConnectX-IB. 36

    8.3 Throughput of a stream benchmark in the presence of rIOPFs of varying

    frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    8.4 Transient operation of a TCP stream benchmark over time in the presence

    of minor rIOPFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    8.5 (a), (b) Startup with 64 entries in receive ring. (c) Time it takes to

    perform 10,000 operations as a function of receive ring size. . . . . . . . 41

    8.6 Pinning vs. IOPF with dynamic working set: (a) with IOPFs (b) with

    pinning (c) combined throughput . . . . . . . . . . . . . . . . . . . . . . 43

    8.7 The transition period. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    8.8 Flipping the working set with different swap devices. . . . . . . . . . . . 45

    8.9 No swap experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    8.10 Rare good results with HDD. . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    8.11 System never recovers (90-10-90). . . . . . . . . . . . . . . . . . . . . . . . 47

    8.12 (a) Storage bandwidth with single initiator and varying memory limit.

    (b) Memory usage with multiple initiators and a fixed memory limit. . . 47

    8.13 IMB running time for different MPI operations by message size. The

    ratio between the copying and the pinning run-times is shown in the labels. 49

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Abstract

    Virtual memory is used in most modern general purpose computer systems. This

    invention simplifies systems and increases their usability and efficiency. In recent years,

    I/O devices also started using virtual addresses. However support for I/O page faults is

    still lacking. I/O devices are designed under the assumption that the virtual addresses

    they are using are always valid and the software is forced to makes sure that this is

    indeed the case. This deficiency deprives one class of software from the benefits of

    virtual memory: it prevents memory overcommitment, complicates the programming

    model and hinders administration. The effected software class is exclusively comprised

    of software that performs direct I/O, which is the act of accessing I/O devices without

    any involvement of intermediary privileged software such as the operating system (OS)

    kernel or the hypervisor. Prominent example of this are direct device assignment of

    SR-IOV (single root I/O virtualization) instances in virtualization scenarios and kernel

    bypassing access to I/O devices by user space applications.

    This thesis presents a working hardware and software support for I/O page faults

    (IOPFs) in a network interface card (NIC). It described the challenges involved in

    implementing this support and demonstrates that an IOPF-enabled NIC allows for

    efficient memory overcommitment.

    1

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • 2

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Abbreviations and Notations

    I/O : Input/Output

    IOPF : I/O page fault

    rIOPF : receive I/O page fault

    MMU : Memory Management Unit

    IOMMU : I/O Memory Management Unit

    TLB : Translation Lookaside Buffer

    IOTLB : I/O Translation Lookaside Buffer

    VA : Virtual Address

    IOVA : I/O Virtual Address

    OS : Operating System

    DMA : Direct Memory Access

    NIC : Network Interface Card

    TX : Transmit

    RX : Receive

    PT : Page Table

    PTE : Page Table Entry

    IP : The Internet Protocol

    TCP : The Transport Control Protocol

    VM : Virtual Machine

    API : Application Programming Interface

    TPS : Transactions Per Second

    HPC : High-Performance Computing

    CWND : Congestion Window

    RTT : Round-Trip Time

    3

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • 4

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Chapter 1

    Introduction

    The availability of physical memory often determines the performance of the system

    for a given workload. Lack of sufficient memory may even render the system unusable.

    Virtual memory, which was introduced in the 1960s, allows a computing system to run

    multiple workloads concurrently while sharing the physical memory in a transparent

    manner. In addition, virtual memory optimizes physical memory usage by holding only

    the necessary working set of each workload.

    Virtual memory is usually implemented by the CPU’s memory management unit

    (MMU), and is thus not exposed to I/O devices. I/O memory management units

    (IOMMUs), which are integrated in modern servers and devices, provide similar MMU

    services to I/O devices. However, until recently, I/O devices were not able to inform the

    operating system about page fault events. This deficiency limits the I/O devices’ virtual

    memory support to isolation only; I/O devices cannot handle dynamically changing

    working sets.

    Nonetheless, IOMMU enabled devices are prevalent today. Prominent examples

    include direct device assignment of I/O devices to virtual machines (VMs) [RS07,

    WSC+07, YBYW08], high performance computing (HPC) applications [JLJ+04], and

    packet processing applications [Int]. In §2 we provide a brief overview of PCIe, virtualmemory and recent advancements in IOMMU technology.

    By adding support for I/O Page Faults (IOPF), I/O devices become true first-class

    citizens in virtual memory (§3). IOPF support provides the means for I/O devicesto directly access virtual memory pages, which are not guaranteed to be resident in

    physical memory at the time of access. It comprises two complementary mechanisms:

    (1) allowing an I/O device to request from the OS physical mappings of currently

    non-present pages on-demand; (2) allowing the OS to invalidate mappings on the I/O

    device. In this work we focus on NICs, one of the most demanding class of I/O device,

    and provide a prototype implementation of IOPF support.

    Initially, we detail the design trade-offs and the implementation of the fundamental

    building blocks necessary for IOPF (§4). These building blocks have broad applicability,and may serve any class of I/O device that supports IOPFs.

    5

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Next, we observe that basic IOPF support is not enough. For a large class of I/O

    devices it is crucial to efficiently tolerate the increased latency incurred by page faults,

    which now occurs directly in the I/O fast path. We elaborate on this observation for

    the general case and for the specific case of networking (§5).We then describe how IOPF latency may be tolerated in two prominent implemen-

    tation approaches in use today for high-end network devices: HW offloaded (§6) andSW-managed (§7) transport protocols. In the first, IOPF and transport processingare coupled together, and the intimate knowledge between the two may be leveraged

    accordingly. In the latter, the I/O device provides the basic IOPF support, while SW

    manages the transport (e.g., TCP). Here, the coordination is less tight, but the SW

    implementation allows more freedom in the design space.

    In §8 we provide a performance evaluation of IOPF. We begin by showing the basiccosts of a single IOPF in our implementation, followed by the effects of recurring page

    faults on the network bandwidth. We show that even with a page fault probability

    of 2−15, we achieve full network bandwidth. Next, we demonstrate the performance

    gains of IOPF in multiple real-world deployment scenarios. We examine virtualization,

    storage and HPC scenarios. In both the virtualization and storage use cases, the IOPF

    implementation achieved about 80% performance gain compared to the current art. The

    IOPF implementation was also able to pack more VMs on the same physical machine.

    In the HPC scenario, the IOPF implementation achieved performance on par with the

    current state of the art, while simplifying the programing model significantly.

    We provide an overview of related commercial and academic works(§9) and discussinsights gained in this work and future directions(§10). To the best of our knowledge,this is the first detailed study and evaluation of IOPF in real world systems. We

    conclude in §11.

    6

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Chapter 2

    Background

    2.1 PCIe Primer

    Presently, most I/O devices are connected to the computer using PCIe (Peripheral

    Components Interconnect Express)[PS14]. PCIe was designed by PCI-SIG to replace

    its predecessor, PCI, which is a true bus. Conversely, PCIe is a point-to-point link. It is

    nevertheless commonly referred to as a “bus” because it is backward compatible with

    PCI and, as a result, it behaves like a bus from the software perspective.

    The actual topology, however, is not a bus. It consists of many point-to-point links

    and switches that connect all the peripheral devices to the PCIe root complex device.

    The latter typically resides on die and is responsible for connecting peripheral devices

    to the CPU and memory. It facilitates three important functionalities, as follows.

    The first functionality is Memory Mapped I/O (MMIO). The physical memory space

    contains at least one, contiguous address interval, denoted PCI MMIO range. Such a

    range is owned by the root complex and is thus ignored by the memory controller. Any

    memory operation issued by the CPU that is directed at a PCI MMIO range is handled

    by the root complex. The latter converts the operations into PCIe requests, which are

    then fulfilled by the corresponding I/O devices. This mechanism is denoted MMIO. It

    is used to communicate with the I/O devices, e.g., by allowing access to their registers

    as if they are “ordinary” memory.

    We note in passing that MMIO operations can be translated to either PCIe memory

    operations or PCIe I/O operations. I/O operations are slower and should only be used

    for initialization. The PCIE I/O and memory operations operate in different address

    spaces. Each device has a fixed size configuration space in the I/O address space. The

    address of this space can be calculated using the device ID. The device ID itself is

    composed of bus, device and function numbers. The configuration spaces expose, among

    other things, the base address registers (BARs) of the devices, through which devices

    are notified to which address ranges in the PCIe memory address space they should

    respond. The BARs themselves must therefore be configured before the PCIe memory

    operations can be used.

    7

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • The second functionality facilitated by the root complex is direct memory access

    (DMA), which allows devices to access the main memory. I/O devices perform DMA

    by issuing PCIe memory read and write requests. The root complex processes these

    requests and passes them to the memory controller similarly to how the CPU does it.

    DMA accesses carry a bit that specifies whether cache coherency should be applied to

    them.

    The third root complex functionality is interrupt delivery. PCIe devices trigger

    interrupts to receive the attention of the CPU asynchronously. Interrupts are triggered

    similarly to DMAs, i.e., by the device writing to a special address1. The PCIe root

    complex also implements the IOMMU functionality. MMIO operations to the device

    are translated by the MMU, whereas DMA operations from the device are translated by

    the IOMMU. Interrupt requests, which are essentially PCIe write operations, are also

    translated by the IOMMU—a functionality that is commonly referred to as interrupt

    remapping.

    2.2 Virtual Memory

    Virtual memory is used in most modern general purpose computer systems [Den70].

    This invention simplifies systems and increases their usability and efficiency. Virtual

    memory isolates processes from each other. Each process has its own virtual address

    space. Independently compiled applications can reside and reference memory at any

    location without risking conflicts.

    Not all virtual address ranges referenced by a process always reside in physical

    memory. Only the current working set is resident. The process relies on the OS

    and underlying paging mechanisms to track changes in the working set and adjust

    the memory mapping accordingly. Locality of reference [Den05] makes paging work

    well [SK09]. It allows substantial savings by overcommitting in physical memory. Only

    a small portion of the large virtual address spaces is actually mapped to the limited

    physical memory.

    As the active working sets change, OSes use secondary storage (e.g., disks) as

    an extension to physical memory. Data that is not part of the active working set is

    transparently moved to the secondary storage. If an application attempts to access

    data that was moved, a page fault exception is raised. This exception is handled by the

    OS, which brings the data back from the secondary storage to memory. During this

    handling, the application is suspended. Once the data is again in memory, the virtual

    memory mapping is updated, and the application is resumed. Page faults are typically

    classified as major or minor depending on whether disk access is required to satisfy a

    missing virtual-to-physical page mapping.

    The OS uses free physical memory as a disk cache. A special API [Gal95] allows

    the application to access this cache directly, using the virtual memory mechanism. In a

    1PCIe also supports legacy PCI interrupts. Those interrupts are not implemented as memory writes.

    8

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • balanced computing systems, disk paging activity usually occurs only during sporadic

    transient periods. These periods can occur for example, when new processes are spawned

    and use the physical memory of currently idle processes or rarely accessed cached disk

    blocks.

    Virtual memory provides additional system-wide benefits. Prominent examples in-

    clude speeding up process startup time by reading only the needed parts of the executable;

    efficient fork() implementation using copy-on-write (CoW); and reducing the physical

    memory footprint using active de-duplication [Wal02a] or page compression [Gup09].

    Until recently, virtual memory was the sole domain of applications running on the

    CPU but its use is now expanding to peripheral devices as well. System IOMMUs

    reside between the I/O device and memory, and map virtual I/O address ranges

    that are provided to the device into physical page frames in a secure way. IOMMUs

    are typically used in direct HW pass-through to VMs [YBYW08] and user-level I/O

    [CBD+98, Sch01, SD06]. In addition, some device classes, such as RDMA devices

    [vE98], employ embedded IOMMU units within the device. All of these IOMMU devices

    address only memory mapping for isolation; they are not able to indicate a page fault

    to the OS. In the past year, commercial devices that support paging have been showing

    up in the market [PS09, SBJS15]. However, none of these implementations examine the

    impact of a page fault during an I/O operation. Thus, the implications of introducing

    the full benefits of virtual memory to I/O devices remain unexplored.

    2.3 InfiniBand Primer

    InfiniBand[Inf15] is computer networking standard widely used in high performance

    computing. InfiniBand exposes both a send/receive semantic and a remote direct

    memory access(RDMA) semantic for data transfers.

    The send/receive is the standard semantic we are used to from Ethernet. It is also

    called the two-sided semantic as software on both sides is involved in data transfer. The

    receiver must post a buffer large enough to receive the incoming data and the sender

    needs to tell the NIC what buffer to send. Since the receiver usually cannot know the

    size of incoming messages in advance, there is usually an agreed upon maximal message

    size and all the buffers posted by the receiver are of that size. This creates inefficiencies

    for both the sender and the receiver. The sender is limited in the amount of data it can

    send in one message, forcing it to split large data transfers into multiple transactions

    while the receiver wastes memory when the sender passes messages that are smaller

    than the maximal message size.

    The RDMA semantic, also called the one-sided semantic, allows data with software

    involvement from only one of the parties. RDMA does require initial software involvement

    on both sides to establish a connection and decide which memory areas are accessible to

    other parties. But after this initial setup, one of the parties can issue RDMA read and

    RDMA write operations to access the memory of the remote party without any software

    9

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • involvement from the remote party. This semantic reduces the overhead of data transfer

    for both parties. The initiating party is free to use the optimal transaction size for the

    transfer and the other party does not incur any software overhead. We note that while

    RDMA formally refers only to the one-sided semantic described above, it is commonly

    used informally for all Infiniband user level I/O operations including send and receive.

    While one could imagine an application where all the communication is done using

    RDMA, this is usually not the case. Consider, for example, a remote storage application

    with a client and a server. Since the data is stored on secondary storage, the client

    usually can not access it directly using RMDA operations. Even if it could access all

    the data, we would probably want some server involvement to synchronize concurrent

    access from multiple clients. Consequently, RDMA is typically used only for large data

    transfers and there is usually a side channel for control. In our storage example, the

    client would use the control channel to say: “I want to read k blocks starting at block

    number x”, the server would respond with “Ok, it is available at address p, please let

    me know when you are done”. The client would then use RDMA to read the actual

    data in the blocks and use the control channel once again to notify the server when

    it is done. The control channel is typically implemented with send/receive semantics

    because the software on the receiver does want to be notified about the reception of a

    control message and be able to do something about it.

    Unlike the Ethernet standard which ends at the link layer InfiniBand also includes

    specifications for the transport layer. encouraging implementation of that layer in

    hardware. The most commonly used transport is the reliable connection (RC) transport,

    which provides reliable data transfer between two parties similarly to TCP/IP. Other

    notable transports are unreliable connection (UC) and unreliable datagram(UD) like

    UDP/IP, they are both unreliable. The difference is that UC is connection oriented and

    mandates a send queue per connection while UD supports multi-cast and using a single

    send queue to talk to multiple parties. One notable place where UD, rather than RC,

    is typically used is in the implementation of IP over InfiniBand (IPoIB)[CK06] driver.

    IPoIB allows regular applications designed to be used in the IP environment to work

    over InfiniBand networks. The rationale for using UD in IPoIB is that it is simpler to

    add InfiniBand headers to IP packets and send them with a shared send queue than to

    maintain multiple reliable send queues corresponding to all the active remote parties.

    10

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Chapter 3

    Motivation

    Direct I/O applications and virtualized systems with direct device assignment have two

    ways of accessing the memory. Directly using the CPU and indirectly by asking an I/O

    device to access the memory on the applications behalf. While CPU page faults are

    supported on all modern general purpose systems, support for IOPFs is almost non

    existent. Consequently such systems are forced to use pinning and cannot enjoy the full

    benefits of virtual memory.

    Translation only Translation + page fault support

    isolation 3 3

    on demand paging 3 7

    swapping 3 7

    overcommitment 3 7

    easy programming 3 7

    page migration 3 7

    Table 3.1: Benefits of virtual memory

    Table 3.1 lists the benefits of virtual memory and specify whether they can be

    provided by a system that only supports translation or whether page fault support is

    also required. As we can see, the only benefit that a translation only virtual memory

    system can provide is isolation. Such a system cannot provide on demand paging or

    swapping because a page fault is required to page in the data. It cannot provide Copy

    on Write (COW) because a page fault upon write is required to copy the data and

    break the cow mapping. And as a result of the limitation above there is no memory

    overcommitment in such a system. The ability to over-commit memory allows programs

    to allocate memory even if there is not enough physical memory to satisfy the allocation.

    This ability greatly simplifies the programming model as relives the programmer from

    writing fall-back code paths for every memory allocation failure. In typical system today

    memory allocation failures in user space are so rare that its acceptable for a program

    to simply terminate when an allocation fails. In addition to all the over-commitment

    related benefits mentioned above page faults are also required page migration, which

    11

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • allows compacting the physical memory for more efficient usage.

    With IOPF support, the inherent benefits of virtual memory apply seamlessly to

    application buffers used for direct I/O. Storage servers, for example, may allocate large

    buffer pools up front to accommodate the worst case, but reference only “hot” buffers

    in the common case. As a result, the I/O memory footprint follows the current working

    set.

    In other applications, such as HPC, large memory ranges may be mapped for remote

    direct memory access (RDMA) by I/O devices [Inf15]. A remote host in an RDMA-

    capable network may read and write local application memory directly (after a proper

    key exchange) without involving the (local) CPU. Here, the working set is determined

    by remote IO activity rather than by the local CPU.

    In the context of virtualization, direct device assignment mandates pinning the entire

    VM address space [ABYTS11], even though the VMs themselves are held in virtual

    memory. IOPF maintains the benefits of the large body of work done on over-committing

    VM memory without giving up the performance advantages of direct I/O.

    Virtual memory moves the complex memory management code employed by state

    of the art direct I/O applications to the operating system. Complex code that decides

    what data should be in memory at any given time can be discarded. Finally, pinning

    memory requires special administrative privileges. IOPF support will allow running

    unprivileged direct I/O applications.

    12

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Chapter 4

    Basic IOPF Support

    When we add an indirection level and decide that DMA transactions should use virtual

    rather than physical addresses, we also have to decide how to handle translation failures:

    a situation where a DMA transaction references a virtual address that has no translation

    or has permissions which do not allow the transaction to complete. We can leave the

    I/O devices oblivious to the possibility of translation failure and force the software to

    avoid them or we could inform the I/O devices about translation failures and give the

    devices a chance to do something about them. The VT-d spec[Int14] uses the terms

    non-recoverable and recoverable address translation failures to describe those options.

    4.1 Non-Recoverable Failures

    The simpler and widely used option is to keep the I/O devices oblivious to translation

    failures. A DMA transaction to a virtual address that cannot be translated is treated

    the same as a DMA transaction to an invalid physical address. IO devices are not

    notified about translation failures, and do not require any modification to work in this

    mode. Under this design option, a DMA access should never encounter a translation

    failure. Such a failure, if it does happen, indicates that either the device or its driver

    are misbehaving. The IOMMU nevertheless detects and reports translation failures.

    But OSes currently do not have a standard interface for drivers to register translation

    failure callbacks [Cor14]. As a result, all the IOMMU driver can currently do is to log

    the failure. Even if OSes did have an interface for registering such callbacks, recovering

    from this failure could be prohibitively difficult or altogether impossible, because most

    I/O devices are designed under the assumption that DMA operations do not fail.

    DMA Read When a device issues a read, it expects an answer. The IOMMU indeed

    returns an answer when the corresponding translation fails, but this answer is a generic

    “unsupported request”[Int14], namely, the device has no way of knowing that the failure

    was due to a translation failure. The specific reaction of devices to such generic failures

    varies. The network and disk controllers that we tested exhibited various undesirable

    13

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • behaviors, ranging from getting stuck to corrupting the filesystem. Devices may, in

    principle, behave in a more civilized manner by raising an interrupt, informing their

    driver to reset the device. Conceivably, devices can ignore the read error and continue

    as if nothing bad happened. For example, a sound card can skip the data that it was not

    able to read and be silent for a moment. Such an approach, however, might colossally

    fail if a disk controller is involved. For example, if the controller is instructed to read

    from memory and write the content to the disk, then silently ignoring a DMA read error

    would likely result in data loss, because the OS would rightfully expect the information

    to be persistent on disk. This example coincides with the behavior we empirically

    observed.

    DMA Write DMA write operations are more challenging than reads with respect to

    address translation failures. Whereas DMA read operations expect a response, DMA

    write operations are conducted in a “write and forget” manner. Namely, devices are not

    acknowledged when DMA write operations complete1. Consequently, there is no way

    for the IOMMU to inform devices that write operations have failed, so the operations

    fail silently upon translation failures. Silent failures are usually worse than other

    outcomes such as, say, crashing. Assume, for example, that the host directly assigns a

    disk controller to a guest without pinning the associated memory in a system where

    translation failures cause DMA writes to fail silently. Further consider that the guest

    intends to read and run an executable from the disk. The guest will therefore instruct

    the disk controller to read the relevant blocks and DMA write them to memory. If an

    translation failure occurs during the DMA write, it will fail silently, without writing the

    requested data to memory. Being oblivious to the IOPF, the disk controller will next

    inform the guest that the operation has completed successfully. The guest, in turn, will

    run the executable, which will crash when the CPU will try to execute “random” data

    that was supposed to be overwritten with the executable code but was left unchanged

    due to the DMA write failure. The crash might happen at the worst possible moment,

    for example, when trying to save a big document that the user has been working on for

    hours. Worse, the executable could be a kernel module and thus crash the OS entirely.

    Note that in such a scenario, the hypervisor is notified about the IOPF through the

    IOMMU, but there is no standard way for the hypervisor to notify the guest about the

    error.

    NIC translation failures The above examples for what might go wrong when

    encountering address translation failures are disk-related. Seemingly, the network is

    not as “trustworthy” as the disk. For example, one trusts a disk write to transpire as

    1PCIe does have a link layer ack. But this ack only means that the packet containing the writerequest has been received successfully. It does not mean that the write operation has been successfullycompleted. Furthermore, the entity sending the ack is not necessarily the root complex and hence hasno knowledge as to whether the DMA write was successful. Rather, it may be a PCIe switch that hasno understanding of (I/O) virtual memory semantics.

    14

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • requested, whereas one does not equivalently trust a packet to reach the other side. As

    a result, network packets have checksums and sequence numbers to recover from packet

    loss and data corruption. If random data is passed to the network stack, it will most

    likely be dropped. The reason for the drop can be an unknown protocol, bad checksum

    or bad sequence number. Consequently, when the NIC receives a packet, fails to write

    its content to memory due to an IOPF and informs the NIC’s driver that a packet was

    received, the most likely scenario is that the packet will get dropped.

    However, we would like to point out that the disk related failure scenarios we

    mentioned earlier are actually applicable to NIC translation failures when working with

    NFS (network file system). Many modern NICs have hardware checksum offloading

    and scatter gather capabilities. In order to take advantage of the hardware checksum

    offloading, the network stack is usually designed to skip the checksum checking step

    in cases where the hardware has confirmed that the checksum is good. When the

    NIC receives a good packet but fails to write its content to memory, it will actually

    report good checksum because it checked the original packet and not the content that

    the network stack will see in memory. In such a scenario that packet will still most

    likely get dropped due to a bad header. However, if we further assume that the packet

    crosses a page boundary, or that the NIC’s driver uses the scatter gather capabilities

    and receives each incoming packet into multiple pages, then it is possible that a part

    of the packet containing the header will be written successfully to memory while the

    rest of the packet will not be. In such a scenario, the header is valid and there is

    no software checksum check because the network stack blindly trusts the hardware

    checksum checking capabilities. As a result, the network stack will give corrupted data

    to the application using it. If the application is an NFS client, we would experience a

    failure similar to the disk read failure and if the application is an NFS server we would

    experience a failure similar to the disk write failure2.

    4.2 IOPF support

    The second option for handling translation failures is to inform the device when a

    translation failure occurs and allow the device to respond. We mentioned earlier that

    the VT-d spec uses the term recoverable address translation failures for this option.

    However the VT-d spec actually refers specifically to the ATS/PRI standards by PCI-

    SIG, and there are other ways to implement this design option. For the purpose of

    this paper we will use the terms IO page fault (IOPF) for translation failure and

    IOPF support for any design where devices are notified about IOPFs and have the

    hardware/software interfaces described next in order to do be able to handle IOPFs

    gracefully. Existing I/O devices need to be modified in order to support IOPFs, and as

    a result, this support is still very rare. In fact, the only other implementation we are

    2Application level data integrity testing can save us from data corruption in those case. However, itis only supported in NFSv4, which is not widely deployed.

    15

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • os

    driver

    io device

    io page tables(1)

    (3)

    (4)(2)

    (a)

    (b) (c)

    (d)

    Figure 4.1: IOPF (1 – 4) and invalidation (a – d) flows

    aware of is the GPU in AMD Kaveri. We developed two devices with IOPF support: a

    Mellanox ConnectX-3 40G Ethernet NIC, and a Mellanox Connect-IB 56G Infiniband

    Host Channel Adapter (HCA). Both adapters employ a similar internal IOMMU, which

    initially assumed that all PTE (page table entries) mappings are valid.

    Page Faults To support page faults, we allowed the internal IOMMU to hold invalid

    mappings. The resulting IOPF flow is illustrated in Figure 4.1 and described as follows:

    (1) The device starts processing an I/O request. It consults the I/O page tables and

    determines that one of the pages involved is not present. (2) The device raises an

    IOPF interrupt to its driver in order to resolve the page faults. (3) The driver calls

    get user pages() to obtain the relevant pages from the OS, and immediately calls

    put page() to avoid pinning them. (4) The driver updates the I/O page table with

    the corresponding physical addresses and informs the device that the IOPF has been

    resolved, allowing it to resume normal operation.

    In the unlikely case that the memory is not available, for example due to an access

    attempt to unallocated virtual memory, the device driver notifies the hardware that the

    page fault could not be resolved. The hardware follows the same error semantics that

    InfiniBand [Inf15] defines for local access errors.

    Invalidations As the memory pages accessed by the device are no longer pinned, the

    OS is allowed to unmap and reuse pages at will. This requires an invalidation flow,

    in which the OS notifies the device that a virtual mapping is no longer valid. The

    invalidation flow is illustrated in Figure 4.1 and described as follows: (a) The OS decides

    to change a virtual mapping and asks the driver via the Linux kernel MMU notifiers

    infrastructure [Arc08] to remove the old mapping and stop the device from using it. (b)

    The driver updates the I/O page tables and issues an invalidation to the device. (c) The

    device acknowledges the invalidation and stops using the relevant mapping. (d) The

    driver notifies the OS that the old mapping has been removed and that the relevant

    16

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • pages can be reused.

    We note that multiple IOPFs and invalidation may execute concurrently and software-

    based locking is used for synchronization. Page fault handling might naturally block on

    an invalidation. Due to this fact, the locking scheme had to ensure that invalidations

    never block on page fault handling. If an IOPF collides with an invalidation, the IOPF

    handling is aborted and restarted after the invalidation completes.

    17

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • 18

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Chapter 5

    The Page Fault Latency Problem

    When a CPU page fault occurs, the current thread of execution is halted until the

    operating system handles the page fault. Similarly, for devices that perform purely local

    work (GPU, FPGA, ASIC accelerators and local storage devices), it is usually possible

    to suspend the specific execution context until the page fault is resolved.

    However, there exists a large class of devices for which pausing I/O due to a page

    fault disrupts normal I/O operation, even if the average I/O rate in the presence of

    page faults does not limit the desired throughput. Sensor data, audio sampling, video

    input, and CDROM burning are examples of such devices. In the context of network

    devices, This problem makes it difficult to send an receive packets in a timely manner.

    We denote by rIOPF (receive IOPF) the scenario in which a NIC encounters an IOPF

    while receiving a packet from the network. Arguably, the simplest approach is to drop

    all incoming packets designated to the faulting ring until the rIOPF is resolved. The

    problem with this approach is that if the drop is done without informing the transport

    layer, performance will be greatly impacted due to the relatively large timeout values

    used today – 200 milliseconds in TCP [VPS+09], and around 4 seconds with InfiniBand,

    which assumes a (nearly) lossless network. In the next chapters, we elaborate on how

    we deal with the rIOPF problem for the specific cases of both HW and SW transports.

    For transmission, the situation is less complicated as suspending a transmit ring

    for the duration of an IOPF will only delay the transmission and will not cause data

    loss. However, network protocols might rely on delivering acknowledgments in a timely

    manner because the peer might interpret the delay as an indication of packet loss. This

    does not pose a problem in practice due to the large network timeouts, as mentioned

    above, and in our implementation we suspend the sending queue until page faults are

    resolved.

    19

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • 20

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Chapter 6

    HW Transport IOPF Support

    When the page fault and the transport are both handled in the same hardware unit,

    interfacing them is relatively easy. This is the case for InfiniBand adapters. We

    specifically consider the reliable connection (RC) transport.

    The InfiniBand wire protocol supports explicit link level flow control. Therefore, a

    näıve implementation could block the incoming network traffic until an rIOPF is resolved.

    This would allow the implementation to be work preserving. However, blocking the

    flow control for an extended period of time will lead to congestion spreading [AAK+08,

    SCS+14] in the network. Therefore, an end-to-end mechanism is needed to handle such

    a situation.

    The RC protocol contains an end-to-end mechanism for the receiver to stop the

    sender. When an rIOPF is encountered, the receiver sends a receiver-not-ready (RNR)

    negative acknowledgment (NACK) packet. this NACK notifies the sender that it should

    pause the transmission of the relevant flow for a specified period of time T . This

    notification explicitly informs the sender that the packet was lost, allowing the sender to

    retransmit quickly rather than the using the generic 4 seconds retransmission timeout

    mentioned earlier. The notification also stops the sender early and reduces the amount

    of data that will be lost due to the rIOPF. We note that in this solution some data is lost

    and retransmission is required. However, retransmission is possible because the protocol

    is reliable and so, by definition, the sender must keep the data until it is acknowledged.

    We further note that in InfiniBand packet loss is decoupled from congestion con-

    trol and consequently the retransmission does not have a negative impact on future

    transmissions. This is not the case in TCP/IP.

    We show in §8 that this “drop and stop the sender” solution works reasonably wellin the context of InfiniBand reliable connected transport.

    21

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • 22

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Chapter 7

    SW Transport IOPF Support

    While for hardware transport we could rely on the transport state during page fault

    handling, in software transports such information is not readily available. In the prevalent

    setup of TCP/IP over Ethernet, the transport layer is implemented in software, while

    page faults are handled by the hardware and the device driver in the hypervisor. The

    hardware and the hypervisor, being only aware of the Ethernet layer, cannot handle

    the rIOPF by sending an RNR-NACK equivalent for TCP. An unmodified guest’s

    TCP/IP stack, on the other hand, has the full state information, but is not aware

    that a page fault has occurred, and does not have the required information from the

    TCP/IP headers of the faulting packet. Consequently, for the software transport case

    we cannot implement the RNR-NACK solution, and we are left with the drop and wait

    for retransmit option.

    Initially, we hoped this option would suffice to support TCP/IP. This hope was

    based on the fact that TCP/IP stacks usually use a limited size memory pool for

    transmission and reception buffers, and the buffers are frequently used, so we expected

    IOPFs to be relatively rare. However, when we tested this approach with TCP/IP, we

    discovered a critical problem. We call it the cold ring problem.

    7.1 The Cold Ring Problem with TCP

    Our implementation uses a separate page table in the device, and does not pre-populate

    it. Consequently, when we first start an application, the receive ring is “cold”. Namely,

    the receive buffers are not mapped and rIOPFs are not as rare as one would expect

    during steady state operation. We discovered that a “cold” receive ring poses a unique

    problem: TCP retransmission and congestion avoidance results in a near-deadlock of

    the communication. The cold ring problem is not limited merely to startup situations.

    It can also happen, for example, when the VM is resumed from suspension or brought

    back from swap.

    New TCP connections start in a slow start phase in which they send at a very low

    rate in order to avoid exceeding the network capacity. Drops are considered a sign

    23

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • guest VM/application

    hypervisor/OS kernel

    NIC

    (1)

    (2)

    (3)

    (4)

    pf

    Figure 7.1: High level design of the backup ring.

    of congestion and cause TCP to reduce the transmission rate even further. Similarly,

    during the connection establishment stage, TCP utilizes an exponential backoff scheme

    to avoid overloading the network and the receiver. However, if the cause of the packet

    loss is an IOPF, communication all but stops. The transmitter will be waiting for

    acknowledgments from the receiver before it increases the transmission speed. Instead,

    due to timeouts, it will actually try to reduce the transmission rate. At the same time,

    the receiver will wait for more packets to arrive from the network to page-in the receive

    ring. The effective visible behavior strongly resembles a deadlock. Nearly no network

    traffic is sent, with both parties waiting for the retransmission timeout. In some of the

    cases, the issue is so severe that the TCP stack announces a failure to the application

    layer. This happens once the maximal retry number is exceeded. We demonstrate such

    an issue in §8.4.2.In addition to suspend/resume of a VM or a startup case, a ring can become cold

    due to other reasons. Specifically, operations such as NUMA migration or fork can

    cause the same effect.

    7.2 The Backup Ring

    In para-virtualized guests, the receive ring is also likely to be cold from time to time.

    However, the problem is much less severe thanks to buffering performed by the hypervisor

    virtual switch for the guest traffic. When considering where to buffer such packets, we

    rule out buffering on the NIC itself, since adding enough on-chip memory to buffer a

    major page fault would be too expensive. The backup ring solution to the cold ring

    problem is based upon these observations.

    The design of the solution can be seen in Figure 7.1. Traffic is received from the

    network by the NIC (1). For each incoming packet, the NIC inspects the designated

    receive buffer of the Guest VM. If this buffer is available, the data is written directly to

    it (2). However, if a page fault is encountered while writing into the buffer, the packet

    is written to a backup ring owned by the hypervisor (3). After the hypervisor fixes the

    page fault, it copies the packet into the original receive buffer (4). To maintain ordering,

    the NIC skips receive descriptors that encountered page faults. For the same reasons,

    the NIC does not report the reception of new packets to the guest until all previous

    24

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • struct ring_data {// size of ringint size;// the ring itselfdescriptor_t *descriptor;int head;int tail;

    bit *bitmap;};

    // unresolved IOPFslist IOPF_list;

    void br_interrupt() {head = get_head( br );tail = get_tail( br );count = 0;while(tail != head) {

    brentre_t e = br_re_arm( tail );IOPF_list.append( e );tail = (tail + 1) % ring_size(br);

    }br_update_tail( tail );wake_rIOPF_thread();

    }

    struct bbentry_t {int ringID;int index;int bitmap_index;packet_t pkt;

    };

    void rIOPF_thread() {while( true ) {

    if( IOPF_list.size() == 0 )wait_for_rIOPF();

    assert(IOPF_list.size() > 0 );brentry_t e = IOPF_list.pop( );r = get_ring(e.ringID);if( ! has_room( r ) )

    wait_for_tail_change( r );make_present( r.descriptor[e.index] );store_packet( r.descriptor[e.index], e.pkt);bitmap[e.bitmap_index] = 0;free(e);//triggers the NIC’s//resolve_rIOPFs flowresolve_rIOPFs(r);}

    }

    Figure 7.2: Software psuedo-code of the backup ring

    page faults have been handled.

    The backup ring mechanism allows a graceful fallback. During page fault, the guest

    machine behaves exactly like a guest with a para-virtualized or emulated NIC. Namely,

    the hypervisor will have to buffer incoming traffic for the guest. When the buffer space

    for this guest is exhausted, incoming packets are dropped. At the same time, if the

    IOPF is resolved in a timely manner, this will only cause minor latency jitter instead

    of packet loss. In the common case of page fault free traffic, the guest experiences

    high performance thanks to direct device assignment. We note is passing that while

    the backup ring solution is described and evaluated in the context of TCP/IP over

    Ethernet it is also applicable to the unreliable transport protocols of InfiniBand which

    lack hardware retransmission.

    Figure 7.2 contains psuedo-code describing how the hypervisor manages the backup

    ring. Note that in our design the guest is unaware of the backup ring and does not need

    to be modified in order to benefit from the backup buffer mechanism.

    The struct ring data contains data that the hypervisor maintains for each ring.

    size is the size of the ring. descriptor is a pointer to the ring itself which is an array

    with size buffer descriptors. head is the index of next descriptor to be used by the NIC

    and tail is the index of the next descriptor to be consumed by the software. bitmap

    is a special bitmap used to track which packets experienced rIOPFs and allows the

    NIC to continue storing new incoming packets in the guest’s ring even when there are

    pending unresolved rIOPFs. The bitmap is initialized with all bits set to zero. Its size

    is controlled by the hypervisor and it limits the number of packets the hypervisor will

    store for a specific guest.

    We also explored a simpler design where the existence of an unresolved rIOPF would

    25

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • cause all the subsequents packet to be stored in the backup ring. But we were concerned

    that such a design could get stuck in an operation mode where all the packets go through

    the backup ring because a new packet always arrives before the NIC is notified that the

    previous packet has been copied to the guest’s ring.

    The interrupt handler br interrupt is called after a new packet has been stored

    in the backup buffer. This function moves used entries in the backup ring to a list of

    unresolved rIOPFs and posts a new buffer to the backup ring so that it will not run

    out of buffer for new entries. In addition, it wakes up a thread whose job is to resolve

    the rIOPFs. This thread is required because handling a rIOPF might require sleeping,

    which is forbidden in an interrupt context.

    The rIOPF thread executes rIOPF thread(). The first thing it does is to check

    whether there are any rIOPFs it needs to resolve. If there are none, it goes to sleep until

    a rIOPF occurs. After the thread is woken up, it makes sure that the designated ring

    has an available descriptor where it can store the faulting packet. The condition is likely

    to be true for most of the packets. However, our backup ring mechanism allows the

    hypervisor to buffer more then “ring size” packets for a guest. The reason behind this

    decision is that after a rIOPF occurs and until it is resolved, the NIC does not notify

    the guest about the reception of new packets. Consequently, the guest does not post

    new buffers in the receive ring and there is an overflow risk. As a result of this design,

    the number of faulting packets in the backup ring might exceed the number of entries

    in the original receive ring. Since the hypervisor does not want to drop packets, it will

    have to process a number of packets, ask the NIC to report that they were received,

    and then wait for the guest to post new buffers. The waiting can be implemented either

    in software only using by using sleep and polling, or it can be hardware assisted. The

    hypervisor will ask the NIC to raise an interrupt when the guest changes the tail of the

    ring and go to sleep until the interrupt arrives. After making sure that there is a free

    descriptor, the IOPF thread will make sure that the descriptor and the buffers are all

    present. It will copy the packet into the descriptors buffer, update the faulting packets

    bitmap, and tell the NIC that an IOPF has been resolved.

    For simplicity’s sake, our pseudo code handles the faulting packets one at a time. A

    practical implementation is likely to use batching and possibly multiple threads when

    handling the faulting packet. Such an implementation would read a predefined number

    of faulting packets and start the I/O required to make all of them present. After the

    I/O is done, the hypervisor will copy the faulting packets one at a time and tell the

    NIC that the whole chuck has been resolved using a single operation.

    Figure 7.3 show the corresponding NIC hardware pseudo-code. The struct ring t

    contains the state the hardware maintains for each ring. The head offset variable is

    used to determine where the next incoming packet should be stored. When there are no

    unresolved rIOPFs, head offset is zero and the packet is stored in head. When there

    are unresolved rIOPFs, head keeps pointing to the index of the first rIOPF triggering

    descriptor as we can not inform the guest about reception of new packets before this

    26

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • sturct ring r{int size;descriptor_t *descriptor;int tail;int head;

    int head_offset

    int bitmap index;bit *bitmap;

    }

    // HW - receive packet:// invoked if pkt specifically// designated for r,// or if broadcastvoid recv(Ring r, Packet pkt) {

    if (!store in ring(r, pkt))store in backup(r, pkt);

    }

    // HW - receive packet:// invoked by the hypervisor// after a rIOPF has been resolvedvoid resolve_rIOPFs(ring r) {

    i = r.bitmap_index;

    while (head_offset > 0 &&r.bitmap[i] == 0) {

    atomic {r.head_offset--;r.head = (r.head + 1) % r.size;}i = (i + 1) % r.bitmap_size;

    }

    //takes care of coalesingr.raise_isr();

    }

    bool ring overflow(ring r) {tail = r.tail;if (tail < r.head)

    tail += r.size;

    return r.head + r.head offset >= tail;}

    bool store in ring(ring r, Packet pkt) {if (!ring_overflow(r) &&

    r.is_descriptor_present(r.tail)) {

    head = (r.head + r.head_offset) % r.sizer.descriptor[head].store(pkt);if (r.head offset != 0)

    r.head offset++else {

    r.head = (r.head + 1) % r.size;//takes care of coalesingr.raise_isr();}

    return TRUE;}return FALSE;}

    void store in backup(ring r, Packet pkt) {offset = r.head offset;if (offset < r.bitmap_size &&

    backup.tail != backup.head) {

    offset += r.bitmap_index;head = (r.head + r.head_offset) % r.sizebackup.descriptor[backup.head].store(

    concat( r.id, head, offset, pkt));backup.head = (backup.head + 1) % backup.size;r.bitmap[offset] = 1;r.head_offset++;

    //takes care of coalesingbackup.raise_isr();

    }}

    Figure 7.3: Hardware psuedo-code for the backup ring.

    27

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • rIOPF has been resolved. The NIC always tries to store new incoming packet in head

    + head offset. bitmap is the bitmap we mentioned earlier when we talked about the

    ring data, and bitmap index is the index of the bit in the bitmap corresponding to

    the descriptor at index head in the ring.

    The recv() function describes how a NIC handles a packet, pkt, designated for the

    ring r. The NIC first checks whether the packet can be stored in the ring r and if it

    cannot, the packet is redirected to the backup ring. store in ring() tries to store the

    packet in the ring r. The conditions for using the ring are that target index does not

    exceed the tail and that the relevant descriptor and buffers are present1. Assuming the

    conditions above hold, the packet is stored directly in the ring. If there are unresolved

    rIOPFs for this ring we only advance head offset, and if there are no unresolved

    rIOPFs we advance head and raise an interrupt to signal the reception of a new packet

    in the ring r.

    Alternatively, if the packet is redirected to the backup ring store in backup() is

    executed. It checks that the distance from the first unresolved packet does not exceed

    the bitmap size and that there is room in the backup ring. If this is not the case

    the packet is dropped. Assuming the packet is not dropped, we append additional

    information to the packet in order to help the hypervisor resolve the rIOPF, mark the

    bitmap and raise an interrupt to the hypervisor. We also advance head offset to skip

    an entry in the designated ring.

    The resolve rIOPFs() flow is executed when the hypervisor notifies the ring that

    a rIOPF has been resolved. It uses the bitmap to update head to point to the next

    unresolved rIOPF or to the top of the ring if there are none. We note that while this

    loop might take some time it does not have to be atomic with respect to packet reception.

    Only head and head offset must be updated together because the destination of new

    packets is determined by their sum.

    7.3 Implementation

    Our backup ring evaluation had to overcome two hardware limitations. The hardware

    IOPF implementation did not support redirecting packets to a secondary receive ring

    upon a page fault. Nor did it not support the combination of IOPF and SR-IOV.

    Consequently, to evaluate the backup ring solution, we approximated it using software.

    Lightweight virtualization To address the lack of support for SR-IOV instances,

    the evaluation was performed with lightweight virtualization. Each lightweight virtual

    machine used a user space TCP/IP stack and kernel bypass technology based upon

    IB-verbs [Inf15]. Linux cgroups [Men], was used in order to limit the memory available

    1A real implementation would also check that the incoming packet is not too large for the corre-sponding descriptor and drop the packet without storing it to the backup ring if this is not the case.For simplicity, we ignore this complication.

    28

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • while(1) {/* poll the driver, get any outstanding frames, alloc memoryfor them, andcall netif->input. */poll driver(netif);/* Handle all system timeouts for all core protocols and theapplication */sys check timeouts();}

    Figure 7.4: lwIP main loop

    to each lightweight virtual machine.

    A good user space TCP/IP stack was surprisingly hard to find. The candidates

    we considered were OpenOnload[Rid], libvma and lwIP. OpenOnload only works on

    Solarflare NICs and porting it to a Mellanox NIC is difficult. libvma was closed source

    when we started the work, and only become open source when we already had a working

    system. It’s an interesting library, but making it work requires some effort, so it should

    be considered as future work.

    The option we chose was lwIP. We discovered that it is targeted for embedded

    systems and as a result has many shortcomings: It is single threaded, its malloc is very

    slow, there is no support for window scaling, or hardware offloading. and its socket API

    is very slow. To address these shortcomings, we replaced the malloc that came with

    lwIP with dlmalloc[Lea] and borrowed some pathches from libvma which is also based

    on lwIP. Among other things, this gave us support for window scaling.

    To address the bad performance of lwIP’s socket API issue we followed the advice of

    [Gol] and used the raw API of lwIP. Unlike the socket API, the raw API is event driven

    rather than sequential. The main loop of an application using the raw API should look

    like the one in Figure 7.4[SC]. The application itself should be event driven and the

    code of the application should be called from lwIP callbacks. The lwIP callbacks are

    called on different events: when a new connection is established, new data is receive etc.

    As a result, porting a generic application to use the raw API is nontrivial.

    Backup ring approximation Due to the inability to redirecting packets to a sec-

    ondary receive ring upon a page fault, we had to a settle for an approximation of the

    real backup ring. Instead of redirecting only rIOPF triggering packets, we used an

    existing hardware feature to duplicate all incoming packets into a secondary receive ring.

    This secondary ring is populated with pre-allocated pinned buffers. In the absence of

    rIOPFs, the duplicated packets stored in the secondary ring are ignored and discarded

    by the software. However, when packets are dropped from the primary receive ring due

    to a rIOPF, copies of those packets are still written to the secondary ring, allowing us to

    avoid packet loss. When an rIOPF occurs, the software is notified and starts collecting

    the dropped packets from the secondary ring. The software then waits for the rIOPF to

    be resolved before forwarding copies of those packets to the network stack. The copying

    29

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • void discard_duplicates(bool block) {do {

    while (packets_to_discard !=discarded_packets &&!secondary_ring.is_empty()) {

    secondary_ring.post_buffer(secondary_ring.consume_buffer());

    discarded_packets++}

    //if the caller asked to block,//continue to discard_duplicate} while (block &&

    packets_to_discard !=discarded_packets);

    }

    void sync(void) {in_sync=FALSE;while(!in_sync &&

    !secondary_ring.is_empty()) {p = secondary_ring.dequeue();if (!primary_ring.is_empty() &&

    memcmp(primary_ring.peek(),p ,len(p)) == 0) {

    unprocessed.enqueue(primary_ring.consume_buffer());

    in_sync = TRUE;} else {

    //primary_ring.is_empty() ||//memcmp(...) != 0new_buf = alloc_buf();memcpy(new_buf, p, len(p));unprocessed.enqueue(new_buf);

    }secondary_ring.post_buffer(p)}

    }

    void fast_path(void) {while (!primary_ring.is_empty()) {

    unprocessed.enqueue(primary_ring.consume_packet());

    primary_ring.post_buffer(alloc_buf());packets_to_discard++; //from backup

    }

    // throw duplicated packets from backup but dont block:discard_duplicates(/*block = */ FALSE);

    }

    void slow_path(void) {clear_rIOPF_flag();

    //block until there are no more duplicatesdiscard_duplicates(/*block = */ TRUE);//remove all packets from backupsync_backup();

    resolve_rIOPF();//sync with primary receive queuesync_backup();

    }

    void poll_RX(void) {rIOPF = check_rIOPF_flag();

    fast_path();if (rIOPF) {

    slow_path();}

    if (!unprocessed.is_empty())//pass packet to the network stacknetif_input(unproccesed.dequeue());

    }

    Figure 7.5: Psuedo-code for the backup ring approximation.

    is done to allow reusing the pinned buffers for the secondary ring and to improve the

    approximation. Since the hardware neither skips faulting receive ring entries nor reports

    how many packets are dropped during an rIOPF, packet content matching is used to

    detect when to switch back to the primary ring.

    The psuedo-code of our approximation is shown in Figure 7.5. The main function

    of interest is poll RX whose job is to decide which packet should be passed next to

    netif input(), which does the network stack processing. It checks the rIOPF flag

    and then executes the fast path(), which re-arms the primary ring and moves all

    new packets to a software maintained unprocessed packets queue. After removing the

    new packets from the primary ring, discard duplicates() is used to discard copies of

    those packets from the secondary ring. This is done in a nonblocking manner to improve

    the performance in the absence of rIOPFs. If the rIOPF flag was set before executing

    fast path(), the slow path() is also invoked. It first discards possible remaining

    duplicates and calls sync() to drain and re-arm the secondary ring. Next, the rIOPF is

    resolved, making the primary ring operational again and sync() is called once more

    to drain and re-arm the secondary ring until it is either empty or we find a matching

    packet in the primary ring. Finally, a packet from the unprocessed queue is pushed to

    30

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • the network stack.

    We note that our approximation does not have a clear guest/host separation and

    lacks context switches; as a result, it cannot tell us much about the CPU usage of

    the real solution. However, we believe it approximates the delayed passing of faulting

    packets to the networks stack and can teach us about the behavior of the network and

    network stack in the presence of IOPFs.

    Figure 7.6 shows the performance of a memcached port that uses our lwIP network

    stack. We can see that in the absence of IOPF our backup approximation does not

    significantly harm performance We further see that our memcached port, which uses

    user space I/O, is about twice as fast as the native version up to a value size of 4 KB.

    For large value sizes the native version benefits significantly from the Hardware TCP

    offloading, which is not supported in lwIP.

    31

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • 0 0.2 0.4 0.6 0.8

    1 1.2 1.4 1.6 1.8

    1B 4B 16B64B

    256B1KB

    4KB16KB

    64KB256KB

    norm

    aliz

    ed th

    roug

    hput

    value size

    0

    5

    10

    15

    20

    25

    1B 4B 16B64B

    256B1KB

    4KB16KB

    64KB256KB

    thou

    ghpu

    t [G

    b/s]

    lwip (pin)lwip backup buffer (no pin)linux (pin)linux w/ offload (pin)

    Figure 7.6: lwIP vs. Linux performance evaluation.

    32

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • Chapter 8

    Evaluation

    8.1 Methodology

    We evaluate the impact of IOPFs in the context of network devices. Initially, we strive

    to characterize the latency of a single IOPF or a single invalidation event. We then look

    at the cold ring problem, measuring its effect and the efficacy of the solution. In §8.4, wemeasure the impact of an IOPF on the network behavior of the system under synthetic

    load. In the following subsections we evaluate the performance impact of using IOPFs

    in a variety of real-life scenarios. We examine use cases of high performance computing

    (HPC) interconnect fabric, Web 2.0 workloads, and storage systems workloads.

    Experimental Setup The setup for TCP over Ethernet evaluation is comprised of

    two identical Dell PowerEdge R210 II Rack Server machines that communicate through

    Mellanox ConnectX-3 40 Gbit/sec Ethernet NICs. The NICs are connected back-to-

    back. Each machine has a 8GB 1333MHz memory and a single-socket 4-core Intel Xeon

    E3-1220 CPU running at 3.10GHz. The machines run Ubuntu 13.10 with Linux 3.11.4

    kernel modified to support IOPFs.

    The HPC and storage experiments used a test cluster with 8 computing nodes. The

    nodes were HP ProLiant DL380p Gen8 servers, with a dual-socket Intel Xeon E5-2697

    v2 (IvyBridge) CPU and 128GB of RAM. Each node had a single Connect-IB card

    installed. The cluster was connected by a single SwitchX-2 SX6036 switch. The nodes

    ran RedHat 7.0, with kernel version 3.10.0-123.el7.x86 64. We used a Connect-IB driver

    based upon the driver in the Mellanox OFED 2.4 package.

    Both setups were tuned for performance and to avoid reporting artifacts caused by

    nondeterministic events. All power optimizations —sleep states (C-states) and dynamic

    voltage and frequency scaling (DVFS)— were turned off. Hyper-threading was disabled.

    In our backup ring implementation, the receiver duplicates each packet into two

    buffers. The receiver’s PCIe bus becomes a bottleneck compared to the transmitter

    and the network. This asymmetry causes packet loss that disturbs our measurements.

    In order to avoid this, Ethernet flow control [IEE97] was enabled. In addition, this

    33

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • 0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    200

    kernel=eventuser=poll

    kernel=polluser=poll

    kernel=eventuser=event

    kernel=polluser=event

    mic

    rose

    cond

    sminor IOPF handling breakdown

    interrupt latencyuntil IOPF thread startsread faulting ring entryget_user_pages()+update IO PTinvalidate IOTLBpagefault resolve

    Figure 8.1: Minor IOPF handling breakdown for ConnectX-3.

    bottleneck gives the backup ring approach an unfair disadvantage compared to the drop

    configuration. To address this issue, our driver duplicates the incoming packets in both

    configurations. When working in the drop configuration, the copy is simply discarded.

    8.2 Cost of IOPFs on ConnectX-3

    Figure 8.1 shows the results of a micro-benchmark measuring how long it takes to

    resolve an IOPF in the ConnectX-3 implementation. We limited our kernel to use only

    1 CPU. We instrumented the Linux kernel to store timestamps at various points in

    the IOPF handling flow and wrote a simple application that does many iterations of

    the following: Take a timestamp, post a packet to be sent, do a busy wait until the

    corresponding completion appears in the CQ, retrieve the timestamps recorded by the

    kernel, and log all the timestamps. The average of 1 million iterations can be seen

    in the leftmost column of Figure 8.1. The breakdown is as follows. First there is an

    ’interrupt latency’, which includes the time it takes the NIC to identify and raise an

    IOPF interrupt. It also includes a very minor software component of waking up an IOPF

    thread to handle the IOPF. Next, there is a small scheduling delay represented by ’until

    IOPF thread starts’. The IOPF thread then asks the NIC to read the faulting entry

    from the ring on its behalf. It then calls get user pages() to make sure the pages are

    present and updates the I/O page table. Finally there are two commands– ’invalidate

    IOTLB’ and ’pagefault resolve– that than need to be issued in order to resolve the

    34

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • IOPF. After seeing those results we were a bit surprised that handling a single IOPF

    takes so long, as a minor CPU page fault on the same machine takes about 0.3 µs. We

    noticed that a significant portion of the time is spent waiting for the NIC to execute

    the relevant commands. Looking at the code, we noticed that commands are executed

    in an event driven manner. Namely, after a command is issued to the NIC, the issuer

    goes to sleep until the NIC sends an interrupt to notify the issuer about the commands

    completion. We were concerned that so much time is spent waiting for commands to

    complete because the scheduler might introduce an arbitrary long delay between the

    time that the command issuer is woken up and the time it receives the CPU.

    In order to see whether our concern was real, we modified the kernel to work in a

    polling mode. whereby instead of sleeping after issuing a command, the issuer enters

    a busy wait loop, continuously asking the NIC for the command status until the NIC

    reports its completed. Scheduling is thereby avoided, at the price of the CPU not doing

    any useful work while commands are being issued to the NIC. The results can be seen

    in the second column of Figure 8.1. To our surprise the result were significantly worse.

    In particular, the average time until the IOPF thread begins executing was significantly

    lengthened. We investigated the issue and found that every so often the scheduler

    keeps running our user level application for a relatively long time after the kernel IOPF

    kernel thread is woken up. This behavior did not happen in every iteration of our send

    loop, but when it did happen the resulting delay was long enough to skew the average.

    Because Linux scheduling is quite irrelevant to our research, and because this issue only

    occurs when we run the kernel driver in the non-standard polling mode, we decided

    not to pinpoint the exact scenario that causes this anomaly1. Instead we modified our

    user application to work in an event driven mode. After a packet is queued for sending,

    rather than doing a busy wait loop, our application goes to sleep and asks to be woken

    up when something is posted to the CQ. This way, our application is sleeping while

    the IOPF is handled and does not compete with the IOPF thread for the CPU. The

    anomaly is thus avoided. The results of this modification with the kernel working in

    either polling or event driven mode can be seen in the next two columns of Figure 8.1.

    We can see that using polling in the kernel is indeed slightly faster. However, the

    difference is negligible, justifying the use of the event driven mode, which allows the

    CPU to do useful work while waiting for the NIC to execute a command.

    8.3 Cost of IOPFs and Invalidations on ConnectX-IB

    On ConnectX-IB, we evaluated both the IOPF and the invalidation flows for 4KB

    (page size) messages and for larger 4MB messages. Large messages are native to

    InfiniBand/RDMA-based communication. They are also applicable to Ethernet when

    1 However, for real implementations, one might want to take measures to avoid this situation. Namely,it is undesirable for the IOPF handling thread to have to wait for the user application to release theCPU while this application is busy polling the NIC waiting for a packet to be sent.

    35

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • 0

    50

    100

    150

    200

    250

    300

    350

    400

    time

    [µse

    cond

    s](a) IOPF

    trigger interrupt [hw only]os overhead [sw only]

    update hw PT [sw + hw]resume process [hw only]

    4KB

    4MB

    0

    10

    20

    30

    40

    50

    60

    70

    (b) Invalidation

    check shadow PT [sw only]update hw PT [sw + hw]

    update shadow PT [sw only]

    4KB

    4MB

    Figure 8.2: (a) IOPF and (b) invalidation flow execution breakdown on ConnectX-IB.

    using offloads [FHL+05]. The evaluation was performed using an InfiniBand request-

    response micro-benchmark modified to call madvise(..., MADV DONTNEED) and initial-

    ize the memory that was going to be used for transmitted messages. The call itself

    triggered the invalidation flow, and the send operation that came next triggered the

    IOPF flow. The initialization is important because omitting it would have caused the

    relevant pages to be zeroed out during the IOPF flow. The benchmark ran on a Linux

    kernel that was instrumented using kprobes to store timestamps at points of interest

    along those flows.

    Figure 8.2(a) shows the timing breakdown of the IOPF flow. The different message

    sizes allow us to identify what parts of the page-in process are sensitive to the amount

    of data mapped. By comparing the results for the different sizes, we can tell that the

    fixed page-in cost amounts to about 220 µs. The cost per page is about 100 ns, roughly

    2-3 memory accesses. For an IOPF of a single page, the run time is dominated by the

    HW overheads. This is composed of the ‘trigger interrupt’ and the ‘process resume’

    phases. The former is the time it takes the hardware to notice and report an IOPF,

    36

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • while the latter is the time required for the hardware to resume the transmission of a

    ring after the IOPF is resolved. These phases total to 90% of the fixed page-in cost.

    As the number of pages requested in the page fault increases, the ‘os overhead’ phase

    becomes more dominant. During this phase, the driver detects the page fault, reads

    the ring to determine which virtual pages need to be mapped, and asks the OS for the

    physical addresses. Based upon finer profiling, we know that the major time consumer

    is the OS providing the driver with physical page addresses. The increased size of the

    page-in request also prolongs the ‘update hw PT’ phase, in which the hardware page

    table is updated with the appropriate PTEs.

    Figure 8.2(b) shows the timing breakdown of the invalidation flow. First, in the

    ‘check shadow PT’ phase, the software finds the memory region in which the invalidation

    occurred, and scans a shadow copy of the hardware PT to see whether any of the

    mappings were visible to the hardware. If the pages were not mapped through an IOPF,

    this is the only overhead incurred. If the pages were mapped, the driver updates the

    hardware table, in the ‘update hw PT’ phase. Finally, in the ‘update shadow PT’ phase,

    the driver updates the shadow copy of the hardware PT, to indicate that they are no

    longer visible to the hardware. Parallelizing the ‘update PTEs’ and ‘update hw PTEs’

    phases is possible. However, doing so involves a lock granularity trade-off, which hurts

    the common case.

    8.4 Network Transport and IOPF Interplay

    The page-in latency measured above is relevant to all IOPF-capable devices. We now

    move to measuring phenomena that arise from the interplay between IOPFs and the

    network-specific transport.

    8.4.1 Impact of Periodic IOPFs on Bandwidth

    We used a simple stream benchmark to measure impact of periodic IOPFs on bandwidth.

    The benchmark strongly resembles netperf’s TCP STREAM benchmark [J+96]. The

    sender side performed 64 KB sends in an infinite loop using a standard Linux TCP stack.

    The receiver side ran our lwIP stack and discarded received packets as soon as it received

    them, while keeping track of how much data was received. To allow comparison between

    Ethernet and InfiniBand, we also used the ib send bw InfiniBand micro-benchmark from

    the perftest package as an equivalent InfiniBand stream workload. In order to highlight

    the impact of IOPFs and suppress other memory pressure phenomena, we synthetically

    generated rIOPFs at a variable frequency. The benchmark pre-faulted the entire receive

    ring at start, so that the cold ring problem would not affect the measurements.

    We forced a minor IOPF using mprotect. We changed the permission of the relevant

    page to read only and then back to read/write2. The mprotect calls invalidated the

    2We use mprotect, rather than madvise, as here we want to preserve the meta-data in the page

    37

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • page mapping, and forced an IOPF upon next access. Triggering a major IOPF is

    more complicated. Our implementation used writes to a file opened with O DIRECT

    to evict pages from the page cache. An mmap of the same file experienced a major

    page fault when accessing the same page following the write. We also had to modify

    lwIP’s code such that it would not touch the relevant page prematurely. LwIP touched

    the page before posting it to the NIC’s receive ring, causing a major page fault on

    the CPU, as opposed to a major IOPF. In order to evaluate the hardware-based

    RNR-NACK implementation similarly, we ran a similar InfiniBand benchmark. We

    modified ib send bw to trigger a minor page fault once every X messages.

    0

    2

    4

    6

    8

    10

    12

    thro

    uphp

    ut [G

    b/s]

    minor dropmajor drop

    minor backupmajor backup

    0 8

    16 24 32 40 48

    2-10 2-15 2-20 2-25

    thro

    uphp

    ut [G

    b/s]

    frequency

    minor hw

    Figure 8.3: Throughput of a stream benchmark in the presence of rIOPFs of varyingfrequencies.

    The result are shown in Figure 8.3. Note that due to the different setup, the

    InfiniBand benchmark has a different y-axis. The backup ring approximation significantly

    improves performance for both major and minor page faults. In the case of drop, the type

    of page fault does not matter because the TCP retransmission timeout is significantly

    longer than the time it takes to resolve a major page fault. The hardware implementation,

    shown in the lower figure, notifies the remote sender immediately upon a page fault.

    The notification allows the sender to use a relatively short IOPF-specific timeout,

    38

    Technion - Computer Science Department - M.Sc. Thesis MSC-2015-21 - 2015

  • 0 2 4 6 8

    10 12

    1 1.5 2 2.5 3 3.5 4 4.5 5

    thro

    ughp

    ut [G

    bps]

    (a) time [seconds]

    backupdrop

    pinning 0

    50 100 150 200 250 300 350

    1 1.5 2 2.5 3 3.5 4 4.5 5

    cwnd

    [pac

    kets

    ]

    (b) time [seconds]

    0 100 200 300 400 500 600 700 800

    1 1.5 2 2.5 3 3.5 4 4.5 5

    retr

    ansm

    itted

    pac

    kets

    (c) time [seconds]

    0

    200

    400

    600

    800

    1000

    1200

    1 1.5 2 2.5 3 3.5 4 4.5 5re

    cove

    red

    pack

    ets

    (d) time [seconds]

    Figure 8.4: Transient operation of a TCP stream benchmark over time in the presenceof minor rIOPFs.

    resulting in significant performance improvement relative to drop. Nevertheless, network

    utilization-wise, this solution is less efficient than the backup ring solution.

    Figure 8.4 examines the steady state behavior of the TCP stream benchmark over

    time, given a fixed rIOPF frequency of one in 1M packets. We added a baseline pinning

    configuration in which we run our stream test with no IOPFs at all. Figure 8.4(a) shows

    that IOPFs configurations experience a decrease in throughput when a rIOPF occurs.