Realizing the Performance Potential of the Virtual Interface Architecture

26
Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of Electrical and Computer Engineering Presented by Constantin Serban, R.U.

description

Realizing the Performance Potential of the Virtual Interface Architecture. Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of Electrical and Computer Engineering Presented by Constantin Serban, R.U. VIA Goals. - PowerPoint PPT Presentation

Transcript of Realizing the Performance Potential of the Virtual Interface Architecture

Page 1: Realizing the Performance Potential of the Virtual Interface Architecture

Realizing the Performance Potential of the Virtual Interface Architecture

Evan Speight, Hazim Abdel-Shafi, and John K. Bennett

Rice University, Dep. Of Electrical and Computer Engineering

Presented by Constantin Serban, R.U.

Page 2: Realizing the Performance Potential of the Virtual Interface Architecture

VIA Goals

• Communication infrastructure for System Area Networks (SANs)

• Targets mainly high speed cluster applications

• Efficiently harnesses the communication performance of underlying networks

Page 3: Realizing the Performance Potential of the Virtual Interface Architecture

Trends

• The peak bandwidth increase two order of magnitude over past decade while user latency decreased modestly.

• The latency introduced by the protocol is typically several times the latency of the transport layer.

• The problem becomes acute especially for small messages

Page 4: Realizing the Performance Potential of the Virtual Interface Architecture

Targets

VI architecture addresses the following issues:

• Decrease the latency especially for small messages (used in synchronization)

• Increase the aggregate bandwidth (only a fraction of the peak bandwidth is utilized)

• Reduce the CPU processing due to the message overhead

Page 5: Realizing the Performance Potential of the Virtual Interface Architecture

Overhead

Overhead mainly comes from two sources:• Every network access requires one-two

traps into the kernel – user/kernel mode switch is time consuming

• Usually two data copies occur:– From the user buffer to the message passing

API– From message layer to the kernel buffer

Page 6: Realizing the Performance Potential of the Virtual Interface Architecture

VIA approach

• Remove the kernel from the critical path – Moving communication code out of the kernel

into user space

• Provide 0-copy protocol– Data is sent/received directly into the user

buffer, no message copy is performed

Page 7: Realizing the Performance Potential of the Virtual Interface Architecture

VIA emerged as a standardization effort from Compaq, Intel, and Microsoft

It was built on several academic ideas: • The main architecture most similar to U-Net• Essential features derived from VMMCAmong current implementations :

– GigaNet cLan – VIA implemented in hardware– Tandem ServerNet –VIA software driver

emulated– Myricom Myrinet - software emulated in

firmware

Page 8: Realizing the Performance Potential of the Virtual Interface Architecture

VIA architecture

Page 9: Realizing the Performance Potential of the Virtual Interface Architecture

VIA operationsSet-Up/Tear-Down :• VIA is point-to-point connection oriented protocol• VI-endpoint : the core concept in VIA

• Register/De-Register Memory• Connect/Disconnect• Transmit• Receive• RDMA

Page 10: Realizing the Performance Potential of the Virtual Interface Architecture

VIA operationsSet-Up/Tear-Down :VIA is point-to-point

connection oriented protocol• VI-endpoint : the core concept in VIA• VipCreateVi function creates a VI endpoint in the

user space.• The user-level library passes the call to the kernel

agent which passes the creation information to the NIC.

• OS thus controls the application access to the NIC

Page 11: Realizing the Performance Potential of the Virtual Interface Architecture

VIA operations - cont’dRegister/De-Register Memory:• All data buffers and descriptors reside in a

registered memory • NIC performs DMA I/O operation in this

registered memory• Registration pins down the pages into the physical

memory and provides a handle to manipulate the pages and transfer the addresses to the NIC

• It is performed once, usually at the beginning of the communication session

Page 12: Realizing the Performance Potential of the Virtual Interface Architecture

VIA operations - cont’dConnect/Disconnect:

• Before communication, each endpoint is connected to a remote endpoint

• The connection is passed to the kernel agent and down to the NIC

• VIA does not define any addressing scheme, existing schemes can be used in various implementations

Page 13: Realizing the Performance Potential of the Virtual Interface Architecture

VIA operations - cont’dTransmit/receive:• The sender builds a descriptor for the message to

be sent. The descriptor points to the actual data buffer. Both descriptor and data buffer resides in a registered memory area.

• The application then posts a doorbell to signal the availability of the descriptor.The doorbell contains the address of the descriptor.

• The doorbells are maintained in an internal queue inside the NIC

Page 14: Realizing the Performance Potential of the Virtual Interface Architecture

VIA operations - cont’dTransmit/receive (cont’d):• Meanwhile, the receiver creates a descriptor that

points to an empty data buffer and posts a doorbell in the receiver NIC queue

• When the doorbell in the sender queue has reached the top of the queue, through a double indirection the data is sent into the network.

• The first doorbell/ descriptor is picked up from the receiver queue and the buffer is filled out with data

Page 15: Realizing the Performance Potential of the Virtual Interface Architecture

VIA operations - cont’dRDMA:• As a mechanism derived from VMMC, VIA allows

Remote DMA operations: RDMA Read and Write

• Each node allocates a receive buffer and registers it with the NIC. Additional structures that contain read and write pointers to the receive buffers are exchanged during connection setu

• Each node can read and write to the remote node address directly.

• These operations posts potential implementation problems.

Page 16: Realizing the Performance Potential of the Virtual Interface Architecture

Evaluation Benchmarks

• Two VI implementations :– GigaNet cLan B:125MB/sec, Latency 480ns – Tandem ServerNet, 50MB/S, Latency 300ns

• Performance measured:– Bandwidth and Latency – Poling vs. Blocking– CPU Utilization

Page 17: Realizing the Performance Potential of the Virtual Interface Architecture

Bandwidth

Page 18: Realizing the Performance Potential of the Virtual Interface Architecture

Latency

Page 19: Realizing the Performance Potential of the Virtual Interface Architecture

Latency Polling/Blocking

Page 20: Realizing the Performance Potential of the Virtual Interface Architecture

CPU utilization

Page 21: Realizing the Performance Potential of the Virtual Interface Architecture

MPI performance using VIA

• The challenge is to deliver performance to distributed application

• Software layers such MPI are mostly used between VIA and the application: provide increased usability but they bring additional overhead

• How to optimize this layer in order to use it efficiently with VIA ?

Page 22: Realizing the Performance Potential of the Virtual Interface Architecture

MPI VIA - performance

Page 23: Realizing the Performance Potential of the Virtual Interface Architecture

MPI observations• Difference between MPI-UDP and MPI-

VIA-baseline is remarkable

• MPI-VIA-baseline is dramatically far from VIA-Native

• Several improvements proposed to shift MPI-Via to be closer to VIA native : reduce MPI overhead

Page 24: Realizing the Performance Potential of the Virtual Interface Architecture

MPI Improvements

• Eliminating unnecessary copies:MPI UDP and VIA use a single set of receiving buffers,

thus data should be copied to the application : allow the user to register any buffer

• Choosing a synchronization primitive:All synchronization formerly using OS constructs/events.

Better implementation using swap processor commands

• No Acknowledge: Remove the acknowledge of the message by switching to

a reliable VIA mode

Page 25: Realizing the Performance Potential of the Virtual Interface Architecture

VIA - Disadvantages

• Polling vs. blocking synchronization – a tradeoff between CPU consumption and overhead

• Memory registration: locking large amount of memory makes virtual memory mechanisms inefficient. Registering / deregistering on the fly is slow

• Point-to-point vs. multicast: VIA lacks multicast primitives. Implementing multicast over the actual mechanism, makes communication inefficient

Page 26: Realizing the Performance Potential of the Virtual Interface Architecture

Conclusion

• Small latency for small messages. Small messages have a strong impact on application behavior

• Significant improvement over UDP communication (still after recent TCP/UDP hardware implementations?)

• At the expense of an uncomfortable API