Low Latency Networking slides
-
Upload
networksguy -
Category
Documents
-
view
498 -
download
2
description
Transcript of Low Latency Networking slides
![Page 1: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/1.jpg)
Low Latency NetworkingGlenford Mapp
Digital Technology Group
Computer Laboratory
http://www.cl.cam.ac.uk/Research/DTG/~gem11
![Page 2: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/2.jpg)
What is Latency?
• The time taken to send a unit of data between two points in a network
• A low latency network is a network in which the design of the hardware, systems and protocols are geared towards minimizing the time taken to move units of data between any two points on that network
![Page 3: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/3.jpg)
Throughput
• Number of bytes of data that is transferred per second between two points
• Doesn’t high throughput imply low latency?
• Not necessarily– A bus vs a car travelling along a section of road
• Which has the higher throughput?
• Which has the lower latency?
![Page 4: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/4.jpg)
Throughput vs Latency
• In simplest form, – Throughput ~ C / Latency
– C = instantaneous capacity• Number of units that are handled per operation
• So if C is large you can get good throughput even if your latency is not low
• Low latency does not necessarily imply high throughput if C also gets smaller– ATM is a good example
![Page 5: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/5.jpg)
Throughput Claims
• Look carefully at high throughput claims.– Have they decreased the latency
• Per unit operation is faster– Software -> Hardware (ATM)
– Have they increased instantaneous capacity• Serial -> Parallel-Parallel->Serial
• In most designs we have a mixture of both– Manufacturers will generally allow increased
latency if capacity greatly increases
![Page 6: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/6.jpg)
Who cares about latency?
• Why is latency important?• Some applications are more affected by
latency rather than throughput– Voice
• Also affected by jitter
– Networked Games– Interactive sessions
![Page 7: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/7.jpg)
Lessons from Computers
• Consider the Mainframe in the time-sharing era. 1963-1976
• Studies showed that user productivity reduced by half if the response time from mainframe increases from 0.5 to 3 seconds
• Mainframe optimised for throughput – Maximize the number of people using it
• High throughput
![Page 8: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/8.jpg)
Lessons from Computers
• But as more people logged on the slower the machine became and by noon the response time would increase markedly so user productivity would fall
• Key factor in the development of PCs• Famous saying
– I love the Alto (first PC) because it does not run faster at night!
![Page 9: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/9.jpg)
A look at the Internet
• Not really designed for low latency
• Designed to be adaptable and robust
• But the new applications we want the Internet to support need low latency– Web servers– Voice over IP– Networked Games, etc
![Page 10: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/10.jpg)
Components of Network Latency
• Hardware – Different hardware capacities and limitations
• Ethernet – variable packet size; max 1500
• ATM – 53 bytes uses fixed cells
• Network Routers and Switches– Queueing strategies – Overload/ Congestion strategy
![Page 11: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/11.jpg)
Components of Network Latency
• System Latency– Moving the packet between the application and
the network interface– OS latency
• The operating system handling the packet
– Application Latency• Application must acquire resources (e.g. CPU) in
order to send or consume data
![Page 12: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/12.jpg)
Traditional Networking – A closer look
• Look at a packet being received by the host machine and delivered up to the application
• At the lowest level, packet enters the network interface card (NIC) – ends up in a buffer or fifo on the card. Card generates an interrupt.
![Page 13: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/13.jpg)
Tradition Networking cont’d
• Interrupt Handler runs, data is moved into a system buffer in main memory.
• Packet is placed on a receive queue – In Linux there is one network receive queue
• Packets from all the network interfaces are placed on that queue
• Packet is marked for system processing– Interrupt Handler ends
![Page 14: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/14.jpg)
Traditional Networking cont’d
• System processing– Packet is taken up the protocol stack
• IP processing ; TCP processing
– Connection information associated with the packet is used to find the corresponding socket
• Socket ~ Src (IPaddr, TCP port) , Dest (IPaddr, TCP port)
![Page 15: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/15.jpg)
Traditional Networking cont’d
• Queue the packet on the socket structure and see if any application threads are waiting for incoming data
• If so, copy the data from system buffer to the user buffer and wake up the thread
• Application has to wait until it gets the CPU to consume data
![Page 16: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/16.jpg)
Analysis of Traditional Networking
• Interrupt systems – potentially infinite latency– Processing of packets in the queue is affected by the rate
of incoming packets
• Copying data adds to latency• OS sits between two worlds
– It de-multiplexes the packet and decides its final destination
– It also ensures that the relevant application is scheduled to receive the data. This is called application synchronisation
![Page 17: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/17.jpg)
APPLICATION LAYER
Socket Interface
Socket layer in OS
NIC Network
System Buffers
System Buffers
![Page 18: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/18.jpg)
Cross Talk Issues
• Interrupt level– while an application is running on the
processor, network interrupts occur on incoming packets for other processes.
• Protocol level– packets for all applications are multiplexed and
de-multiplexed in the kernel
• Application Level– All applications must share resources so
sometimes I must wait a long time before I get the processor.
![Page 19: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/19.jpg)
Some ways to improve Traditional Networking
• User level network interfaces– UNET - Matt Walsh (1995-1998)
• Zero copy architectures– Virtual memory mapping techniques
• Vertical Partitioning of Operating Systems
![Page 20: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/20.jpg)
UNET
• Application has an interface to talk directly to a network device
• Doesn’t involve the kernel in things like protocol processing, etc.
• Uses per application message queues to send and receive data
• Novel idea at the time – complicates what applications need to do
![Page 21: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/21.jpg)
UNET EndpointCommunication segment Send
queueFreequeue
Recvqueue
![Page 22: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/22.jpg)
Zero-Copy Architecture
• No need to copy data up to the application
• DMA from network buffers in NIC card straight into system buffers
• Use VM techniques to map the relevant system buffers into the address space of the application
![Page 23: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/23.jpg)
Vertical Partitioning of the OS
• So UNET gave applications an abstract network card so there was less multiplexing of data.
• Why not go all the way and do more partitioning of OS resources
• So CPU is carefully partitioned, file systems and disk devices also carefully partitioned
![Page 24: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/24.jpg)
Pegasus project - Cambridge
• Studied system support for multimedia applications
• Developed a new operating system called Nemesis which adopted a vertical approach– Most of the operating system functions were in
shared libraries which executed in the user’s process space
– System-wide page table, so no copying
![Page 25: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/25.jpg)
Vertical ApproachProcesses
Shared Libraries
Normal OS
![Page 26: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/26.jpg)
Why haven’t these ideas been universally implemented
• Some were explored– VIA is a hardware idea based on UNET– Replace PCI bus– Devices have receive, send and completion
queues and are connected along a high-speed serial bus
– One or two products out there but fell out of favour
• Infiniband - now popular – extension of VIA
![Page 27: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/27.jpg)
Ideas not universal
• Zero copy and VM ideas explored in some Operating Systems, e.g. the Spring OS by Sun. Some ideas made their way into Solaris. Windows 2000 and XP, via Mach and NT
• Nemesis was too radical for prime time– QoS ideas have been taken up by others
![Page 28: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/28.jpg)
But the real reason was..
• That processor and network speeds have been increasing fast enough to keep traditional networking in the picture.
• If you simply want to browse the Web and read email, then it is OK
• However, there is a looming problem
![Page 29: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/29.jpg)
Network speeds still going up!
• We have gone from 10 Mbps in 1987 to 10G in 2004 and beyond.
• Processor not be able to keep up– Interrupt rate is phenomenal
• Buses like the PCI bus cannot keep up– Move to PCI Express (Switch Fabric)
• Workstation can presently saturate the network but the tide is rapidly turning!
• Network traffic will soon be able to cripple your PC
![Page 30: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/30.jpg)
Need a system that is less interrupt-dependent
• Two main approaches– No OS processing whatsoever
• including no interrupts
• data is moved by hardware
• OS is used to setup where the data is moved to
– Apply more processing power but target it on the network interface
![Page 31: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/31.jpg)
Shared Memory Model
• Data transfer is accomplished by writing to memory addresses in the local address space of the process
• This data is captured by the local network card and serialized into packets which are transferred over the network to the remote machine which writes the data to remote addresses.
![Page 32: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/32.jpg)
How does it actually work?
• A region of the local address space of the process is mapped to an IO region on the card. That mapping is usually made using standard memory-mapping techniques. – In Unix the mmap call is used.
• Same thing is done on the remote side
![Page 33: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/33.jpg)
Shared Memory ModelProcess VM
NIC NIC
Process VM
packets
![Page 34: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/34.jpg)
How is the association between the local and remote regions
made• Fixed
– In early SMMs, it was fixed. – All processors on the network share the same
region.
• Flexible – Needs a communications channel to set up the
mapping between regions
![Page 35: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/35.jpg)
Fixed SMMProcess VM space
Proc A Proc B Proc C Proc D
![Page 36: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/36.jpg)
Dynamic SMMProcess VM space
Proc A Proc B Proc C Proc D
![Page 37: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/37.jpg)
SMM
• Been around a long time– Used to communicate between processors in a
cluster.
• The SMM is divided into pages, some of which can be mapped between two processes and the other set can be mapped globally
![Page 38: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/38.jpg)
Problems with SMM
• Since no interrupts are involved and the OS is no longer in the loop, it’s hard to inform the remote node that data has been sent and is waiting to be read
• Major problem is therefore not the transfer, but application synchronization
![Page 39: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/39.jpg)
Applications SynchronizationSolutions
• Polling:– the receiver keeps polling certain addresses to
see if a data transfer has occurred– This is expensive (wasting local CPU) and only
relevant if there is a real chance of a data transfer.
– Could be used to provide to provide a form of distributed synchronization - spinning on a remote address
![Page 40: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/40.jpg)
Application Synchronization Solutions
• VM signalling – Pagefault or access violations– Example: page is only mapped locally when
there is data to be read. If I access the page when there is no data, then a pagefault occurs and I am blocked until the owner writes to the page
![Page 41: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/41.jpg)
VM Signalling
• If I wish to read and there is data to be read then the page is mapped into my address space read-only.
• If I attempt to write to the page, a pagefault occurs and I am blocked until I can acquire the write lock for the page
• Not scalable, too closely coupled to the VM system
![Page 42: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/42.jpg)
Out-of-Band signaling
• Use a separate channel outside the data transfer region to signal that data has been transferred.
• For example, writing to a special set of addresses would cause an interrupt to be generated at the remote end
![Page 43: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/43.jpg)
Out-of-Band Signalling
• So you would transfer the data by writing to your local address
• After you then wrote to a special address associated with that memory region
• An interrupt occurs on the other side and the OS works out which buffer you are referring to and wakes up the waiting process
![Page 44: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/44.jpg)
Out-of-Band Signalling
• Out-of-Band Signalling still involves the processor to achieve application synchronization
• Adds the overall transfer latency– Ex. Memory Channel
• data transfer 2.9 us
• acquire spin lock 120 us
• Increases the expense of the NIC
![Page 45: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/45.jpg)
History of SMM
• Used to be extremely proprietary
• DEC Memory Channel best known– Used a fixed shared memory region of 512 MB
divided into 64K pages each page being 8K– Very versatile, can share pages between one or
more processes. Use broadcast facilities– Average latencies 10-25 us
![Page 46: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/46.jpg)
SCI - Scalable Coherent Interface
• IEEE Standard 1956-1992
• Uses high speed unidirectional links– Parallel links 16 bits, 500 Mhz (8 Gbs)– Serial G-Link technology (1Gbs)
• Packet-based transfer – header - 16 bytes; data = 0, 16, 64 or 256 bytes– queue and signal interrupts
![Page 47: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/47.jpg)
SCI cont’d
• Can do cache-coherency (optional)
• Latency < 10 us
• Modern cards uses 64bit and 66 MHz buses (5.33 Gbits/s)
• Big player: Dolphin Interconnect– Sun uses their boards to build megaservers
![Page 48: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/48.jpg)
Processor Intensive Approach PIA
• We offload networking by using a processor on the NIC
• Myrinet - most well-known exponent– Full duplex data links 2 Gbits/s – Bus 64-bit 133Hz PCI-X bus– PC - 255 Mhz RISC & Memory
![Page 49: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/49.jpg)
Myrinet con’t
• Packet-based– Header, packet type, payload
• Host Computer controls the NIC– runs a MCP program
• Myrinet controls around 39 % of the cluster market
![Page 50: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/50.jpg)
Performance
• Latency around 6.3 us – Climbs to over 100 us over 10000 bytes
• One way throughput 248 MB/s – Messages over a 1000 bytes
• Two way throughput 489 MB/s– Message over 10000 bytes
• Throughput between Unix processes on different hosts – 1.98 Gbits (uni) 3.9 Gbits/s (bi)
![Page 51: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/51.jpg)
Comparing SCI and Myrinet
• Latency are about the same
• SCI much faster for cluster of 8 or less– but slows exponentially as the number of PCs
increases
• Myrinet is better for large systems > 64
• Software appears more complete with Myrinet
![Page 52: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/52.jpg)
Recent developments in Low Latency Systems
• Collapsed LAN project (CLAN)– 1997 - 2002, AT&T Laboratories-Cambridge– project originally centred around using fibre
technology throughout the building– remoting PCs; just have mouse, keyboard and
display in your office and put the PC in the server room
– bought some SCI cards and got some systems going
![Page 53: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/53.jpg)
CLAN project
• Faced the application synchronization problem
• Came up with a novel solution called Tripwire– in-band synchronization– an event is signalled on the receiver when data
is written to a special address in the data region during the data transfer
![Page 54: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/54.jpg)
TripwireProcesses
Tripwire
![Page 55: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/55.jpg)
CLAN Project
• Applications can therefore set Tripwires and be notified when they occur– no spinning, no extra hardware for out-of-band
signaling
• Latency:– DWORD - RRT = 3.7us– 1KB IP transfer - 225 Mbit/s RRT= 100us– Throughput 910 Mbits/s 33 MHz, 32 bit bus
![Page 56: Low Latency Networking slides](https://reader034.fdocuments.us/reader034/viewer/2022042601/54b759324a795905078b4605/html5/thumbnails/56.jpg)
Will Low latency ever make it into the Main Stream
• Some low latency 1 Gigabit/s NICs on the market
• Unfortunately 1 Gigabit/s market is now in the commodity phase.
• Real battle is shaping up at 10 Gbit/s market– CLAN project -> Level5Networks-> Solarflare