Scalable Networking for Next-Generation Computing Platforms Yoshio Turner *, Tim Brecht *‡, Greg...
-
Upload
jasmine-sparks -
Category
Documents
-
view
221 -
download
1
Transcript of Scalable Networking for Next-Generation Computing Platforms Yoshio Turner *, Tim Brecht *‡, Greg...
Scalable Networking for Next-Generation Computing
Platforms Yoshio Turner*, Tim Brecht*‡, Greg Regnier§, Vikram Saletore§, John Janakiraman*, Brian
Lynn*
*Hewlett Packard Laboratories§Intel Corporation
‡University of Waterloo
page 214 Feb 2004 SAN-3 workshop – HPCA-10
Outline
•Motivation: Enable applications to scale to next-generation network and I/O performance on standard computing platforms
•Proposed technology strategy:– Embedded Transport Acceleration (ETA)– Asynchronous I/O (AIO) programming model
•Web server application evaluation vehicle•Evaluation Plan•Conclusions
page 314 Feb 2004 SAN-3 workshop – HPCA-10
Motivation: Next-Generation Platform Requirements
• Low overhead packet and protocol processing for next-generation commodity interconnects (e.g., 10 gigE)– Current systems: performance impeded by
interrupts, context switches, data copies– Existing proposals include:
• TCP Offload Engines (TOE): special hardware, cost/time to market issues
• RDMA: new protocol, requires support at both endpoints
• Increased I/O concurrency for high link utilization– I/O bandwidth is increasing– I/O latency is fixed or slowly decreasing toward limit Need larger number of in-flight operations to fill
pipe
page 414 Feb 2004 SAN-3 workshop – HPCA-10
Proposed Technology Strategy
• Embedded Transport Acceleration (ETA) architecture– Intel Labs project: prototype architecture dedicates one
or more processors to perform all network packet processing -- ``Packet Processing Engines’’ (PPEs)
– Low overhead processing: PPE interacts with network interfaces/applications directly via cache-coherent shared memory (bypass the OS kernel)
– Application interface: VIA-style user-level communication
• Asynchronous I/O (AIO) programming model– Split two-phase file/socket operations
• Post an I/O operation request: non-blocking call• Asynchronously receive completion event information
– High I/O concurrency even for single-threaded application
– Initial focus: ETA socket AIO (future extensions to file AIO)
page 514 Feb 2004 SAN-3 workshop – HPCA-10
Key Advantages
• Potentially enables Ethernet and TCP to approach latency and throughput performance of System Area Networks
• Uses standard system processor/memory resources:– Automatically tracks semiconductor cost-performance
trends– Leverages microarchitecture trends:
multiple cores, hardware multi-threading– Leverages standard software development environments
rapid development• Extensibility: fully programmable PPE to support
evolving data center functionality– Unified IP-based fabric for all I/O– RDMA
• AIO increases network-centric application scalability
page 614 Feb 2004 SAN-3 workshop – HPCA-10
Overview of the ETA Architecture
• Partitioned server architecture:– Host: application execution– Packet Processing Engine (PPE)
• Host-PPE Direct Transport Interface (DTI)– VIA/Infiniband-like queuing structures in cache
coherent shared host memory (OS bypass)– Optimized for sockets/TCP
• Direct User Socket Interface (DUSI)– Thin software layer to support user level
applications
page 7SAN-3 workshop – HPCA-10 14 Feb 2004
HostCPU(s)
PPE
LAN Storage IPC
Network Fabric
ETA Host Interface
iSCSI
FileSystem Kernel
Applications
User Applications
LegacySockets
DirectAccess
TCP/IP
Driver
SharedMemory
ETA Overview: Partitioned Architecture
page 8SAN-3 workshop – HPCA-10 14 Feb 2004
ETA Overview: Direct Transport Interface (DTI) Queuing Structure
• Asynch socket operations: connect, accept, listen, etc.• TCP buffering semantics – anonymous buffer pool supports
non-pre-posted or OOO receive packets
Packet Processing Engine
HOST
Shared Host Memory
DTITx
Queue
DataBuffers DTI
EventQueue
DTIRx
Queue
AnonymousBufferPool
DTIDoorbells
page 914 Feb 2004 SAN-3 workshop – HPCA-10
API for Asynchronous I/O (AIO)
• Layer socket AIO API above ETA architecture– Investigate impact of AIO API features on application
structure and performance
• Initial focus: ETA Direct User Socket Interface (DUSI) API– provides asynchronous socket operations: connect,
listen, accept, send, receive
• AIO examples:– File/socket: Windows AIO w/completion ports, POSIX AIO– File I/O: Linux AIO recently introduced– Socket I/O with OS bypass: ETA DUSI, OpenGroup
Sockets API Extensions
page 1014 Feb 2004 SAN-3 workshop – HPCA-10
ETA Direct User Socket Interface (DUSI) AIO API
• Queuing structure setup for sockets:– One Direct Transfer Interface (DTI) per socket– Event queues: created separately from DTIs
• Memory registration: – Pin user space memory regions, provide address
translation information to ETA for zero-copy transfers
– Provide access keys (protection tags)• Application posts socket I/O operation requests
to DTI Tx and Rx work queues• PPE delivers operation completion events to DTI
event queues• Both operation posting and event delivery are
lightweight (no OS involvement)
page 1114 Feb 2004 SAN-3 workshop – HPCA-10
AIO Event Queue Binding
• AIO API design issue: assignment of events to event queues– Flexible binding enables applications to separate or group
events to facilitate operation scheduling
• DUSI: each DTI work queue can be bound at socket creation to any event queue– Allows separating or grouping events from different sockets– Allows separating events by type (transmit, receive)
• Alternatives for event queue binding:– Windows: per-socket– Linux and POSIX AIO: per-operation– OpenGroup Sockets API Extensions: per-operation-type
page 1214 Feb 2004 SAN-3 workshop – HPCA-10
Retrieving AIO Completion Events
• AIO API design issue: application interface for retrieving events
• DUSI: lightweight mechanism bypassing OS– Event queues in shared memory– Callbacks: similar to Windows– Event tags
• Application monitoring of multiple event queues– Poll for events (OK for small number of queues)– No events block in OS on multiple queues
• Uncommon case in a busy server acceptable in this case to use OS signaling mechanism
• Useful for simultaneous use of different AIO APIs– Race conditions: user level responsibility
page 1314 Feb 2004 SAN-3 workshop – HPCA-10
AIO for Files and Sockets
• File AIO support– OS (e.g., Linux AIO, POSIX AIO)– Future: ETA support for file I/O (e.g., via iSCSI or DAFS)
• Unified application processing of file/socket events– ETA PPE and OS kernel may both supply event queues
• Blocking on event queues of different types facilitated by use of OS signal signal mechanism (as in DUSI)
• Unified event queues may be desirable: require efficient coordination of ETA and OS access to event queues
– Support for zero-copy sendfile(): integration of ETA with OS management of the shared file buffer in system memory
page 1414 Feb 2004 SAN-3 workshop – HPCA-10
Initial Demonstration Vehicle: Web Server Application
• Plan: demonstrate value of ETA/AIO for network-centric applications
• Initial target: web server application– Single request may require multiple I/Os– Stresses system resources (esp. OS resources)– Must multiplex thousands/tens of thousands
concurrent connections• Web server architecture alternatives:
– SPED (single process event-driven)– MP (multi-process) or MT (multi-threaded)– Hybrid approach: AMPED (asymmetric multi-
process event-driven) AIO model favors SPED for raw performance
page 1514 Feb 2004 SAN-3 workshop – HPCA-10
The userver
• Open source micro web server• Extensive tracing and statistics facilities• SPED model -- run one process per host CPU• Previous support for Unix non-blocking socket I/O
and event notification via Linux epoll()• Modified to support socket AIO (eventually file
AIO)– Generic AIO interface: can be mapped to a variety of
underlying AIO APIs (DUSI, Linux AIO, etc.)• Comparison: web server performance with and
without ETA engine– With Standard Linux: processes share file buffer
cache using sendfile() for zero-copy file transfer– With ETA: mmap() files into shared address space
page 1614 Feb 2004 SAN-3 workshop – HPCA-10
Web Server Event Scheduling
• Balance accepting new connections with processing of existing connections
• Scheduling:– Separate queues
for accept(), read(), and write()/close() completion events
– Process based on current queue lengths
• Early results with non-blocking I/O – accept processing frequency
Throughput impact of frequency of accepting new connections
page 1714 Feb 2004 SAN-3 workshop – HPCA-10
Evaluation Plans
• Goal: evaluate approach, compare to design alternatives• Construct functional prototype of proposed stack (Linux)
– Extend existing ETA prototype kernel-level interface to user level with OS bypass (DUSI)
– Extend the userver to use socket AIO, mapping layer to DUSI
– Evaluate on 10 gigE –based client/server setup using SPECweb type workload
• Current ETA prototype: promising kernel-level micro-benchmark performance
• Expectation: ETA + AIO will show significantly higher scalability than existing Linux network implementation
page 18SAN-3 workshop – HPCA-10 14 Feb 2004
UDP TCP RAW
IP
Linux Kernel DTIDataPath
User
Kernel
ETA Direct UserSockets Interface (DUSI)
Packet Driver
Linux Sockets Library
uServer - AIO
ETA Packet Processing Engine
ControlPath
AIO Mapping
Network Interfaces
ETA KernelAgent
uServer - sockets
Proposed Stack/Comparison
0.00
0.20
0.40
0.60
0.80
1.00
1.20
256 512 1024 4096 65536
Transmit Size in Bytes
Idle
CP
U
0
0.5
1
1.5
2
2.5
3
Th
rou
gh
pu
t in
Gig
abit
s
Linux Idle CPU ETA Idle CPU
Linux Rate ETA Rate
Kernel-Level ETA Prototype
0.00
0.20
0.40
0.60
0.80
1.00
1.20
256 512 1024 4096 65536
Receive Size in Bytes
Idle
CP
U
0
0.5
1
1.5
2
2.5
Th
rou
gh
pu
t in
Gig
abit
s
Linux Idle CPU ETA Idle CPU
Linux Rate ETA Rate
Kernel-Level ETA Prototype
page 2114 Feb 2004 SAN-3 workshop – HPCA-10
Evaluation Plans: Analyses and Comparisons
• Compare proposed stack to well-tuned conventional system: checksum offload, TCP segmentation offload, interrupt moderation (NAPI)
• Examine micro-architectural impacts: VTune/oprofile to get CPU, memory, cache usage, interrupts, data copies, context switches
• Comparison to TOE• Extend analysis to application domains beyond
web server: e.g., storage, transaction processing• Port highly scalable user-level threading package
(UC Berkeley Capriccio project) to ETA– Benefit: familiar threaded programming model with
efficient ``under the hood’’ underlying AIO and OS bypass
page 2214 Feb 2004 SAN-3 workshop – HPCA-10
Summary
• Proposed technology strategy combining ETA and AIO to enable industry standard platforms to scale to next-generation network performance
• Cost-performance, time to market, flexibility advantages over alternative approaches
• Ethernet/TCP to approach performance levels of today’s SANs – toward unified data center I/O fabric based on commodity hardware
• Status– Promising initial experimental results for kernel-
level ETA– Prototype implementation of proposed stack nearly
complete– Testing environment setup based on 10 gigE