Mesas and Dunes in Mars’ Hellas Basin · Mesas and Dunes in Mars’ Hellas Basin
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer...
-
Upload
elinor-benson -
Category
Documents
-
view
221 -
download
3
Transcript of Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer...
Institute of Computer ScienceFoundation for Research and Technology – Hellas
Greece
Computer Architecture and VLSI Systems Laboratory
Exploiting Spatial Parallelism in Ethernet-based Cluster
Interconnects
Stavros Passas, George Kotsis, Sven Karlsson, and Angelos
Bilas
FORTH-ICS CARV/scalable 2
Motivation Typically, clusters today use multiple interconnects
Interprocess communication (IPC): myrinet, infiniband, etc
IO: fibre channel, scsi Fast LAN: 10 GigE
However, this increases system and management cost
Can we use a single interconnect for all types of traffic? Which one?
High network speeds 10-40 GBit/s
FORTH-ICS CARV/scalable 3
Trends and Constraints Most interconnects use similar physical layer, but
differ in Protocol semantics and guarantees they provide Protocol implementation on the NIC and network core
Higher layer protocols (e.g. TCP/IP, NFS) are independent of the interconnect technology
10+ Gbps Ethernet is particularly attractive, but … Typically associated with higher overheads Requires more support at the edge due to simpler net
core
FORTH-ICS CARV/scalable 4
This Work How well can a protocol do over 10-40 GigE?
Scale throughput efficiently over multiple links
Analyze protocol overhead at the host CPU
Propose and evaluate optimizations for reducing host CPU overhead Implemented without H/W support
FORTH-ICS CARV/scalable 5
Outline
Motivation Protocol design over Ethernet Experimental results Conclusions and future work
Standard Protocol Processing
Sources of overhead System call to issue operation Memory copies at sender and receiver Protocol packet processing Interrupt notification for freeing send-side buffer, packet arrival Extensive device accesses Context switch from interrupt to receive thread for packet
processing
FORTH-ICS CARV/scalable 6
FORTH-ICS CARV/scalable 7
Our Base Protocol
Improves on MultiEdge [IPDPS’07] Support for multiple links with different
schedulers H/W coalescing for send- & receive-side
interrupts S/W coalescing in interrupt handler
Still requires System calls One copy at send and one at receive side Context switch in receive path
FORTH-ICS CARV/scalable 8
Evaluation Methodology Research questions
How does the protocol scale with the number of links? What are the important overheads at 10 Gbits/s? What is the impact of link scheduling?
We use two nodes connected back-to-back Dual-CPU (Opteron 244) 1-8 links of 1 Gbit/s (Intel) 1 link of 10 Gbit/s (Myricom)
We focus on Throughput: end-to-end, reported by benchmarks Detailed CPU breakdowns: extensive kernel
instrumentation Packet-level statistics: flow-control, out-of-order
FORTH-ICS CARV/scalable 9
Throughput Scalability: One Way
FORTH-ICS CARV/scalable 10
What If…
We were able to avoid certain overheads Interrupts
Use polling instead
Data copying Remove copies from send and receive path
We examine two more protocol configurations Poll: Realistic, but consumes one CPU NoCopy: Artificial, as data are not delivered
FORTH-ICS CARV/scalable 11
Poll Results
FORTH-ICS CARV/scalable 12
NoCopy Results
Memory Throughput Copy performance related to memory throughput
Max memory throughput (NUMA w/ Linux support) Read: 20 GBits/s Write: 15 GBits/s
Max copy throughput 8 GBits/s per CPU accessing local memory
Overall, multiple links approach memory throughput Copies important in future
FORTH-ICS CARV/scalable 13
FORTH-ICS CARV/scalable 14
Packet Scheduling for Multiple Links
Evaluated three packet schedulers Static round robin (SRR)
Suitable for identical links
Weighted static round robin (WSRR) Assign packets proportionally to link throughput Does not consider link load
Weighted dynamic (WD) Assign packets proportionally to link throughput Consider link load
FORTH-ICS CARV/scalable 15
Multi-link Scheduler Results
Setup 4x1 + 1x10 NoCopy + Poll
FORTH-ICS CARV/scalable 16
Lessons Learned
Multiple links introduce overheads Base protocol scales up-to 4 x 1 Gbit/s links Removing interrupts allows scaling to 6 x 1
Gbit/s links
Beyond 6 GBits/s copying becomes dominant Removing copies allows scaling to 8-10
GBits/s
Dynamic weighted performs best 10% better over simpler alternative (SWRR)
Future work
1) Eliminate even single copy Use page remapping without H/W support
2) More efficient interrupt coalescing Share interrupt handler among multiple
NICs
3) Distribute protocol over multiple cores Possibly dedicate cores to network
processing
FORTH-ICS CARV/scalable 17
FORTH-ICS CARV/scalable 18
Related Work User-level communication systems & protocols
Myrinet, Infiniband, etc. Break kernel abstraction and require h/w support Not successful with commercial applications and IO
iWARP Requires H/W support Ongoing work and efforts
TCP/IP optimizations and offload Complex and expensive Important for WAN setups, rather than datacenters