Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer...

19
Institute of Computer Science oundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laborator Exploiting Spatial Parallelism in Ethernet-based Cluster Interconnects Stavros Passas, George Kotsis, Sven Karlsson, and Angelos Bilas

Transcript of Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer...

Page 1: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Institute of Computer ScienceFoundation for Research and Technology – Hellas

Greece

Computer Architecture and VLSI Systems Laboratory

Exploiting Spatial Parallelism in Ethernet-based Cluster

Interconnects

Stavros Passas, George Kotsis, Sven Karlsson, and Angelos

Bilas

Page 2: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

FORTH-ICS CARV/scalable 2

Motivation Typically, clusters today use multiple interconnects

Interprocess communication (IPC): myrinet, infiniband, etc

IO: fibre channel, scsi Fast LAN: 10 GigE

However, this increases system and management cost

Can we use a single interconnect for all types of traffic? Which one?

High network speeds 10-40 GBit/s

Page 3: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

FORTH-ICS CARV/scalable 3

Trends and Constraints Most interconnects use similar physical layer, but

differ in Protocol semantics and guarantees they provide Protocol implementation on the NIC and network core

Higher layer protocols (e.g. TCP/IP, NFS) are independent of the interconnect technology

10+ Gbps Ethernet is particularly attractive, but … Typically associated with higher overheads Requires more support at the edge due to simpler net

core

Page 4: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

FORTH-ICS CARV/scalable 4

This Work How well can a protocol do over 10-40 GigE?

Scale throughput efficiently over multiple links

Analyze protocol overhead at the host CPU

Propose and evaluate optimizations for reducing host CPU overhead Implemented without H/W support

Page 5: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

FORTH-ICS CARV/scalable 5

Outline

Motivation Protocol design over Ethernet Experimental results Conclusions and future work

Page 6: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Standard Protocol Processing

Sources of overhead System call to issue operation Memory copies at sender and receiver Protocol packet processing Interrupt notification for freeing send-side buffer, packet arrival Extensive device accesses Context switch from interrupt to receive thread for packet

processing

FORTH-ICS CARV/scalable 6

Page 7: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

FORTH-ICS CARV/scalable 7

Our Base Protocol

Improves on MultiEdge [IPDPS’07] Support for multiple links with different

schedulers H/W coalescing for send- & receive-side

interrupts S/W coalescing in interrupt handler

Still requires System calls One copy at send and one at receive side Context switch in receive path

Page 8: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

FORTH-ICS CARV/scalable 8

Evaluation Methodology Research questions

How does the protocol scale with the number of links? What are the important overheads at 10 Gbits/s? What is the impact of link scheduling?

We use two nodes connected back-to-back Dual-CPU (Opteron 244) 1-8 links of 1 Gbit/s (Intel) 1 link of 10 Gbit/s (Myricom)

We focus on Throughput: end-to-end, reported by benchmarks Detailed CPU breakdowns: extensive kernel

instrumentation Packet-level statistics: flow-control, out-of-order

Page 9: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

FORTH-ICS CARV/scalable 9

Throughput Scalability: One Way

Page 10: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

FORTH-ICS CARV/scalable 10

What If…

We were able to avoid certain overheads Interrupts

Use polling instead

Data copying Remove copies from send and receive path

We examine two more protocol configurations Poll: Realistic, but consumes one CPU NoCopy: Artificial, as data are not delivered

Page 11: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

FORTH-ICS CARV/scalable 11

Poll Results

Page 12: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

FORTH-ICS CARV/scalable 12

NoCopy Results

Page 13: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Memory Throughput Copy performance related to memory throughput

Max memory throughput (NUMA w/ Linux support) Read: 20 GBits/s Write: 15 GBits/s

Max copy throughput 8 GBits/s per CPU accessing local memory

Overall, multiple links approach memory throughput Copies important in future

FORTH-ICS CARV/scalable 13

Page 14: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

FORTH-ICS CARV/scalable 14

Packet Scheduling for Multiple Links

Evaluated three packet schedulers Static round robin (SRR)

Suitable for identical links

Weighted static round robin (WSRR) Assign packets proportionally to link throughput Does not consider link load

Weighted dynamic (WD) Assign packets proportionally to link throughput Consider link load

Page 15: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

FORTH-ICS CARV/scalable 15

Multi-link Scheduler Results

Setup 4x1 + 1x10 NoCopy + Poll

Page 16: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

FORTH-ICS CARV/scalable 16

Lessons Learned

Multiple links introduce overheads Base protocol scales up-to 4 x 1 Gbit/s links Removing interrupts allows scaling to 6 x 1

Gbit/s links

Beyond 6 GBits/s copying becomes dominant Removing copies allows scaling to 8-10

GBits/s

Dynamic weighted performs best 10% better over simpler alternative (SWRR)

Page 17: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Future work

1) Eliminate even single copy Use page remapping without H/W support

2) More efficient interrupt coalescing Share interrupt handler among multiple

NICs

3) Distribute protocol over multiple cores Possibly dedicate cores to network

processing

FORTH-ICS CARV/scalable 17

Page 18: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

FORTH-ICS CARV/scalable 18

Related Work User-level communication systems & protocols

Myrinet, Infiniband, etc. Break kernel abstraction and require h/w support Not successful with commercial applications and IO

iWARP Requires H/W support Ongoing work and efforts

TCP/IP optimizations and offload Complex and expensive Important for WAN setups, rather than datacenters

Page 19: Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

FORTH-ICS CARV/scalable 19

Thank you!

Questions?

Contact:Stavros Passas

[email protected]