Cisco usNIC: how it works, how it is used in Open MPI

41
Cisco Public 1 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Userspace NIC (usNIC) Jeff Squyres Cisco Systems, Inc. November 7, 2013

description

In this talk, I expand on the slides I presented at the Madrid, Spain EuroMPI conference in September 2013 (I re-used some of the slides from that Madrid presentation, but there's a bunch of new content in the latter half of the slide deck). This talk is a technical deep dive into how Cisco's usNIC technology works, and how we use that technology in the BTL plugin that we wrote for Open MPI. I originally gave this talk at Lawrence Berkeley Labs on Thursday, November 7, 2013.

Transcript of Cisco usNIC: how it works, how it is used in Open MPI

Page 1: Cisco usNIC: how it works, how it is used in Open MPI

Cisco Public 1 © 2013 Cisco and/or its affiliates. All rights reserved.

Cisco  Userspace  NIC  (usNIC)  

Jeff  Squyres  Cisco  Systems,  Inc.  November  7,  2013  

Page 2: Cisco usNIC: how it works, how it is used in Open MPI

Cisco Public 2 © 2013 Cisco and/or its affiliates. All rights reserved.

Yes,  we  sell  servers  now  

Page 3: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 3

Cisco  UCS  servers  

Cisco  2  x  10Gb  VIC  

Cisco  10/40Gb  Nexus  switches  

Record-­‐seNng  Intel  Ivy  Bridge  

1U  and  2U  servers  

Ultra  low  latency  Ethernet  

Yes,  really!  

40Gb  top-­‐of-­‐rack  and  core  switches  

Page 4: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 4 4  

Rack  

4  socket  +  giant  memory  HPC  performance  

Blade  

UCS  B420  M3  4-­‐socket  blade  for    

large-­‐memory  compute  workloads  

Cisco  UCS:    Many  Server  Form  Factors,  One  System  

UCS  C240  M3  Perfect  as  HPC  cluster  head  nodes  

or  IO  nodes  (2  socket)  

UCS  C220  M3  Ideal  for  HPC  compute-­‐intensive  

applicaXons  (2  socket)  

UCS  B200  M3  Blade  form  factor,  2-­‐socket  

UCS  C420  M3  4-­‐socket  rack  server  for  large-­‐memory  

compute  workloads  

Industry-­‐leading  compute  without  compromise  

Page 5: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 5

UCS  impacBng  growth  of  established  vendors  like  HP  

Legacy  offerings  flat-­‐lining  or  in  decline  

Cisco  growth  out-­‐pacing  the  market  

Customers  have  shiMed  19.3%  of  the  global  x86  blade  server  market  to  Cisco  and  over  26%  in  the  Americas  (Source:    IDC  Worldwide  Quarterly  Server  Tracker,  Q1  2013  Revenue  Share,  May  2013)  Source:    IDC  Worldwide  Quarterly  Server  Tracker,  Q1  2013  Revenue  Share,  May  2013  

Worldwide  X86  Server  Blade  Market  Share  

Demand  for  Data  Center  InnovaBon  Has  Vaulted  Cisco  Unified  CompuBng  System    (UCS)  to  the  #2  Leader  in  the  Fast-­‐Growing  Segment  of  the  x86  Server  Market  

Market  AppeXte    for  InnovaXon  Fuels  

UCS  Growth  UCS  #2  and    climbing  

Page 6: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 6

Best  CPU  Performance   16  world  records  

Best  VirtualizaXon  &  Cloud  Performance  

8  world  records  

Best  Database  Performance   9  world  records  

Best  Enterprise  ApplicaXon  Performance   18  world  records  

Best  Enterprise  Middleware  Performance   14  world  records  

Best  HPC  Performance   15  world  records  

Page 7: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 7

One  wire  to  rule  them  all:  •  Commodity  traffic  (e.g.,  ssh)  •  Cluster  /  hardware  management  •  File  system  /  IO  traffic  •  MPI  traffic  

10G  or  40G  with  real  QoS  

Page 8: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 8 8  

High  density  Low  latency  

Cisco  Nexus:  Years  of  experience  rolled  into  dependable  soluBons  

Nexus  3548  190ns  port-­‐to-­‐port  latency  (L2  and  L3)  

Created  for  HPC  /  HFT  48  10Gb  /  12  40Gb  ports  

Low  latency,  high  density  10  /  40Gb  switches  

Nexus  6004  1us  port-­‐to-­‐port  latency  384  10Gb  /  96  40Gb  ports  

Page 9: Cisco usNIC: how it works, how it is used in Open MPI

Cisco Public © 2013 Cisco and/or its affiliates. All rights reserved. 9

Spine  

Leaf  

CharacterisXcs  •  3  Hops  •  Low  OversubscripXon  –  Non-­‐Blocking  •  <  ~3.5  usecs  depending  on  config  and  workload  •  10G  or  40G  Capable  •  Spine:  4  to  16  Wide  •  Leaf:  Determined  by  Spine  Density  

Spine  -­‐  Leaf   Port  Scale   Latency   Spines   Leafs  

10G  Fabric   6004  -­‐  6001   18,432  x  10G  3:1   ~  3  usecs  Cut-­‐through   16   384  

40G  Fabric   6004  -­‐  6004   7,680  x  40G  5:1   ~  3  usecs  Cut-­‐through   16   96  

Mixed  Fabric   6004  -­‐  6001   4,680  x  10G  3:1   ~  3  usecs  S&F   4   96  

10G  Fabric   6004  -­‐  3548   12,288  x  10G  3:1   ~  1.5  usecs  Cut-­‐through   16   384  

40G  Fabric   6004  -­‐  3548   1,152  x  40G  1:1   ~  1.5  usecs  Cut-­‐through   6   96  

Mixed  Fabric   6004  -­‐  3548   3,072  x  10G  3:1   ~  1.5  usecs  S&F   4   96  

…many  other  configuraBons  are  also  possible  

Page 10: Cisco usNIC: how it works, how it is used in Open MPI

Cisco Public © 2013 Cisco and/or its affiliates. All rights reserved. 10

Leaf  

Spine2  

Spine1  

Spine2-­‐Spine1-­‐Leaf   Port  Scale   Latency   Spine2   Spine1   Leafs  

10G  Fabric   6004  -­‐  6004  -­‐  6001   55,296  x  10G  3:1   ~  3-­‐5  usecs  Cut-­‐through   48   16  x  6   192  

40G  Fabric   6004  -­‐  6004  -­‐  6004   23,040  x  40G  5:1   ~  3-­‐5  usecs  Cut-­‐through   48   16   48  

Mixed  Fabric   6004  -­‐  6004  -­‐  6001   18,432  x  10G  3:1   ~  3-­‐5  usecs  S&F   32   4  x  8   48  

10G  Fabric   6004  -­‐  6004  -­‐  3548   24,576  x  10G  2:1   ~  1.5-­‐3.5  usecs  Cut-­‐through   32   16  x  4   192  

40G  Fabric   6004  -­‐  6004  -­‐  3548   2,304  x  40G  1:1   ~  1.5-­‐3.5  usecs  Cut-­‐through   24   6  x  8   48  

Mixed  Fabric   6004  -­‐  6004  -­‐  3548   9,216x  10G  2:1   ~  1.5-­‐3.5  usecs  S&F   24   6  x  8   48  

CharacterisXcs  •  3  Hops  Pod  –  5  hops  DC  east-­‐west  traffic  •  Low  OversubscripXon  –  Non-­‐Blocking  •  <  ~3.5  usecs  depending  on  config  and  

workload  •  10G  or  40G  Capable  •  Two  spine  layers  

Page 11: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 11

Page 12: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 12

•  Direct  access  to  NIC  hardware  from  Linux  userspace  OperaXng  System  bypass  Via  the  Linux  Verbs  API  (UD)  

•  UXlizes  Cisco  Virtual  Interface  Card  (VIC)  for  ultra-­‐low  Ethernet  latency  2nd  generaXon  80Gbps  Cisco  ASIC  2  x  10Gbps  Ethernet  ports  2  x  40Gbps  coming  …soon…  PCI  and  mezzanine  form  factors  

•  Half-­‐round  trip  (HRT)  ping-­‐pong  latencies  (Intel  E5-­‐2690  v2  servers):  Raw  back  to  back:  1.57μs  MPI  back  to  back:  1.85μs  Through  MPI+N3548:  2.05μs  

These  numbers  keep  going  down  

Page 13: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 13

ApplicaXon  

Kernel  

Cisco  VIC  hardware  

TCP  stack  

General  Ethernet  driver  

Cisco  VIC  driver  

Userspace  

Userspace  sockets  library   Userspace  verbs  library  

Cisco  VIC  hardware  

ApplicaXon  

Verbs  IB  core  

Cisco  USNIC  driver  

Bootstrapping  and  setup  

Send  and  receive  fast  path  

usNIC  TCP/IP  

Page 14: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 14

MPI  

MPI  directly  injects  

L2  frames    to  the  network  

MPI  receives  L2  frames  directly  from  the  VIC  

Userspace  verbs  library  

Cisco  VIC  hardware  

Page 15: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 15

I/O MMU SR-IOV NIC

VIC  

Classifier  

x86  Chipset  VT-­‐d  

MPI  process  

QP QP Queue pair

MPI  process  

Inbound  L2  frames  

Outbound  L2  frames  

Page 16: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 16

VIC                  

         

VF   VF  

VF   VF   VF  

Physical  port  

         

Physical  port  

Physical  FuncXon  (PF)   Physical  FuncXon  (PF)  MAC  address:  aa:bb:cc:dd:ee:ff   MAC  address:  aa:bb:cc:dd:ee:fe  

VF   VF  

VF   VF   VF  

VF  QP  

QP  QP  

QP  VF  

QP  

QP  QP  

QP  

Page 17: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 17

VIC          

PF  (MAC)          

VF   VF   VF  

VF   VF   VF  

PF  (MAC)          

VF   VF   VF  

VF   VF   VF  

MPI  process  

MPI  process  Physical  port  Physical  port  

Intel IO MMU QP   QP   QP   QP  

Page 18: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 18

•  Used  for  physical  ßà  virtual  memory  translaXon  •  usnic  verbs  driver  programs  (and  deprograms)  the  IOMMU    

Intel IO MMU VIC  Virtual  Userspace  process  

Virtual  

Physical  

RAM  

Virtual   Physical  

Page 19: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 19

•  For  the  purposes  of  this  talk,  let’s  assume  that  each  physical  port  has  one  Linux  ethX  device  

•  Each  ethX  device  corresponds  to  a  PF  

•  Each  usnic_Y  device  corresponds  to  an  ethX  device2  

VIC  Physical  port  0  eth4  /  usnic_0  

Physical  port  1  eth5  /  usnic_1  

(fiber)  

Physical  port   Physical  port  

Page 20: Cisco usNIC: how it works, how it is used in Open MPI

Cisco Public © 2013 Cisco and/or its affiliates. All rights reserved. 20

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

usnic_0

PCI 1137:0043

eth5

usnic_1

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda

PCI 1137:0043

eth6

usnic_2

PCI 1137:0043

eth7

usnic_3

Indexes: physical

Date: Thu Nov 7 10:58:23 2013

Intel  Xeon  E5-­‐2690  (“Sandy  Bridge”)  2  sockets,  8  cores,  64GB  per  socket  

 VIC  ports  

 VIC  ports  

Page 21: Cisco usNIC: how it works, how it is used in Open MPI

Cisco Public © 2013 Cisco and/or its affiliates. All rights reserved. 21

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

usnic_0

PCI 1137:0043

eth5

usnic_1

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda

PCI 1137:0043

eth6

usnic_2

PCI 1137:0043

eth7

usnic_3

Indexes: physical

Date: Thu Nov 7 10:58:23 2013

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

usnic_0

PCI 1137:0043

eth5

usnic_1

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda

PCI 1137:0043

eth6

usnic_2

PCI 1137:0043

eth7

usnic_3

Indexes: physical

Date: Thu Nov 7 10:58:23 2013

Page 22: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 22

ApplicaXon  

Byte  Transfer  Layer  (BTL)  

OperaXng  System  

Hardware  

Open  MPI  layer  (OMPI)  

Point-­‐to-­‐point  messaging  layer  (PML)  

Page 23: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 23

usnic  BTL  /dev/usnic_0  

VIC  0  

OB1  PML  

usnic  BTL  /dev/usnic_1  

usnic  BTL  /dev/usnic_2  

usnic  BTL  /dev/usnic_3  

MPI_Send  /  MPI_Recv  (etc.)  

VIC  1  

Page 24: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 24

•  Byte  Transfer  Layer  •  Point-­‐to-­‐point  transfer  plugins  in  OMPI  layer  

•  No  protocol  is  assumed  /  required  

•  “usnic”  BTL    •  Uses  unreliable  datagram  (UD)  verbs  

•  Handles  all  fragmentaXon  and  re-­‐assembly  (vs.  PML)  

•  Retransmissions  and  ACKs  handled  in  sovware  

•  Sliding  window  retransmission  scheme  

•  Direct  inject  /  direct  receive  of  L2  Ethernet  frames  

usnic  BTL  /dev/usnic_2  

Page 25: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 25

•  One  BTL  module  for  each  usNIC  verbs  device  

•  Each  module  has  two  UD  queue  pairs  •  Priority  queue  for  small  and  control  packets  •  Data  queue  for  up  to  MTU-­‐sized  data  packets  

•  Each  QP  has  its  own  CQ  

•  QPs  may  or  may  not  be  on  same  VF  

•  Overall  BTL  glue  polls  CQs  for  each  device  •  First,  priority  CQs  

•  Then  data  CQs  

Priority  QP  

Data  QP  

CQ  

CQ  

Page 26: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 26

•  “raw”  latency  (no  MPI,  no  verbs)  is  1.57μs  

•  MPI  latency  back-­‐to-­‐back  on  Sandy  Bridge  1.85μs  

•  Verbs  responsible  for  about  80ns  of  the  difference  (not  related  to  MPI  API)  

•  All  the  rest  of  OMPI  is  only  about  200ns  

Raw:  1.57μs   MPI:  200ns  

Verbs:  80ns  

Page 27: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 27

•  Deferred  and  piggy-­‐backed  ACKs  

Time  

Process  A   Process  B  

Msg  ACK  N   Immediate  

Msg  Msg  Msg  

ACK  N+2  

Deferred  

Msg  Msg  Msg  

Msg+ACK  N+2  

Deferred  +  piggybacked  

Page 28: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 28

•  Host  writes  WQ  structure  Writes  index  to  VIC  via  PIO  

VIC  reads  WQ  descriptor  

VIC  reads  buffer  from  RAM  

VIC  sends  buffer  from  RAM   Host   VIC  

WQ  descriptor  

Buffer  to  send  

Write  WQ  index  

Read  WQ  

Read  packet  

Send  on  wire  

VIC  now  has  buffer  address  

Page 29: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 29

•  Host  writes  WQ  structure  Writes  index  +  encoded  buffer  address  to  VIC  via  PIO  

VIC  reads  WQ  descriptor  

VIC  reads  buffer  from  RAM  

VIC  sends  buffer  from  RAM  Host   VIC  

WQ  descriptor  

Buffer  to  send  

Write  WQ  index+addr  Read  WQ  

Read  packet  

Send  on  wire  

Send  ~400ns  sooner  

Page 30: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 30

•  Minimize  length  of  priority  receive  queue  •  Using  2048  different  receive  buffers  200ns  worse  than  using  64  

•  Result  of  IOMMU  cache  effect  

•  We  scale  length  of  priority  RQ  with  number  of  processes  in  job  

Intel IO MMU VIC  Virtual  Userspace  process  

Virtual  

Physical  

Use  this  much  

Instead  of  this  much  

Page 31: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 31

•  Use  fastpaths  wherever  possible  Be  friendly  to  the  opXmizer  and  instrucXon  cache  

Made  a  noXceable  difference  (!)  

if (fastpathable)! do_it_inline();!else! call_slower_path();!

Page 32: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 32

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

usnic_0

PCI 1137:0043

eth5

usnic_1

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda

PCI 1137:0043

eth6

usnic_2

PCI 1137:0043

eth7

usnic_3

Indexes: physical

Date: Thu Nov 7 10:58:23 2013

MPI  processes  running  on  these  cores…  

Page 33: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 33

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

usnic_0

PCI 1137:0043

eth5

usnic_1

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda

PCI 1137:0043

eth6

usnic_2

PCI 1137:0043

eth7

usnic_3

Indexes: physical

Date: Thu Nov 7 10:58:23 2013

MPI  processes  running  on  these  cores…  

Only  use  these  usNIC  devices  for  short  messages  

Page 34: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 34

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

usnic_0

PCI 1137:0043

eth5

usnic_1

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda

PCI 1137:0043

eth6

usnic_2

PCI 1137:0043

eth7

usnic_3

Indexes: physical

Date: Thu Nov 7 10:58:23 2013

MPI  processes  running  on  these  cores…  

Use  ALL  usNIC  devices  for  long  messages  

Page 35: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 35

•  Everything  above  the  firmware  is  open  source  

•  Open  MPI  DistribuXng  Cisco  Open  MPI  1.6.5  Upstream  in  Open  MPI  1.7.3  

•  Libibverbs  plugin  

•  Verbs  kernel  module  

Page 36: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 36

Hardware  

•  Cisco  UCS  C220  M3  Rack  Server    •  Intel  E5-­‐2690  Processor  2.9  GHz  (3.3  GHz  Turbo),  2  Socket,  8  Cores/Socket  •  1600  MHz  DDR3  Memory,  8  GB  x  16,  128  GB  installed  •  Cisco  VIC  1225  with  Ultra  Low  Latency  Networking  usNIC  Driver    

•  Cisco  Nexus  3548  •  48  Port  10  Gbps  Ultra  Low  Latency  Ethernet  Networking  Switch  

SoMware  

•  OS:  Centos  6.4,  Kernel:  2.6.32-­‐358.el6.x86_64  (SMP)  

•  NetPIPE  (ver  3.7.1)  

•  Intel  MPI  Benchmarks  (ver  3.2.4)  

•  High  Performance  Linpack  (ver  2.1)  

•  Other:  Intel  C  Compiler  (ver  13.0.1),  Open  MPI  (ver  1.6.5),  Cisco  usNIC  (1.0.0.7x)  

 

 

 

Page 37: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 37

0  

2500  

5000  

7500  

10000  

1  

10  

100  

1000  

10000  

1   4   12  

19  

27  

35  

51  

67  

99  

131  

195  

259  

387  

515  

771  

1027  

1539  

2051  

3075  

4099  

6147  

8195  

12291  

16387  

24579  

32771  

49155  

65539  

98307  

131075  

196611  

262147  

393219  

524291  

786435  

1048579  

1572867  

2097155  

3145731  

4194307  

6291459  

8388611  

Throughp

ut  (M

bps)  

Latency  (usecs)  

Message  Size  (bytes)  

Cisco  usNIC  Latency   Cisco  usNIC  Throughput  

2.05  usecs  latency  for  small    messages    

9.3  Gbps  Throughput  

Page 38: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 38

0  

300  

600  

900  

1200  

1  

10  

100  

1000  

10000  

4   16   64   256   1024   4096   16384   65536   262144   1048576   4194304  

Throughp

ut  (M

B/s)  

Latecny  (usecs)  

Message  Size  (bytes)  

PingPong  ThroughPut  (MB/s)   PingPing  Througput  (MB/s)   PingPong  Latency  (usecs)   PingPing  Latency  (usecs)  

2.05  usecs  PingPong  Latency  2.10  usecs  PingPing  Latency  

PingPing  and  PingPong  Latency  track  together!  

Page 39: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 39

0  

600  

1200  

1800  

2400  

1  

10  

100  

1000  

10000  

4   16   64   256   1024   4096   16384   65536   262144   1048576   4194304  

Throughp

ut  (M

B/s)  

Latecny  (usecs)  

Message  Size  (bytes)  

SendRecv  Throughput  (MB/s)   Exchange  Throughput  (MB/s)   SendRecv  Latency  (usecs)   Exchange  Latency  (usecs)  

2.11  usecs  SendRecv  Latency  2.58  usecs  Exchange  Latency  

Full  Bi-­‐direcBonal  Performance  for  both  Exchange  and  SendRecv  

Page 40: Cisco usNIC: how it works, how it is used in Open MPI

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 40

16   32   64   128   256   512  GFlops   340.51   673.68   1271.14   2647.09   5258.27   9773.45  

0  

2500  

5000  

7500  

10000  

12500  

GFlop

s  

#  of  CPU  Cores  

 GFLOPS  =  FLOPS/Cycle  x  Num  CPU  Cores  x  Freq  (GHz)  E5-­‐2690  Max  GFLOPS  =  8  x  16  x  3.3  =  422  GFLOPS    

Single  Node  HPL  Score  (16  cores):  340.51  GFLOPS*  32  Node  HPL  Score  (512  cores):  9,773.45  GFLOPS    Efficiency  based  on  Single  Machine  Score:      (9,773.45)/(340.51  x  32)  x  100  =  89.69%        

*  Score  may  improve  with  addiBonal  compiler  serngs  or  newer  compiler  versions    

Page 41: Cisco usNIC: how it works, how it is used in Open MPI

Thank  you.