Customer Case Study: Tier-1 US Service Provider

6
© Big Switch Networks 1 Case Study: Industry’s Largest NFV Deployment Collaboration between Red Hat, Dell and Big Switch for a tier1 US Service Provider embracing large scale NFV deployments on OpenStack with SDN NFV deployments represent some of the most demanding workloads in OpenStack clouds, yet the economic and operational promise of NFV makes this a highvalue technical challenge for service providers worldwide. This paper will discuss the collaboration between Dell, Red Hat and Big Switch for a tier1 US service provider in the industry’s largest deployment of NFV infrastructure to date. Four key areas of collaboration were needed to bring the deployment from lab to production: Resiliency & Performance at Scale Design & Deployment Flexibility Reducing Operational Complexity Integrating Security & Analytics Software developers from Red Hat and Big Switch worked together on a daily basis over months, leveraging over $1m of test hardware from Dell, to accelerate the open community engineering process and deliver a high quality, validated NFV Pod architecture. As a result of the collaboration, multiple improvements were made to upstream open source code to align the final Pod design with key design and operational considerations of a large service provider network infrastructure. Big Switch SDN Controllers (Physical appliance pair) Switch Light OS on Spine (40G Dell ON switches) Red Hat OpenStack 7.1 (with Neutron) Red Hat Enterprise Linux with Switch Light VX (on Dell R630 Compute Nodes) Switch Light OS on Leaf (10G/40G Dell ON switches) + + This SDN/NFV collaboration highlights the open source leadership of Red Hat, the SDN expertise of Big Switch and the proven service and support at scale from Dell Figure 1: Pod Design At A Glance

Transcript of Customer Case Study: Tier-1 US Service Provider

Page 1: Customer Case Study: Tier-1 US Service Provider

   

     

 ©  Big  Switch  Networks          1  

 

Case  Study:  Industry’s  Largest  NFV  Deployment  Collaboration  between  Red  Hat,  Dell  and  Big  Switch  for  a  tier-­‐1  US  Service  Provider  embracing  large  scale  NFV  deployments  on  OpenStack  with  SDN  NFV  deployments  represent  some  of  the  most  demanding  workloads  in  OpenStack  clouds,  yet  the  economic  and  operational  promise  of  NFV  makes  this  a  high-­‐value  technical  challenge  for  service  providers  worldwide.    This  paper  will  discuss  the  collaboration  between  Dell,  Red  Hat  and  Big  Switch  for  a  tier-­‐1  US  service  provider  in  the  industry’s  largest  deployment  of  NFV  infrastructure  to  date.    Four  key  areas  of  collaboration  were  needed  to  bring  the  deployment  from  lab  to  production:  

 

§ Resiliency  &  Performance  at  Scale   § Design  &  Deployment  Flexibility  

§ Reducing  Operational  Complexity   § Integrating  Security  &  Analytics

 

Software  developers  from  Red  Hat  and  Big  Switch  worked  together  on  a  daily  basis  over  months,  leveraging  over  $1m  of  test  hardware  from  Dell,  to  accelerate  the  open  community  engineering  process  and  deliver  a  high  quality,  validated  NFV  Pod  architecture.  As  a  result  of  the  collaboration,  multiple  improvements  were  made  to  upstream  open  source  code  to  align  the  final  Pod  design  with  key  design  and  operational  considerations  of  a  large  service  provider  network  infrastructure.  

     

Big Switch SDN Controllers

(Physical appliance pair)

Switch Light OS on Spine

(40G Dell ON switches)

Red Hat OpenStack 7.1

(with Neutron)

Red Hat Enterprise Linux with Switch Light VX

(on Dell R630 Compute Nodes)

Switch Light OS on Leaf

(10G/40G Dell ON switches)

+

+

This  SDN/NFV  collaboration  highlights  the  open  source  leadership  of  Red  Hat,  the  SDN  expertise  of  Big  Switch  and  the  proven  service  and  

support  at  scale  from  Dell  

Figure  1:  Pod  Design  At  A  Glance  

Page 2: Customer Case Study: Tier-1 US Service Provider

 

 

Case  Study:  Industry’s  Largest  NFV  Deployment  

©  Big  Switch  Networks          2  

 

Key  Network  Design  Challenges  The  key  network  challenges  for  the  OpenStack  NFV  pod  deployment  fell  in  five  major  categories:  

• Resiliency  At  Scale:  To  achieve  scale,  the  design  followed  a  hyperscale-­‐inspired  “core  and  pod”  approach1,  with  a  12  rack  pod  design  replicated  across  a  number  of  data  centers  across  the  US.    The  12  rack  pod,  a  multi-­‐million  dollar  investment,  was  replicated  at  both  Dell  and  Big  Switch  labs  to  test  the  system  under  stress.    Resiliency  was  required  at  every  level  –  in    the  vSwitch,  the  leaf,  the  spine,  the  network  services  and  the  ingress/egress  to  the  datacenter  core  and  other  pods.    

• No  Bandwidth  Bottlenecks:  NFV  workloads  put  extreme  stress  on  the  network  in  many  dimensions  –  east/west  bandwidth,  north/south  bandwidth,  intra-­‐vSwitch  bandwidth  and  logical  L2/L3  bandwidth.    Neither  bandwidth  limitations  from  legacy  protocols  like  spanning  tree  nor  packet  hair-­‐pinning  across  the  fabric  for  overlay  gateway  purposes  were  acceptable,  yet  VNF  instances  needed  to  be  provisioned  in  any  rack  at  any  time.    The  system  as  a  whole  required  optimized  bandwidth  characteristics  from  vSwitch  to  leaf  to  spine  in  both  normal  running  operations  and  in  partial  failure  scenarios.    

• Logical  Network  Design  Flexibility:  The  pod  design  needed  to  accommodate  NFV  workloads  that  each  had  unique  logical  network  requirements,  yet  needed  to  share  the  same  physical  leaf/spine  fabric  and  vSwitches.    Rather  than  a  one-­‐size-­‐fits-­‐all  L2/L3  approach,  this  design  needed  to  accommodate  NFV-­‐specific  public  L2  networks,  public  L3  networks,  private  L2  networks,  tenant-­‐managed  service  chains  with  FWaaS  and  LBaaS,  provider-­‐managed  service  chains  transparent  to  the  tenants,  virtual  tenant  network  functions,  physical  provider  network  functions  with  capacity  for  high  bandwidth  broadcast  and  a  range  of  connectivity  options  to  numerous  external  networks.    All  of  these  options  needed  to  be  mixed-­‐and-­‐matched  in  peaceful  co-­‐existence  in  the  same  physical  pod  at  the  same  time,  with  relevant  provisioning  workflows  automated  by  OpenStack.  

• Reduced  Operational  Complexity:    Operational  complexity  for  the  NFV  deployment  for  this  engagement  came  in  two  forms:  a)  lifecycle  management  of  the  network  control  systems  relative  to  the  OpenStack  control  systems,  and  b)  training  for  design/install/troubleshooting  of  the  network  control  system  itself.    The  first  required  tight  integration  between  Big  Switch  and  Red  Hat.  The  end  result  –  an  leaf-­‐spine  CLOS  fabric  that  can  be  upgraded  in  less  time  than  an  iPhone  without  impacting  production  workloads  or  OpenStack  control  systems  –  is  unique  in  the  industry.    The  second  leveraged  Big  Switch’s  “One  Big  Switch”  metaphor,  detailed  below.  

• Integrated  Security  &  Visibility:  To  ensure  that  the  NFV  Pod  is  compliant  and  secure  against  intrusions  and  other  threats,  it  was  important  to  design  an  out-­‐of-­‐band  monitoring  capability  for  E-­‐W  traffic  as  well  as  an  inline  protection  mechanism  for  N-­‐S  traffic  as  a  part  of  the  overall  Pod  design.  Key  requirements  from  this  visibility  infrastructure  were:  a  scale-­‐out  design  that  grew  with  the  Pod  scale;  support  for  multi-­‐tenant/multi-­‐tool  environments  and,  ease  of  deployment  and  operation.  

Pod  Design  At  A  Glance  The  general  pod  design  includes  one  services/connectivity/control  rack  and  12  compute  racks  (Figure  1).      

§ Services/connectivity/control  rack  holds  the  SDN  controllers,  OpenStack  controllers,  various  physical  provider-­‐side  network  services  and  the  ingress/egress  gateways  to  networks  connecting  to  the  pod.    While  this  rack  represents  only  10%  of  the  physical  space,  it  represents  90%  of  the  engineering  effort  involved  in  the  design.    

§ Compute  racks  are  intended  as  a  scale-­‐out  design,  with  12  per  pod  in  the  initial  deployment.    This  was  designed    to  evolve  over  time  as  more  capacity  per  location  is  required,  and  some  locations  have  power/cooling  constraints  and  require    flexibility  in  server  density.    The  networking  for  each  compute  rack  features  the  Big  Switch  Switch  Light  OS  running  at  each  top  of  rack,  running  on  Dell  ON  switch  hardware.    The  first  generation  pod  design  used  OpenVSwitch,  while  the  second  generation  uses  Big  Switch  Switch  Light  VX  (a  “P+V”  Fabric  Design)  running  on  Dell  compute  nodes.  

                                                                                                                                       1  See  this  article  co-­‐authored  by  Petr  Lapukhov,  Architect  at  Facebook,  and  Kyle  Forster,  Founder  of  Big  Switch:  http://www.infoworld.com/article/2608992/data-­‐center/data-­‐center-­‐rethinking-­‐the-­‐data-­‐center-­‐network.html    

Page 3: Customer Case Study: Tier-1 US Service Provider

 

 

Case  Study:  Industry’s  Largest  NFV  Deployment  

©  Big  Switch  Networks          3  

 

For  network  visibility  and  monitoring,  SPAN  ports  from  each  top  of  rack  switch  were  intended  to  integrate  with  Big  Switch’s  Big  Monitoring  Fabric.    This  enabled  on-­‐demand  and  granular  E-­‐W  traffic  monitoring  (including  intra-­‐host  traffic  using  RSPAN).    In  phase  1,  DDoS  mitigation  tools  were  connected  inline  to  protect  all  N-­‐S  traffic  and  managed  from  the  Big  Monitoring  Fabric  controller.  

Resiliency  At  Scale  To  validate  the  resiliency  of  the  NFV  pod  design  at  scale,  large  scale  test  beds  (>$1.5m  each)  were  constructed  in  both  Dell  and  Big  Switch  facilities.    The  cross-­‐vendor  team  used  a  “Chaos  Monkey”  methodology  pioneered  by  Netflix,  culminating  in  a  test  with  640  forced  network  failures  in  under  30  minutes  with  no  impact  to  workload  performance.2  

In  a  ‘chaos  monkey’  style  test,  random  network  failures  were  injected  in  to  the  pod  while  running  ‘worst  case’  workloads,  including  the  Hadoop  Terrasort  benchmark.  Within  the  testing  window,  Big  Cloud  Fabric  SDN  controllers  were  forced  to  fail-­‐over  every  30  seconds,  a  random  switch  was  forced  to  fail  every  8  seconds  and  a  random  link  was  forced  to  fail  every  4  seconds.  

No  Bandwidth  Bottlenecks  NFV  workloads  put  extreme  stress  on  the  network  in  many  dimensions  –  east/west  bandwidth,  north/south  bandwidth,  intra-­‐vSwitch  bandwidth  and  logical  L2/L3  bandwidth.    A  leaf-­‐spine  CLOS  design,  popularized  by  Google3,  has  become  the  common  approach  for  extreme  east/west/north/south  bandwidth  requirements.    However,  the  traditional  alphabet  soup  of  protocols  used  to  replicate  the  Google  design  with  legacy  networking  products  often  leaves  data  center  designs  that  are  extremely  fragile  in  the  face  of  partial  failures,  particularly  at  the  host,  or  that  significantly  constrain  workload  placement.    For  VNF  deployments,  these  downsides  make  these  approaches  a  non-­‐starter.    A  modern  leaf-­‐spine  CLOS  design,  using  centralized  SDN  control  designed  to  see  the  network  from  spine  to  leaf  to  vSwitch,  was  the  optimal  answer  for  this  design.  

 Figure  3:  Leaf-­‐Spine  Clos  Fabric  Architecture  

                                                                                                                                       2  For  more  details  on  Big  Switch’s  Chaos  Monkey  testing  for  OpenStack  networking,  see  http://go.bigswitch.com/rs/bigswitchnetworks/images/Chaos%20Monkey%20and%20Big%20Cloud%20Fabric.pdf    3  For  a  history  of  leaf-­‐spine  CLOS  designs  at  Google,  see  http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf  

Leaf-spine CLOS extended all the way down to the vSwitch

!  Maximized bandwidth use across all active links

!  Designed-in coverage of all partial failure cases from vSwitch to leaf to spine to controllers to OpenStack orchestration (compared to ‘alphabet soup’ of protocols)

!  Fully distributed L3 and Floating IP functions (no packet hair-pins)

!  End-to-end analytics and troubleshooting tools from vSwitch to leaf to spine

A B

vSWITCH

vSWITCH

vSWITCH

vSWITCH

A B

vSWITCH

vSWITCH

vSWITCH

vSWITCH

A B A B

SCALE OUT INGRESS EGRESS

BARE METAL SERVERS & STORAGE

VIRTUAL MACHINE RACKS SERVICES &

CONNECTIVITY RACKS

BIG CLOUD FABRIC SDN CONTROLLERS

Centralized Control Plane

Figure  2:  Data  Center  Scale  Test  Setup  

Page 4: Customer Case Study: Tier-1 US Service Provider

 

 

Case  Study:  Industry’s  Largest  NFV  Deployment  

©  Big  Switch  Networks          4  

 

Logical  Network  Flexibility  The  pod  design  needed  to  accommodate  NFV  workloads  that  each  had  unique  logical  network  requirements,  yet  needed  to  share  the  same  physical  leaf/spine  fabric  and  vSwitches.      Rather  than  a  one-­‐size-­‐fits-­‐all  L2/L3  approach,  this  design  needed  to  accommodate  numerous  NFV-­‐specific  L2/L3/service  designs.    These  included:  

• Public  L2  networks  with  workload-­‐specific  routers  for  ingress/egress  • Public  (routable)  L3  networks  connected  via  BGP  and  static  routes  to  the  various  service  provider  networks  • Private  L2  networks  for  workloads  requiring  inter-­‐VNF  broadcast  and  L2  multicast  connectivity  • Tenant-­‐managed  service  chains  with  FWaaS,  LBaaS  and  other  services  managed  by  workload-­‐specific  teams  on  their  

operational  schedules  • Provider-­‐managed  service  chains,  transparent  to  the  tenants,  to  serve  as  corporate  standards  across  a  wide  variety  (but  not  

all)  NFV  workloads  loaded  on  to  the  pod  • A  mix  of  both  virtual  network  functions  and  physical  network  functions  inserted  in  to  the  service  chains  mentioned  above  to  

service  NFV  workloads,  • A  mix  of  both  virtual  network  functions  and  part-­‐virtual  /  part-­‐physical  network  functions  making  up  a  NFV  workload  (i.e.  

specialized  physical  equipment  and  high  rate  storage)  

Where  applicable,  workflows  required  for  provisioning  these  networks  needed  to  be  orchestrated  through  OpenStack  APIs  and  User  Interfaces.  

Reduced  Operational  Complexity  NFV  designs  in  the  lab  can  be  incredibly  complex,  representing  unbounded  operational  risk.    To  address  those  risks,  ease  of  deployment  and  management  of  day-­‐to-­‐day  operations  were  critical  elements  for  this  design.    

§ OpenStack  Deployment:  This  was  addressed  with  a  powerful,  simplified  and  automated  cloud  installation  tool  from  Red  Hat  -­‐  the  RHEL  OSP  7  director,  which  also  provides  system-­‐wide  health  checking  and  complete  lifecycle  management.  The  integration  of  the  BCF  networking  installer  with  RHEL  OSP  7  director  provides  a  completely  integrated  workflow  that  not  only  makes  the  system  installation  process  seamless  and  predictable,  but  also  ensures  the  stability  and  rapid  convergence  of  the  system  upon  subsequent  upgrade  of  the  system  components.    

 

§ Pod  Operations:  In  order  to  make  this  system  intuitive  for  networking  professionals,  the  pod  design  used  Big  Cloud  Fabric’s  “One  Big  Switch”  operational  metaphor  (Figure  5).    From  an  operations  perspective,  the  SDN  controllers  feel/act  just  like  chassis  supervisors,  while  the  spine  switches  feel  just  like  a  chassis  backplane  and  the  leaf  and  vSwitches  feel  just  like  chassis  line  cards.    This  metaphor  dramatically  reduced  the  training  required  when  integrating  this  new  pod  in  to  existing  operational  processes.  

Figure  4:  RHEL  OpenStack  Platform  Director  

Page 5: Customer Case Study: Tier-1 US Service Provider

 

 

Case  Study:  Industry’s  Largest  NFV  Deployment  

©  Big  Switch  Networks          5  

 

 Figure  5:  One  "Big  Switch"  

With  complex  NFV  workloads  riding  on  top  of  a  layer  of  OpenStack  automation  which  itself  is  riding  on  top  of  an  SDN  fabric,  network  health,  history  and  troubleshooting  tools  were  a  key  challenge  for  the  deployment.    With  integration  from  vSwitch  to  leaf  to  spine,  the  visibility  of  the  Big  Cloud  Fabric  “P+V”  design  dramatically  reduced  operational  concerns  with  this  kind  of  deployment.    According  to  a  recent  ACG  research  study,  these  tools  allow  for  troubleshooting  12x  faster  than  traditional  network  designs  for  these  types  of  pods4.  

 

Integrated  Security  &  Visibility  To  ensure  that  the  NFV  Pod  is  compliant  and  secure  against  intrusions  and  other  threats,  Big  Monitoring  Fabric  was  used  to  monitor  East-­‐West  traffic  (intra-­‐pod)  and  North-­‐South  traffic  (inline).  Big  Monitoring  Fabric  is  provisioned  and  managed  through  a  centralized,  single  pane  of  glass—Big  Monitoring  Fabric  controller  CLI,  GUI  or  REST  APIs.    In  addition  to  delivering  relevant  traffic  to  dedicated  tools  (e.g.  DDoS  appliance  in  inline  deployment),  Big  Monitoring  Fabric  also  supports  built  in  analytics  and  troubleshooting  as  shown  in  Figure  6.  

                                                                                                                                       4  The  entire  ACG  study,  showing  12x  faster  troubleshooting  times,  20x  faster  software  upgrade  times  and  12x  faster  pod  expansion  times  is  available  at  http://go.bigswitch.com/rs/974-­‐WXR-­‐561/images/Economic%20Advantages%20of%20Open%20SDN%20Fabrics%20-­‐%20ACG%20Research.pdf    

Traditional Chassis Pair

BACKPLANE

SUPERVISOR(S)

LINE CARD(S) LINE CARD

LINE CARD

LINE CARD

LINE CARD

LINE CARD

SUPERVISOR 1

LINE CARD

LINE CARD

LINE CARD

LINE CARD

LINE CARD

SUPERVISOR

BIG CLOUD FABRIC

CONTROLLER

1 3

SPINE SWITCHES

2 4 1 3 2 4

COMPUTE WORKLOAD

SERVICES & CONNECTIVITY

COMPUTE WORKLOAD

LEAF SWITCHES LINE CARD

LINE CARD

LINE CARD

LINE CARD

LINE CARD

SUPERVISOR

LINE CARD

LINE CARD

LINE CARD

LINE CARD

LINE CARD

SUPERVISOR

BAC

KPLA

NE

BAC

KPLA

NE

Health

Machine-assisted troubleshooting

History

Page 6: Customer Case Study: Tier-1 US Service Provider

 

 

Case  Study:  Industry’s  Largest  NFV  Deployment  

©  Big  Switch  Networks          6  

 

 Figure  6:  Integrated  Visibility  &  Analytics  

 

To  Learn  More  § Big  Cloud  Fabric  Overview:  More  details  available  at:  http://bigswitch.com/sdn-­‐products/big-­‐cloud-­‐fabric  

§ Red  Hat  OpenStack  Platform  Overview:  More  details  available  at:  https://access.redhat.com/documentation/en/red-­‐hat-­‐enterprise-­‐linux-­‐openstack-­‐platform/7/    

§ Big  Monitoring  Fabric  Overview:    More  details  available  at:  http://bigswitch.com/products/big-­‐monitoring-­‐fabric    

§ Big  Switch  Labs:  Get  hands-­‐on  experience  with  the  seamless  integration  of  OpenStack  and  Big  Cloud  Fabric  (P+V  Edition)  using  Big  Switch’s  Neutron  plugin.  Available  online,  for  free:  http://labs.bigswitch.com  

§ BCF  Starter  Kits:  Big  Switch  offers  this  fully  tested,  scalable  OpenStack  networking  solution  in  several  Big  Cloud  Fabric  starter  kits,  pre-­‐configured  with  hardware,  cables,  support  and  physical+virtual  Big  Cloud  Fabric  software  starting  at  $49k.    For  more  details,  download  the  brochure  at:  http://bigswitch.com/starter-­‐kits    

§ Test  Setup  Details:  Details  of  the  scale  testing  architecture  and  chaos  monkey  testing  installation  and  methodology  available  on  request.    Email  [email protected].  

 

 

ABOUT  BIG  SWITCH  

Big  Switch  Networks  is  the  market   leader  in  bringing  hyperscale  data  center  networking  technologies  to  a  mainstream  data  center  audience.  The  company   is   taking   three  key  hyperscale   technologies   -­‐-­‐  OEM/ODM  bare  metal  and  open  Ethernet   switch  hardware,  sophisticated   SDN   control   software,   and   core-­‐and-­‐pod   data   center   designs   -­‐-­‐   and   leveraging   them   in   fit-­‐for-­‐purpose   products  designed  for  use  in  enterprises,  cloud  providers,  and  service  providers.  For  additional  information,  email  [email protected],  follow  @bigswitch,  or  visit  www.bigswitch.com.  

Big   Switch  Networks,   Big   Cloud   Fabric,   Big  Monitoring   Fabric,   Switch   Light  OS,   and   Switch   Light   VX   are   trademarks   or   registered  trademarks  of  Big  Switch  Networks,   Inc.  All  other  trademarks,  service  marks,  registered  marks,  or  registered  service  marks  are  the  property  of  their  respective  owners.