Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps]...

23
Less is More: Trading a li0le Bandwidth for UltraLow Latency in the Data Center Mohammad Alizadeh, Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, and Masato Yasuda

Transcript of Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps]...

Page 1: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

Less  is  More:  Trading  a  li0le  Bandwidth  for  Ultra-­‐Low  Latency  in  the  Data  Center  

Mohammad  Alizadeh,  Abdul  Kabbani,  Tom  Edsall,    Balaji  Prabhakar,  Amin  Vahdat,  and  Masato  Yasuda    

 

Page 2: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

Latency  in  Data  Centers  

•  Latency  is  becoming  a  primary  performance  metric  in  DC  

 

•  Low  latency  applicaIons  –  High-­‐frequency  trading  –  High-­‐performance  compuIng  –  Large-­‐scale  web  applicaIons  –  RAMClouds  (want  <  10μs  RPCs)  

 

•  Desire  predictable  low-­‐latency  delivery  of  individual  packets    

2  

Page 3: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

                                           

Large-scale Web Application

Why  Does  Latency  Ma7er?  

Data Structures

Traditional Application

App.    Logic  

App  Logic  

App  Logic  

App  Logic  

App  Logic  

App  Logic  

App  Logic  

App  Logic  

App  Logic  

App  Logic  

App    Logic   Alice  App    

Logic  

Who  does  she  know?  What  has  she  done?  

Minnie  Eric   Pics   Videos  Apps  

•  Latency  limits  data  access  rate  Ø  Fundamentally  limits  applicaAons  

•  Possibly  1000s  of  RPCs  per  operaIon  Ø   Microseconds  ma7er,  even  at  the  tail          (e.g.,  99.9th  percenAle)  

3  

Page 4: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

Reducing  Latency  

•  SoXware  and  hardware  are  improving  –  Kernel  bypass,  RDMA;  RAMCloud:  soXware  processing  ~1μs  –  Low  latency  switches  forward  packets  in  a  few  100ns  –  Baseline  fabric  latency  (propagaAon,  switching)  under  10μs  is  achievable.  

 •  Queuing  delay:  random  and  traffic  dependent    –  Can  easily  reach  100s  of  microseconds  or  even  milliseconds  

•  One  1500B  packet  =  12μs  @  1Gbps  

Goal:  Reduce  queuing  delays  to  zero.  4  

Page 5: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

Data  Center    Workloads:      

•  Short  messages  [100B-­‐10KB]          

•  Large  flows  [1MB-­‐100MB]      

Low  Latency  

High  Throughput  

Low  Latency  AND  High  Throughput  

We  want  baseline  fabric  latency    AND  high  throughput.  

5  

Page 6: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

Why  do  we  need  buffers?  •  Main  reason:  to  create  “slack”  –  Handle  temporary  oversubscripIon  –  Absorb  TCP’s  rate  fluctuaIons  as  it  discovers  path  bandwidth    

•  Example:  Bandwidth-­‐delay  product  rule  of  thumb  –  A  single  TCP  flow  needs  C×RTT  buffers  for  100%  Throughput.  

Throughp

ut  

Buffe

r  Size  

100%  

B  B  ≥  C×RTT  

B  

100%  

B  <  C×RTT  

6  

Page 7: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

Overview  of  our  Approach  

•  Use  “phantom  queues”  –  Signal  congesIon  before  any  queuing  occurs  

•  Use  DCTCP  [SIGCOMM’10]  – MiIgate  throughput  loss  that  can  occur  without  buffers      

 •  Use  hardware  pacers  –  Combat  bursIness  due  to  offload  mechanisms  like  LSO  and  Interrupt  coalescing  

 

Main  Idea  

7  

Page 8: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

Source:  •  React  in  proporIon  to  the  extent  of  congesIon  è  less  fluctuaIons  

–  Reduce  window  size  based  on  fracAon  of  marked  packets.    

 

 

   

ECN  Marks   TCP     DCTCP  

1  0  1  1  1  1  0  1  1  1   Cut  window  by  50%   Cut  window  by  40%  

0  0  0  0  0  0  0  0  0  1   Cut  window  by  50%   Cut  window  by    5%  

Review:  DCTCP  

Switch:  •  Set  ECN  Mark  when  Queue  Length  >  K.  

B   K  Mark   Don’t    Mark  

8  

Page 9: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

Setup:  Win  7,  Broadcom  1Gbps  Switch  Scenario:  2  long-­‐lived  flows,    

(Kbytes)  

ECN  Marking  Thresh  =  30KB  

DCTCP  vs  TCP  

­  From  Alizadeh  et  al  [SIGCOMM’10]  

9  

Page 10: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

TCP:  ~1–10ms  

DCTCP:  ~100μs  

~Zero  Latency  

How  do  we  get  this?  

Achieving  Zero  Queuing  Delay  

C  Incoming  Traffic  

TCP  

Incoming  Traffic  

DCTCP   K  C  

10  

Page 11: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

   

Phantom  Queue  

Link  Speed  C  

Switch  

Bump  on  Wire  (NetFPGA  implementaAon)  

•  Key  idea:    –  Associate  congesIon  with  link  uIlizaIon,  not  buffer  occupancy    –  Virtual  Queue  (Gibbens  &  Kelly  1999,  Kunniyur  &  Srikant  2001)    

Marking  Thresh.  

γC      γ  <  1:  Creates  “bandwidth  headroom”  

11  

Page 12: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

Throughput   Switch  latency  (mean)  

0

200

400

600

800

1000

600 650 700 750 800 850 900 950 1000

Thro

ughp

ut [M

bps]

PQ Drain Rate [Mbps]

ecn1kecn3kecn6k

ecn15kecn30k

0 50

100 150 200 250 300 350 400

600 650 700 750 800 850 900 950 1000

Mea

n S

witc

h La

tenc

y [µ

s]

PQ Drain Rate [Mbps]

ecn1kecn3kecn6k

ecn15kecn30k

Throughput  &  Latency    vs.  PQ  Drain  Rate  

12  

Page 13: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

•  TCP  traffic  is  very  bursty  – Made  worse  by  CPU-­‐offload  opImizaIons  like  Large  Send  Offload  and  Interrupt  Coalescing  

–  Causes  spikes  in  queuing,  increasing  latency  

Example.  1Gbps  flow  on  10G  NIC  

The  Need  for  Pacing  

65KB  bursts  every  0.5ms  

13  

Page 14: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

Impact  of  Interrupt  Coalescing  

Interrupt  Coalescing  

Receiver  CPU  (%)  

Throughput  (Gbps)  

Burst  Size  (KB)  

disabled   99   7.7   67.4  

rx-­‐frames=2   98.7   9.3   11.4  

rx-­‐frames=8   75   9.5   12.2  

rx-­‐frames=32   53.2   9.5   16.5  

rx-­‐frames=128   30.7   9.5   64.0  

More  Interrupt  Coalescing  

Lower  CPU  UAlizaAon    &  Higher  Throughput  

More    BursAness   14  

Page 15: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

•  Algorithmic  challenges:  –  At  what  rate  to  pace?    

•  Found  dynamically:  – Which  flows  to  pace?    

•  Elephants:  On  each  ACK  with  ECN  bit  set,  begin  pacing  the  flow  with  some  probability.  

R← (1−η)R+ηRmeasured +βQTB

Outgoing  Packets    From    

Server  NIC    

Un-­‐paced  Traffic    

TX  

Token  Bucket    Rate  Limiter  

Flow  AssociaAon  

Table  

R  QTB  

Hardware  Pacer  Module  

15  

Page 16: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

0

200

400

600

800

1000

600 650 700 750 800 850 900 950 1000

Thro

ughp

ut [M

bps]

PQ Drain Rate [Mbps]

ecn1kecn3kecn6k

ecn15kecn30k 0

50 100 150 200 250 300 350 400

600 650 700 750 800 850 900 950 1000

Mea

n S

witc

h La

tenc

y [µ

s]

PQ Drain Rate [Mbps]

5µsec

ecn1kecn3kecn6k

ecn15kecn30k

Throughput   Switch  latency  (mean)  

Throughput  &  Latency    vs.  PQ  Drain  Rate    (with  Pacing)  

16  

Page 17: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

0 50

100 150 200 250 300 350 400

600 650 700 750 800 850 900 950 1000

Mea

n S

witc

h La

tenc

y [µ

s]

PQ Drain Rate [Mbps]

ecn1kecn3kecn6k

ecn15kecn30k

0 50

100 150 200 250 300 350 400

600 650 700 750 800 850 900 950 1000

Mea

n S

witc

h La

tenc

y [µ

s]

PQ Drain Rate [Mbps]

5µsec

ecn1kecn3kecn6k

ecn15kecn30k

No  Pacing   Pacing  

No  Pacing  vs  Pacing    (Mean  Latency)  

17  

Page 18: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

0 100 200 300 400 500 600 700 800

600 650 700 750 800 850 900 950 1000

99th

Per

cent

ile L

aten

cy [µ

s]

PQ Drain Rate [Mbps]

ecn1kecn3kecn6k

ecn15kecn30k

0 100 200 300 400 500 600 700 800

600 650 700 750 800 850 900 950 1000

99th

Per

cent

ile L

aten

cy [µ

s]

PQ Drain Rate [Mbps]

21µsec

ecn1kecn3kecn6k

ecn15kecn30k

No  Pacing   Pacing  

No  Pacing  vs  Pacing    (99th  PercenAle  Latency)  

18  

Page 19: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

The  HULL  Architecture  

       

       

               

               

               

   Phantom  Queue  

Hardware  Pacer  

DCTCP  CongesAon  Control  

19  

Page 20: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

SW1

NF5

S9 S10

NF1S1

S2

NF2S3

S4

NF3S6

NF4S7

S8

NF6S5

•  ImplementaIon  –  PQ,  Pacer,  and  Latency  Measurement  modules  implemented  in  NetFPGA  

–  DCTCP  in  Linux  (patch  available  online)  

•  EvaluaIon  –  10  server  testbed  –  Numerous  micro-­‐benchmarks  

•  StaIc  &  dynamic  workloads  •  Comparison  with  ‘ideal’  2-­‐priority  QoS  scheme  •  Different  marking  thresholds,  switch  buffer  sizes  •  Effect  of  parameters  

–  Large-­‐scale  ns-­‐2  simulaIons  

ImplementaAon  and  EvaluaAon  

20  

Page 21: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

Load:  20%  Switch  Latency  (μs)   10MB  FCT  (ms)  

Avg   99th   Avg   99th  

TCP   111.5   1,224.8   110.2   349.6  

DCTCP-­‐30K   38.4   295.2   106.8   301.7  

DCTCP-­‐6K-­‐Pacer   6.6   59.7   111.8   320.0  

DCTCP-­‐PQ950-­‐Pacer   2.8   18.6   125.4   359.9  

•  9  senders  à  1  receiver  (80%  1KB  flows,  20%  10MB  flows).    

 

Dynamic  Flow  Experiment  20%  load  

21  

Page 22: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

Conclusion  

•  The  HULL  architecture  combines  –  Phantom  queues  –  DCTCP  –  Hardware  pacing  

 •  We  trade  some  bandwidth  (that  is  relaIvely  plenIful)  for  significant  latency  reducIons    (oXen  10-­‐40x  compared  to  TCP  and  DCTCP).    

 

22  

Page 23: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750

Thank  you!