Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ...

53
Networkaware Data Management Middleware for High Throughput Flows March 16, 2015 Mehmet Balman h3p://balman.info Performance Engineer at VMware Inc. Guest ScienFst at Berkeley Lab 1

Transcript of Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ...

Page 1: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Network-­‐aware  Data  Management  Middleware  for  High  Throughput  Flows  

March  16,  2015  

Mehmet  Balman  h3p://balman.info    Performance  Engineer  at  VMware  Inc.    Guest  ScienFst  at  Berkeley  Lab  

1  

Page 2: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

About  me:  

Ø 2013:  Performance,  Central  Engineering,  VMware,  Palo  Alto,  CA  

Ø 2009:  ComputaFonal  Research  Division  (CRD)  at  Lawrence  Berkeley  NaFonal  Laboratory  (LBNL)  

Ø 2005:  Center  for  ComputaFon  &  Technology  (CCT),  Baton  Rouge,  LA  

v Computer  Science,  Louisiana  State  University  (2010,2008)  v Bogazici  University,  Istanbul,  Turkey  (2006,2000)  

 Data  Transfer  Scheduling  with  Advance  ReservaFon  and  Provisioning,  Ph.D.                    Failure-­‐Awareness  and  Dynamic  AdaptaFon  in  Data  Scheduling,  M.S.  Parallel  Tetrahedral  Mesh  Refinement,  M.S.  

2  

Page 3: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Why  Network-­‐aware?  Networking  is  one  of  the  major  components  in  many  of  the  soluFons  today  

•  Distributed  data  and  compute  resources  •  CollaboraFon:  data  to  be  shared  between  remote  sites  •  Data  centers  are  complex  network  infrastructures    

ü What  further  steps  are  necessary  to  take  full  advantage  of  future  networking  infrastructure?  

ü How  are  we  going  to  deal  with  performance  problems?    ü How  can  we  enhance  data  management  services  and  make  them  network-­‐aware?    

New  collabora>ons  between  data  management  and  networking  communi>es.  

3  

Page 4: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Two  major  players:  • AbstracFon  and  Programmability  

•  Rapid  Development,  Intelligent  services  •  OrchestraFng  compute,  storage,  and  network  resources  together  •  IntegraFon  and  deployment  of  complex  workflows  

•  VirtualizaFon  (+containers)    •  Distributed  storage  (storage  wars)  •  Open  Source    (if  you  can’t  fix  it,  you  don’t  own  it)  

• Performance  Gap:  •  LimitaFon  is  current  system  socware  and  foreseen    speed:  •  Hardware  is  fast,  Socware  is  slow    

•  Latency  throughput  mismatch  will  lead  to  new  innovaGons  

4  

Page 5: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Outline  

• Data  Streaming  in  High-­‐bandwidth  Networks  •  Climate100:  Advance  Network  IniFaFve  and  100Gbps  Demo  •  MemzNet:  Memory-­‐Mapped  Network  Zero-­‐copy  Channels    •  Core  Affinity  and  End  System  Tuning  in  High-­‐Throughput  Flows  

• Network  Reserva>on  and  Online  Scheduling  •  FlexRes:  A  Flexible  Network  ReservaFon  Algorithm  •  SchedSim:  Online  Scheduling  with  Advance  Provisioning    

 •  Performance  Engineering  and  Virtualized  Solu>ons  

•  So,ware  Defined  Storage  

5  

Page 6: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

100Gbps  networking  has  Finally  arrived!  

Applica>ons’  Perspec>ve  Increasing   the   bandwidth   is   not   sufficient   by   itself;   we   need  careful   evaluaFon   of   high-­‐bandwidth   networks   from   the  applicaFons’  perspecFve.      

1Gbps  to  10Gbps  transiFon    (10  years  ago)  

ApplicaFon  did  not  run  10  Fmes  faster  because  there  was  more  bandwidth  available  

6  

Page 7: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

ANI  100Gbps  Demo  

•  100Gbps  demo  by  ESnet  and  Internet2    

 •  ApplicaFon  design  issues  and  host  

tuning  strategies  to  scale  to  100Gbps  rates  

 

•  VisualizaFon  of  remotely  located  data  (Cosmology)  

 •  Data  movement  of  large    datasets  with  

many  files  (Climate  analysis)    

7  

Page 8: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Earth  System  Grid  Federation  (ESGF)  

8  

•  Over  2,700  sites  •  25,000  users  

 

•  IPCC  Fich  Assessment  Report  (AR5)  2PB    •  IPCC  Forth  Assessment  Report  (AR4)  35TB  

•  Remote    Data  Analysis  •  Bulk  Data  Movement  

Page 9: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Application’s  Perspective:    Climate  Data  Analysis  

9  

Page 10: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

 lots-­‐of-­‐small-­‐*iles  problem!  

*ile-­‐centric  tools?    

FTP RPC

request a file

request a file

send file

send file

request data

send data

•  Keep  the  network  pipe  full  •  We  want  out-­‐of-­‐order  and  asynchronous  send  receive    

  10  

Page 11: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Many  Concurrent  Streams  

(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).  

   

11  

Page 12: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min intervals], TCP buffer size is 50M

Effects  of  many  concurrent  streams  

12  

Page 13: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Analysis  of    Core  AfFinities    (NUMA  Effect)  

13  Nathan  Hanford  et  al.    NDM’13  

Sandy  Bridge  Architecture  

Receive  process    

Page 14: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

14  

Analysis  of    Core  AfFinities    (NUMA  Effect)  

Nathan  Hanford  et  al.  NDM’14  

Page 15: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

 100Gbps  demo  environment  

RRT:    Sea3le  –  NERSC    16ms                      NERSC  –  ANL              50ms                      NERSC  –  ORNL        64ms  

15  

Page 16: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Framework  for  the  Memory-­‐mapped  Network  Channel  

+  SynchronizaFon  mechanism  for  RoCE  -­‐  Keep  the  pipe  full  for  remote  analysis  

16  

Page 17: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Moving  climate  *iles  ef*iciently  

17  

Page 18: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Advantages  •  Decoupling  I/O  and  network  operaFons  

•  front-­‐end  (I/O    processing)  •  back-­‐end  (networking  layer)    

•  Not  limited  by  the  characterisFcs  of  the  file  sizes  •  On  the  fly  tar  approach,    bundling  and  sending    many  files  

together  

•  Dynamic  data  channel  management     Can   increase/decrease   the   parallelism   level   both     in   the   network  communicaFon   and   I/O   read/write   operaFons,   without   closing   and  reopening   the   data   channel   connecFon   (as   is   done   in   regular   FTP  variants).    MemzNet   is     is  not  file-­‐centric.  Bookkeeping   informaFon  is  embedded  inside  each  block.      

18  

Page 19: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

MemzNet’s  Architecture  for  data  streaming  

19  

Page 20: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

100Gbps  Demo  •  CMIP3  data  (35TB)  from  the  GPFS  filesystem  at  NERSC  

•  Block  size  4MB  •  Each  block’s  data  secFon  was  aligned  according  to  the  system  pagesize.    

•  1GB  cache  both  at  the  client  and  the  server    •  At  NERSC,  8  front-­‐end  threads  on  each  host  for  reading  data  files  in  parallel.  

•   At  ANL/ORNL,  4  front-­‐end  threads  for  processing  received  data  blocks.  

•   4  parallel  TCP  streams  (four  back-­‐end  threads)  were  used  for  each  host-­‐to-­‐host  connecFon.    

20  

Page 21: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

83Gbps    throughput  

21  

Page 22: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

MemzNet’s  Performance    

TCP  buffer  size  is  set  to  50MB    

MemzNet GridFTP

100Gbps demo

ANI Testbed

22  

Page 23: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Challenge?  •  High-­‐bandwidth  brings  new  challenges!  

•  We  need  substanFal  amount  of  processing  power  and  involvement  of  mulFple  cores  to  fill  a  40Gbps  or  100Gbps  network    

•  Fine-­‐tuning,  both  in  network  and  applicaFon  layers,  to  take  advantage  of  the  higher  network  capacity.    

•  Incremental  improvement  in  current  tools?  •  We  cannot  expect  every  applicaFon  to  tune  and  improve  every  Fme  we  change  the  link  technology  or  speed.    

 23  

Page 24: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

MemzNet  •  MemzNet:  Memory-­‐mapped  Network  Channel    

•  High-­‐performance  data  movement    

MemzNet  is  an  iniFal  effort  to  put  a  new  layer  between  the  applicaFon  and  the  transport  layer.  

•  Main  goal  is  to  define  a  network  channel  so  applicaFons  can  directly  use  it  without  the  burden  of  managing/tuning  the  network  communicaFon.    

24  Tech  report:  LBNL-­‐6177E  

Page 25: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

MemzNet  =  New  Execution  Model  •  Luigi  Rizzo  ’s  netmap    

• proposes  a  new  API  to  send/receive  data  over  the  network  

• RDMA  programming  model  • MemzNet  as  a  memory-­‐management  component  

• IX:  Data  Plane  OS  (Adam  Baley  et  al.  @standford  –  similar  to  MemzNet’s  model)  

• mTCP  (even  based  /  replaces  send/receive  in  user  level)  •  Tanenbaum  et  al.    Minimizing  context  switches:  proposing  to  use  MONITOR/MWAIT  for  synchronizaFon  

25  

Page 26: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Problem  Domain:  Esnet’s  OSCARS  

26  

ASIA-PACIFIC (ASGC/Kreonet2/

TWAREN)

ASIA-PACIFIC(KAREN/KREONET2/

NUS-GP/ODN/REANNZ/SINET/

TRANSPAC/TWAREN)

AUSTRALIA(AARnet)

LATIN AMERICACLARA/CUDI

CANADA(CANARIE)

RUSSIAAND CHINA(GLORIAD)

US R&E(DREN/Internet2/NLR)

US R&E(DREN/Internet2/

NASA)

US R&E(NASA/NISN/

USDOI)

ASIA-PACIFIC(BNP/HEPNET)

ASIA-PACIFIC(ASCC/KAREN/

KREONET2/NUS-GP/ODN/REANNZ/

SINET/TRANSPAC)

AUSTRALIA(AARnet)

US R&E(DREN/Internet2/

NISN/NLR)

US R&E(Internet2/

NLR)

CERN

US R&E(DREN/Internet2/

NISN)

CANADA(CANARIE) LHCONE

CANADA(CANARIE)

FRANCE(OpenTransit)

RUSSIAAND CHINA(GLORIAD)

CERN (USLHCNet)

ASIA-PACIFIC(SINET)

EUROPE (GÉANT/

NORDUNET)

EUROPE (GÉANT)

LATIN AMERICA(AMPATH/CLARA)

LATIN AMERICA(CLARA/CUDI)

HOUSTON

ALBUQUERQUE

El PASO

SUNNYVALE

BOISE

SEATTLE

KANSAS CITY

NASHVILLE

WASHINGTON DC

NEW YORK

BOSTON

CHICAGO

DENVER

SACRAMENTO

ATLANTA

PNNL

SLAC

AMES PPPL

BNL

ORNL

JLAB

FNAL

ANL

LBNL

•  ConnecFng  experimental  faciliFes  and  supercompuFng  centers  

•  On-­‐Demand  Secure  Circuits  and  Advance  ReservaFon  System    •  Guaranteed  between  collaboraFng  insFtuFons  by  delivering  

network-­‐as-­‐a-­‐service    

•  Co-­‐allocaFon  of  storage  and  network  resources        (SRM:  Storage  Resource  Manager)  

 

OSCARS  provides  yes/no  answers  to  a  reservaFon  request  for  (bandwidth,  start_Gme,  end_Gme)  

End-­‐to-­‐end  ReservaFon:    Storage+Network    

Page 27: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Reservation  Request  •  Between  edge  routers    Need  to  ensure  availability  of  the  requested  bandwidth  from  source  to  desGnaGon  for  the  requested  Gme  interval  

 v   R={  nsource,  ndesGnaGon,  Mbandwidth,  tstart,  tend}.  

v  source/desFnaFon  end-­‐points  v  Requested  bandwidth  v  start/end  Fmes  

 Commi3ed  reservaFons  between  tstart  and  tend  are  examined      The  shortest  path  from  source  to  desFnaFon  is  calculated  based  on  the  engineering  metric  on  each  link,  and  a  bandwidth  guaranteed  path  is  set  up  to  commit  and  eventually  complete  the  reservaFon  request  for  the  given  Fme  period  

27  

Page 28: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Reservation  

28  

v Components (Graph): v node (router), port, link (connecting two ports) v engineering metric (~latency) v maximum bandwidth (capacity)

v Reservation: v source, destination, path, time v (time t1, t3) A -> B -> D (900Mbps) v (time t2, t3) A -> C -> D (400Mbps) v (time t4, t5) A -> B -> D (800Mpbs)

A  

C  B  

D  

800Mbps  

900Mbps   500Mbps  

1000Mbps  

300Mbps  

ReservaFon  1  

ReservaFon  2  ReservaFon  3  

t1  

t2   t3  t4   t5  

Page 29: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Example  

(Fme  t1,  t2)  :    A  to  D  (600Mbps)  NO    A  to  D  (500Mbps)  YES          

A  

C  B  

D  

0  Mbps  /  900Mbps  (900Mbps)  

100  Mbps  /  900Mbps  (1000Mbps)  800  Mbps  /  0Mbps  (800Mbps)  

500  Mbps  /  0Mbps  (500Mbps)  

300  Mbps  /    0  Mbps  (300Mbps)  

AcFve  reservaFon  reservaFon  1:  (Fme  t1,  t3)    A  -­‐>  B  -­‐>  D    (900Mbps)  reservaFon  2:  (Fme  t1,  t3)    A  -­‐>  C  -­‐>  D    (400Mbps)  reservaFon  3:  (Fme  t4,  t5)    A  -­‐>  B  -­‐>  D    (800Mpbs)  

available/  reserved  (capacity)    

29  

Page 30: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Example  

A  

C  B  

D  

0  Mbps  /  900Mbps  (900Mbps)  

100  Mbps  /  900Mbps  (1000Mbps)  400  Mbps  /  400Mbps  (800Mbps)  

100  Mbps  /  400Mbps  (500Mbps)  

300  Mbps  /    0  Mbps  (300Mbps)  

(Fme  t1,  t3)  :    A  to  D  (500Mbps)  NO      A  to  C  (500Mbps)  No  (not  max-­‐FLOW!)          

AcFve  reservaFon  reservaFon  1:  (Fme  t1,  t3)    A  -­‐>  B  -­‐>  D    (900Mbps)  reservaFon  2:  (Fme  t1,  t3)    A  -­‐>  C  -­‐>  D    (400Mbps)  reservaFon  3:  (Fme  t4,  t5)    A  -­‐>  B  -­‐>  D    (800Mpbs)  

available/  reserved  (capacity)    

30  

Page 31: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Alternative  Approach:  Flexible  Reservations  

•  IF  the  requested  bandwidth  can  not  be  guaranteed:  •  Try-­‐and-­‐error  unFl  get  an  available  reservaFon  •  Client  is  not  given  other  possible  opFons  

•  How  can  we  enhance  the  OSCARS  reservaFon  system?  •  Be  Flexible:  

•  Submit  constraints  and  the  system  suggests  possible  reservaFon  opFons  saFsfying  given  requirements  

31  

 Rs'={  nsource  ,  ndesGnaGon,  MMAXbandwidth,  DdataSize,  tEarliestStart,  tLatestEnd}    ReservaFon  engine  finds  out  the  reservaFon    

       R={  nsource,  ndesGnaGon,  Mbandwidth,  tstart,  tend}    for  the  earliest  compleFon  or  for  the  shortest  duraFon    where  Mbandwidth≤  MMAXbandwidth  and  tEarliestStart  ≤  tstart  <  tend≤  tLatestEnd  .  

Page 32: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Bandwidth  Allocation  (time-­‐dependent)              

Modified  Dijstra's  algorithms  (max  available  bandwidth):  

 •  BoPleneck  constraint    (not  addiFve)  

•  QoS  constraint  is  addiFve  in  shortest  path,  etc)  

32  The  maximum  bandwidth  available  for  allocaFon  from  a  source  node  to  a  desFnaFon  node  

t1   t2   t3   t4   t5   t6  

Page 33: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Analogous Example n  A vehicle travelling from city A to city B n  There are multiple cities between A and B connected with separate

highways. n  Each highway has a specific speed limit

–  (maximum bandwidth)

n  But we need to reduce our speed if there is high traffic load on the road

n  We know the load on each highway for every time period –  (active reservations)

n  The first question is which path the vehicle should follow in order to reach city B from city A as early as possible (earliest completion)

•  Or, we can delay our journey and start later if the total travel time would be reduced. Second question is to find the route along with the starting time for shortest travel duration (shortest duration)

33  

Advance bandwidth reservation: we have to set the speed limit before starting and cannot change during the journey  

Page 34: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Time steps

n  Time steps between t1 and t13

Fme  t4  t2   t3  t1   t5   t6   t7   t8   t9   t10   t11   t12   t13  

ReservaFon  1  ReservaFon  2  

ReservaFon  3  

Res  1   Res  1,2   Res  2   Res  3  

t4  t1   t6   t7   t9   t12   t13  

Fme  

Fme  steps  

Max (2r+1) time steps, where r is the number of reservations

ts1   ts2   ts3   ts4  34  

Page 35: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Static Graphs Res  1   Res  1,2   Res  2  

t4  t1  t6   t7   t9  

A  

C  B  

D  

0  Mbps  

100  Mbps  800  Mbps  

500  Mbps  

300  Mbps)  

A  

C  B  

D  

0  Mbps  

100  Mbps  400  Mbps  

100  Mbps  

300  Mbps)  

A  

C  B  

D  

900  Mbps  

1000  Mbps  400  Mbps  

100  Mbps  

300  Mbps)  

A  

C  B  

D  

900  Mbps  

1000  Mbps  800  Mbps  

500  Mbps  

300  Mbps)  

t4   t6  t7  

G(ts3)   G(ts4)  G(ts2)  G(ts1)  35  

Page 36: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Time Windows

Res  1,2   Res  2  

t1  t6   t9  

A  

C  B  

D  

0  Mbps  

100  Mbps   400  Mbps  

100  Mbps  

300  Mbps  

A  

C  B  

D  

900  Mbps  

1000  Mbps   400  Mbps  

100  Mbps  

300  Mbps  

t6  

Max (s × (s + 1))/2 time windows, where s is the number of time steps

G(tw)=G(ts3)  x  G(ts4)  

tw=ts1+ts2  

Bo3leneck  constraint  

G(tw)=G(ts1)  x  G(ts2)  

tw=ts3+ts4  

36  

Page 37: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Time  Window  List          (special  data  structures)  

now   infinite  

Time  windows  list  

new  reservaFon:    reservaFon  1,  start  t1,  end  t10  

now   t1   t10   infinite  

Res  1  

new  reservaFon:    reservaFon  2,  start  t12,  end  t20  

now   t1   t10   t12  

Res  1  

t20   infinite  

Res  2  

37  

Careful  socware  design  makes  implementaFon  fast  and  efficient  

Page 38: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Performance max-bandwidth path ~ O(n^2 )

n is the number of nodes in the topology graph In the worst-case, we may require to search all time windows, (s × (s + 1))/2, where s is the number of time steps. If there are r committed reservations in the search period, there can be a maximum of 2r + 1 different time steps in the worst-case. Overall, the worst-case complexity is bounded by O(r^2 n^2 ) Note: r is relatively very small compared to the number of nodes n 38  

Page 39: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Example

Reservation 1: (time t1, t6) A -> B -> D (900Mbps)

Reservation 2: (time t4, t7) A -> C -> D (400Mbps)

Reservation 3: (time t9, t12) A -> B -> D (700Mpbs)

A  

C  B  

D  

800Mbps  

900Mbps   500Mbps  

1000Mbps  

300Mbps  

t4  t2   t3  t1   t5   t6   t7   t8   t9   t10   t11   t12   t13  

ReservaFon  1  

ReservaFon  2  ReservaFon  3  

from A to D (earliest completion) max bandwidth = 200Mbps, volume = 200Mbps x 4 time slots earliest start = t1, latest finish t13

39  

Page 40: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Search Order - Time Windows

Res  1   Res  1,2   Res  2   Res  3  

t4  t1   t6   t7   t9   t12   t13  

Fme  windows  

Res  1  Res  1,  2  

Res  1,  2  2  

Res  1,2    Res  1,  2  

Res  2  Res  1,  2  

Res  1,  2  

t1-­‐-­‐t6  

t4—t6  

t1-­‐-­‐t4  

t6—t7  t4—t7  t1—t7  t7—t9  t6—t9  t4—t9  t1—t9  

Max  bandwidth  from  A  to  D  1.  900Mbps    (3)  

2.  100Mbps    (2)  

3.  100Mbps    (5)  

4.  900Mbps    (1)  

5.  100Mbps    (3)  

6.  100Mbps    (6)  

7.  900Mpbs    (2)  

8.  900Mbps    (3)  

9.  100Mbps    (5)  

10.  100Mbps    (8)  

ReservaFon:  (  A  to  D  )  (100Mbps)  start=t1    end=t9   40  

Page 41: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Search Order - Time Windows Shortest  dura>on?    

Res  1   Res  1,2   Res  2   Res  3  

t4  t1   t6   t7   t9   t12   t13  

Fme  windows  

Res  3  

Res  3  t9—t13  

t12—t12  t9—t12  

Max  bandwidth  from  A  to  D  

1.  200Mbps    (3)  

2.  900Mbps    (1)  

3.  200Mbps    (4)  

   ReservaFon:  (A  to  D  )  (200Mbps)  start=t9  end=t13  

     Ø from  A  to  D,  max  bandwidth  =  200Mbps          volume  =  175Mbps  x  4  Fme  slots            earliest  start  =  t1,  latest  finish  t13    

   earliest  compleFon:    (  A  to  D  )  (100Mbps)  start=t1    end=t8      shortest  duraFon:          (  A  to  D  )  (200Mbps)  start=t9    end=t12.5  

 

41  

Page 42: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Source  >  Network  >  Destination  

 

A

CB

D

800Mbps  

900Mbps   500Mbps  

1000Mbps  

300Mbps  

n2  

n1  

Now  we  have    mulFple  requests  

42  

Page 43: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

With  start/end  times  •   Each  transfer  request  has  start  and  end  Fmes  

•  n  transfer  requests  are  given  (each  request  has  a  specific  amount  of  profit)  

•  ObjecFve  is  to  maximize  the  profit  

•  If  profit  is  same  for  each  request,  then  objecFve  is  to  maximize  the  number  of  jobs  in  a  give  Fme  period  

 • Unspli3able  Flow  Problem:  

•  An  undirected  graph,    •  route  demand  from  source(s)  to  desFnaFons(s)  and  maximize/minimize  the  total  profit/cost  

 

43  

 The  online  scheduling  method  here  is  inspired  from  Gale-­‐Shapley  algorithm  (also  known  as  stable  marriage  problem)  

Page 44: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Methodology  •  Displace  other  jobs  to  open  space  for  the  new  request  

•   we  can  shic  max  n  jobs?  •  Never  accept  a  job  if  it  causes  other  commi3ed  jobs  to  break  their  criteria  

•  Planning  ahead  (gives  opportunity  for  co-­‐allocaFon)  •  Gives  a  polynomial  approximaFon  algorithm  

•  The  preference  converts  the  UFP  problem  into  Dijkstra  path  search  

•  UFlizes  Fme  windows/Fme  steps  for  ranking  (be3er  than  earliest  deadline  first)  

•  Earliest  compleFon  +  shortest  duraFon  •  Minimize  concurrency    

•  Even  random  ranking  would  work  (relaxaFon  in  an  NP-­‐hard  problem  

 

44  

Page 45: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

       

45  

Page 46: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Recall  Time  Windows  

Res  1   Res  1,2   Res  2   Res  3  

t4  t1   t6   t7   t9   t12   t13  

Fme  windows  

Res  1  Res  1,  2  

Res  1,  2  2  

Res  1,2    Res  1,  2  

Res  2  Res  1,  2  

Res  1,  2  

t1-­‐-­‐t6  t4—t6  t1-­‐-­‐t4  

t6—t7  t4—t7  t1—t7  t7—t9  t6—t9  t4—t9  

t1—t9  

Max  bandwidth  from  A  to  D  1.  900Mbps    (3)  

2.  100Mbps    (2)  

3.  100Mbps    (5)  

4.  900Mbps    (1)  

5.  100Mbps    (3)  

6.  100Mbps    (6)  

7.  900Mpbs    (2)  

8.  900Mbps      (3)  

9.  100Mbps    (5)  

10.  100Mbps    (8)  

ReservaFon:  (  A  to  D  )  (100Mbps)  start=t1    end=t9   46  

Page 47: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Test  

 

47  

In  real  life,  number  of  nodes  and  number  of  reservaFon  in  a  given  search  interval  are  limited   See  AINA’13  paper  for  results  

 +  comparison  with  different  preference  metrics  

Page 48: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Autonomic  Provisioning  System  

•  Generate  constraints  automaFcally  (without  user  input)  •  Volume  (elephant  flow?)  •  True  deadline  if  applicable  •  End-­‐host  resource  availability  •  Burst  rate  (fixed  bandwidth,  variable  bandwidth)  

•  Update  constraints  according  to  feedback  and  monitoring  •  Minimize  operaFonal  cost  

•  AlternaFve  to  manual  traffic  engineering  

 What  is  the  incenFve  to  make  correct  reservaFons?  

   

48  

Page 49: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Data  Center  1  

Data  Center  2  

Data  node  B    (web  access)  

Experimental    facility  A  

*  (1)  Experimental  facility  A  generates  30T  of  data  every  day,  and  it  needs  to  be  stored  in  data  center  2,  before  the  next  run,  since  local  disk  space  is  limited  

*  (2)  There  is  a  reservaFon  made  between  data  center  1  and  2.  It  is  used  to  replicate  data  files,  1P  total  size,  when  new  data  is  available  in  data  center  2  

*  (3)  New  results  are  published  at  data  node  B,  we  expect  high  traffic  to  download  new  simulaFon  files  for  the  next  couple  of  months  

Wide-­‐area  

SDN  

49  

Page 50: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Example  •  Experimental  facility  periodically  transfers  data  (i.e.  every  night)  •  Data  replicaFon  happens  occasionally,  and  it  will  take  a  week  to  move  1P  of  data.  If  could  get  delayed  couple  of  hours  with  no  harm  

•  Wide-­‐area  download  traffic  will  increase  gradually,  most  of  the  traffic  will  be  during  the  day.    

•  We  can  dynamically  increase  preference  for  download  traffic  in  the  mornings,  give  high  priority  for  transferring  data  from  the  facility  at  night,  and  use  rest  of  the  bandwidth  for  data  replicaFon  (and  allocate  some  bandwidth  to  confirm  that  it  would  finish  within  a  week  as  usual)  

50  

Page 51: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Virtual  Circuit  ReservaFon  Engine  

Autonomic  provisioning  system  

monitoring  

Reserva>on  Engine  –  Select  opFmal  path/Fme/bandwidth  

–  maximize  the  number  of  admi3ed  requests  –   increase  overall  system  uFlizaFon  and  network  efficiency  

–  Dynamically  update  the  selected  rouFng  path  for  network  efficiency  – Modify  exisFng  reservaFons  dynamically  to  open  space/Fme  for  new  requests  

51  

Page 52: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

Performance  Engineer  ?  

•  Sample  projects:  

•  VSAN      (Virtual  SAN)  

•  VVOL    (Virtual  Volumes)  

•  Important  aspects  of  performance  engineering:  

• Be  a  part  in  the  iniFal  development  phase  • Develop  techniques  to  analyze  performance  problems    

• Make  sure!  performance  issues  are  addresses  correctly  

52  

Page 53: Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

THANK  YOU  

 Any  QuesFon/Comment?          

Mehmet  Balman          [email protected]    

h3p://balman.info    

53