Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ...

Network-‐aware Data Management Middleware for High Throughput Flows

March 16, 2015

Mehmet Balman h3p://balman.info Performance Engineer at VMware Inc. Guest ScienFst at Berkeley Lab

1

About me:

Ø 2013: Performance, Central Engineering, VMware, Palo Alto, CA

Ø 2009: ComputaFonal Research Division (CRD) at Lawrence Berkeley NaFonal Laboratory (LBNL)

Ø 2005: Center for ComputaFon & Technology (CCT), Baton Rouge, LA

v Computer Science, Louisiana State University (2010,2008) v Bogazici University, Istanbul, Turkey (2006,2000)

Data Transfer Scheduling with Advance ReservaFon and Provisioning, Ph.D. Failure-‐Awareness and Dynamic AdaptaFon in Data Scheduling, M.S. Parallel Tetrahedral Mesh Refinement, M.S.

2

Why Network-‐aware? Networking is one of the major components in many of the soluFons today

•  Distributed data and compute resources •  CollaboraFon: data to be shared between remote sites •  Data centers are complex network infrastructures

ü What further steps are necessary to take full advantage of future networking infrastructure?

ü How are we going to deal with performance problems? ü How can we enhance data management services and make them network-‐aware?

New collabora>ons between data management and networking communi>es.

3

Two major players: • AbstracFon and Programmability

•  Rapid Development, Intelligent services •  OrchestraFng compute, storage, and network resources together •  IntegraFon and deployment of complex workflows

•  VirtualizaFon (+containers) •  Distributed storage (storage wars) •  Open Source (if you can’t fix it, you don’t own it)

• Performance Gap: •  LimitaFon is current system socware and foreseen speed: •  Hardware is fast, Socware is slow

•  Latency throughput mismatch will lead to new innovaGons

4

Outline

• Data Streaming in High-‐bandwidth Networks •  Climate100: Advance Network IniFaFve and 100Gbps Demo •  MemzNet: Memory-‐Mapped Network Zero-‐copy Channels •  Core Affinity and End System Tuning in High-‐Throughput Flows

• Network Reserva>on and Online Scheduling •  FlexRes: A Flexible Network ReservaFon Algorithm •  SchedSim: Online Scheduling with Advance Provisioning

•  Performance Engineering and Virtualized Solu>ons

•  So,ware Defined Storage

5

100Gbps networking has Finally arrived!

Applica>ons’ Perspec>ve Increasing the bandwidth is not sufficient by itself; we need careful evaluaFon of high-‐bandwidth networks from the applicaFons’ perspecFve.

1Gbps to 10Gbps transiFon (10 years ago)

ApplicaFon did not run 10 Fmes faster because there was more bandwidth available

6

ANI 100Gbps Demo

•  100Gbps demo by ESnet and Internet2

•  ApplicaFon design issues and host

tuning strategies to scale to 100Gbps rates

•  VisualizaFon of remotely located data (Cosmology)

•  Data movement of large datasets with

many files (Climate analysis)

7

Earth System Grid Federation (ESGF)

8

•  Over 2,700 sites •  25,000 users

•  IPCC Fich Assessment Report (AR5) 2PB •  IPCC Forth Assessment Report (AR4) 35TB

•  Remote Data Analysis •  Bulk Data Movement

Application’s Perspective: Climate Data Analysis

9

lots-‐of-‐small-‐*iles problem!

*ile-‐centric tools?

FTP RPC

request a file

request a file

send file

send file

request data

send data

•  Keep the network pipe full •  We want out-‐of-‐order and asynchronous send receive

10

Many Concurrent Streams

(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).

11

ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min intervals], TCP buffer size is 50M

Effects of many concurrent streams

12

Analysis of Core AfFinities (NUMA Effect)

13 Nathan Hanford et al. NDM’13

Sandy Bridge Architecture

Receive process

14

Analysis of Core AfFinities (NUMA Effect)

Nathan Hanford et al. NDM’14

100Gbps demo environment

RRT: Sea3le – NERSC 16ms NERSC – ANL 50ms NERSC – ORNL 64ms

15

Framework for the Memory-‐mapped Network Channel

+ SynchronizaFon mechanism for RoCE -‐ Keep the pipe full for remote analysis

16

Moving climate *iles ef*iciently

17

Advantages •  Decoupling I/O and network operaFons

•  front-‐end (I/O processing) •  back-‐end (networking layer)

•  Not limited by the characterisFcs of the file sizes •  On the fly tar approach, bundling and sending many files

together

•  Dynamic data channel management Can increase/decrease the parallelism level both in the network communicaFon and I/O read/write operaFons, without closing and reopening the data channel connecFon (as is done in regular FTP variants). MemzNet is is not file-‐centric. Bookkeeping informaFon is embedded inside each block.

18

MemzNet’s Architecture for data streaming

19

100Gbps Demo •  CMIP3 data (35TB) from the GPFS filesystem at NERSC

•  Block size 4MB •  Each block’s data secFon was aligned according to the system pagesize.

•  1GB cache both at the client and the server •  At NERSC, 8 front-‐end threads on each host for reading data files in parallel.

•  At ANL/ORNL, 4 front-‐end threads for processing received data blocks.

•  4 parallel TCP streams (four back-‐end threads) were used for each host-‐to-‐host connecFon.

20

83Gbps throughput

21

MemzNet’s Performance

TCP buffer size is set to 50MB

MemzNet GridFTP

100Gbps demo

ANI Testbed

22

Challenge? •  High-‐bandwidth brings new challenges!

•  We need substanFal amount of processing power and involvement of mulFple cores to fill a 40Gbps or 100Gbps network

•  Fine-‐tuning, both in network and applicaFon layers, to take advantage of the higher network capacity.

•  Incremental improvement in current tools? •  We cannot expect every applicaFon to tune and improve every Fme we change the link technology or speed.

23

MemzNet •  MemzNet: Memory-‐mapped Network Channel

•  High-‐performance data movement

MemzNet is an iniFal effort to put a new layer between the applicaFon and the transport layer.

•  Main goal is to define a network channel so applicaFons can directly use it without the burden of managing/tuning the network communicaFon.

24 Tech report: LBNL-‐6177E

MemzNet = New Execution Model •  Luigi Rizzo ’s netmap

• proposes a new API to send/receive data over the network

• RDMA programming model • MemzNet as a memory-‐management component

• IX: Data Plane OS (Adam Baley et al. @standford – similar to MemzNet’s model)

• mTCP (even based / replaces send/receive in user level) •  Tanenbaum et al. Minimizing context switches: proposing to use MONITOR/MWAIT for synchronizaFon

25

Problem Domain: Esnet’s OSCARS

26

ASIA-PACIFIC (ASGC/Kreonet2/

TWAREN)

ASIA-PACIFIC(KAREN/KREONET2/

NUS-GP/ODN/REANNZ/SINET/

TRANSPAC/TWAREN)

AUSTRALIA(AARnet)

LATIN AMERICACLARA/CUDI

CANADA(CANARIE)

RUSSIAAND CHINA(GLORIAD)

US R&E(DREN/Internet2/NLR)

US R&E(DREN/Internet2/

NASA)

US R&E(NASA/NISN/

USDOI)

ASIA-PACIFIC(BNP/HEPNET)

ASIA-PACIFIC(ASCC/KAREN/

KREONET2/NUS-GP/ODN/REANNZ/

SINET/TRANSPAC)

AUSTRALIA(AARnet)


NISN/NLR)

US R&E(Internet2/

NLR)

CERN


NISN)

CANADA(CANARIE) LHCONE

CANADA(CANARIE)

FRANCE(OpenTransit)

RUSSIAAND CHINA(GLORIAD)

CERN (USLHCNet)

ASIA-PACIFIC(SINET)

EUROPE (GÉANT/

NORDUNET)

EUROPE (GÉANT)

LATIN AMERICA(AMPATH/CLARA)

LATIN AMERICA(CLARA/CUDI)

HOUSTON

ALBUQUERQUE

El PASO

SUNNYVALE

BOISE

SEATTLE

KANSAS CITY

NASHVILLE

WASHINGTON DC

NEW YORK

BOSTON

CHICAGO

DENVER

SACRAMENTO

ATLANTA

PNNL

SLAC

AMES PPPL

BNL

ORNL

JLAB

FNAL

ANL

LBNL

•  ConnecFng experimental faciliFes and supercompuFng centers

•  On-‐Demand Secure Circuits and Advance ReservaFon System •  Guaranteed between collaboraFng insFtuFons by delivering

network-‐as-‐a-‐service

•  Co-‐allocaFon of storage and network resources (SRM: Storage Resource Manager)

OSCARS provides yes/no answers to a reservaFon request for (bandwidth, start_Gme, end_Gme)

End-‐to-‐end ReservaFon: Storage+Network

Reservation Request •  Between edge routers Need to ensure availability of the requested bandwidth from source to desGnaGon for the requested Gme interval

v  R={ nsource, ndesGnaGon, Mbandwidth, tstart, tend}.

v  source/desFnaFon end-‐points v  Requested bandwidth v  start/end Fmes

Commi3ed reservaFons between tstart and tend are examined The shortest path from source to desFnaFon is calculated based on the engineering metric on each link, and a bandwidth guaranteed path is set up to commit and eventually complete the reservaFon request for the given Fme period

27

Reservation

28

v Components (Graph): v node (router), port, link (connecting two ports) v engineering metric (~latency) v maximum bandwidth (capacity)

v Reservation: v source, destination, path, time v (time t1, t3) A -> B -> D (900Mbps) v (time t2, t3) A -> C -> D (400Mbps) v (time t4, t5) A -> B -> D (800Mpbs)

A

C B

D

800Mbps

900Mbps 500Mbps

1000Mbps

300Mbps

ReservaFon 1

ReservaFon 2 ReservaFon 3

t1

t2 t3 t4 t5

Example

(Fme t1, t2) : A to D (600Mbps) NO A to D (500Mbps) YES

A

C B

D

0 Mbps / 900Mbps (900Mbps)

100 Mbps / 900Mbps (1000Mbps) 800 Mbps / 0Mbps (800Mbps)


300 Mbps / 0 Mbps (300Mbps)

AcFve reservaFon reservaFon 1: (Fme t1, t3) A -‐> B -‐> D (900Mbps) reservaFon 2: (Fme t1, t3) A -‐> C -‐> D (400Mbps) reservaFon 3: (Fme t4, t5) A -‐> B -‐> D (800Mpbs)

available/ reserved (capacity)

29

Example

A

C B

D


100 Mbps / 900Mbps (1000Mbps) 400 Mbps / 400Mbps (800Mbps)


300 Mbps / 0 Mbps (300Mbps)

(Fme t1, t3) : A to D (500Mbps) NO A to C (500Mbps) No (not max-‐FLOW!)

AcFve reservaFon reservaFon 1: (Fme t1, t3) A -‐> B -‐> D (900Mbps) reservaFon 2: (Fme t1, t3) A -‐> C -‐> D (400Mbps) reservaFon 3: (Fme t4, t5) A -‐> B -‐> D (800Mpbs)

available/ reserved (capacity)

30

Alternative Approach: Flexible Reservations

•  IF the requested bandwidth can not be guaranteed: •  Try-‐and-‐error unFl get an available reservaFon •  Client is not given other possible opFons

•  How can we enhance the OSCARS reservaFon system? •  Be Flexible:

•  Submit constraints and the system suggests possible reservaFon opFons saFsfying given requirements

31

Rs'={ nsource , ndesGnaGon, MMAXbandwidth, DdataSize, tEarliestStart, tLatestEnd} ReservaFon engine finds out the reservaFon

R={ nsource, ndesGnaGon, Mbandwidth, tstart, tend} for the earliest compleFon or for the shortest duraFon where Mbandwidth≤ MMAXbandwidth and tEarliestStart ≤ tstart < tend≤ tLatestEnd .

Bandwidth Allocation (time-‐dependent)

Modified Dijstra's algorithms (max available bandwidth):

•  BoPleneck constraint (not addiFve)

•  QoS constraint is addiFve in shortest path, etc)

32 The maximum bandwidth available for allocaFon from a source node to a desFnaFon node

t1 t2 t3 t4 t5 t6

Analogous Example n  A vehicle travelling from city A to city B n  There are multiple cities between A and B connected with separate

highways. n  Each highway has a specific speed limit

–  (maximum bandwidth)

n  But we need to reduce our speed if there is high traffic load on the road

n  We know the load on each highway for every time period –  (active reservations)

n  The first question is which path the vehicle should follow in order to reach city B from city A as early as possible (earliest completion)

•  Or, we can delay our journey and start later if the total travel time would be reduced. Second question is to find the route along with the starting time for shortest travel duration (shortest duration)

33

Advance bandwidth reservation: we have to set the speed limit before starting and cannot change during the journey

Time steps

n  Time steps between t1 and t13

Fme t4 t2 t3 t1 t5 t6 t7 t8 t9 t10 t11 t12 t13


ReservaFon 3

Res 1 Res 1,2 Res 2 Res 3

t4 t1 t6 t7 t9 t12 t13

Fme

Fme steps

Max (2r+1) time steps, where r is the number of reservations

ts1 ts2 ts3 ts4 34

Static Graphs Res 1 Res 1,2 Res 2

t4 t1 t6 t7 t9

A

C B

D

0 Mbps

100 Mbps 800 Mbps

500 Mbps

300 Mbps)

A

C B

D

0 Mbps

100 Mbps 400 Mbps

100 Mbps

300 Mbps)

A

C B

D

900 Mbps

1000 Mbps 400 Mbps

100 Mbps

300 Mbps)

A

C B

D

900 Mbps

1000 Mbps 800 Mbps

500 Mbps

300 Mbps)

t4 t6 t7

G(ts3) G(ts4) G(ts2) G(ts1) 35

Time Windows

Res 1,2 Res 2

t1 t6 t9

A

C B

D

0 Mbps

100 Mbps 400 Mbps

100 Mbps

300 Mbps

A

C B

D

900 Mbps

1000 Mbps 400 Mbps

100 Mbps

300 Mbps

t6

Max (s × (s + 1))/2 time windows, where s is the number of time steps

G(tw)=G(ts3) x G(ts4)

tw=ts1+ts2

Bo3leneck constraint

G(tw)=G(ts1) x G(ts2)

tw=ts3+ts4

36

Time Window List (special data structures)

now infinite

Time windows list

new reservaFon: reservaFon 1, start t1, end t10

now t1 t10 infinite

Res 1

new reservaFon: reservaFon 2, start t12, end t20

now t1 t10 t12

Res 1

t20 infinite

Res 2

37

Careful socware design makes implementaFon fast and efficient

Performance max-bandwidth path ~ O(n^2 )

n is the number of nodes in the topology graph In the worst-case, we may require to search all time windows, (s × (s + 1))/2, where s is the number of time steps. If there are r committed reservations in the search period, there can be a maximum of 2r + 1 different time steps in the worst-case. Overall, the worst-case complexity is bounded by O(r^2 n^2 ) Note: r is relatively very small compared to the number of nodes n 38

Example

Reservation 1: (time t1, t6) A -> B -> D (900Mbps)

Reservation 2: (time t4, t7) A -> C -> D (400Mbps)

Reservation 3: (time t9, t12) A -> B -> D (700Mpbs)

A

C B

D

800Mbps

900Mbps 500Mbps

1000Mbps

300Mbps

t4 t2 t3 t1 t5 t6 t7 t8 t9 t10 t11 t12 t13

ReservaFon 1


from A to D (earliest completion) max bandwidth = 200Mbps, volume = 200Mbps x 4 time slots earliest start = t1, latest finish t13

39

Search Order - Time Windows


t4 t1 t6 t7 t9 t12 t13

Fme windows

Res 1 Res 1, 2

Res 1, 2 2

Res 1,2 Res 1, 2

Res 2 Res 1, 2

Res 1, 2

t1-‐-‐t6

t4—t6

t1-‐-‐t4

t6—t7 t4—t7 t1—t7 t7—t9 t6—t9 t4—t9 t1—t9

Max bandwidth from A to D 1.  900Mbps (3)

2.  100Mbps (2)

3.  100Mbps (5)

4.  900Mbps (1)

5.  100Mbps (3)

6.  100Mbps (6)

7.  900Mpbs (2)

8.  900Mbps (3)

9.  100Mbps (5)

10.  100Mbps (8)

ReservaFon: ( A to D ) (100Mbps) start=t1 end=t9 40

Search Order - Time Windows Shortest dura>on?


t4 t1 t6 t7 t9 t12 t13

Fme windows

Res 3

Res 3 t9—t13

t12—t12 t9—t12

Max bandwidth from A to D

1.  200Mbps (3)

2.  900Mbps (1)

3.  200Mbps (4)

ReservaFon: (A to D ) (200Mbps) start=t9 end=t13

Ø from A to D, max bandwidth = 200Mbps volume = 175Mbps x 4 Fme slots earliest start = t1, latest finish t13

earliest compleFon: ( A to D ) (100Mbps) start=t1 end=t8 shortest duraFon: ( A to D ) (200Mbps) start=t9 end=t12.5

41

Source > Network > Destination

A

CB

D

800Mbps

900Mbps 500Mbps

1000Mbps

300Mbps

n2

n1

Now we have mulFple requests

42

With start/end times •  Each transfer request has start and end Fmes

•  n transfer requests are given (each request has a specific amount of profit)

•  ObjecFve is to maximize the profit

•  If profit is same for each request, then objecFve is to maximize the number of jobs in a give Fme period

• Unspli3able Flow Problem:

•  An undirected graph, •  route demand from source(s) to desFnaFons(s) and maximize/minimize the total profit/cost

43

The online scheduling method here is inspired from Gale-‐Shapley algorithm (also known as stable marriage problem)

Methodology •  Displace other jobs to open space for the new request

•  we can shic max n jobs? •  Never accept a job if it causes other commi3ed jobs to break their criteria

•  Planning ahead (gives opportunity for co-‐allocaFon) •  Gives a polynomial approximaFon algorithm

•  The preference converts the UFP problem into Dijkstra path search

•  UFlizes Fme windows/Fme steps for ranking (be3er than earliest deadline first)

•  Earliest compleFon + shortest duraFon •  Minimize concurrency

•  Even random ranking would work (relaxaFon in an NP-‐hard problem

44

Recall Time Windows


t4 t1 t6 t7 t9 t12 t13

Fme windows

Res 1 Res 1, 2

Res 1, 2 2

Res 1,2 Res 1, 2

Res 2 Res 1, 2

Res 1, 2

t1-‐-‐t6 t4—t6 t1-‐-‐t4

t6—t7 t4—t7 t1—t7 t7—t9 t6—t9 t4—t9

t1—t9

Max bandwidth from A to D 1.  900Mbps (3)

2.  100Mbps (2)

3.  100Mbps (5)

4.  900Mbps (1)

5.  100Mbps (3)

6.  100Mbps (6)

7.  900Mpbs (2)

8.  900Mbps (3)

9.  100Mbps (5)

10.  100Mbps (8)

ReservaFon: ( A to D ) (100Mbps) start=t1 end=t9 46

Test

47

In real life, number of nodes and number of reservaFon in a given search interval are limited See AINA’13 paper for results

+ comparison with different preference metrics

Autonomic Provisioning System

•  Generate constraints automaFcally (without user input) •  Volume (elephant flow?) •  True deadline if applicable •  End-‐host resource availability •  Burst rate (fixed bandwidth, variable bandwidth)

•  Update constraints according to feedback and monitoring •  Minimize operaFonal cost

•  AlternaFve to manual traffic engineering

What is the incenFve to make correct reservaFons?

48

Data Center 1

Data Center 2

Data node B (web access)

Experimental facility A

* (1) Experimental facility A generates 30T of data every day, and it needs to be stored in data center 2, before the next run, since local disk space is limited

* (2) There is a reservaFon made between data center 1 and 2. It is used to replicate data files, 1P total size, when new data is available in data center 2

* (3) New results are published at data node B, we expect high traffic to download new simulaFon files for the next couple of months

Wide-‐area

SDN

49

Example •  Experimental facility periodically transfers data (i.e. every night) •  Data replicaFon happens occasionally, and it will take a week to move 1P of data. If could get delayed couple of hours with no harm

•  Wide-‐area download traffic will increase gradually, most of the traffic will be during the day.

•  We can dynamically increase preference for download traffic in the mornings, give high priority for transferring data from the facility at night, and use rest of the bandwidth for data replicaFon (and allocate some bandwidth to confirm that it would finish within a week as usual)

50

Virtual Circuit ReservaFon Engine

Autonomic provisioning system

monitoring

Reserva>on Engine –  Select opFmal path/Fme/bandwidth

–  maximize the number of admi3ed requests –  increase overall system uFlizaFon and network efficiency

–  Dynamically update the selected rouFng path for network efficiency – Modify exisFng reservaFons dynamically to open space/Fme for new requests

51

Performance Engineer ?

•  Sample projects:

•  VSAN (Virtual SAN)

•  VVOL (Virtual Volumes)

•  Important aspects of performance engineering:

• Be a part in the iniFal development phase • Develop techniques to analyze performance problems

• Make sure! performance issues are addresses correctly

52

THANK YOU

Any QuesFon/Comment?

Mehmet Balman [email protected]

h3p://balman.info

53

Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ...

Engineering

Transcript of Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ...