Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ...
-
Upload
balmanme -
Category
Engineering
-
view
231 -
download
0
Transcript of Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ...
Network-‐aware Data Management Middleware for High Throughput Flows
March 16, 2015
Mehmet Balman h3p://balman.info Performance Engineer at VMware Inc. Guest ScienFst at Berkeley Lab
1
About me:
Ø 2013: Performance, Central Engineering, VMware, Palo Alto, CA
Ø 2009: ComputaFonal Research Division (CRD) at Lawrence Berkeley NaFonal Laboratory (LBNL)
Ø 2005: Center for ComputaFon & Technology (CCT), Baton Rouge, LA
v Computer Science, Louisiana State University (2010,2008) v Bogazici University, Istanbul, Turkey (2006,2000)
Data Transfer Scheduling with Advance ReservaFon and Provisioning, Ph.D. Failure-‐Awareness and Dynamic AdaptaFon in Data Scheduling, M.S. Parallel Tetrahedral Mesh Refinement, M.S.
2
Why Network-‐aware? Networking is one of the major components in many of the soluFons today
• Distributed data and compute resources • CollaboraFon: data to be shared between remote sites • Data centers are complex network infrastructures
ü What further steps are necessary to take full advantage of future networking infrastructure?
ü How are we going to deal with performance problems? ü How can we enhance data management services and make them network-‐aware?
New collabora>ons between data management and networking communi>es.
3
Two major players: • AbstracFon and Programmability
• Rapid Development, Intelligent services • OrchestraFng compute, storage, and network resources together • IntegraFon and deployment of complex workflows
• VirtualizaFon (+containers) • Distributed storage (storage wars) • Open Source (if you can’t fix it, you don’t own it)
• Performance Gap: • LimitaFon is current system socware and foreseen speed: • Hardware is fast, Socware is slow
• Latency throughput mismatch will lead to new innovaGons
4
Outline
• Data Streaming in High-‐bandwidth Networks • Climate100: Advance Network IniFaFve and 100Gbps Demo • MemzNet: Memory-‐Mapped Network Zero-‐copy Channels • Core Affinity and End System Tuning in High-‐Throughput Flows
• Network Reserva>on and Online Scheduling • FlexRes: A Flexible Network ReservaFon Algorithm • SchedSim: Online Scheduling with Advance Provisioning
• Performance Engineering and Virtualized Solu>ons
• So,ware Defined Storage
5
100Gbps networking has Finally arrived!
Applica>ons’ Perspec>ve Increasing the bandwidth is not sufficient by itself; we need careful evaluaFon of high-‐bandwidth networks from the applicaFons’ perspecFve.
1Gbps to 10Gbps transiFon (10 years ago)
ApplicaFon did not run 10 Fmes faster because there was more bandwidth available
6
ANI 100Gbps Demo
• 100Gbps demo by ESnet and Internet2
• ApplicaFon design issues and host
tuning strategies to scale to 100Gbps rates
• VisualizaFon of remotely located data (Cosmology)
• Data movement of large datasets with
many files (Climate analysis)
7
Earth System Grid Federation (ESGF)
8
• Over 2,700 sites • 25,000 users
• IPCC Fich Assessment Report (AR5) 2PB • IPCC Forth Assessment Report (AR4) 35TB
• Remote Data Analysis • Bulk Data Movement
Application’s Perspective: Climate Data Analysis
9
lots-‐of-‐small-‐*iles problem!
*ile-‐centric tools?
FTP RPC
request a file
request a file
send file
send file
request data
send data
• Keep the network pipe full • We want out-‐of-‐order and asynchronous send receive
10
Many Concurrent Streams
(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).
11
ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min intervals], TCP buffer size is 50M
Effects of many concurrent streams
12
Analysis of Core AfFinities (NUMA Effect)
13 Nathan Hanford et al. NDM’13
Sandy Bridge Architecture
Receive process
14
Analysis of Core AfFinities (NUMA Effect)
Nathan Hanford et al. NDM’14
100Gbps demo environment
RRT: Sea3le – NERSC 16ms NERSC – ANL 50ms NERSC – ORNL 64ms
15
Framework for the Memory-‐mapped Network Channel
+ SynchronizaFon mechanism for RoCE -‐ Keep the pipe full for remote analysis
16
Moving climate *iles ef*iciently
17
Advantages • Decoupling I/O and network operaFons
• front-‐end (I/O processing) • back-‐end (networking layer)
• Not limited by the characterisFcs of the file sizes • On the fly tar approach, bundling and sending many files
together
• Dynamic data channel management Can increase/decrease the parallelism level both in the network communicaFon and I/O read/write operaFons, without closing and reopening the data channel connecFon (as is done in regular FTP variants). MemzNet is is not file-‐centric. Bookkeeping informaFon is embedded inside each block.
18
MemzNet’s Architecture for data streaming
19
100Gbps Demo • CMIP3 data (35TB) from the GPFS filesystem at NERSC
• Block size 4MB • Each block’s data secFon was aligned according to the system pagesize.
• 1GB cache both at the client and the server • At NERSC, 8 front-‐end threads on each host for reading data files in parallel.
• At ANL/ORNL, 4 front-‐end threads for processing received data blocks.
• 4 parallel TCP streams (four back-‐end threads) were used for each host-‐to-‐host connecFon.
20
83Gbps throughput
21
MemzNet’s Performance
TCP buffer size is set to 50MB
MemzNet GridFTP
100Gbps demo
ANI Testbed
22
Challenge? • High-‐bandwidth brings new challenges!
• We need substanFal amount of processing power and involvement of mulFple cores to fill a 40Gbps or 100Gbps network
• Fine-‐tuning, both in network and applicaFon layers, to take advantage of the higher network capacity.
• Incremental improvement in current tools? • We cannot expect every applicaFon to tune and improve every Fme we change the link technology or speed.
23
MemzNet • MemzNet: Memory-‐mapped Network Channel
• High-‐performance data movement
MemzNet is an iniFal effort to put a new layer between the applicaFon and the transport layer.
• Main goal is to define a network channel so applicaFons can directly use it without the burden of managing/tuning the network communicaFon.
24 Tech report: LBNL-‐6177E
MemzNet = New Execution Model • Luigi Rizzo ’s netmap
• proposes a new API to send/receive data over the network
• RDMA programming model • MemzNet as a memory-‐management component
• IX: Data Plane OS (Adam Baley et al. @standford – similar to MemzNet’s model)
• mTCP (even based / replaces send/receive in user level) • Tanenbaum et al. Minimizing context switches: proposing to use MONITOR/MWAIT for synchronizaFon
25
Problem Domain: Esnet’s OSCARS
26
ASIA-PACIFIC (ASGC/Kreonet2/
TWAREN)
ASIA-PACIFIC(KAREN/KREONET2/
NUS-GP/ODN/REANNZ/SINET/
TRANSPAC/TWAREN)
AUSTRALIA(AARnet)
LATIN AMERICACLARA/CUDI
CANADA(CANARIE)
RUSSIAAND CHINA(GLORIAD)
US R&E(DREN/Internet2/NLR)
US R&E(DREN/Internet2/
NASA)
US R&E(NASA/NISN/
USDOI)
ASIA-PACIFIC(BNP/HEPNET)
ASIA-PACIFIC(ASCC/KAREN/
KREONET2/NUS-GP/ODN/REANNZ/
SINET/TRANSPAC)
AUSTRALIA(AARnet)
US R&E(DREN/Internet2/
NISN/NLR)
US R&E(Internet2/
NLR)
CERN
US R&E(DREN/Internet2/
NISN)
CANADA(CANARIE) LHCONE
CANADA(CANARIE)
FRANCE(OpenTransit)
RUSSIAAND CHINA(GLORIAD)
CERN (USLHCNet)
ASIA-PACIFIC(SINET)
EUROPE (GÉANT/
NORDUNET)
EUROPE (GÉANT)
LATIN AMERICA(AMPATH/CLARA)
LATIN AMERICA(CLARA/CUDI)
HOUSTON
ALBUQUERQUE
El PASO
SUNNYVALE
BOISE
SEATTLE
KANSAS CITY
NASHVILLE
WASHINGTON DC
NEW YORK
BOSTON
CHICAGO
DENVER
SACRAMENTO
ATLANTA
PNNL
SLAC
AMES PPPL
BNL
ORNL
JLAB
FNAL
ANL
LBNL
• ConnecFng experimental faciliFes and supercompuFng centers
• On-‐Demand Secure Circuits and Advance ReservaFon System • Guaranteed between collaboraFng insFtuFons by delivering
network-‐as-‐a-‐service
• Co-‐allocaFon of storage and network resources (SRM: Storage Resource Manager)
OSCARS provides yes/no answers to a reservaFon request for (bandwidth, start_Gme, end_Gme)
End-‐to-‐end ReservaFon: Storage+Network
Reservation Request • Between edge routers Need to ensure availability of the requested bandwidth from source to desGnaGon for the requested Gme interval
v R={ nsource, ndesGnaGon, Mbandwidth, tstart, tend}.
v source/desFnaFon end-‐points v Requested bandwidth v start/end Fmes
Commi3ed reservaFons between tstart and tend are examined The shortest path from source to desFnaFon is calculated based on the engineering metric on each link, and a bandwidth guaranteed path is set up to commit and eventually complete the reservaFon request for the given Fme period
27
Reservation
28
v Components (Graph): v node (router), port, link (connecting two ports) v engineering metric (~latency) v maximum bandwidth (capacity)
v Reservation: v source, destination, path, time v (time t1, t3) A -> B -> D (900Mbps) v (time t2, t3) A -> C -> D (400Mbps) v (time t4, t5) A -> B -> D (800Mpbs)
A
C B
D
800Mbps
900Mbps 500Mbps
1000Mbps
300Mbps
ReservaFon 1
ReservaFon 2 ReservaFon 3
t1
t2 t3 t4 t5
Example
(Fme t1, t2) : A to D (600Mbps) NO A to D (500Mbps) YES
A
C B
D
0 Mbps / 900Mbps (900Mbps)
100 Mbps / 900Mbps (1000Mbps) 800 Mbps / 0Mbps (800Mbps)
500 Mbps / 0Mbps (500Mbps)
300 Mbps / 0 Mbps (300Mbps)
AcFve reservaFon reservaFon 1: (Fme t1, t3) A -‐> B -‐> D (900Mbps) reservaFon 2: (Fme t1, t3) A -‐> C -‐> D (400Mbps) reservaFon 3: (Fme t4, t5) A -‐> B -‐> D (800Mpbs)
available/ reserved (capacity)
29
Example
A
C B
D
0 Mbps / 900Mbps (900Mbps)
100 Mbps / 900Mbps (1000Mbps) 400 Mbps / 400Mbps (800Mbps)
100 Mbps / 400Mbps (500Mbps)
300 Mbps / 0 Mbps (300Mbps)
(Fme t1, t3) : A to D (500Mbps) NO A to C (500Mbps) No (not max-‐FLOW!)
AcFve reservaFon reservaFon 1: (Fme t1, t3) A -‐> B -‐> D (900Mbps) reservaFon 2: (Fme t1, t3) A -‐> C -‐> D (400Mbps) reservaFon 3: (Fme t4, t5) A -‐> B -‐> D (800Mpbs)
available/ reserved (capacity)
30
Alternative Approach: Flexible Reservations
• IF the requested bandwidth can not be guaranteed: • Try-‐and-‐error unFl get an available reservaFon • Client is not given other possible opFons
• How can we enhance the OSCARS reservaFon system? • Be Flexible:
• Submit constraints and the system suggests possible reservaFon opFons saFsfying given requirements
31
Rs'={ nsource , ndesGnaGon, MMAXbandwidth, DdataSize, tEarliestStart, tLatestEnd} ReservaFon engine finds out the reservaFon
R={ nsource, ndesGnaGon, Mbandwidth, tstart, tend} for the earliest compleFon or for the shortest duraFon where Mbandwidth≤ MMAXbandwidth and tEarliestStart ≤ tstart < tend≤ tLatestEnd .
Bandwidth Allocation (time-‐dependent)
Modified Dijstra's algorithms (max available bandwidth):
• BoPleneck constraint (not addiFve)
• QoS constraint is addiFve in shortest path, etc)
32 The maximum bandwidth available for allocaFon from a source node to a desFnaFon node
t1 t2 t3 t4 t5 t6
Analogous Example n A vehicle travelling from city A to city B n There are multiple cities between A and B connected with separate
highways. n Each highway has a specific speed limit
– (maximum bandwidth)
n But we need to reduce our speed if there is high traffic load on the road
n We know the load on each highway for every time period – (active reservations)
n The first question is which path the vehicle should follow in order to reach city B from city A as early as possible (earliest completion)
• Or, we can delay our journey and start later if the total travel time would be reduced. Second question is to find the route along with the starting time for shortest travel duration (shortest duration)
33
Advance bandwidth reservation: we have to set the speed limit before starting and cannot change during the journey
Time steps
n Time steps between t1 and t13
Fme t4 t2 t3 t1 t5 t6 t7 t8 t9 t10 t11 t12 t13
ReservaFon 1 ReservaFon 2
ReservaFon 3
Res 1 Res 1,2 Res 2 Res 3
t4 t1 t6 t7 t9 t12 t13
Fme
Fme steps
Max (2r+1) time steps, where r is the number of reservations
ts1 ts2 ts3 ts4 34
Static Graphs Res 1 Res 1,2 Res 2
t4 t1 t6 t7 t9
A
C B
D
0 Mbps
100 Mbps 800 Mbps
500 Mbps
300 Mbps)
A
C B
D
0 Mbps
100 Mbps 400 Mbps
100 Mbps
300 Mbps)
A
C B
D
900 Mbps
1000 Mbps 400 Mbps
100 Mbps
300 Mbps)
A
C B
D
900 Mbps
1000 Mbps 800 Mbps
500 Mbps
300 Mbps)
t4 t6 t7
G(ts3) G(ts4) G(ts2) G(ts1) 35
Time Windows
Res 1,2 Res 2
t1 t6 t9
A
C B
D
0 Mbps
100 Mbps 400 Mbps
100 Mbps
300 Mbps
A
C B
D
900 Mbps
1000 Mbps 400 Mbps
100 Mbps
300 Mbps
t6
Max (s × (s + 1))/2 time windows, where s is the number of time steps
G(tw)=G(ts3) x G(ts4)
tw=ts1+ts2
Bo3leneck constraint
G(tw)=G(ts1) x G(ts2)
tw=ts3+ts4
36
Time Window List (special data structures)
now infinite
Time windows list
new reservaFon: reservaFon 1, start t1, end t10
now t1 t10 infinite
Res 1
new reservaFon: reservaFon 2, start t12, end t20
now t1 t10 t12
Res 1
t20 infinite
Res 2
37
Careful socware design makes implementaFon fast and efficient
Performance max-bandwidth path ~ O(n^2 )
n is the number of nodes in the topology graph In the worst-case, we may require to search all time windows, (s × (s + 1))/2, where s is the number of time steps. If there are r committed reservations in the search period, there can be a maximum of 2r + 1 different time steps in the worst-case. Overall, the worst-case complexity is bounded by O(r^2 n^2 ) Note: r is relatively very small compared to the number of nodes n 38
Example
Reservation 1: (time t1, t6) A -> B -> D (900Mbps)
Reservation 2: (time t4, t7) A -> C -> D (400Mbps)
Reservation 3: (time t9, t12) A -> B -> D (700Mpbs)
A
C B
D
800Mbps
900Mbps 500Mbps
1000Mbps
300Mbps
t4 t2 t3 t1 t5 t6 t7 t8 t9 t10 t11 t12 t13
ReservaFon 1
ReservaFon 2 ReservaFon 3
from A to D (earliest completion) max bandwidth = 200Mbps, volume = 200Mbps x 4 time slots earliest start = t1, latest finish t13
39
Search Order - Time Windows
Res 1 Res 1,2 Res 2 Res 3
t4 t1 t6 t7 t9 t12 t13
Fme windows
Res 1 Res 1, 2
Res 1, 2 2
Res 1,2 Res 1, 2
Res 2 Res 1, 2
Res 1, 2
t1-‐-‐t6
t4—t6
t1-‐-‐t4
t6—t7 t4—t7 t1—t7 t7—t9 t6—t9 t4—t9 t1—t9
Max bandwidth from A to D 1. 900Mbps (3)
2. 100Mbps (2)
3. 100Mbps (5)
4. 900Mbps (1)
5. 100Mbps (3)
6. 100Mbps (6)
7. 900Mpbs (2)
8. 900Mbps (3)
9. 100Mbps (5)
10. 100Mbps (8)
ReservaFon: ( A to D ) (100Mbps) start=t1 end=t9 40
Search Order - Time Windows Shortest dura>on?
Res 1 Res 1,2 Res 2 Res 3
t4 t1 t6 t7 t9 t12 t13
Fme windows
Res 3
Res 3 t9—t13
t12—t12 t9—t12
Max bandwidth from A to D
1. 200Mbps (3)
2. 900Mbps (1)
3. 200Mbps (4)
ReservaFon: (A to D ) (200Mbps) start=t9 end=t13
Ø from A to D, max bandwidth = 200Mbps volume = 175Mbps x 4 Fme slots earliest start = t1, latest finish t13
earliest compleFon: ( A to D ) (100Mbps) start=t1 end=t8 shortest duraFon: ( A to D ) (200Mbps) start=t9 end=t12.5
41
Source > Network > Destination
A
CB
D
800Mbps
900Mbps 500Mbps
1000Mbps
300Mbps
n2
n1
Now we have mulFple requests
42
With start/end times • Each transfer request has start and end Fmes
• n transfer requests are given (each request has a specific amount of profit)
• ObjecFve is to maximize the profit
• If profit is same for each request, then objecFve is to maximize the number of jobs in a give Fme period
• Unspli3able Flow Problem:
• An undirected graph, • route demand from source(s) to desFnaFons(s) and maximize/minimize the total profit/cost
43
The online scheduling method here is inspired from Gale-‐Shapley algorithm (also known as stable marriage problem)
Methodology • Displace other jobs to open space for the new request
• we can shic max n jobs? • Never accept a job if it causes other commi3ed jobs to break their criteria
• Planning ahead (gives opportunity for co-‐allocaFon) • Gives a polynomial approximaFon algorithm
• The preference converts the UFP problem into Dijkstra path search
• UFlizes Fme windows/Fme steps for ranking (be3er than earliest deadline first)
• Earliest compleFon + shortest duraFon • Minimize concurrency
• Even random ranking would work (relaxaFon in an NP-‐hard problem
44
45
Recall Time Windows
Res 1 Res 1,2 Res 2 Res 3
t4 t1 t6 t7 t9 t12 t13
Fme windows
Res 1 Res 1, 2
Res 1, 2 2
Res 1,2 Res 1, 2
Res 2 Res 1, 2
Res 1, 2
t1-‐-‐t6 t4—t6 t1-‐-‐t4
t6—t7 t4—t7 t1—t7 t7—t9 t6—t9 t4—t9
t1—t9
Max bandwidth from A to D 1. 900Mbps (3)
2. 100Mbps (2)
3. 100Mbps (5)
4. 900Mbps (1)
5. 100Mbps (3)
6. 100Mbps (6)
7. 900Mpbs (2)
8. 900Mbps (3)
9. 100Mbps (5)
10. 100Mbps (8)
ReservaFon: ( A to D ) (100Mbps) start=t1 end=t9 46
Test
47
In real life, number of nodes and number of reservaFon in a given search interval are limited See AINA’13 paper for results
+ comparison with different preference metrics
Autonomic Provisioning System
• Generate constraints automaFcally (without user input) • Volume (elephant flow?) • True deadline if applicable • End-‐host resource availability • Burst rate (fixed bandwidth, variable bandwidth)
• Update constraints according to feedback and monitoring • Minimize operaFonal cost
• AlternaFve to manual traffic engineering
What is the incenFve to make correct reservaFons?
48
Data Center 1
Data Center 2
Data node B (web access)
Experimental facility A
* (1) Experimental facility A generates 30T of data every day, and it needs to be stored in data center 2, before the next run, since local disk space is limited
* (2) There is a reservaFon made between data center 1 and 2. It is used to replicate data files, 1P total size, when new data is available in data center 2
* (3) New results are published at data node B, we expect high traffic to download new simulaFon files for the next couple of months
Wide-‐area
SDN
49
Example • Experimental facility periodically transfers data (i.e. every night) • Data replicaFon happens occasionally, and it will take a week to move 1P of data. If could get delayed couple of hours with no harm
• Wide-‐area download traffic will increase gradually, most of the traffic will be during the day.
• We can dynamically increase preference for download traffic in the mornings, give high priority for transferring data from the facility at night, and use rest of the bandwidth for data replicaFon (and allocate some bandwidth to confirm that it would finish within a week as usual)
50
Virtual Circuit ReservaFon Engine
Autonomic provisioning system
monitoring
Reserva>on Engine – Select opFmal path/Fme/bandwidth
– maximize the number of admi3ed requests – increase overall system uFlizaFon and network efficiency
– Dynamically update the selected rouFng path for network efficiency – Modify exisFng reservaFons dynamically to open space/Fme for new requests
51
Performance Engineer ?
• Sample projects:
• VSAN (Virtual SAN)
• VVOL (Virtual Volumes)
• Important aspects of performance engineering:
• Be a part in the iniFal development phase • Develop techniques to analyze performance problems
• Make sure! performance issues are addresses correctly
52