Efficient Long Distance End to End Throughput from the Campus ...
Transcript of Efficient Long Distance End to End Throughput from the Campus ...
Efficient Long Distance End to End Throughput from the Campus
Over 100G Networks
Successes and Challenges to, from and within the Campus
Infrastructure Environment
Harvey B Newman and Artur BarczykCalifornia Institute of Technology
NSF CC-NIE Meeting, Washington DCApril 30, 2014
SC06 BWC: Fast Data Transferhttp://monalisa.cern.ch/FDT
An easy to use open source Java application that runs on all major platforms
Uses asynch. multithreaded system toachieve smooth, linear data flow: Streams a dataset (list of files) contin-
uously through an open TCP socket No protocol Start/stops between files Sends buffers at rate matched to the
monitored capability of end to end path Use independent threads to read
& write on each physical device
SC06 BWC: Stable disk-to-disk flows Tampa-Caltech: 10-to-10 and 8-to-8 1U Server-pairs for9 + 7 = 16 Gbps; then Solid overnight. Using One 10G link
17.77 Gbps BWC peak; + 8.6 Gbps to and from Korea
By SC07: ~70-100 Gbps per rack of low cost 1U servers I. Legrand
I. Legrand
Forward to 2014: Long-distance Wide Area 100G Data Transfers
100 G
Caltech SC’13 Demo:Solid 99-100G Throughput on
one 100G Wave; Up to 325G WAN Traffic
BUT: using 100G infrastructure for production efficiently revealed several challenges
Mostly end-system related:Need IRQ affinity tuning
Multi-core support (multi-threaded applications)Storage controller limitations – mainly SW driver
+ CPU-controller-NIC flow control ?
It’s increasingly easy to saturate 100G infrastructure with well-prepared demonstration equipment – using aggregate traffic of several hosts
70-74 Gbps Caltech – Internet2 – ANA100 – CERN
Note: single server, multiple TCP streams
using FDT tool
Network Path Layout: Caltech (CHOPIN: CC-NIE) – CENIC – Internet2 – ANA100 – Amsterdam (SURFnet) – CERN (US LHCNet)
Internet2AL2S to
ANA - CERN
Caltech, Pasadena Caltech @ CERN
Cisco 15454Echo-7
Echo-6
CHOPIN 100G
100G3 x 40G
3 x 40G
100G
Sandy01
Sandy03
Brocade100GE Switch
100G
2 x 40G
2 x 40G
100GCisco 15454
Azher MughalRamiro Voicu
CENIC
CHOPIN: 100G Advanced Networking + Science
Driver Targets for 2014-15 LIGO Scientific Collab. Astro Sky Surveys; VOs Geodetic + Seismic Nets Genomics: On-Chip
Gene Sequencing
100G TCP tests CERN Caltech This Week (Cont’d)
Peaks ~83Gbpson some
AL2S segments
You need strong and willing partners:
Caltech CampusCENIC
Internet2ANA-100SURFNet
CERN+ Engagement
• Server 2: Newer generation (E5 2690V2 Ivy Bridge); same chassis as Server 1; issue with newer CPUs and the Mellanox 40GE NICs; Engaged with the vendors (Mellanox, Intel; LSI)
100G TCP Tests CERN Caltech; An Issue
Server 1: ~58 Gbps Server 2: Only ~12 Gbps
Expect further improvements once this issue is resolvedLessons Learned: Need a strong team with the right talents,
a systems approach and especially strong partnerships: regional, national, global; manufacturers
CHOPIN Network Layout (CC-NIE Grant)
100GE Backbone capacity,
operationalExternal
connectivity to major carriers
including CENIC, Esnet, Internet2 and PacWave
LIGO and IPAC are in process to join
using 10G and 40G links
CHOPIN WAN Connections
External connectivity to CENIC, Esnet, Internet2 and
PacWaveAble to create Layer2 paths
using Internet2 OESS portal over
the AL2S US footprints
Dynamic Circuits through Internet2
ION over the 100GE path
Caltech CMS fully integrated with
100GE BackboneIP Peering with
Internet2 and UFL at 100GE … ready for next LHC run
Current peaks are around 8Gbps
capacity, operational
CHOPIN – CMS Tier2 Integration
Key Issue and Approach to a Solution: Next Generation System for Data Intensive Research
Present Solutions will not scale We need: an agile architecture exploiting
globally distributed, grids, clouds, specialized (e.g. GPU) & opportunistic resources A Services System that provisions it
all, moves the data flexibly and dynamically, and behaves coherently
Examples do exist, with smaller, but still very large scope
Pervasive, autonomous agent architecture that deals with and reduces complexity
Requires talented system developers with a deep appreciation of networks
Ready to exploit the potential of new paradigms, such as SDN and NDN
Grid Job Lifelines-*
Grid Net Topology
MonALISA
Automated Transfers on Dynamic Networks
MonALISA
Real-time Grid Monitoring, Topology and Ops Control
ALICE Worldwide Grid
Advanced Network Services for Experiments (ANSE)
Problems Encountered;Solutions Discovered
and Further Challenges UncoveredHarvey B Newman and Artur Barczyk
California Institute of TechnologyNSF CC-NIE Meeting, Washington DC
April 30, 2014
ANSE: Advanced Network Services for (LHC) Experiments
NSF CC-NIE Funded, 4 US Institutes: Caltech, Vanderbilt, Michigan, UT Arlington. A US ATLAS / US CMS Collaboration
Goal: provide more efficient, deterministic workflows
Method: Interface advanced network services, including dynamic circuits, with the LHC data management systems: PanDA in (US) Atlas PhEDEx in (US) CMS
Includes leading personnel forthe data production systems Kauschik De (PanDA Lead) Tony Wildish (PhEDEx Lead)
12
2000 MBytes/sec
Performance measurements with PhEDEx and FDT for CMS
FDT sustained rates: ~1500MB/secAverage over 24hrs: ~ 1360MB/sec Difference due to delay in starting jobs Bumpy plot due to binning and job size24 Hour Throughput Reported by PhEDEx
1h moving average
Throughput as reported by MonALISA
1500
1000
500
0
1500
1000
500
0 040214 002216 18 20 0806 1012
PhEDEx testbed in ANSE
T2_ANSE_Geneva & T2_ANSE_Amsterdam• High Capacity link with dynamic circuit
creation between storage nodes• PhEDEx and storage nodes separate• 4x4 SSD RAID 0 arrays,
16 physical CPU cores / machine
PhEDEx throughput on a shared path
(with 5 Gbps of UDP cross traffic)
Seamless switchoverNo interruption of service
CMS: PhEDEx and Dynamic Circuits
sandy01-amswood1-ams
T2_ANSE_Amsterdam
sandy01-gva hermes2
T2_ANSE_Geneva
High speed WAN circuit
Shared path
Latest efforts: integrating circuit awareness into the FileDownload agent:• Prototype is backend agnostic; No modifications to PhEDEx DB• All control logic is in the FileDownload agent• Transparent for all other PhEDEx instances
PhEDEx throughput on a dedicated path
1h moving average
1h moving average600
400
200
1200
1000
800
0403020100 06050
Testing circuit integration into the Download agentPhEDEx transfer rates (in MB/sec)
181210090807 11 16151413 17 222120192221 23
Using dynamic circuits in PhEDEx allows for more deterministic workflows, useful for co-scheduling CPU with data movement
Vlad LapadatescuTony Wildish
25M Jobs at > 100 Sites Now Completed Each Month
6X Growth in 3 Years (2010-13):
Production and Distributed Analysis
Kaushik De
STEP1: Import network information into PanDA STEP2: Use network information directly to optimize workflow for data
transfer/access; at a higher level than individual transfers alone Start with simple use cases leading to measureable
improvements in workflow/user experience
A New Plateau
1. Faster User Analysis Analysis jobs normally go to sites with local data:
sometimes leads to long wait times due to queuing Could use network information to assign work to
‘nearby’ sites with idle CPUs and good connectivity2. Optimal Cloud Selection Tier2s are connected to Tier1 “Clouds”, manually
by the ops team (may be attached to multiple Tier1s) To be automated using network info: Algorithm under test
3. PD2P = PanDA Dynamic Data Placement: Asynchronous usage-based Repeated use of data or Backlog in Processing Make add’l copies Rebrokerage of queues new data locations PD2P is perfect for network integration Use network for strategic replication + site selection – tested soon Try SDN provisioning since this usually involves large datasets
USE CASES Kaushik De
DYNES (NSF MRI-R2): Dynamic Circuits Nationwide: Led by Internet2 with Caltech
DYNES is extending circuit capabilities to ~50 US campuses
Turns out to be nontrivial
Will be an integral part of the point-to-point service in LHCONE
Partners: I2, Caltech, Michigan, Vanderbilt. Working with I2 and
ESnet on dynamic circuits issuessoftware
http://internet2.edu/dynes
Extended the OSCARS scope; Transition: DRAGON to PSS, OESS
17
Challenges Encountered
perfSONAR deployment status For meaningful results, we need most LHC computing sites equipped
with perfSONAR nodes. This is work in progress. Easy to use perfSONAR API: Was missing, but a REST API has been
made available recently Inter-domain Dynamic Circuits Intra-domain systems have been in production for some time E.g. ESnet uses OSCARS as production tool since several years OESS (OpenFlow-based) also in production – single domain
Inter-domain circuit provisioning continues to be hard Implementations are fragile; Error recovery tends to require
manual intervention Holistic approach needed: pervasive monitoring + tracking of
configuration state changes; intelligent clean-up and timeout handling NSI framework needs faster standardization, adoption
and implementation among the major networks, or Future SDN-based solution: for example OpenFlow and Open Daylight
Some of the DYNES Challenges Encountered; Approaches to a Solution
Some of the issues encountered in both the control and data planes came from immaturity of the implementation at the time Failed request left configuration on switches, causing subsequent failures Too long time to get failure notification, blocks serialized requests Error messages often erratic hard to find root cause of problem End-to-end VLAN translation not always resulting in functional data plane
Static data plane configuration need changes upon upgrades Grid certificates validity (1 year), over 40+ sites led to frequent expiration
issues – not DYNES specific! Solution: We use nagios to monitor certificate states at all DYNES sites,
generating early warning to the local administrators. Alternate solution would be to create a DYNES CA, and administer certificates
in a coordinated way. Requires a responsible party. DYNES path forward:
Working with a selected subset of sites on getting automated tests failure free Taking input from these – propagate changes to other sites, and/or deploy NSI If funding allows (future proposal): an SDN based multidomain solution