Experiences and results from implementing the QBone Scavenger

1

Experiences and results from implementing the QBone Scavenger

Les Cottrell – SLACPresented at the CENIC meeting, San Diego, May 2002

www.slac.stanford.edu/grp/scs/talk/cenic-may02.html

Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP

2

Outline• Needs for High Energy & Nuclear Physics (HENP)

• Why we need scavenger service

• What is scavenger service

• How is it used

• Results of tests with 10Mbps, 100bps and 2Gbps bottlenecks

• How we could use it all

3

HENP Experiment Model• World wide collaborations necessary for large

undertakings

• Regional computer centers in France, Italy, UK & US– Spending Euros on data center at SLAC not attractive– Leverage local equipment & expertise

• Resources available to all collaborators

• Requirements – bulk (60% of SLAC traffic):– Bulk data replication (current goal > 100MBytes/s)– Optimized cached read access to 10-100GB from 1PB

data set

4

Data requirements for HEP• HEP physics accelerator experiments generate 10’s to 100’s Mbytes/s

of raw data (100Mbytes/s == 3.6TB/hr)– Already heavily filtered in trigger hardware/software to only choose

“potentially” interesting events– Data rate limited by ability to record and use data

• Data is analyzed to reconstruct tracks etc., and events from the electronics signal data– Requires computing resources at several (tier 1) sites worldwide, for BaBar this

includes: France, UK & Italy– Data has to be sent to sites, reconstructions have to be shared

• Reconstructed data is summarized into an object oriented data base providing parameters of the events

• Summarized data is analyzed by physicists around the world looking for physics and equipment understanding, thousands of physicists in hundreds of institutions in tens of countries.

• In addition use Monte Carlo methods to create simulated events, to compare with real events– Also very cpu intensive, so done at multiple sites such as LBNL, LLNL,

Caltech, and results shared with other sites

5

HENP Data Grid Hierarchy

Tier 1

Tier2 Center

Online System

CERN 700k SI95 ~1 PB Disk; Tape Robot

FNAL: 200k SI95; 600 TBIN2P3 Center INFN Center RAL Center

InstituteInstituteInstituteInstitute ~0.25TIPS

Workstations

~100-400 MBytes/sec

2.5 Gbps

100 - 1000

Mbits/sec

Physicists work on analysis “channels”

Each institute has ~10 physicists working on one or more channels

Physics data cache

~PByte/sec

~2.5 Gbits/sec

Tier2 CenterTier2 CenterTier2 Center~2.5 Gbps

Tier 0 +1

Tier 3

Tier 4

Tier2 Center Tier 2

Experiment

CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1

6

HEP Next Gen. Networks needs• Providing rapid access to event samples and subsets

from massive data stores– From ~400 Terabytes in 2001, ~Petabytes by 2002,

~100 Petabytes by 2007, to ~1 Exabyte by ~2012.• Providing analyzed results with rapid turnaround, by

coordinating and managing the LIMITED computing, data handling and NETWORK resources effectively

• Enabling rapid access to the data and the collaboration– Across an ensemble of networks of varying capability

• Advanced integrated applications, such as Data Grids, rely on seamless operation of our LANs and WANs– With reliable, quantifiable (monitored), high

performance– For “Grid-enabled” event processing and data analysis,

and collaboration

7

Also see http://www-iepm.slac.stanford.edu/monitoring/bulk/; and the Internet2 E2E Initiative: http://www.internet2.edu/e2e

*

Throughputs today• Can get 400Mbits/s TCP throughput regularly from SLAC to well

connected sites on production ESnet or Internet 2 within US.• Need big windows & multiple streams, > 500MHz cpus• Usually single transfer is disk limited to < 70Mbits/sTrans-Atlantic *

*

*

8

Why do we need even higher speeds• Data growth exceeds Moore’s law• New experiments coming on line• Experiment with higher speeds:

– Understand next limitations:• End hosts: disks, memory, compression

• Application steering, windows, streams, tuning stacks, choosing replicas…

• Improve or replace TCP stacks, forward error correction, non congestion related losses …

• Coexistence, need for QoS, firewalls …

• Set expectations• Change mindset

• NTON enabled us to be prepared to change from shipping tapes to using network, assisted in more realistic planning

9

In addition …• Requirements – interactive:

– Remote login, video conferencing, document sharing, joint code development, co-laboratory (remote operations, reduced travel, more humane shifts)

– Modest bandwidth – often < 1 Mbps– Emphasis on quality of service & sub-second responses

• How to get the best of both worlds:– Use all available bandwidth– Minimize impact on others

• One answer is to be a scavenger

10

What is QBSS• QBSS stands for QBone Scavenger Services.

It’s an Internet2 initiative, to let users and applications.

– take advantage of otherwise unused bandwidth.

– without affecting performance of the default best-effort class of service.

• QBSS corresponds to a specific Differentiated Service Code Point (DSCP): DSCP = 001000 (binary)

• The IPv4 ToS (Type of Service) octet looks like:

• Bits 0-2 = Class selector

Bits 0-5 = DSCP (Differentiated Service Code Point)

Bits 6-7 = Early Congestion Notification (ECN)

60 541 2 3 7

11

How is it usedUsers can voluntarily mark their traffic with QBSS

codepoint:As they would type nice on Unix;

Routers can mark packets for users/applicationsRouters that see traffic marked with QBSS code point

can:• Be configured to handle it

Forward at a lower priority than best effort traffic, with possibility of expanding bandwidth when other traffic is not using all capacity

• Not know about itTreat is as regular Best Effort (DSCP 000000)

12

Impact on Others• Make ping measurements with & without iperf TCP

loading– Loss loaded vs unloaded– RTT

• Looking at how to avoid impact: e.g. QBSS/LBE, application pacing, control loop on RTT, reducing streams, want to avoid scheduling

13

QBSS test bed with Cisco 7200s• Set up QBSS testbed

– Has a 10Mbps bottleneck

• Configure router interfaces– 3 traffic types:

• QBSS, BE, Priority

– Define policy, e.g.• QBSS > 1%, priority < 30%

– Apply policy to router interface queues

10Mbps

100Mbps

100Mbps

100Mbps

1Gbps

Cisco 7200s

14

Using bbcp to make QBSS measurements

• Run bbcp src data /dev/zero, dst=/dev/null, report throughput at 1 second intervals– with TOS=3210 (QBSS)

– After 20 s. run bbcp with no TOS bits specified (BE)

– After 20 s. run bbcp with TOS=4010 (priority)

– After 20 more secs turn off Priority– After 20 more secs turn off BE

15

Example of effects

Also tried: 1 stream for all, and priority at 70%

16

QBSS with Cisco 6500• 6500s + Policy Feature Card (PFC)

– Routing by PFC2, policing on switch interfaces– 2 queues, 2 thresholds each– QBSS assigned to own queue with 5% bandwidth – guarantees

QBSS gets something– BE & Priority traffic in 2nd queue with 95% bandwidth– Apply ACL to switch port to police Priority traffic to < 30%

100Mbps

1Gbps

1Gbps

1Gbps

1Gbps

Cisco 6500s + MSFC/Sup2

Time

100%BE

Priority(30%)

QBSS(~5%)

17

Impact on response time (RTT)• Run ping with Iperf loading with various QoS

settings, iperf ~ 93Mbps– No iperf ping avg RTT ~ 300usec (regardless of QoS)– Iperf = QBSS, ping=BE or Priority: RTT~550usec

• 70% greater than unloaded

– Iperf=Ping QoS (exc. Priority) then RTT~5msec• > factor of 10 larger RTT than unloaded

18

SC2001Our challenge: Bandwidth to the world

• Demonstrate the current data transfer capabilities to several sites worldwide:

– 26 sites all over the world– IPERF servers on each remote side that can accept data

coming from the show floor;

• Mimic a high energy physics tier 0 or tier 1 site (an accelerator or major computation site) in distributing copies of the raw data to multiple replica sites.

19

SC2001 Setup

SC2001NOC To the world!

The configuration of the two 6509 switches defined the baseline for QBSS

traffic at 5% of the total bandwidth.

The Gig lines to the NOC were Ether-Channeled together so to have an

aggregate 2 Gig line

The setup at SC2001 had three Linux PCs with a total of 5 gig Eth interfaces.

20Pings to host on show floor

Priority: 9+-2 ms

BE: 18.5+-3ms

QBSS: 54+-100 ms

SC2001 demo 1/2• Send data from 3 SLAC/FNAL booth computers to over 20

other sites with good connections in about 6 countries– Iperf TCP throughputs ranged from 3Mbps to ~ 300Mbps

• Saturate 2Gbps connection to floor network– Maximum aggregate throughput averaged over 5 min. ~ 1.6Gbps

• Apply QBSS to highest performance site, and rest BE

100

0Time

Mbi

ts/s

Iperf TCP Throughput Per GE interface

QBSSNo QBSS

5mins

21

Possible usage• Apply priority to lower volume interactive

voice/video-conferencing and real time control• Apply QBSS to high volume data replication• Leave the rest as Best Effort• Since 40-65% of bytes to/from SLAC come from a

single application, we have modified to enable setting of TOS bits

• Need to identify bottlenecks and implement QBSS there

• Bottlenecks tend to be at edges so hope to try with a few HEP sites

22

Acknowledgements &More Information

• Official Internet2 page: – http://qbone.internet2.edu/qbss/

• IEPM/PingER home site:– www-iepm.slac.stanford.edu/

• Bulk throughput site:– www-iepm.slac.stanford.edu/bw

• QBSS measurements– www-iepm.slac.stanford.edu/monitoring/qbss/measure.html

• CENIC Network Applications Magazine, vol 2, April ’02– www.cenic.org/InterAct/interactvol2.pdf

• Thanks to Stanislav Shalunov of Internet 2, for inspiration and encouragement; Paola Grosso, Stefan Luitz, Warren Matthews & Gary Buhrmaster of SLAC for setting up routers and helping with measurements.

http://www-iepm.slac.stanford.edu/









http://www-iepm.slac.stanford.edu/monitoring/qbss/measure.html











Experiences and results from implementing the QBone Scavenger

Documents

Transcript of Experiences and results from implementing the QBone Scavenger