1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments,...

13
1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments, Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster, 1998 http://www.globus.org/documentation/incomi ng/gloperf_sc99.pdf.pdf

Transcript of 1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments,...

Page 1: 1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments, Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster,

1

Reading Report 8Yin Chen

24 Mar 2004

Reference:A Network Performance Tool for Grid Environments, Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster, 1998

http://www.globus.org/documentation/incoming/gloperf_sc99.pdf.pdf

Page 2: 1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments, Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster,

2

Overview Network bandwidth discovery and allocation is a serious issue in Grid.

How to know which hosts have suitable bandwidth? How should application initially configure and adapt to network conditions? Difficulties -- Running application can monitor their own performance but cannot

measure other parts of the environment. Need a tool for performance monitoring and prediction.

Gloperf Makes sample, end-to-end TCP measurements No special host permissions is required, and independent on link topology Addressed problems of

Accuracy vs. Instrusiveness Portability Scalability Fault tolerance Security Measurement Policy Data Discovery, Access and Usebility

Page 3: 1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments, Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster,

3

The Gloperf Design Four Function Areas

Sensors : a set of gloperfd daemons Collation of data : data is collected into Globus Metacomputing Directory Service(MDS) Production of derived data Access and use of data

Working Mechanism When Globus is started on a host, one gloperfd will register to MDS Then, goes into main control loop

while ( no termination signal ) {query MDS for all other gloperfds to do peer and group discovery; filter list of gloperfds on version number and groups;build a randomized list of tests that does a bandwidth test

and a latency test to each filtered peer; for each element of the list {

perform the test;write the results in the MDS;wait n minutes;

} }

Measurement based on netperf Uses a netperf “TCP_STREAM” test TCP/IP overhead Uses a netperf “TCP_RR” measure latency

One bandwidth or latency test in the randomized list is done every n min. (n = 5)

Page 4: 1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments, Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster,

4

Performance Results Highly accurate means minimally intrusive. Frequent measurements can increase accurate, but also increase the

intrusiveness. Can be improved by using shorter tests. The experimental version of Gloperf makes one test per min between a pair

of hosts A cycle of 5 test times are used : 1,2,5,7,10 sec. Collect 5, interleaved time series of measurements, with 5 min between tests. To evaluate the effect of test period, compare the error in this 5-min data with

the error produced when the same data is subsampled to produce a longer test period.

i.e., subsample every 12th point to effectively produce a one-hour test period.

Page 5: 1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments, Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster,

5

Performance Results Figure1 shows the bandwidth

measurements made between sleipnir.aero.org (near Los Angeles) and yukon.mcs.anl.gov (near Chicago) for approximately a 48 hour period. The 1-sec. tests are the lowest on

average and longer test times produce a higher measured bandwidth.

Figure 2 is the histograms of sleipnir-yukon data. TCP coming out of slow start. 5,7, and 10-sec. tests all share the

same mode. This indicates that tests longer than 5-

sec. are NOT necessary on this link in accordance with the bandwidth-delay product governing TCP slow start.

Page 6: 1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments, Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster,

6

Performance Results NOT all links are well-behaved.

Figure 5 shows a similar experiment between jupiter.isi.edu (in Los Angeles) and yukon.mcs.anl.gov (near Chicago) that

exhibits a volatile mean.

Only the 1 and 10-sec. tests are shown.

Figure 6 is the histograms of jupiter-yukon.

Page 7: 1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments, Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster,

7

Performance Results To assess the “error”, require to compare with “true” values.

Need prediction model based on Gloperf data.

The simplest prediction model assumes that the last measurement predicts the next and the error is the difference between them.

By using the last measurement predictor, can assess the error by computing the Mean Absolute Error (MAE) and assess the variability by comparing that with the Mean Squared Error (MSE).

Can also compute the MAE and MSE for subsampled data.

Since when subsampling every n-th point, there are n different possible subsampled time series, thus, for a given subsampling period, compute the MAE/MSE for every possible series.

Page 8: 1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments, Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster,

8

Performance Results Figure 3 shows sleipnir-yukon MAE/MSE as a

function of test time. The longer test times “average-out” some of

the variability seen in the shorter test times resulting in a significantly lower MAE and MSE.

An increase in the MAE for the 2-sec. tests. May caused by packet loss At two seconds, TCP is in the process of

leaving slow start which causes the strata to be spread out wider, producing a higher MAE.

Once out of slow start with a longer test time, the strata are more narrowly constrained and an individual packet drop has less effect.

Figure 4 shows the same data as a function of test period. The test period has NO effect on the MAE.

Testing once an hour is just as accurate as testing once every five-min, since the original data have relatively stationary means.

In this situation, infrequent measurement is adequate and preferable.

Page 9: 1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments, Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster,

9

Performance Results

Figure 7 shows jupiter-yukon MAE/ MSE by test time using the last measurement predictor and a random predictor.

Even in a “noisy” environment, infrequent measurements still produce a BETTER bandwidth estimate than random guessing.

Page 10: 1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments, Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster,

10

Performance Results One drawback of the last measurement

predictor is that an individual measurement may not be representative of the subsequent

“true” bandwidth. i.e., the measurement happened to hit a transient peak or trough in the trace.

In a long-lived operational environment, an extremely large number of samples would be collected and a new measurement would NOT perceptibly change the mean.

A running exponential average could be used to track the current “load” based on a weighted sum of the current measurement and the previous “load”.

This allows current peaks and troughs in the trace to be weighted against recent past measurement with an exponential decay.

Figure 8 shows jupiter-yukon MAE/ MSE by test time for both the last measurement predictor and an exponential average predictor with a weight of 0.35 on the most recent measurement.

Page 11: 1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments, Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster,

11

Performance Results

Work with the Network Weather Service showed that simple forecasting techniques such as exponential smoothing can improve prediction accuracy dramatically over simple use of monitor data alone.

In this case, exponential average data shows a clear improvement in the prediction.

Figure 9 shows exponential average data plotted as a function of test period.

Once out of slow start, the MAE only rises approximately 10% from testing every 5-min to testing once an hour. (Compare with Figure 4)

Page 12: 1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments, Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster,

12

Performance Results Rise the test period to once every 24 hours. Used a trace generated by the NWS that contained

69751 high-frequency tests over a 28.8 day period. Subsampeld this trace to produce a trace with 7163

tests at a test period of 300 sec. over the same 28.8 days.

Figure 10 shows the MAE and MSE for test periods of 5,10,30,45,60 min and 2,4,8,12,24 hours.

Beyond one hour, the MAE dose start to rise more quickly.

But, the average worst error measuring once a day only amounts to less than20% of the actual bandwidth observed.

Note, interleave time series do not produce the same MAE and MSE error readings. Two series shifted only by one measurement can generate different overall forecasting errors. A frequency analysis does NOT expose such periodicities.

Figure 11 is the histogram of exponential predictor MAEs from subsampled NWS trace.

Despite the variability, the worst-case sequence with a one-day measurement period produced an error of only slightly worse than 25% of the actual bandwidth observed.

Also note that the 24 hour period exhibits the widest degree of variability leading is to conjecture that cyclic frequency components are the cause of the variability.

Page 13: 1 Reading Report 8 Yin Chen 24 Mar 2004 Reference: A Network Performance Tool for Grid Environments, Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster,

13

Performance Discussion Longer test period do not erode accuracy. Longer test times tend to average-out variation, and also increase the intrusiveness. The overhead of Gloperf in total number of bytes transferred per hour is linearly

related to the test period. Going from 5 min, to 60 min, represents a 12x reduction in testing data volume Going from 1 hour to 24 hours represents a 24x redoction. On fast, under-utilized links, Gloperf can push a larger data volume than is nucessary.

What is the best metric for “intrusiveness” The number of bytes transferred The time spent sending additional traffic The bandwidth actually denied to other flows The congestion precipitated at routers along the path

But not all of these metrics are easy to capture.

End …