Linking cache performance to user behaviour

Computer Networks and ISDN Systems 30 (1998) 2123–2130

Linking cache performance to user behaviour

Ian Marshall 1, Chris Roadknight *;1

BT Research Laboratories, Martlesham Heath, Ipswich, Suffolk IP5 7RE, UK

Abstract

The performance of HTTP cache servers varies dramatically from server to server. Much of the variation is independentof cache size and network topology and thus appears to be related to differences in the user communities. Analysis of arange of user traces shows that, just like caches, individual users have highly variable hit rates, Zipf locality curves andshow strong signs of long range dependency. In order to predict cache performance we propose a simple model whichtreats a cache as an aggregation of single users, and each user as a small cache. 1998 Elsevier Science B.V. All rightsreserved.

Keywords: Traffic; Cache; Internet; Users

1. Introduction

HTTP caching has been shown to deliver con-siderable benefits in network utilisation and userperceived performance [1,2]. In order to optimise thebenefits it is important to develop accurate models ofcache behaviour. To our knowledge, accurate modelshave yet to be fully defined. The performance ofHTTP cache servers varies dramatically from serverto server and invariants have proved hard to identify.Much of the observed variability is independent ofcache size [3] and network topology [4] and appearsto be related to differences in the user communi-ties of the caches. It therefore seems desirable todevelop a good model of user behaviour. Unfortu-nately studies of the behaviour of individual usersare not readily available. This is probably because itis not possible to analyse the behaviour of anything

Ł Corresponding author.1 E-mail: {marshall,roadknic}@drake.bt.co.uk.

other than large (atypical) groups of users in theanonymised statistics published by some cache op-erators (e.g. ftp://ircache.nlanr.net/Traces/). In orderto build reliable statistics it is necessary to obtainlong-term traces for small and medium users. If userrequest rates follow a single sided distribution, likemany cache parameters, the smallest users will beby far the most numerous and will dominate results.However, if users request rates follow a Poisson dis-tribution (as do telephony call rates [5]), it will besafe to ignore the smallest users.

In this paper we report the first results from astudy of individual user behaviour that we have un-dertaken. In Section 2 we illustrate the analysis thatis possible using the published daily statistics from alarge cache, and show that there is strong evidencethat user request rates are Poisson. In Section 3 wereport some longer-term results for single users fromseveral different caches, and show for the first timethat large and moderate individual users have highlyvariable hit rates, Zipf locality curves and exhibit

0169-7552/98/$ – see front matter 1998 Elsevier Science B.V. All rights reserved.PII: S 0 1 6 9 - 7 5 5 2 ( 9 8 ) 0 0 2 4 4 - X

2124 I. Marshall, C. Roadknight / Computer Networks and ISDN Systems 30 (1998) 2123–2130

long range dependency. The smallest users do notgenerate large enough samples to enable meaning-ful analysis. In Section 4 a simple model is pro-posed which models individual users using only keycache performance parameters. Although this modelis based on the behaviour of large and moderateusers, it should generate reasonable predictions sincethe distribution of user request rates is not singlesided and the smallest users can be safely ignored. InSection 5 we illustrate how this simple model couldbe used to predict cache performance, and proposean empirical constant representing the fact that cacheuser communities are not random selections from thetotal user community.

2. Analysis of daily statistics

The clearest link to user choices and behaviour inpublished cache statistics is the host popularity curve(or locality plot). This curve plots the number ofrequests for files from a given site, against the pop-ularity ranking of the site. Previous studies [6] haveshown that the curve usually approximates to Zipf’slaw. Zipf’s First Law [7] states that if the frequenciesof occurrence of each item are ranked by frequencythen the frequency of the second most popular itemwill be half the frequency of the most common item.The frequency of the third most popular item will beproportional to a third of the frequency of the mostpopular, and so on. That is:

Frequency of rankingN D Frequency of ranking1

N (1)

In order to get a clear picture of user activityhowever, it is necessary to know the target file foreach request, rather than just the target host. Thisis because some users could access a large range offiles on a small range of hosts, and some users couldrepeatedly access a small range of files on the samerange of hosts. This would lead to identical popu-larity curves but very different results for metricssuch as hit rate, and number of cached files. Unfor-tunately the statistics published by most caches donot provide any detail of the files accessed by indi-vidual users. However, we were able to obtain a logwhich reported the target URL for each request, fromthe operators of the Funet (Finnish University and

Research Network) cache in Finland [http://www.funet.fi/funet/]. The Funet cache is a primary cacheserving universities, polytechnics and some researchorganisations in Finland. The cache has about 2500users, 500 000 requests per day, a hit rate of ¾30%and a popularity curve which is approximately Zipf.The cache appears, on these metrics, to be similarto many primary caches run by other universitiesor groups of academic institutions [e.g. http://www.swin.edu.au/proxy/usage/]. The analysis reported inthis section is based on a log file for a single days ac-tivity, obtained from Finland. Analysis over a longerperiod was not possible as the logs are anonymisedon a daily basis. In addition, many of the loggedfile requests are automatically generated by embed-ded objects such as .gif files, rather than by theuser. As we are attempting to analyse user behaviourrather than the composition of Web pages we haveattempted to eliminate the auto-generated requestsfrom the analysis. Therefore all requests which oc-curred within a second of the previous request andwere made to the same host as the previous requesthave been filtered out of the data.

Fig. 1 shows the user community of the cachesorted by number of requests generated in a singleday (20=1=98). The logarithmic distribution of re-quest rates is approximately Poisson with a mean of101 requests (where the mean is defined as the in-verse log of the mean of the logarithmic distribution).This observation means we can build a model withoutrequiring analysis of the smallest users. The bottom20% of users generate less than 1% of the traffic sotheir performance impact will be minimal. We will,however, need to build up a detailed understanding ofusers generating around 100 requests per day.

Fig. 2 shows the host popularity data for thelargest two users of the cache. A line with the

Fig. 1. Distribution of user request rates on the Funet cache.

I. Marshall, C. Roadknight / Computer Networks and ISDN Systems 30 (1998) 2123–2130 2125

Fig. 2. Popularity curve for hosts accessed by two users of theFinland cache.

gradient of Zipf’s law is also shown as an aid to theeye. It is apparent that user 1 (no. of requests) visitsfewer hosts than user 2 (no. of requests), and thatwhilst user 1 approximately follows Zipf’s law user2 does not. In fact user 2 shows weaker locality sincehis most popular sites are less popular than Zipfwould predict. We chose to analyse large users atthis stage because it was more practical and resultingstatistics would be more sound.

Analysis of requests made over one day are notideal, common sense would suggest that data fromonly one day will consistently underestimate thelocality shown by users because daily visits to sites(e.g. news sites) will not be represented. Secondlysmaller users generate samples which are too smallafter only one day, so the selection of users from adaily log can never be fully representative.

In order to illustrate the need for data on filerequests rather than just the usual data on host re-quests, the file request popularity curve for the sametwo users is shown in Fig. 3, again with the Zipfgradient shown as an aid. It is clear that user 1 is vis-iting more pages than user 2, even though he visitedfewer hosts, illustrating the importance of the filedata. It is also clear that page request popularity doesnot follow Zipf’s law, since both curves now showless locality than Zipf would predict. The reason forthis is likely to be that data has only been analysedfor one day and thus the popularity of the mostfrequently visited sites has been underestimated asdiscussed above. It also important to note that therequest gradient for hosts will always be steeper thanfiles because the former is usually an aggregate of

Fig. 3. Popularity curves for files accessed by two users of theFinland cache.

the latter. This means that there is always more filesrequested than hosts visited and the most popularhost is always more popular than the most popularfile.

Even with access to the file request data from theFunet cache we are unable to draw any strong con-clusions. The reasons can be summarised as follows:(1) Only large users generate enough requests to be

worthy of analysis and they are not likely to berepresentative of smaller users.

(2) A user identity cannot be traced for long timeperiods since the data is freshly anonymised on adaily basis. This is particularly important as thereis likely to be significant long range dependencyin user activity based on the users memory whichwill be lost if only the activity of a single day isanalysed.

In an attempt to overcome these issues we haveassembled long term data sets from a range ofcaches. The initial analysis of some of the data isreported in Section 3 below.

3. Analysis of long term statistics

The cache which is the source of the majority ofthe data presented in this section has been used by5 individuals over the course of a year. The cacheis small since only users who are members of ourresearch team, and are happy for their logs to beanalysed in non-anonymised form have been encour-aged to use it. Over a 26-week period (25=8=97–22=2=98) the cache generated 14 000 requests andsustained a hit rate of around 20%. The cache is sitedat the subnet router for the team’s network and is


thus caching requests to sites in other buildings onthe local Intranet. These local requests do not appearin the data for larger caches and should possiblybe eliminated. In this paper these requests have notbeen eliminated as it is difficult to know where tostop eliminating sites from a global Intranet such asthat operated by BT. We do not feel that the localrequests significantly alter the principal findings ofthe work.

Table 1 shows the usage statistics of the mostconsistent user of the cache over the 26-week period.This user was using the WWW through the cachewhilst in the office, primarily as a research tool.

On days when the WWW was used the userspent between one and three hours browsing andgenerated up to 345 page requests in a session.The typical number of initial page requests (autogenerated requests for embedded files have beenfiltered as above) on a working day was 100. Thisplaces the user in the most probable part of theusage distribution observed on the Finland cache.One cannot on this basis alone claim that the user

Table 1Users statistics for biggest user of small scale cache

Client 1

Requests 6545Hosts 389Uri 3535Hit rate (%) 24.04Requests on busiest day 345Number of distinct hosts on busiest day 27Number of distinct URL’s on busiest day 240

Table 2The 10 most popular hosts for 3 users

Finland: user 1 Requests Finland: user 2 Requests BT: client 1 Requests

www.msnbc.com 3070 freespace.virgin.net 83 www.sunday-times.co.uk 1057www.microsoft.com 1487 photogallery.simplenet.com 79 local project management 245ads.msn.com 671 website.yle.fi 56 www.zdnet.co.uk 234investor.msn.com 110 us.imdb.com 55 ad.doubleclick.net 212www.mungopark.com 75 www.unitedmedia.com 48 www.altavista.digital.com 206travel.state.gov 55 altavista.telia.com 48 www.windows95.com 163www.superbowl.com 46 icons.imbd.com 43 local experimental site 157ad.preferences.com 44 www.geocities.com 39 local directory 139www.zdnet.com 40 www.bubblegumtv.com 36 rc5stats.distributed.net 133msnbc.com 39 free.prohosting.com 33 www.microsoft.com 109

is representative, but at least we can be sure we areanalysing a user with significant differences to theusers analysed in Section 2.

On the basis of the amount of time this user spentin WWW based activity it seems unlikely that therequests attributed to the heavy users on the Finlandcache were entirely generated by single human endusers. 14% of the requests generated by our userwere actually automated requests. The user was us-ing Internet Explorer and has experimented with thesubscription mechanisms this browser provides. Inthe 26 week period the user had 6 subscriptions for aperiod of about one month. One of the subscriptionswas checking the currency of bookmarked pages.Had these subscriptions been extended over the en-tire 26 weeks 50% of the traffic would have been‘robot’ generated. It is conceivable that an enthusi-astic subscription user would have a much higherproportion of robot traffic. Table 2 shows the tenmost popular sites for all three users.

As expected from previous reports [8] thefavourite sites for our moderate user include a lo-cal directory, some intranet pages, some news pages,some experimental research activity, and developersites of large software companies. On the other handthe popular sites list for Finland user 1 is heavilybiased towards the sites which are default subscrip-tions in Internet Explorer. This tends to support asupposition that some high request rates are robotassisted. It also leads us to question the benefits ofthe subscription mechanism which is potentially in-creasing Web traffic by up to an order of magnitude.It is possible that Finland user 2 is a small subsidiarycache.


Fig. 4. Comparison of popularity curves over different timespans.

Fig. 4 shows the host popularity data for ourcache analysed over progressively increasing periodsof 1 week, one month and 6 months. As the samplesize and time period increase the results tend towardsfollowing Zipf’s law. It is not clear whether thisis an artifact of small sample sizes or whether itindicates a very long-range dependency of more than6 months. It is to be expected that there will besignificant self-similarity at long timescales as theusers’ memory and interests are quite persistent, butit is also clear that for less popular sites the statisticswill be very poor.

Fig. 5 compares the page popularity curves forthe two Finland users and for our local user (withauto generated requests filtered as before). It is clearfrom Fig. 6 that our local, moderate user has apopularity curve with a similar gradient to that ofFinland user 2, but Finland user 1 has a curve witha steeper gradient. Despite the results being takenover a longer time period, our user still exhibits lesslocality than would be predicted by Zipf’s law. We

Fig. 5. Popularity curves of three users at two sites.

Fig. 6. Decline in hit rate variance with increasing sample size.

suspect that this may be because the lifetime of pagesis significantly shorter than that of sites so the usageof the most popular pages cannot build up over longperiods as it does for the most popular sites. It isequally possible that it is simply because our user isdifferent. To resolve this question we analysed somelong term user traces obtained from other sites.

Fig. 6 shows the variance of the ‘auto hit rate’generated by a range of users plotted against samplesize. We are confident that all the users plotted inthis figure are genuine individual users. One useris the BT user analysed above. One additional useris derived from traces collected in 1995 at BostonUniversity [6]. The remaining users are drawn fromearly analysis of logs provided by Research Ma-chines plc (RMPLC). RMPLC [http://www.rmplc.net] is an internet service and content provider toschools and colleges in the UK, they also have asmall number of modem users who are assigned astatic IP address. We analysed the hit rate statistics oftwo of the modem users, during a period of 6 weeksin early 1998. All of the users are generating around100 requests per day, and are thus near the peakof the request rate distribution found in the Finlanddata.

The ‘auto hit rate’ is the rate of request for filespreviously requested by the user and successfullycached. As observed elsewhere [9], for the aggregatehit rate on much larger caches, the variance doesnot decay rapidly with sample size. The curves arelinear and the Hurst parameters can be estimatedfrom the data and fall in the range 0.55–0.7 (withestimation errors of 0.1 due to the use of smallsamples). All the users we have analysed thus exhibitthe statistical anomalies that have been interpreted


elsewhere as self similarity [6]. The data presenteddoes not prove self similarity in a strict mathematicalsense as the range of the curves is small, due to thesmall sample sizes. However, the data is sufficient todemonstrate, for the first time, that individual usersexhibit the same anomalies as caches. The anomaliescan be safely interpreted as a type of long rangedependency, which we postulate originates in thememory of the users.

It is clear from Fig. 6 that the BT user is notobviously atypical. It is also clear that the hit rates ofthe users are just as variable as those of caches [9]and fall in a remarkably similar range (16%–65%).Use of long term traces has thus enabled us to findthat (at least large and moderate) users have hit rates,locality curves and long range dependency, all ofwhich are indistinguishable from those of caches.

In Fig. 7 we show the build up of the auto hitrate of the BT user with time. The rate plateaus afterjust over 1 working week, indicating that the charac-teristic time of the dominant long-range dependencyshould be around 1 week. The data also indicatesthat the ideal length of user trace is at least 1 weekeven for large users. For moderate users with ¾100requests per day a 25 day sample would be ideal toovercome stochastic effects. Users with 20 requestsper day would ideally require a 6 month long trace.Users with less than 10 requests=day will be hardto analyse accurately as the sample would need tobe 250 days long, i.e. it would exceed the typicallifetime of Web pages. However since these usersrepresent the tail of the distribution and have min-imal impact on cache performance they can almostcertainly be safely ignored.

Based on the results reported in Sections 2 and 3we now move on to propose a simple model of users.

Fig. 7. Change in theoretical hit rate over 10 working days.

4. User model

The simplest conceivable model of a Web userhas one assumption and one parameter. The assump-tion is that page popularity will follow Zipf’s lawas do other popularity curves for human beings [10].The parameter is usage level. The results we havereported indicate that this is not sufficient to explainthe variability of user behaviour — since in practicethe users do not follow Zipf’s law (although theydo appear to follow a heavy tailed Pareto curve),and they have highly variable hit rates. In additionit is clear that users are a significant source of longrange dependency. This leads us to the assertionthat at least 3 parameters are required to model theobserved behaviour:(1) usage,(2) auto hit rate,(3) Hurst parameter derived from auto hit rate.

Of course for modelling situations where theburstiness is not important the third parameter can beneglected and we are left with a simple two-parame-ter model.

Auto hit rate is chosen as the second parameterbecause it is the most obvious single number metricfor the locality of users requests, and it has no depen-dencies on other users or the properties of the WWW.In addition it is plausible to argue that it will show adistribution with a well defined maximum likelihood,as does the usage parameter. In the next section weillustrate how a user model based on the above pa-rameters can be used to predict cache performance.

5. Community model

Based on a two parameter user model as proposedin Section 4 and additional input from the obser-vations we can propose a method for building usercommunity models. This method is intended as an il-lustration of our objectives rather than as a definitiveproposal due to the small amount of data available.

Assumptions:(1) the distribution of request rates is Poisson with a

mean of 101 as observed in the Finland data,(2) the distribution of auto hit rates is Poisson with

a mean of 0.2 (the observed auto hit rates of ouruser) and independent of request rate.


If the community is mostly human the aboveassumptions are likely to be true. However if thereis a substantial proportion of robot traffic there arelikely to be at least two independent distributionswith different means. The impact of robots will bethe subject of a further study.

Since we have assumed Poisson distributions itis sufficient to have studied users enough to char-acterise the means of the above distributions accu-rately. The community model is not dependent onindividual user properties. However, we also needa characterising parameter — the degree of overlapbetween users in a given community. This param-eter is required because communities are not ran-dom samples from the overall user population, butare subject to strong locational bias. Examples ofbias include factors like language, dominant localindustries and demographics (small towns with uni-versities are dominated by single people in theirearly twenties). We propose an overlap parameter, !,which can be derived empirically using

! D1� ðobservedno.ofcachedpagesłworkingdays

ŁŽðno.ofusersð .meanrequestrate�

meanrequestrateðmeanautohitrate/Ł. (2)

We anticipate that caches with similar user com-munities will have very similar values for !, and that! will have a useful probability distribution, whichcan be treated as the sum of the distributions for !for a small set of community types (e.g. researchers,home users, business users, robots, mixed). Eachtype represents a different mixture of user objectives,and should thus represent a defined market sector forWWW products. We are currently attempting to cal-culate ! for a wide range of caches with publisheddata in order to validate this hypothesis.! can be used to make predictions, for example it

can be shown that:no.ofhits D

meanrequestrateðmeanautohitrateð no.ofusersC ! ð .no.ofusers� 1/ð .meanrequestrate� meanrequestrateð autohitrate/.

(3)

Using meanrequestrate D 101, meanautohitrate D0.2 and taking observations from the Finland cachefor observedno.ofcached pages D 161 628, working-days D 1, and no.ofusers D 2284, we derive ! D0:124.

Applying this value of ! to the Finland cache weget no.ofhits D 69 010.

The true value of no.ofhits was 70 999 so ourmodel has predicted a reasonable answer.

6. Conclusions

Based on a preliminary analysis of user behaviourin daily cache logs and in long term traces we haveshown that individual users, with moderate dailyrequest rates, exhibit hit rates, request locality andlong range dependencies similar to those observedin caches. We have also shown that user requestrates follow a Poisson type distribution, so it is notnecessary to understand the behaviour of the smallnumber of low level users as their requests makeup an insignificant proportion of overall requests toa cache. Based on the observations a simple modelof users has been proposed, and we have illustratedhow this model could be used. A parameter wasproposed to account for the systematic biases inthe membership of real cache user communities.This parameter may be sufficient to characterise theobserved differences between real caches.

Acknowledgements

We would like to thank Pekka Jarvelainen forsupplying us with anonymised logs for the Funetproxy cache, and Simon Rainey of RM plc forsupplying us with the logs of their caches in the UK.

References

[1] M. Baentsch, L. Baum, G. Molter, S. Rothkugel, P. Sturm,Enhancing the Web’s infrastructure: from caching to repli-cation, IEEE Internet Comput. (March 1997) 18–27.

[2] M. Abrams, C.R. Standridge, G. Abdulla, S. Williams,E.A. Fox, Caching proxies: limitations and potentials, in:Proc. 4th Int. World-Wide Web Conference, Boston, MA,December 1995.

[3] B.M. Duska, D. Marwood, M.J. Feeley, The measured ac-cess characteristics of WWW proxy caches, 1997, http://www.cs.ubc.ca /spider/marwood/Projects/SPA/

[4] M. Arlitt, C. Williamson, Internet Web servers: Workloadcharacterization and performance implications, IEEE/ACMTrans. Networking 5 (5) (1997).


[5] J. Boucher, Voice Teletraffic Systems Engineering, Ch. 2,ISBN 0-89006-335-4.

[6] C.A. Cunha, A. Bestavros, M.E Crovella, Characteristicsof WWW client-based traces, Technical Report TR-95-010,Boston Univ. Depart. of Computer Science, April 1995.

[7] G.K. Zipf, Human Behavior and the Principle of LeastEffort, Addison-Wesley, Cambridge, MA, 1949.

[8] J. Gwertzman, M I. Seltzer: People, places, and things:the next generation Web, in: Proc. COMPCON, 1996, pp.65–70.

[9] C. Roadknight, I. Marshall, Variations in cache behavior,Comp. Networks ISDN Syst. 30 (1998) 733–735.

[10] J. Beran, Statistical methods for data with long-range de-pendence, Stat. Sci. 7 (4) 404–427.

Ian Marshall has been leader of BT’s information systems re-search team since 1992. Since early 1995 the research hasconcentrated on enhancing the internet and WWW to meet fu-ture quality and scalability requirements. Ian’s current researchincludes internet traffic analysis, building and testing scalableservers, server management, and dynamic adaptation throughapplication layer active networking.

Chris Roadknight received his B.Sc degree in applied biologicalsciences from Manchester Metropolitan University in 1992 andan M.Sc. in computer sciences from the University of Man-chester in 1996. He is in the final stages of submitting for hisPh.D. on pollution impact modelling using neural networks andis currently employed by British Telecom, working on cacheperformance modelling.

Linking cache performance to user behaviour

Documents

Transcript of Linking cache performance to user behaviour