IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. XX, NO. XX,...

IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. XX, NO. XX, XXXX 201X 1

Demystifying the Imperfect Client-Side CachePerformance of Mobile Web Browsing

Xuanzhe Liu, Member, IEEE, Yun Ma, Yunxin Liu Member, IEEE, Tao Xie Senior Member, IEEE, andGang Huang Senior Member, IEEE

Abstract—The Web browser is one of the most significant applications on mobile devices such as smartphones. However, the userexperience of mobile Web browsing is undesirable because of the slow resource loading. To improve the performance of Web resourceloading, client-side cache has been adopted as a key mechanism. However, the existing passive measurement studies cannotcomprehensively characterize the “client-side” cache performance of mobile Web browsing. For example, most of these studies mainlyfocus on client-side implementations but not server-side configurations, suffer from biased user behaviors, and fail to study “miscached”resources. To address these issues, in this article, we present a proactive approach to making a comprehensive measurement study onclient-side cache performance. The key idea of our approach is to proactively crawl resources from hundreds of websites periodicallywith a fine-grained time interval. Thus, we are able to uncover the resource update history and cache configurations at the server side,and analyze the cache performance in various time granularities. Based on our collected data, we build a new cache analysis modeland study the upper bound of how high percentage of resources could potentially be cached and how effectively the caching works inpractice. We report detailed analysis results of different websites and various types of Web resources, and identify the problemscaused by unsatisfactory cache performance. In particular, we identify two major problems- Redundant Transfer and MiscachedResource, which lead to unsatisfactory cache performance. We investigate three main root causes: Same Content , HeuristicExpiration, and Conservative Expiration Time, and discuss what mobile Web developers can do to mitigate those problems.

Index Terms—Mobile Web, Cache, Measurement.

F

1 INTRODUCTION

AMONG the numerous mobile applications, a.k.a, apps,the Web browser is the most frequently used one.

According to a recent study [1], as of 2013, Web browsersoccupy 63% of the window focus time on the subjects’ mo-bile devices. Recently, besides the standard browser, nativeapplications also involve built-in browser facilities, such asWebView on Android, to realize part of its functionalitiesincluding detailed information exhibition, link sharing, etc.

However, the user experience of mobile Web browsingis far from satisfaction. Resource loading is one of the keyfactors influencing Web browsing performance. A webpageconsists of a set of resources, such as HTML, CSS, JavaScript,and images. After users type in a URL and press the “Go”button or click through a hyperlink, the target webpage isloaded into the browser. During the page load, all relatedresources are fetched from the Web server, and then parsedor evaluated to finally render the page. As a result, down-loading resources dominates the resource loading process,and thus affects Web browsing performance. The moreresources need downloading from the network, the moredata traffic it consumes to load the webpage. It is reportedthat 65% of page load time is spent on resource loading [2].Mobile users suffer from slow page load, which leads tolong time of screen light-up and inevitably consumes moreenergy [3]. Such a problem becomes even worse on mobiledevices because of the unreliable network conditions, partic-ularly when the network is congested or the user is movingaround.

• Xuanzhe Liu, Yun Ma, Gang Huang are with the Key Laboratoryof High Confidence Software Technologies (Peking University), Min-istry of Education, Beijing, China, 100871. Email: liuxuanzhe, mayun,[email protected]

• Yunxin Liu is with Microsoft Research. Beijing, China, 100084. Email:[email protected]

• Tao Xie is with the University of Illinois at Urbana-Champaign. Email:[email protected]

Manuscript received June, 2015

In practice, client-side cache1 is particularly helpful forresource loading on mobile devices. By saving the resourcesof visited webpages into the local storage facilities (memoryor disk), these webpages may be served from the localstorage rather than being re-fetched from the Web serverswhen the webpages are accessed again. As a result, clientcache can reduce data transmitted over the last mile, i.e., theradio access network, which is shown to be the bottleneckof a cellular network [4].

Indeed, it is significant to understand how mobile Webbrowsing could benefit from “client-side cache” and howwell “client-side cache” takes effect in practice. To thisend, several measurement studies have been conducted byanalyzing user access traces gathered from Internet ServiceProviders (ISPs) or instrumented client devices [5], [6], [7],[8], [9].

Although the existing studies have already revealedthe performance problems of client-side cache on mobiledevices, they focus on only the imperfect client-side im-plementations. They do not comprehensively characterizethe end-to-end Web cache performance involving not onlyWeb clients, but also Web servers and user behaviors. Morespecifically, the existing measurement studies on mobileWeb cache have the following main limitations.•Existing studies are limited in terms of biased user be-haviors. The traces collected from ISPs or instrumentedclient devices are passive observations on users’ Web accessbehaviors. These traces are just fragmented snapshots ofthe websites and thus cannot cover the full resource updatehistory of these websites. As a result, measurements basedon these traces could imply only how caching works onthe sampled access of the sampled users, but cannot reflecthow the websites themselves can essentially benefit fromcaching.

1. Caching is a widely adopted mechanism on the Web and cache canbe deployed at many places between browsers and Web servers. Thisarticle focuses on client cache. We use caching, cache, and Web cacheinterchangeably to denote client-side cache if not specifically stated.


• Existing studies do not investigate the influences ofserver-side caching configurations on cache performance.Similar to the first issue, as the passively collected datatraces cannot cover the full resource update history of thewebsites, these studies concentrate on client-side implemen-tations but not server-side configurations. Although how thebrowsers support caching on mobile devices is important,Web cache performance is largely determined by the cacheconfigurations at the server side. For example, if resourcesare configured as no-cache (indicating that they should not becached), client-side caching cannot help at all. It is desirableto examine how server-side configurations may impact Webcaching.• Existing studies do not investigate the problem ofmiscached resources. Miscached resources are those thathave already been updated on the server, but the browserwrongly uses the expired resources from the cache ratherthan fetching the updated ones from the server. The passiveapproaches of data collection cannot capture miscachedresources because the passive approaches collect only the re-sources actually fetched by browsers. However, miscachedresources may significantly affect user experience. In par-ticular, if the functional resources (e.g., JavaScript or CSS)are miscached, the corresponding website may not workcorrectly. Therefore, there is a strong need of studyingmiscached resources.

To address the preceding limitations faced by the exist-ing studies, in this article, we present a proactive approachfor comprehensive measurement and analysis of mobileWeb cache performance. We crawl the resources from hun-dreds of websites periodically with a fine-grained timeinterval for a long time duration. In this way, we are ableto uncover the resource update history and cache configura-tions at the server side, and analyze the cache performancein various time granularities, without bias caused by limitedsamples of users’ behaviors. We build an analysis model toanswer three research questions. The first one is how muchof a website can be cached, giving an upper bound of howcaching could help improve mobile Web browsing experi-ence. More cacheable resources indicate more performanceimprovements that the website can benefit from caching.The second question is how well caching has been supportedto reach the upper bound, providing comprehensive charac-terization of the state-of-the-art of mobile Web caching. Thethird question is what causes the gap between the upper boundand the actual performance. For various reasons, resources thatcan be cached may not be actually cached and there existmiscached resources.

To conduct a quantitative study, we measure the per-formance of Web caching in both the resource size andresource number. Larger size of resources loaded from thecache indicates less data traffic that needs downloadingfrom the network. However, the resource size itself cannotfully quantify the performance of Web caching. Given thesame total size of resources, a larger number of smallresources loaded from the cache indicates fewer networkconnections to the server. Therefore, we consider both theresource size and resource number as the evaluation metricson how caching influences mobile Web browsing experiencein terms of page load time and data traffic.

This article2 makes the following main contributions:

• We propose a proactive approach to acquiring de-tailed history of resource update. We choose twosets of websites for a comparative study to uncover

2. This article is an extended version of our previous work publishedat WWW 2015 [10].

both the state-of-the-art and ordinary status of mo-bile Web cache performance. One set (Top) comesfrom Alexa’s Top 100 list. The other set (Random)comes from 100 websites randomly chosen fromAlexa’s Top 1,000,000 list. We periodically crawl theirresource update history for one week, forming thebasis of our analysis.

• We design a measurement model for analyzing cacheperformance. Based on the resource update history,we formally define the metrics to quantitativelymeasure cache performance. We then build a cachebehavior taxonomy to correlate cache configurationsto resource updates. Such a taxonomy maps eachpossible case to one of the four behavior patterns:Positive Hit (PH), Negative Hit (NH), Positive Miss(PM), and Negative Miss (NM).

• We present the empirical results for demonstratingthe gap between ideal and actual mobile Web cacheperformance. On average, more than half of the totalresources can be cached and the Random websitesare more cacheable than the Top ones. However,in practice, only a limited portion of the cacheableresources are cached, not more than 50% for theRandom websites. Different types of resources havedifferent cacheabilities.

• We present the empirical results for highlightingtwo main problems caused by undesirable cache per-formance. The Redundant Transfer problem comesfrom NM resources (the resources that can be servedfrom the cache but are actually downloaded from thenetwork). The median Negative Miss Ratios in termsof the resource size of Top and Random websitesare 10% and 52% for one-day revisiting interval,respectively. The Miscached Resource problem comesfrom NH resources (the resources that have beenupdated on the server but are actually served fromthe cache). NH resources have a similar proportionin both Top and Random websites.

• We demystify the root causes for undesirable mobileWeb cache performance. Three main root causes areidentified: Same Content (the same resources thathave different URLs when requested at differenttimes), Heuristic Expiration (the caching policy is notexplicitly defined by the server and thus it dependson browsers to infer an expiration time), and Conser-vative Expiration Time (the expiration time is set tobe too short).

• We suggest some implications and recommendationson how to improve cache performance by using bettercache policies. For example, the Application Cacheof HTML5 can be used to make the resource updatesvisible to clients.

To the best of our knowledge, our work provides thefirst study that reveals the imperfect cache performance ofmobile Web browsing by proactively exploring the resourceupdate history and cache configurations at the server side.The remainder of this article is organized as follows. Section2 motivates our study by analyzing how caching works andrelevant factors influencing cache performance. Section 3presents three research questions. Section 4 describes ourdataset and data-collection approach. Section 5 presents theanalysis methodology and Section 6 presents the detailedmeasurement results. Based on the findings, we provideimplications and recommendations for Web developers inSection 7. Section 8 presents threats to validity. Section 9surveys related work and Section 10 ends this article withconcluding remarks and future work.


2 BACKGROUNDIn this section, we describe the background on how Webcaching works and possible reasons that may influence thecache performance.

2.1 How Web Caching WorksFigure 1 shows a simplified procedure to illustrate how Webcaching works. A webpage consists of a set of resources,such as HTML, CSS, JavaScript, and images. When userstype in a URL and press the “Go” button or click through ahyperlink, the browser loads the target webpage by acquir-ing the corresponding resources from the Resource Loadermodule.

The Resource Loader first tries to load resources from thelocal cache through the Cache Manager module. Accordingto RFC 2616 [11], HTTP/1.1 defines a cache expirationmodel and a cache validation model. The cache expirationmodel allows a Web server to set an expiration time foreach resource, indicating how long time the resource canbe cached by a client device. Setting expiration time isnot mandatory. If a resource does not have an expirationtime explicitly set by the server, browsers may apply theirown heuristic algorithms to decide the cache expirationtime of the resource. Such a mechanism is called HeuristicExpiration.

The Cache Manager checks whether the requested re-source (identified based on its URL’s protocol, host, port,path, and query) can be found in the Cache Database. If not,the browser sets up a HTTP connection to fetch the resourcefrom the remote Web server. If the resource is found andnot expired according to the cache expiration model, it isreturned directly from the cache. If the resource is expired,the HTTP/1.1 cache validation model requires the browserto check with the server whether the resource is changedor not (via the last modified time or Etag identifier). If theserver determines that the corresponding resource has notchanged, it responds with a “304 Not Modified” messageand the resource is fetched from the cache of the client.Otherwise, the server sends the updated resource back tothe browser.

Note that a website may be much more complex than asingle Web server. A large website (e.g., Google) may havemany backend Web servers. There may also be Content De-livery Network (CDN) servers to enable scalability and ge-ographical distributions. In such a case, a URL’s query partoften contains a hash code to indicate which CDN servershould process the request, and thus the same resourcemay have different URLs served by different CDN servers.Furthermore, proxy servers and reverse proxies often act inthe middle to route the requests to different backend serversfor workload balance. Different backend servers could alsoattach different URLs to the same resource. As discussedin Section 5.3, this problem of “multiple URLs of the sameresource” causes undesirable Web cache performance andshould be investigated.

2.2 Factors Influencing Cache PerformanceBased on the principle of Web cache, we identify five factorsthat could possibly influence the mobile Web cache perfor-mance.•Website Types. Websites of different types have differentkinds of resources, which are updated with different fre-quencies. For example, a news website may update its textsand images in a very short time period to meet the require-ment of timeliness, while resources of a search website couldremain stable for quite some time to provide consistent

Fig. 1. Architecture of Web caching

experience for users. Therefore, cache performance of newswebsites is usually lower than that of search websites.•Web Administrators. Web administrators are responsiblefor deploying websites. During the deployment, they aresupposed to configure the servers to support cache. Forexample, they may configure different cache parameters fordifferent resources that may have different lifetimes. To en-able geographical distribution and load balance of backendservers, Web administrators could also adopt CDNs andreverse proxies to dispatch requests. Such middle pointsbetween browsers and servers can also influence the cacheperformance.•Web Developers. Some web techniques may actually dis-able cache or change its original configuration. For exam-ple, in order to achieve version management, developerswould like to attach a version number after the URL ofresources (e.g., http://foo.com/resource.js?v=1). Sometimesdevelopers choose to randomly generate the number withJavaScript code. Since the HTTP specification regards thewhole URL including the query string as the cache entryindex, resources with different version numbers are treatedas different resources. As a result, even if the resource isconfigured to be cached and does not change at all, it has tobe requested again.•Browser Runtime. Cache is ultimately implemented bybrowsers. How well browsers support caching has a directimpact on the cache performance, especially given the factthat there are various kinds and versions of browsers.The implementation includes the size of the cache space,whether the cache is volatile, and the accordance to specifi-cations.• End User Behaviors. End users have their own habits formobile Web browsing. These habits include which websitesthey usually visit, how often they visit each website, howmuch of a website they would like to visit, etc. End user be-haviors also include how end users interact with a browser,such as whether they clear browsing histories when closingthe browser, whether they often use the refresh button toreload a page etc. Caching performance heavily depends onuser habits and behaviors.

The type of website dominates the cacheability of re-sources and thus gives an upper bound of cache perfor-mance. While browser runtime is responsible for imple-menting the cache logic, Web administrators and developershave the most impact on the utilization of cacheability sincethey have to configure the cache parameters. Otherwise,even if the browser perfectly implements the cache mecha-nisms, cache can hardly make a difference when configuredimperfectly or even wrong. End-user behaviors dominatethe actual cache performance that can be achieved. For

http://foo.com/resource.js?v=1


example, given the same website, the actual cache perfor-mance may vary a lot among different revisiting intervals.

3 MOTIVATION AND PROBLEM STATEMENT

Indeed, cache is one of the most conventional technolo-gies adopted widely in computer science. The traditionalWeb caching research mainly targets at how to reducethe overall network traffic for the Internet. Infrastructuresupports are proposed to improve the global Web’s perfor-mance and reduce server workloads [12], [13], [14], [15]. Formobile computing, cache was mainly investigated in theMANET environment, such as a push-oriented [16], pull-oriented [17], or hybrid-based [18] approach. These studiestry to maintain cache consistency between MANET and theoriginal servers with the lowest cost, i.e., data traffic andenergy consumption. When it comes to mobile Web, studieson Web cache are shifted to the performance for end users.

Compared to previous efforts, our work is to conduct acomprehensive study on the performance of mobile Webcaching. Specifically, we intend to answer the followingthree research questions.

• RQ1: What percentage of Web resources can becached ideally? By answering this question, we studythe upper bound of the benefit of Web caching.We consider not only the redundant transmissionof the same resource (identified based on its uniqueURL), but also the redundancy of the resources withdifferent URLs.

• RQ2: What percentage of the cacheable Web re-sources are actually cached if the client is perfectlyimplemented according to the specification? By an-swering this question, we uncover how well devel-opers leverage the caching mechanism in practice,and reveal the gap between the ideal and actualperformance of mobile Web caching.

• RQ3: What causes the gap between ideal and ac-tual performance? By answering this question, weinvestigate the root causes of the undesirable ac-tual performance of Web caching, and give Webdevelopers recommendations on how to improve theperformance of mobile Web caching.

Answering these questions requires tracking the wholeresource update history of websites. This requirement can-not be met by existing passive approaches of data collection.Therefore, we propose a new proactive approach of datacollection. We next present how it works.

4 DATA COLLECTION

In this section, we describe how we proactively crawl Webresources and what data sets are used to conduct our study.

4.1 Proactive Data Collection ApproachWe propose a proactive approach to capture the resourceupdate history of a specific website. For a given URL, weperiodically visit the corresponding webpage with a shorttime interval and for a long time duration. For each visit,we first clean the cache and then record all the returnedresources. The short revisiting interval not only makes itpossible to obtain the complete resource update history, butalso enables us to make flexible analysis with various timegranularities. The long data-collection time duration ensuresenough samples and lowers the bias. We develop a tool torealize our approach and Figure 2 shows its architecture.

Fig. 2. Data collection tool

Our tool runs on a PC using the Chrome browser. Weenable the Chrome browser’s emulation mode to make it actas a mobile browser (an Android 4.2 native browser in ourcase) so that a website returns its mobile-version webpages(if the website has a mobile version). It is important to use areal browser to crawl data. The reason is that many websitesembed JavaScript code, and more resources may be loadedby the JavaScript-code execution. Only when the webpageis actually rendered by a browser, can we obtain the wholeset of resources. Furthermore, using a PC-version browserhelps us leverage the browser’s programmability, which isnot easy on Android smartphones.

We use the Charles proxy [19], a commercial product,to act as the middle tier for data collection. Charles canrecord each TCP connection passed through it and saveHTTP traces in structured data. The data structures splitHTTP protocol elements into different fields, making iteasy to perform further analysis. In addition, Charles hasa programming interface to ease the data recording andsaving. We install Charles on the same PC and configurethe Chrome browser to use Charles as the proxy. We disablethe caching feature of Charles to ensure the completeness ofcrawled resources.

We build a Chrome extension to automate our datacollection. The extension takes a list of URLs and controlsthe Chrome browser to visit the URLs one by one. Beforeeach visit, the browser cache is cleaned in order to en-sure that all the resources are loaded from the network.When a Chrome TabComplete event is detected, the extensionregards that the website has finished loading. Note thatsome websites utilize lazy loading, and thus there are stillresource requests after TabComplete is fired. To mitigate suchresource incompleteness, the extension is set to wait fortwo seconds after a TabComplete event. Finally, the extensioninvokes the Charles programming interface to save all theresource request records on the disk. We can also configurethe extension to periodically crawl a list of URLs with agiven time interval.

4.2 Data SetsFor a comparative study, we choose two sets of websitesfrom the Alexa [20] ranking list in January 2014. The first set(Top) contains Alexa’s top 100 websites. We expect that theircache performance is highly optimized and regard them asthe state-of-the-art of Web caching. The other set (Random)contains 100 websites randomly selected from Alexa’s top1,000,000 list. We assume that their cache performance isnot fully optimized and regard them as the average level ofcache performance.

We manually checked each website in the two sets andfiltered out the following ones.• Unreachable websites. Some websites were not reachablebecause they were shut down at the time when we visited.


Unreachable websites occurred in only the Random data set.• The same websites with different domain names. Forexample, in the Top data set, Google appears 16 timessuch as google.in, google.de, and google.jp. Since theyhave the same functionality and appearance, we keep onlygoogle.com in our Top data set. However, we treat yahooand yahoo.jp as two different websites because their con-tents are totally different.• HTTPS-only websites. As HTTPS prevents us from cap-turing the Web resources transmitted over the encryptednetwork connection, we exclude those websites that supportonly the HTTPS protocol.

After the filtering, 146 websites remain in total, 55 in theTop data set, and 91 in the Random data set. Note that notall of the 146 websites have a mobile version customizedfor mobile browsing. According to our manual checking, 7of the 55 Top websites and 56 of the 91 Random websitesdo not have their mobile versions. However, the goal ofour measurement is to study how Web caching performson mobile devices for all websites, not only the ones with amobile version. Therefore, we do not distinguish mobile ornon-mobile websites.

For these 146 websites, we used our data-collection toolto periodically crawl their homepages for one week. Thecrawling time interval was 30 minutes, being short enoughto capture all the resource changes. Thus, in this one week,we visited each website for more than 300 times. In total wecrawled 157 GB data, 73 GB for the Top websites, and 84 GBfor the Random websites.

5 MEASUREMENT METHODOLOGYIn this section, we formally define the measurement metricsused to study Web cache performance. We also design a tax-onomy to map the relationship between cache specificationsand actual resource updates.

5.1 Metric DefinitionCache works only when a website is revisited. So we con-sider two visits. At the first visit, the browser cache is emptyand all the resources needed to render the website haveto be loaded from the network. After a given RevisitingInterval (RI) ∆t, the same website is visited for the secondtime, where some of the resources are loaded from thebrowser cache while others are loaded from the network.We examine how the browser cache behaves for the secondvisit.

We define St = rt as the resource set fetched by thebrowser at time t under the condition that the browser cacheis empty. rt ∈< URL,Content > is a resource entry, whereURL (including the domain, port, path, and query) is usedto identify the resource, and Content is its actual entity. Aresource can be updated when time goes on. St representsthe up-to-date resources of the website at time t.

We define

CAt(∆t) = rt ∈ St|∃rt−∆t ∈ St−∆t, rt.Content = rt−∆t.Content

as the resources (at time t) that have already been loadedfrom the network at the previous visit before ∆t. Note thatwe do not enforce whether the resource’s URL is the same ornot. We consider only the resource’s content that is actuallyused by browsers to render websites. In fact, CAt(∆t) arethe resources that can benefit from caching.

Based on the preceding definitions, cacheablity Ψ(∆t)can be formulated as the proportion of CAt(∆t) in St:

Ψ(∆t) =CAt(∆t)

St

Ψ(∆t) measures what percentage of a website can be cachedafter ∆t. Larger Ψ(∆t) means more resources are cacheable.It should be pointed out that cacheablity depends on ∆tbecause resource updates are different when a website isrevisited after different ∆t.

We define

Ct(∆t) = rt−∆t ∈ St−∆t|rt−∆tis configured to be cachedas the resource set in the browser cache at time t. Ct(∆t)represents the resources stored in the cache on the previousvisit at time t−∆t. Note that Ct(∆t) ⊆ St−∆t because someresources in St−∆t may be configured not to be stored in thebrowser cache.

According to the cache mechanism, Ct(∆t) = EXt ∪FRt, indicates that some of the cached resources are expired(denoted as EXt) while others are still fresh (denoted asFRt). The browser first loads a resource from the cache ifits URL is found in FRt, in which case we say the resourceis hit. Define Ht = rt ∈ St|∃r ∈ FRt, rt.URL = r.URLas the resource set loaded from the cache. Otherwise, nomatter when the resource URL is found in EXt or cannotbe found in Ct(∆t), the browser loads it from the net-work, in which case we say the resource is miss. DefineMt = rt ∈ St|(∃r ∈ EXt, rt.URL = r.URL) ∨ (∀r ∈Ct(∆t), rt.URL 6= r.URL) as the resource set loaded fromthe network.

However, the cache policy may not be set properly sothat expired resources could still be up-to-date. In addi-tion, considering only the resource URL can bring mistakesbecause the content is what actually matters for renderingwebsites. Therefore, we further define positive and negativepatterns for Ht and Mt, respectively.

Positive Hit (PHt(∆t)). Given rt ∈ Ht, rt ∈PHt(∆t) ⇐⇒ ∃rt−∆t ∈ St−∆t, rt.URL = rt−∆t.URL ∧rt.Content = rt−∆t.Content, which means that the re-source is loaded from the cache and it is the same as theup-to-date version on the server. Larger PHt(∆t) meansmore resources are correctly served from the cache.

Negative Hit (NHt(∆t)). Given rt ∈ Ht, rt ∈NHt(∆t) ⇐⇒ ∃rt−∆t ∈ St−∆t, rt.URL = rt−∆t.URL ∧rt.Content 6= rt−∆t.Content, which means that the re-source is loaded from the cache but its content has alreadybeen updated on the server. In this case, the browser shouldacquire the updated resource. Larger NHt(∆t) means morestale resources are used to render the website, and wrongbehaviors are more likely to happen.

Positive Miss (PMt(∆t)). Given rt ∈ Mt, rt ∈PMt(∆t) ⇐⇒ ∃rt−∆t ∈ St−∆t, rt.Content 6=rt−∆t.Content, which means that the resource is loadedfrom the network and its content has never occurred inSt−∆t. It is either a new resource or an updated version of acached resource. Larger PMt(∆t) indicates more resourcesare newly involved by the website or changed during thetime.

Negative Miss (NMt(∆t)). Given rt ∈ Mt, rt ∈NMt(∆t) ⇐⇒ ∃rt−∆t ∈ St−∆t, rt.Content =rt−∆t.Content, which means that the resource is loadedfrom the network but its content has ever been loaded by thebrowser before. It is either stored in the cache or configurednot to be cached. In this case, the browser should directlyuse the previously loaded resource. Larger NMt(∆t) indi-cates more redundant transfers.

Based on the preceding definitions, we can formulatethe actual cache performance Φ(∆t) as the proportion ofPHt(∆t) in PHt(∆t) ∪NMt(∆t):

Φ(∆t) =PHt(∆t)

PHt(∆t) ∪NMt(∆t)


TABLE 1Cache performance metrics

Metric DescriptionPHt(∆t) The proportion of resources that are correctly

served by the cache when the website is revis-ited after ∆t.

NHt(∆t) The proportion of resources that should havebeen fetched from the server but are actuallyprovided by the cache when the website isrevisited after ∆t.

PMt(∆t) The proportion of resources that are correctlyfetched from the server when the website isrevisited after ∆t.

NMt(∆t) The proportion of resources that should havebeen provided by the cache but are actuallyfetched from the server when the website isrevisited after ∆t.

Ψ(∆t) Cacheability of a website with revisiting inter-val ∆t, giving an upper bound of how much awebsite can benefit from caching.

Φ(∆t) Actual cache performance of a website withrevisiting interval ∆t, showing how well a web-site has utilized caching.

Ω(∆t) Negative Miss Ratio with revisiting interval ∆t,indicating how much data traffics are wastedby redundant transfers.

In addition, we can measure the problems of redundanttransfer by Negative Miss Ratio Ω(∆t):

Ω(∆t) =NMt(∆t)

PMt(∆t) ∪NMt(∆t)

Larger Ω indicates more resources loaded from the networkcan actually be loaded from the cache.

Table 1 summarizes the metrics and their meanings tomeasure mobile Web cache performance. Meanwhile, as wechoose the resource size and number as the quantitativeparameters, subscripts s and n are attached to the metricsrepresenting which parameter is used to calculate the metricvalues. For example, Ψs and Ψn denote the cacheabilitycalculated by the resource size and number, respectively.

5.2 Cache Behavior TaxonomyIn order to calculate the preceding metrics, we correlate thecache configurations and resource updates by building acache behavior taxonomy (Figure 3).

When a webpage is visited, the browser looks up in itscache table for each resource represented by the URL.

If the URL is found in the cache table (URLHit), the cachemechanism is ready to work. The browser first checks ex-piration to determine the freshness of the cached resources.There are two expiration policies. One is the server-specifiedexpiration relying on the origin server to provide an explicitexpiration time. The other is the heuristic expiration relyingon the browser to assign an estimated value for thosewithout explicit server-specified expiration time.

For the heuristic expiration, no matter whether the re-source content is changed or not, any of the four cachebehavior patterns is possible to happen as its freshnesscannot be explicitly determined. In the worst case, we canassume the ContentChanged state of heuristic expiration asNH and the ContentNotChanged state as NM. In the follow-ing discussion, all the analyses are for the worst case unlessspecially stated.

Fig. 3. Cache behavior taxonomy

For the server-specified expiration, there are three possi-ble policies. The first is NoStore, indicating that the resourcescannot be persistently stored on the hard disk. The second isNoCache, indicating that although the resources should notbe cached by the browser, they can be persistently stored onthe hard disk. The ContentChanged and ContentNotChangedstates of both NoStore and NoCache are classified as PM andNM, respectively. In the third case, according to the expira-tion time, we can determine whether the cached resourcesget expired or stay fresh. If expired, two further branchesValidated and NotValidated can be derived from the cache val-idation model, where the last modified time or Etag is usedfor the freshness check with the origin server. Combinedwith resource update states, we can classify each leaf nodewith the ContentChanged state as PM or ContentNotChangedstate as NM. For the Fresh node, NH happens if the contentis changed, while PH happens if the content is not actuallychanged.

If the URL is not found in the cache table (URLMiss), itdoes not mean that the resource cannot be served by thecache. As aforementioned, resources with different URLsmay have the same content. We use ContentFound to denotethis case and classify its cache behavior pattern as NM. Thelast branch of New indicates completely new resources thathave to be downloaded from the network, and thus weclassify it as PM.

6 MEASUREMENT RESULTSThis section presents our detailed measurement results fromthree aspects: the gap between the ideal and the actualcache performance (Section 6.1), the problems caused byundesirable cache performance (Section 6.2), and the rootcause analysis (Section 6.3).

6.1 Ideal and Actual Cache PerformanceWe first show the cacheability of websites and then presenthow much cacheability is realized in practice, uncoveringthe gap between the ideal and actual cache performance.Figure 4 depicts Ψ and Φ in a bar chart with different RIsand for different resource types. Based on our methodology,each bar represents the cacheability (denoted by Ψ) of corre-sponding resource types or all resources with the given RI,where the dark-color part stands for those actually cached(PH) resources and the light-color part represents resourcesthat have not been cached (NM). Therefore, the actualperformance (Φ) can be easily observed as the proportionwhich the dark-color part occupies in the whole bar.


(a) Ψs and Φs of aggregated websites and resource types (b) Ψn and Φn of aggregated websites and resource types

Fig. 4. Ideal and actual cache performance

6.1.1 Potential Benefits of Caching

To answer the first research question of “What percentage ofWeb resources can be cached ideally”, we calculate Ψ of thewebsites in the two data sets, shown as the height of eachbar in Figure 4. We can make the following observations.

A high percentage of the total resources are cacheableand thus caching may be substantially helpful. For theTop websites, 56.4%-72.7% of the total resource size (Ψs)and 69.5%-83.2% of the total resource number (Ψn) arecacheable, depending on the values of RI. For the Randomwebsites, the percentages are even higher: 63.2%-79.0% inΨs and 72.1%-84.5% in Ψn. These results demonstrate thatcaching has great potential to help improve the performanceof mobile Web browsing if caching is correctly configuredand implemented. For the websites that users visit dailyor even more frequently (e.g., news websites and searchengines), caching can substantially reduce the cost of Webbrowsing on mobile devices, in terms of network band-width, computation, and energy. Furthermore, as RI in-creases, Ψ declines. This phenomenon is reasonable becauseresources are more likely to change when users revisit awebsite after a longer period and thus the probability ofcache hit becomes lower.

Caching is more helpful for the Random websites thanfor the Top websites. No matter what value RI has, Ψs andΨn of the Random websites are always larger than those ofthe Top websites. For example, when RI is 24 hours, Ψs ofthe Random set is 6.8% higher than Ψs of the Top set. Inother words, the Random websites are more cacheable thanthe Top websites. The reason is that resources are less likelyto change for ordinary websites than for popular websites.

Next, we examine how different types of resources canbenefit from caching.

HTML is the least likely to benefit from caching.Interestingly, HTML is the least cacheable type of Webresources. For example, Ψs of HTML for the Top websites isas small as 4.5%-9.1%. This result may be due to that modernwebsites are becoming increasingly dynamic to meet theever-growing user requirements. Rich client technologies,such as Asynchronous JavaScript + XML (AJAX) program-ming, can change HTML DOM trees on-the-fly and makeHTML less cacheable. Ψn of HTML (34.2%-41.1% for the Topwebsites) is larger than Ψs of HTML, because small HTMLpages are less likely dynamic. In addition, Ψ of HTML forthe Random websites is larger than the one for the Topwebsites, indicating that the Random websites are relativelymore stable than the Top websites.

CSS is the most cacheable resources. Compared to othertypes of resources, Ψ of CSS is the highest: 74.6%-89.4% forthe Top websites, and 72.0%-91.4% for the Random websites.This observation can be explained by the nature of CSS:although the content of a webpage may change frequently,the layout and style of the webpage may be more stable.

JS of the Top websites benefits more than that ofthe Random websites. The cacheability of JS for the Topand Random websites is very different from overall resultsand other resource types. Given an RI, Ψ of JS for the Topwebsites is always larger than that of the Random websites.Furthermore, Ψ of JS for the Top websites does not changemuch as RI changes (70.9% to 72.3%), but the variance ofΨ of JS is rather large for the Random websites (54.1%to 72.1%). This observation reflects that the Top websitesremain more stable on functionalities than the Randomwebsites.

Images mostly benefit from caching for short RI. AsRI increases, Ψ of images decreases a lot (about 30% forthe Top websites and 20% for the Random websites). Thisphenomenon is reasonable since new images may take placeof old ones after a longer time.

6.1.2 Actual Cache PerformanceWe further calculate Φ of the two data sets, to explore “Whatpercentage of the cacheable Web resources are actually cached” forthe websites that we captured, shown as the percentage ofthe dark-color part in each bar in Figure 4.

Compared to the Top websites, the Random websitesdo not make best use of caching. Φ of Random is alwayssmaller than Φ of Top. More than 80% of the cacheable bytesfor Top have been actually cached in all cases, but just above50% of the cacheable bytes for Random have been actuallycached in the best case within 0.5 hour RI.

Small-size resources are not taken into considerationvery well. Although Φs of Top is high, Φn is relativelylow. For 24 hours RI, Φs is above 80% while Φn is justnot more than 50%. This observation implies that the Topwebsites pay more attention on larger-size resources butthere are still a number of resources not being considered.They occupy small proportion in the total resource size butconstitute the large proportion of the total resource number.Network connections may be wasted to download thesesmall resources.

The cacheability of HTML is almost not utilized at all.Both Φs and Φn of HTML for Top are smaller than thoseof Random. In the worst case with 24 hours RI, Φs of Top


is just 2.8%. For Random, Φs is not above 50% in the bestcase with 0.5 hours RI. Generally, it seems not possible tomake HTML cached as it can hardly be predicted when theHTML is changed. Therefore, popular websites give up thecacheability of HTML.

The resource type with the highest cache performanceis image for Top (84%-93%) and JS for Random (51%-70%).It is surprising that images of the Random websites do notmake the best use of cache, where both Φs and Φn are evennot above 50% in the best cases.

In summary, there is a big gap between ideal and actualcache performance. For the Top websites, the gap is mostlyoriginated from the large number of small-size resources.For the Random websites, all types of resources’ cacheabilityhave not been properly considered.

6.2 Problems Caused by Undesirable Cache Perfor-manceWe then identify two major problems caused by undesirablecache performance: Redundant Transfer introduced by NMand Miscached Resource introduced by NH. These problemscan negatively affect the user experience of mobile Webbrowsing.

6.2.1 Redundant TransferThe Redundant Transfer problem comes from NM whenresources could be fetched from the cache but are actuallydownloaded from the network. Figure 5 shows the CDF ofΩs and Ωn of the Top and Random websites with differentRIs.

Redundant Transfer is a significant issue for mobileWeb browsing and the issue is worse for the Randomwebsites than the Top ones. Considering 24 hours RI, themedium Ωs of Top is about 10% while it reaches morethan 40% for Random, i.e., 40% of the transferred bytes areredundant. The result indicates that Redundant Transfer isa common problem for mobile Web browsing while the Topwebsites indeed make some efforts to reduce it. However, interms of the resource number, Ωn of both Top and Randomis relatively high where the medium value is about 50% and75%, respectively.

Redundant Transfer is worse for short RI. Ω becomeshigher as RI decreases for both Top and Random, rep-resented by that CDF curves shift to the right when RIdecreases in Figure 5. In the worst case, the medium Ωs

of Random is more than 80% with 0.5 hour RI. As a result,a larger proportion of resources requested from the networkare redundant when users revisit a website after a shorterinterval. However, as discussed earlier, caching is moreeffective when RI is shorter. Therefore, caching is far from itscapacity and it could be improved a lot if NM was reduced.

Some websites have 100% Ω, indicating that all therequested resources are redundant for these websites. Forexample, about 10% of Top and 20% of Random have 100%Ω with 24 hours RI. Table 2 lists the websites of 100% Ωfor Top considering 24 hours RI. It is surprising that somepopular websites are in the list, such as Baidu, Wikipedia,and Apple. Since users may visit these popular websitesmany times every day, the waste of data traffic and energycould be very large.

We then examine the Redundant Transfer problem fordifferent resource types. Figure 6 shows Ωs’s CDF withdifferent RIs for both Top and Random, corresponding tothe resource types of HTML, CSS, JavaScript, and Image.

For all kinds of resources, Ω of Random is larger thanthat of Top, represented by that CDF curves shift to the

TABLE 2Top websites with 100% Ω

Website Rank CategoryBaidu 5 Searchwikipedia 6 News & ReferenceWordpress 20 Misc360 22 CorporationSoso 29 SearchAsk 35 SearchInstagram 37 EntertainmentApple 45 CorporationAlibaba 68 E-commerce

right bottom by comparing the two figures of the sameresource type for Top and Random. This result implies thatordinary websites do not consider cache much for each kindof resources.

For HTML (6(a) and 6(e)), Ω of Top does not varya lot while Ω of Random decreases as RI increases. Thisobservation implies that HTML of the Top websites is moredynamic while HTML of the Random gets updated in afixed period.

For CSS (6(b) and 6(f)), when RI increases, Ω of bothTop and Random increases, implying that more redundanttransfers could occur after a longer RI. The reason may bethat both Top and Random leverage the cache expirationmodel for CSS but the expiration time is too conservative. Asa result, when the CSS is revisited after a shorter interval, itis served by the cache. In contrast, the CSS is loaded from thenetwork when the page is revisited after a longer interval,but the resource is actually not changed. It is also interestingthat CDF of CSS is a traverse line, indicating a polar effectthat some websites’ CSS are no NM while others are all NM.

For JS (6(c) and 6(g)), Ω of both Top and Random doesnot vary so much as RI changes, indicating that there are afixed set of NM in JS.

For image (6(d) and 6(h)), when RI increases, the Ωdecreases. Random decreases in a small margin while Topdecreases larger, implying that images of the Top websiteschange more frequently than the Random websites.

6.2.2 Miscached ResourceDue to the expiration model, it is likely to provide the out-of-date resources from the browser cache. In most cases,developers expect that users would tolerate these miscachedresources such as acquiring information that is not the lat-est. However, in some circumstances, miscached resourcescould lead to incorrect functionalities. For example, duringthe evolution of Baidu, it changes many JS and CSS re-sources to improve its user interaction. However, some pre-vious JS and CSS resources are configured to a considerablelong expiration period. Therefore, users are even not ableto use Baidu’s search functionality until the users clear thebrowser cache, or force to refresh the page, or wait until theexpiration time comes. It is worth exploring the MiscachedResource problem. In fact, NH measures the extent of thisproblem. Table 3 shows the result of NH.

NH takes a small proportion but we cannot ignoreit. NH of Top and Random share a similar proportion.Although NH of the resource number is no more than 3%,NH of the resource size is about 7%. Note that NH of Topand Random share a similar proportion, implying that bothTop and Random have not considered much for NH.

NH of HTML is relatively large. NH introduces a lotof expired content. Developers may think that users could


(a) Ωs of Top (b) Ωs of Random (c) Ωn of Top (d) Ωn of Random

Fig. 5. Overall Negative Miss Ratio

(a) Ωs of Top HTML (b) Ωs of Top CSS (c) Ωs of Top JS (d) Ωs of Top Image

(e) Ωs of Random HTML (f) Ωs of Random CSS (g) Ωs of Random JS (h) Ωs of Random Image

Fig. 6. Ωs of different resource types for Top and Random

TABLE 3NH of aggregated websites and resource types

24h 0.5hNHs NHn NHs NHn

AllTop 6.4% 2.3% 8.1% 3.0%AllRandom 7.7% 2.9% 9.0% 2.8%HTMLTop 35.7% 18.7% 40.0% 19.9%HTMLRandom 24.6% 10.3% 21.5% 9.1%CSSTop 8.3% 5.6% 12.1% 7.7%CSSRandom 13.5% 2.8% 10.5% 2.9%JSTop 4.8% 2.1% 10.4% 4.1%JSRandom 10.5% 4.3% 16.6% 5.3%ImageTop 0.1% 0.3% 0.1% 0.2%ImageRandom 1.0% 1.3% 0.5% 0.8%

bare the expired content so that temporally expired contentis also acceptable.

NH of image is the lowest but not zero because somefunctional images (such as borders and buttons) may changewhen websites get updated.

6.3 Analysis of Root CausesTo study the third research question “What causes the gapbetween ideal and actual performance”, we investigate the cachebehaviors belonging to the NH and NM patterns, finding

misconfigurations that lead to undesirable performance. ForNH, according to the taxonomy in Figure 3, two states,Heuristic and Fresh, are related with NH. By calculating theproportion of these two states in the total NH, we find thatHeuristic accounts for 60% and Fresh is 40%. As a result, ifall resources are assigned a proper expiration time, then theproportion of NH resources could be reduced. However, itis not possible to eliminate NH resources under the currenttime-based expiration policy. Although longer expirationtime results in NH, shorter expiration time can bring NM aswell. Therefore, it could be better if new URLs are assignedto changed resources.

For NM, we break down the cache behaviors of NMand plot the CDF of each case in Figure 7. As explainedin Section 5.2, there are six cache behaviors belonging tothe NM pattern. We calculate the proportion of each cachebehavior in the total NM and plot them on a CDF graph.Figure 7 shows the cases of the resource size for Top andRandom with 24 hours and 0.5 hours RI.

Heuristic and Same Content are the top two causes toNM for both Top and Random. For Top, Same Content isthe most significant issue. For Random, Heuristic is the mostsignificant issue while Same Content is in the second place.The distributions of Heuristic and Same Content are notaffected by RI. For Heuristic, as we consider the worst case,RI is not an influencing factor and thus it occupies the sameproportion in each RI. For Same Content, the distribution is


(a) Proportion (Top, 24h) (b) Proportion (Random, 24h) (c) Proportion (Top, 0.5h) (d) Proportion (Random, 0.5h)

Fig. 7. Breakdown of NM for all resources

(a) NM of Top HTML (b) NM of Top CSS (c) NM of Top JS (d) NM of Top Image

(e) NM of Random HTML (f) NM of Random CSS (g) NM of Random JS (h) NM of Random Image

Fig. 8. Breakdown of NM for different resource types

similar because the same resources have different URLs atevery visit.

Checked is the third reason for NM. Checked happenswhen the resources are expired in the cache (but actuallyare up-to-date on the server) and validated with the originalserver by last-modified time or Etag identifiers. Althoughthe validation is successful to eliminate sending a full re-sponse, the protocol overheads account for some redundantsize. Therefore, validating each resource’s status may beharmful for mobile Web browsing. The reason of Checked isthat the expiration time is configured too short.

No-store and No-cache occupy small proportion, indicat-ing that developers are careful when configuring resourcesto no store or no cache. Therefore it is less important to con-sider such cases to improve mobile Web cache performance.

Figure 8 shows the breakdown of NM for differentresource types with 24 hours RI. For HTML, just like thetotal resources, Heuristic and Same Content are the top twocauses. But the No-check and No-store have higher percent-ages than Check. This observation reflects that developersare less likely to use a validation mechanism for HTMLbecause HTML could always be changing for new contents.For CSS and JS, Checked takes a higher proportion thanthe other resource types as well as comparing to the totalresources. Checked is even the most significant issue of TopCSS and Top JS. For CSS, Same Content is not an importantissue, occupying a rather small proportion. Image has themost similar distribution with the total resources. This result

is reasonable since images dominate the largest share of thetotal resource size.

We next deeply analyze Heuristic Expiration, which isthe main source for both NH and NM, as well as SameContent and Conservative Expiration Time, which are themain sources of NM.

6.3.1 Heuristic ExpirationWe first investigate the Heuristic Expiration issue, whichoften happens when developers are unaware of the im-portance of cache and do not explicitly configure cacheparameters. Heuristic Expiration is an important issue forthe Random websites because their developers may be un-skilled. However, Heuristic Expiration also takes the secondplace in the Top websites.

Table 4 shows the proportion of Heuristic Expirationresources as well as the average updating cycle (the intervalbetween two updates). Totally, 6% and 33.7% of Top andRandom resources are configured as Heuristic Expiration.For Top, HTML is the largest (37.4%). For Random, allproportions are above 20%, meaning that more than 1/5of resources have not explicitly utilized the caching mecha-nism.

Since different Web browsers have different algorithmsto deal with Heuristic Expiration, we just consider the worstcase in the previous analysis. Here we calculate the averageupdating cycle to study what is the proper expiration timefor these resources. We can see that there is a big difference


TABLE 4Resources of Heuristic Expiration

Top RandomPct. Cycle Pct. Cycle

HTML 37.4% 3.7 h 29.9% 29 hCSS 1.9% 1.0 h 37.3% 134 hJS 3.8% 7.2 h 20.4% 143 h

Image 2.6% 1.0 h 45.2% 133 hOthers 2.6% 4.0 h 24.5% 90 hTotal 6.0% 5.9 h 33.7% 107 h

between Top and Random. Considering all the HeuristicExpiration resources, the cycle is 5.9 hours for Top whileit is 107 hours for Random. This observation implies thatresources configured as “Heuristic Expiration” can benefit alot from caching if their expiration time is set to day-level.In contrast, the proper expiration time for Heuristic Expi-ration resources of Top is hour-level. Considering differentresource types, JS has the longest updating cycle for bothTop (7.2 hours) and Random (143 hours). HTML has theshortest cycle for Random (29 hours).

6.3.2 Same ContentNext, we study the Same Content issue, as it has a significantinfluence on both Top and Random. As we described earlier,Same Content represents a case that a resource’s content isfound in the cache but the URL is different. Since the HTTPspecification regards the URL as the index of resources forcaching, there is no means to use the cached resource ifthe URL cannot be found in the cache. By examining theresource request records, we find that there are two majorcauses of Same Content.

(function() var APC_test_rand = Math.floor(Math.random() * 10000);if (APC_test_rand < 10000)

setTimeout (function() var s = document.createElement("script");document.getElementByTagName(’head’)[0].appendChild(s

);s.src = "http://jqmt.qq.com/cdn_dianjiliu.js?a=" +

Math.random();, 5000);

)();

Listing 1. An example of random strings attached to URLs leading toSame Content

Version management of resources. In order to provideend users the up-to-date resources, some best practices ofWeb development suggest that developers attach randomquery strings to URLs. Listing 1 shows an example of suchrandom strings. It is a piece of JS code whose functionis to load another JS from the URL “http://jqmt.qq.com/cdn dianjiliu” (Lines 5-7). A random string generated byMath.random() is attached to the URL. Therefore, every timethe webpage with the resource’s reference is visited, a newquery string is generated and the browser treats the resourceas a new one to download from the server. In other cases,some websites, such as Youku, use different directoriesto manage resource versions. Each time the website getsupdated, a new directory with the new version numberis created. Since an update is not likely to change all theresource files, Same Content happens if those unmodifiedresources are requested from the new directory.

Web hierarchy architecture. Figure 1 has shown theWeb hierarchy architecture that there are middle-boxes toimprove the Web performance such as CDN and reverse

Fig. 9. Expiration time distribution of Top and Random

proxy. To leverage these middle-boxes, Web servers have togenerate different URLs or attach hash code of route pathsto resources. As a result, one resource could have differentURLs when requested at different times or at differentplaces, leading to Same Content.

6.3.3 Conservative Expiration TimeFigure 9 shows the resource expiration time (indicated byMax-age primitive) distribution. 40% and 24% of the re-sources for Top and Random, respectively, are configuredby less than one-day expiration time. This result impliesthat the Top websites are more conservative on expirationbecause their resources change more frequently than theRandom websites. However, we find that as many as 89%and 86% resources with less than one-day expiration timehave no updates for Top and Random, respectively. There-fore, the expiration time can be extended longer. However,longer expiration time may introduce the Miscached Re-source problem. As a result, it is actually a tradeoff betweenwebsite functionality and cache performance.

7 IMPLICATIONS

Table 5 summarizes the findings of our measurement re-sults. Based on these findings, we can draw several implica-tions.• Cache mechanisms. According to Finding 1, caching hassubstantial benefits for mobile Web browsing, especiallyfor ordinary websites. But the cacheability has not beeneffectively exploited. There exists a big gap between theideal and actual cache performance. According to Finding3, data traffic of mobile Web browsing is wasted mostly forfrequently visited websites due to Negative Miss resources.According to Finding 4, miscached resource is a commonproblem for all the websites due to the current time-basedexpiration mechanism. Popular websites may have wronglayout and view while ordinary websites could suffer fromwrong functionalities.• Web administrators. According to Finding 1, administra-tors from popular websites indeed make efforts on cachinglarge-size resources but do not consider small-size resourcesvery well. Administrators from ordinary websites performworse for both cacheable bytes and objects. According toFinding 2, Web administrators should configure differentcaching policies for different types of resources based onthe characteristics of websites. Administrators from pop-ular websites could consider more for caching JS, whileadministrators from ordinary websites should check thecache configurations for all resources especially the images.

http://jqmt.qq.com/cdn_dianjiliu

http://jqmt.qq.com/cdn_dianjiliu


TABLE 5Summary of findings

No. Findings from Measurement Study1 More than half of the total resource bytes and about 70% of the resource objects can be cached after 24 hours. Although

80% of the cacheable bytes are actually cached for Top, the percentage of cached objects is under 50%. For Random, bothcached bytes and objects occupy not more than 50% of the cacheability.

2 The cacheability varies for different types of resources, and the actual performance is also different. The results betweenTop and Random are not the same either.

3 The Redundant Transfer problem is worse when a website is revisited after a shorter period. Some popular websiteseven have 100% Negative Miss Ratio.

4 NH of Top and Random share a similar proportion. HTML accounts for the largest NH. NH of CSS for Top is largerthan for Random, while JS is on the contrary. NH of images is not 0.

5 Same Content accounts for the main source of NM. More than 20% of NM resources are resulted from Same Content forhalf of Top and 25% of Random. Same Content is caused by the version management and Web hierarchy architecture.

6 6% and 33.7% of Top and Random resources are configured as Heuristic Expiration, but the average updating cycle is5.9 hours and 107 hours for Top and Random, respectively.

7 40% and 24% of the resources for Top and Random are configured by less than one-day expiration time. But 89% and86% of them have no updates for Top and Random, respectively.

According to Finding 5, although the Web hierarchy ar-chitecture reduces the overall network traffic and releasesserver workload, it is harmful for user-centric mobile Webbrowsing. End-user based content distribution may helpsolve the problem. According to Finding 6, it is better to setan explicit expiration time than using the Heuristic Expi-ration. Ordinary websites can set day-level expiration timeto their resources while popular websites can set hour-leveltime. According to Finding 7, administrators can extend theexpiration time for some resources that are not likely tochange for a considerably long time. But longer expirationtime could cause the Miscached Resource problem.• Web developers. According to Findings 3 and 5, currenttechniques to maintain website versions, such as randomquery strings, are not appropriate for mobile Web browsing.New techniques have to be adopted such as the HTML5AppCache interface [21]. A manifest file can be attached toeach webpage, specifying what to cache and what to loadfrom the network. Resources in the cache list are alwaysloaded from the cache until the manifest file is updated.

Our results are derived from only the Chrome browserin this article. However, the measurement approach itselfcan be generalizable to other Web browsers such as Safariand FireFox, and hybrid apps that are built atop browserengines such as WebKit. It would be interesting to explorethe “imperfectness of client-side cache mechanisms on otherbrowsers.”

8 THREATS TO VALIDITYConsiderable care and attention should be addressed toensure the rigor of this measurement study. However, aswith any chosen research methodology, it is hardly withoutlimitations. In this section, we present threats to validity inour study.

A threat to validity is the size of our dataset. In ourdataset, we have the 51 websites from Alexa’s Top 100 listand the 91 websites randomly chosen from Alexa’s Top1,000,000 list. The total number of websites may not belarge enough to represent the average situation, leading toconclusions that may not be generalizable. However, sincewe intend to conduct a comparative study, it is essential tochoose representative websites for the two counterparts. Weassume that the more popular a website is, the more effortcould be taken to optimize its cache performance. Therefore,it is reasonable to consider that the websites from Top 100show the state-of-the-art cache performance.

We consider only the home page (or the landing page)for each website. The home page is the first view that userscan see when they visit a website. So Web developers haveto make a lot of efforts to optimize their home page forfast and efficient page load. In addition, resources of thehome page are likely to have a large convergence withthose in subsequent pages, which are mostly designed tobe consistent in page style with the home page. Therefore,the cache performance of the home page can also reflect thestatus of other pages.

Our study excludes the websites transferred via theHTTPS protocol. However, it is a big trend that more andmore websites, especially their mobile versions, are beingmigrated from HTTP to HTTPS for enhancing security. Ex-cluding HTTPS may neglect some valuable findings. At thetime when we collected the data, we filtered out only 4 web-sites from Top (Facebook, Twitter, Linkedin, and Netflix)and 3 websites from Random (Delta, Smartrecruiters, andPolarpersonaltrainer) that are transmitted via HTTPS. Soour current results are not significantly influenced by suchexclusion. One ongoing plan is to compare the differencesof cache performance between HTTP and HTTPS.

Our data collection lasted for only one week so that wecan report the cache performance with only less than 24-hour revisiting interval. The reason is that fewer samplescan be used for longer revisiting interval, given the fixedduration. For example, we collect the traces of each websiteevery 0.5 hour and continue the collection for one week.Therefore, the number of samples used to analyze cacheperformance with 0.5 hour revisiting interval is 2*24*7=336,while there are only 7 samples that could be used for 24hour revisiting interval. Fewer samples could lead to biasof our findings. The limited duration also prevents us fromanalyzing revisiting interval longer than 1 day, which maybe more general for the Random websites.

Our measurement targets at the client-side cache mech-anism, so we assume that the proxy used in our frameworkcan intercept all resources fetched from Web server. It ispossible to obtain different measurement results by usingother proxies than Charles. We plan to compare the impactsintroduced by different proxies in future work.

9 RELATED WORKThe combination between mobile computing and the Webhas raised new research questions and opportunities. Onone hand, measurement studies are conducted to profile


the performance of mobile Web, aiming to understand thelatency, data usage, and energy consumption when surfingthe Web on mobile devices. On the other hand, researchersinvest efforts to improve the performance of mobile Webwith novel techniques from different layers and tiers.

We next discuss related research on general mobile Webperformance and then present related work on mobile Webcache specifically.

9.1 Mobile Web PerformanceMeasurement studies have been conducted to understandmobile Web performance. Huang et al. [22] measured mo-bile Web browsing performance. Wang et al. [23] examinedthe issues specific to mobile Web browsers. Mobile HTTPArchive [24] records mobile Web performance informationof about 5000 mobile websites. But its recording period is15 days, and is too coarse-grained to analyze the cacheperformance. Papapanagiotou et al. [25] compared smart-phone and laptop Web traffic based on a three-week-longwireless communication trace collected in an enterpriseenvironment. Thiagarajan et al. [26] measured the preciseenergy used by a mobile browser to render webpages.

Many techniques were proposed by researchers and em-ployed by mobile developers to improve mobile Web perfor-mance. Jones et al. [27] studied how browser parallelizationcould help mobile browsers reduce latency and improveenergy efficiency. Dong et al. [28] implemented a coloradaptive Web browser Chameleon that renders webpageswith power-optimized color schemes under user-suppliedconstraints to reduce energy consumption of mobile Webbrowsing. Lymberopoulos et al. [29] leveraged prefetch-ing techniques to load resources based on user behaviorswith server support to reduce latency. Some research worktargeted at leveraging proxies or cloud to improve theperformance of mobile Web browsing, such as VMP [30],FlyWheel [31], and Klotski [32]. Some latest commoditybrowsers such as Amazon Silk Browser [33], Google ChromeBeta [34], and OperaMini [35] are also designed for mobilebrowsing by leveraging server or cloud-based offloading.But the cloud-based mobile Web browsing is not alwayshelpful [36]. New protocols, such as HTTP2 [37], SPDY [38],[39] and QUIC [40], are designed and deployed to optimizemobile Web performance from lower levels.

9.2 Mobile Web CacheA lot of efforts have been invested to study the performanceof mobile Web cache. Wang et al. [9] examined three client-only solutions to accelerate mobile browser speed: caching,prefetching, and speculative loading, by using Web usagedata collected from 24 iPhone users over one year. Theyfound that caching has very limited effectiveness: 60% of therequested resources are either expired or not in the cache.Zhang et al. [41] performed a comprehensive measurementstudy on Web caching functionality of 1300 top rankedAndroid apps, not just Web browsers. Results revealed thatimperfect web caching is a common and serious problemfor Android apps generating Web traffic. They also im-plemented a system-wide service called CacheKeeper toeffectively reduce overhead caused by poor Web cachingof mobile apps. Qian et al. [7] conducted a measurementstudy on Web caching in smartphones. By examining a one-day smartphone Web traffic dataset collected from a cellularcarrier and a five-month Web access trace collected from 20smartphone users, the study revealed that about 20% of thetotal Web traffic examined is redundant due to imperfectcache implementations. In their subsequent work [8], they

studied caching efficiency for the most popular 500 web-sites. They found that caching is poorly utilized for manymobile sites. 26% of the top mobile sites mark their mainHTML files as non-storable, leading to potential bandwidthwaste. About 23% of objects in mobile sites are cacheable buthave freshness lifetime of less than 1 hour. Our work differsfrom existing studies in our methodology: we use long-period sampling traces and determine the cache efficiencyby resource evolution history. Wang et al. [42] examined howWeb browsing can benefit from micro-cache that separatelycaches layout, code, and data at a fine granularity. Theystudied how and when these resources are updated andfound that layout and code that block subsequent objectloads are highly cacheable.

10 CONCLUSION

In this article, we have presented a comprehensive measure-ment and analysis of mobile Web cache performance. Byproposing a proactive approach and an analysis model, wefound that a big gap existed between the actual performanceand ideal performance of client-side cache. We summarizedthree main causes of the observations, i.e., Same Content,Heuristic Expiration, and Conservative Expiration Time. Weprovided recommendations for developers to improve mo-bile Web cache performance according to the implications ofour findings.

In future work, we plan to compare the differences ofclient-side cache performance between mobile and desktopwebsite versions when accessing them on mobile devices.We also plan to study how to adaptively configure the cacheby predicting when a resource is going to be updated basedon its changing history and user behaviors.

ACKNOWLEDGMENTThis work was supported by the High-Tech Research and De-velopment Program of China under Grant No.2015AA01A203,the Natural Science Foundation of China (Grant No. 61370020,61421091, 61222203), and the Key Program of Ministry of Ed-ucation, China under Grant No.313004. Tao Xie’s work wassupported in part by National Science Foundation under grantsno. CCF-1349666, CCF-1409423, CNS-1434582, CCF-1434590,CCF-1434596, CNS-1439481, and CNS-1513939.

REFERENCES

[1] Y. Zhu and V. J. Reddi, “Webcore: Architectural support for mobileweb browsing,” in Proc. of ISCA’14, 2014, pp. 541–552.

[2] X. S. Wang, A. Balasubramanian, A. Krishnamurthy, andD. Wetherall, “Demystifying page load performance with WProf,”in Proc. of NSDI’13, 2013, pp. 473–486.

[3] D. H. Bui, Y. Liu, H. Kim, I. Shin, and F. Zhao, “Rethinking energy-performance trade-off in mobile web page loading,” in Proc. ofMobiCom’15, 2015, pp. 14–26.

[4] F. Qian, Z. Wang, A. Gerber, Z. M. Mao, S. Sen, and O. Spatscheck,“Characterizing radio resource allocation for 3G networks,” inProc. of IMC’10, 2010, pp. 137–150.

[5] J. Erman, A. Gerber, M. Hajiaghayi, P. Dan, S. Sen, andO. Spatscheck, “To cache or not to cache: The 3G case,” IEEEInternet Computing, vol. 15, no. 2, pp. 27–34, 2011.

[6] M. Roughan, R. Chang, F. Qian, J. Huang, J. Erman, Z. M. Mao,S. Sen, and O. Spatscheck, “How to reduce smartphone trafficvolume by 30%?” in Passive and Active Measurement, 2013, pp. 42–52.

[7] F. Qian, K. S. Quah, J. Huang, J. Erman, A. Gerber, Z. Mao,S. Sen, and O. Spatscheck, “Web caching on smartphones: idealvs. reality,” in Proc. of MobiSys’12, 2012, pp. 127–140.

[8] F. Qian, S. Sen, and O. Spatscheck, “Characterizing resource usagefor mobile web browsing,” in Proc. of MobiSys’14, 2014, pp. 218–231.

[9] Z. Wang, F. X. Lin, L. Zhong, and M. Chishtie, “How far can client-only solutions go for mobile browser speed?” in Proc. of WWW’12,2012, pp. 31–40.


[10] Y. Ma, X. Liu, S. Zhang, R. Xiang, Y. Liu, and T. Xie, “Measure-ment and analysis of mobile web cache performance,” in Proc. ofWWW’15, 2015, pp. 691–701.

[11] “RFC 2616.” [Online]. Available: http://www.w3.org/Protocols/rfc2616/rfc2616.txt

[12] P. Cao and C. Liu, “Maintaining strong cache consistency in theworld wide web,” IEEE Transactions on Computers, vol. 47, no. 4,pp. 445–457, 1998.

[13] J. Wang, “A survey of web caching schemes for the Internet,” ACMSIGCOMM Computer Communication Review, vol. 29, no. 5, pp. 36–46, 1999.

[14] G. Barish and K. Obraczke, “World wide web caching: trends andtechniques,” IEEE Communications magazine, vol. 38, no. 5, pp. 178–184, 2000.

[15] F. Douglis, A. Feldmann, B. Krishnamurthy, and J. C. Mogul, “Rateof change and other metrics: a live study of the world wide web,”in Proc. of USITS’97, 1997.

[16] G. Cao, “A scalable low-latency cache invalidation strategy formobile environments,” IEEE Transactions on Knowledge and DataEngineering, vol. 15, no. 5, pp. 1251–1265, 2003.

[17] K. Fawaz and H. Artail, “DCIM: distributed cache invalidationmethod for maintaining cache consistency in wireless mobilenetworks,” IEEE Transactions on Mobile Computing, vol. 12, no. 4,pp. 680–693, 2013.

[18] K. Mershad and H. Artail, “SSUM: smart server update mecha-nism for maintaining cache consistency in mobile environments,”IEEE Transactions on Mobile Computing, vol. 9, no. 6, pp. 778–795,2010.

[19] “Charles proxy.” [Online]. Available: http://www.charlesproxy.com/

[20] “Alexa.” [Online]. Available: http://www.alexa.com/[21] “HTML5.” [Online]. Available: http://www.w3.org/TR/html5/[22] J. Huang, Q. Xu, B. Tiwana, Z. M. Mao, M. Zhang, and P. Bahl,

“Anatomizing application performance differences on smart-phones,” in Proc. of MobiSys’10, 2010, pp. 165–178.

[23] Z. Wang, F. X. Lin, L. Zhong, and M. Chishtie, “Why are webbrowsers slow on smartphones?” in Proc. of WWW’11, 2011, pp.91–96.

[24] “Mobile HTTP archive.” [Online]. Available: http://mobile.httparchive.org/

[25] I. Papapanagiotou, E. Nahum, and V. Pappas, “Smartphones vs.laptops: comparing web browsing behavior and the implicationsfor caching,” ACM SIGMETRICS Performance Evaluation Review,vol. 40, no. 1, pp. 423 – 424, 2012.

[26] N. Thiagarajan, G. Aggarwal, A. Nicoara, D. Boneh, and J. P.Singh, “Who killed my battery?: analyzing mobile browser energyconsumption,” in Proc. of WWW’12, 2012, pp. 41–50.

[27] C. G. Jones, R. Liu, L. Meyerovich, K. Asanovic, and R. Bodik,“Parallelizing the web browser,” in Proc. of HotPar’09, 2009, pp.7–7.

[28] M. Dong and L. Zhong, “Chameleon: a color-adaptive webbrowser for mobile OLED displays,” IEEE Transactions on MobileComputing, vol. 11, no. 5, pp. 724–738, 2012.

[29] D. Lymberopoulos, O. Riva, K. Strauss, A. Mittal, and A. Ntoulas,“PocketWeb: instant web browsing for mobile devices,” in Proc. ofASPLOS’12, 2012, pp. 1–12.

[30] B. Zhao, B. Tak, and G. Cao, Mobile Web Browsing Using the Cloud,ser. Springer Briefs in Computer Science. Springer, 2014.

[31] V. Agababov, M. Buettner, V. Chudnovsky, M. Cogan, B. Green-stein, S. McDaniel, M. Piatek, C. Scott, M. Welsh, and B. Yin,“Flywheel: Google’s data compression proxy for the mobile web,”in Proc. of NSDI’15, 2015, pp. 367–380.

[32] M. Butkiewicz, D. Wang, Z. Wu, H. V. Madhyastha, and V. Sekar,“Klotski: Reprioritizing web content to improve user experienceon mobile devices,” in Proc. of NSDI’15, 2015, pp. 439–453.

[33] “Amazon Silk browser.” [Online]. Available: http://aws.amazon.com/cn/documentation/silk/

[34] “Google Chrome beta browser.” [Online]. Available: https://www.google.com/chrome/browser/beta.html

[35] “Opera Mini browser.” [Online]. Available: http://www.opera.com/mobile/mini/iphone

[36] A. Sivakumar, V. Gopalakrishnan, S. Lee, S. Rao, S. Sen, andO. Spatscheck, “Cloud is not a silver bullet: a case study of cloud-based mobile browsing,” in Proc. of HotMobile’14, 2014, pp. 1 – 6.

[37] “HTTP2.” [Online]. Available: https://http2.github.io/[38] “SPDY protocol - draft 3.” [Online]. Available: http://www.

chromium.org/spdy/spdy-protocol/spdy-protocol-draft3[39] X. S. Wang, A. Balasubramanian, A. Krishnamurthy, and

D. Wetherall, “How speedy is SPDY?” in Proc. of NSDI’14, 2014,pp. 387–399.

[40] “Experimenting with QUIC.” [Online]. Available: http://blog.chromium.org/2013/06/experimenting-with-quic.html

[41] Y. Zhang, C. Tan, and L. Qun, “CacheKeeper: a system-wide webcaching service for smartphones,” in Proc. of UbiComp’13, 2013, pp.265–274.

[42] X. S. Wang, A. Krishnamurthy, and D. Wetherall, “How much canwe micro-cache web pages?” in Proc. of IMC’14, 2014, pp. 249–256.

Xuanzhe Liu is an Associate Professor in theSchool of Electronics Engineering and Com-puter Science, Peking University, Beijing, China.He is also with Beida (Binhai) Information Re-search. His research interests are in the areaof services computing, mobile computing, web-based systems, and big data analytic. He is thecorresponding author of this article.

Yun Ma is a Ph.D. student in the School ofElectronics Engineering and Computer Science,Peking University, Beijing, China. His researchinterests include services computing and webengineering.

Yunxin Liu is a Lead Researcher in MicrosoftResearch, Beijing, China. His current researchinterests include mobile systems and network-ing.

Tao Xie is an Associate Professor in the De-partment of Computer Science at the Universityof Illinois at Urbana-Champaign, USA. His re-search interests are software testing, programanalysis, software analytics, software security,and educational software engineering. He is asenior member of IEEE.

Gang Huang is a Full Professor in Instituteof Software, Peking University. He is also withBeida (Binhai) Information Research. His re-search interests are in the area of middlewareof cloud computing and mobile computing. He isa member of IEEE.

http://www.w3.org/Protocols/rfc2616/rfc2616.txt

http://www.w3.org/Protocols/rfc2616/rfc2616.txt

http://www.charlesproxy.com/

http://www.charlesproxy.com/

http://www.alexa.com/

http://www.w3.org/TR/html5/

http://mobile.httparchive.org/

http://mobile.httparchive.org/

http://aws.amazon.com/cn/documentation/silk/

http://aws.amazon.com/cn/documentation/silk/

https://www.google.com/chrome/browser/beta.html

https://www.google.com/chrome/browser/beta.html

http://www.opera.com/mobile/mini/iphone

http://www.opera.com/mobile/mini/iphone

https://http2.github.io/

http://www.chromium.org/spdy/spdy-protocol/spdy-protocol-draft3

http://www.chromium.org/spdy/spdy-protocol/spdy-protocol-draft3

http://blog.chromium.org/2013/06/experimenting-with-quic.html

http://blog.chromium.org/2013/06/experimenting-with-quic.html

IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. XX, NO. XX,...

Documents

Transcript of IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. XX, NO. XX,...