Optimal Web Page Download Scheduling Policies for Green Web …jordan/PAPERS/JSAC-2016.pdf ·...

11
1378 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5, MAY 2016 Optimal Web Page Download Scheduling Policies for Green Web Crawling Vassiliki Hatzi, B. Barla Cambazoglu, and Iordanis Koutsopoulos, Senior Member, IEEE Abstract—A web crawler is responsible for discovering and downloading new pages on the Web as well as refreshing pre- viously downloaded pages. During these operations, the crawler issues a large number of HTTP requests to web servers. These requests increase the energy consumption and carbon footprint of the web servers since computational resources are used while serv- ing the requests. In this work, we introduce the problem of green web crawling, where the objective is to devise a page refresh pol- icy that minimizes the total staleness of pages in the repository of a web crawler, subject to a constraint on the amount of carbon emis- sions due to the processing on web servers. For the case of one web server and one crawling thread, the optimal policy turns out to be a greedy one. At each iteration, the page to be refreshed is selected based on a metric that considers the page’s staleness, its size, and the greenness of the energy consumed at the web server premises. We then extend the optimal policy to the cases of 1) many servers; 2) multiple threads; and 3) pages with variable freshness require- ments. We conduct simulations on a real data set that involves a large web server collection hosting around two billion pages. We present experimental results for the optimal page refresh policy as well as for various heuristics, in an effort to study the effect of different factors on performance. Index Terms—Crawling, carbon footprint, greenness, staleness. I. I NTRODUCTION T HE operations of a web search engine can be grouped under three main components: web crawling, indexing, and query processing [1]. Web crawling is responsible for traversing the hyperlink structure among the web pages to dis- cover and download the content in the Web, as well as for refreshing already downloaded pages in the web repository. The indexing component converts downloaded content into com- pact data structures that can be easily searched. Finally, the query processing component evaluates user queries by process- ing these data structures and matches each query to a set of pages deemed to be relevant to the query. The focus of this work is the greenness of the web crawl- ing component (in particular, page refresh operations). In search data centers, thousands of computers are allocated to crawl the Manuscript received March 29, 2015; revised October 21, 2015; accepted December 6, 2015. Date of publication January 21, 2016; date of current version May 19, 2016. Part of the material in the paper was presented at the International Conference on Software, Telecommunications and Computer Networks, Split, Croatia, Sept. 2014, [2]. V. Hatzi is with the University of Thessaly, Volos 384 46, Greece (e-mail: [email protected]). B. B. Cambazoglu was with Yahoo Labs, Barcelona, Spain (e-mail: [email protected]). I. Koutsopoulos is with the Athens University of Economics and Business, Athens, Greece (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSAC.2016.2520246 Web. Maintaining an infrastructure of this scale results in cer- tain implications in terms of energy consumption and carbon footprint. In general, web crawlers lead to carbon emissions in two different ways, due to (i) local operations performed on the crawling nodes in the data centers (e.g., parsing web pages) and (ii) remote operations performed on the web servers while serving HTTP requests (e.g., retrieving a page from disk). In this work, we are interested in the latter case, i.e., the carbon emissions that the web crawler incurs on remote computers that do not belong to the search engine. In practice, a web crawler may lead to significant energy con- sumption on web servers while the HTTP requests issued by the crawler are processed on the servers (e.g., during disk accesses, processing in the CPU, and network operations). We motivate this by some back-of-the-envelope calculations: Let us assume that there are five billion pages in the Web [3]. According to a conservative estimate, we can assume an average of 200 J (0.055 Wh) of energy consumption per HTTP request [4]. Let us assume that each page in the web repository is refreshed once per minute, on average. Now, refreshing only one-tenth of the repository requires about 40 GWh of energy per day. Web crawling is also a costly operation in terms of the carbon emis- sions of web servers. In our example, if the carbon footprint of the fuel used to generate electricity is 0.85 kg/KWh [5], on aver- age, the carbon emissions due to web crawling can be estimated as 34,000 tons per day. Unfortunately, there is little a web crawler can do to reduce the energy consumption it incurs to web servers without sac- rificing the coverage or freshness of its web repository. This is because the amount of energy consumed on web servers depends only on factors related to the hardware and software resources that are not managed by the search engine company. Nevertheless, certain optimizations can be employed to reduce the carbon emissions that a crawler incurs to web servers. In this work, our main observation is that the carbon footprint of a web server depends on the greenness of the consumed energy, which varies depending on the time of the day. For example, a web server is more likely to consume green energy in daytime, while it is more likely to consume brown energy during the night. This intra-day variation creates an opportunity to reduce the carbon footprint of web servers as HTTP requests may be scheduled such that pages are downloaded from web servers that are more likely to be consuming green energy. Motivated by the observation above, we devise a web repos- itory refreshing technique that takes into account both the greenness and staleness concepts when scheduling the down- load of web pages. This technique aims to reduce the total staleness of pages in the web repository while constraining 0733-8716 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of Optimal Web Page Download Scheduling Policies for Green Web …jordan/PAPERS/JSAC-2016.pdf ·...

Page 1: Optimal Web Page Download Scheduling Policies for Green Web …jordan/PAPERS/JSAC-2016.pdf · 2016-10-07 · 1378 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5,

1378 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5, MAY 2016

Optimal Web Page Download Scheduling Policiesfor Green Web Crawling

Vassiliki Hatzi, B. Barla Cambazoglu, and Iordanis Koutsopoulos, Senior Member, IEEE

Abstract—A web crawler is responsible for discovering anddownloading new pages on the Web as well as refreshing pre-viously downloaded pages. During these operations, the crawlerissues a large number of HTTP requests to web servers. Theserequests increase the energy consumption and carbon footprint ofthe web servers since computational resources are used while serv-ing the requests. In this work, we introduce the problem of greenweb crawling, where the objective is to devise a page refresh pol-icy that minimizes the total staleness of pages in the repository of aweb crawler, subject to a constraint on the amount of carbon emis-sions due to the processing on web servers. For the case of one webserver and one crawling thread, the optimal policy turns out to bea greedy one. At each iteration, the page to be refreshed is selectedbased on a metric that considers the page’s staleness, its size, andthe greenness of the energy consumed at the web server premises.We then extend the optimal policy to the cases of 1) many servers;2) multiple threads; and 3) pages with variable freshness require-ments. We conduct simulations on a real data set that involves alarge web server collection hosting around two billion pages. Wepresent experimental results for the optimal page refresh policyas well as for various heuristics, in an effort to study the effect ofdifferent factors on performance.

Index Terms—Crawling, carbon footprint, greenness, staleness.

I. INTRODUCTION

T HE operations of a web search engine can be groupedunder three main components: web crawling, indexing,

and query processing [1]. Web crawling is responsible fortraversing the hyperlink structure among the web pages to dis-cover and download the content in the Web, as well as forrefreshing already downloaded pages in the web repository. Theindexing component converts downloaded content into com-pact data structures that can be easily searched. Finally, thequery processing component evaluates user queries by process-ing these data structures and matches each query to a set ofpages deemed to be relevant to the query.

The focus of this work is the greenness of the web crawl-ing component (in particular, page refresh operations). In searchdata centers, thousands of computers are allocated to crawl the

Manuscript received March 29, 2015; revised October 21, 2015; acceptedDecember 6, 2015. Date of publication January 21, 2016; date of currentversion May 19, 2016. Part of the material in the paper was presented atthe International Conference on Software, Telecommunications and ComputerNetworks, Split, Croatia, Sept. 2014, [2].

V. Hatzi is with the University of Thessaly, Volos 384 46, Greece(e-mail: [email protected]).

B. B. Cambazoglu was with Yahoo Labs, Barcelona, Spain(e-mail: [email protected]).

I. Koutsopoulos is with the Athens University of Economics and Business,Athens, Greece (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JSAC.2016.2520246

Web. Maintaining an infrastructure of this scale results in cer-tain implications in terms of energy consumption and carbonfootprint. In general, web crawlers lead to carbon emissions intwo different ways, due to (i) local operations performed onthe crawling nodes in the data centers (e.g., parsing web pages)and (ii) remote operations performed on the web servers whileserving HTTP requests (e.g., retrieving a page from disk). Inthis work, we are interested in the latter case, i.e., the carbonemissions that the web crawler incurs on remote computers thatdo not belong to the search engine.

In practice, a web crawler may lead to significant energy con-sumption on web servers while the HTTP requests issued by thecrawler are processed on the servers (e.g., during disk accesses,processing in the CPU, and network operations). We motivatethis by some back-of-the-envelope calculations: Let us assumethat there are five billion pages in the Web [3]. According toa conservative estimate, we can assume an average of 200 J(0.055 Wh) of energy consumption per HTTP request [4]. Letus assume that each page in the web repository is refreshedonce per minute, on average. Now, refreshing only one-tenth ofthe repository requires about 40 GWh of energy per day. Webcrawling is also a costly operation in terms of the carbon emis-sions of web servers. In our example, if the carbon footprint ofthe fuel used to generate electricity is 0.85 kg/KWh [5], on aver-age, the carbon emissions due to web crawling can be estimatedas 34,000 tons per day.

Unfortunately, there is little a web crawler can do to reducethe energy consumption it incurs to web servers without sac-rificing the coverage or freshness of its web repository. Thisis because the amount of energy consumed on web serversdepends only on factors related to the hardware and softwareresources that are not managed by the search engine company.Nevertheless, certain optimizations can be employed to reducethe carbon emissions that a crawler incurs to web servers. In thiswork, our main observation is that the carbon footprint of a webserver depends on the greenness of the consumed energy, whichvaries depending on the time of the day. For example, a webserver is more likely to consume green energy in daytime, whileit is more likely to consume brown energy during the night. Thisintra-day variation creates an opportunity to reduce the carbonfootprint of web servers as HTTP requests may be scheduledsuch that pages are downloaded from web servers that are morelikely to be consuming green energy.

Motivated by the observation above, we devise a web repos-itory refreshing technique that takes into account both thegreenness and staleness concepts when scheduling the down-load of web pages. This technique aims to reduce the totalstaleness of pages in the web repository while constraining

0733-8716 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Optimal Web Page Download Scheduling Policies for Green Web …jordan/PAPERS/JSAC-2016.pdf · 2016-10-07 · 1378 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5,

HATZI et al.: OPTIMAL WEB PAGE DOWNLOAD SCHEDULING POLICIES 1379

web servers’ total carbon footprint resulting from the activitiesof the crawler. We evaluate the performance of our techniqueusing a large, real-life web server data set. Beyond creatingan environment-friendly crawler, our work has implicationsfor large-scale web search engines, which should comply withregulations about carbon footprint reduction.

The rest of the paper is organized as follows. Section IIpresents the related work. In Section III, we describe our sys-tem model. In Section IV, we formulate our core optimizationproblem, devise an optimal policy, and provide some exten-sions. In Section V, we describe our experimental setup, presentsome heuristics, and set forth numerical results obtained viasimulations. Section VI concludes the paper.

II. RELATED WORK

A. Refreshing Web Repositories

The pages in the web repository of a crawler need to berefreshed to prevent the search engine from presenting staleresults to its users. A research problem here is to find an orderin which the pages will be refreshed such that some fresh-ness metric is optimized over the entire repository. Early paperson the topic suggest refreshing frequently updated pages moreoften, relying on the observed update frequency of pages as aproxy [6], [7]. Certain works use the update history of relatedpages to capture the actual update likelihood of a page bet-ter [8], [9]. Works in another line devise page refresh strategiesthat aim to minimize the negative impact on users due to stalesearch results [10], [11]. Finally, the work in [12] suggestsavoiding to refresh fast-changing content as much as possible.To the best of our knowledge, so far, no prior work has inves-tigated this research problem taking into account the greennessaspect of web servers as a constraint, except for the preliminaryversion of this work [2], where we presented various heuristicsfor prioritizing page download requests as a means to study therelative importance of different parameters. Here, we prove thatthe optimal solution is a greedy download scheduling policythat is easy to compute and applicable in realistic scenarios.

B. Job and Packet Processing

A line of work that is relevant to ours is that of schedulingjobs under deadlines. The work in [13] presents the earliestdue date (EDD) scheduling policy for wire-line networks. Inthis policy, at each time, the job with the earliest due dateis scheduled, hence minimizing the expected lateness. Froma wireless networking perspective, the work in [14] presentsa wireless channel-aware version of EDD, the feasible earliestdue date (FEDD) policy. The authors of [15] study the problemof scheduling constant-bit-rate traffic over wireless channelsand devises a policy that minimizes the packet loss rate dueto packet delivery deadline expiration. The work in [16] studiesthe design of a downlink packet scheduler for real-time mul-timedia applications and proposes a channel- and QoS-awareversion of EDD, named CA-EDD. We note that none of theabove works considers energy efficiency.

A number of works aim to minimize the total energy con-sumption. In [17], a CPU scheduler is presented for mobile

devices. This scheduler integrates dynamic voltage scaling intosoft real-time scheduling and decides when, how long, and howfast to execute multimedia applications based on their demanddistribution. In [18], the authors propose offline and onlinepacket scheduling algorithms for uplink and downlink in wire-less networks with constraints imposed by packet deadlines andfinite buffers. The reader may refer to [19] for a survey of stud-ies on energy-efficient scheduling without deadlines. In contrastto these works, our work tries to minimize the total stalenessof a web page repository while keeping the amount of carbonemissions of web servers below a given threshold.

C. Energy Efficiency and Greenness of Data Centers

There is significant amount of work on reducing the energyconsumption of data centers. The work in [20] aims to opti-mize the workload, power, and cooling system managementof a data center with the objective of saving energy. In orderto emphasize the role of communication fabric in energy con-sumption, the work in [21] presents a scheduling techniquewhich makes decisions based on a run-time feedback from datacenter switches and links. The work in [22] uses delay-tolerantjobs to fill the extra capacity of data centers and designs energy-efficient mechanisms that achieve good delay performance. Thesurvey in [23] analyzes software- and hardware-based tech-niques and architectures as well as mechanisms to control datacenter resources for energy-efficient operations.

A different line of works focuses on increasing renewableenergy utilization and reducing the carbon footprint of data cen-ters. Specifically, the authors of [24] propose an adaptive datacenter job scheduler that leverages green energy prediction toreduce the number of canceled jobs due to lack of availablegreen energy. The work in [25] focuses on renewable energycapacity planning and proposes an optimization-based frame-work to achieve specified carbon footprint goals at minimalcost. The work in [26] proposes a policy for request distribu-tion across data centers with the objective of data center costminimization while enabling Internet services to leverage greenenergy and respect their SLAs. The work in [27] investigatesthree issues related to the feasibility of powering an Internet-scale system exclusively with renewable energy: the impactof geographical load balancing, the optimal mix of renewablesources, and the role of storage.

Data centers hosting cloud applications consume hugeamounts of energy. The works in [28], [29], and [30] use virtu-alization as a power management and resource allocation tech-nique for energy efficiency in cloud computing environments.In particular, the work in [28] assumes deterministic virtualmachine (VM) demands, whereas stochastic VM demands areconsidered in [29]. They key idea of the approach presented inwork [30] is to match the VM load with the renewable energysource (RES) provisioned power with the goal of minimizingthe total cost of power consumption for the cloud provider. Thework in [31] proposes a graph-based approach which utilizesVoronoi partitions to control the operation of a cloud systemwith the goal of minimizing a combination of average requesttime, electricity cost, and carbon emissions.

Page 3: Optimal Web Page Download Scheduling Policies for Green Web …jordan/PAPERS/JSAC-2016.pdf · 2016-10-07 · 1378 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5,

1380 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5, MAY 2016

D. Our Contribution

The works surveyed above are related either to refreshingthe repositories of crawlers, effective processing of jobs, orenergy efficiency and greenness of data centers. Our work intro-duces the green web crawling problem, which combines theseresearch threads. It brings together the scheduling conceptswith the problem of reducing the staleness of a web repositorywhile limiting the carbon emissions incurred to web servers.We believe that this combination, along with the observationthat the type of the energy consumed by a web server varies intime, are unique to our work.

Our objective of minimizing the total staleness of a webrepository is related to the problem of minimizing the total wait-ing time of scheduled jobs [13], [14]. However, our problemdiffers in that the crawler needs to consider both the stalenessand greenness aspects while making its scheduling decisions.The concept of freshness could be related to a hard deadlineconstraint on the refreshing of pages. However, the nature ofthe crawling process necessitates a softer version of the dead-line, that of staleness. Due to the constraints on the amount ofcarbon emissions, the opportunities for scheduling each pageweigh differently, and the concept of greenness needs to be con-sidered explicitly in scheduling. To accommodate the averagegreenness constraints in our problem, we weigh them with theLagrange multiplier λ and include them in the objective func-tion. In that respect, our work could be considered, in abstractterms, as a scheduling problem under both deadline and aver-age energy constraints. In contrast to works that present policiesfor time- or energy-efficient offline job scheduling under con-straints [15], our work proposes online policies, where thecrawler makes its decisions as the page download requestsarrive. Finally, the concepts of page size and page freshnessrequirement differentiate our work further from previous works.

The main contributions of our work are the following:• We introduce the green web crawling problem, where we

study a page refresh policy that minimizes the total stal-eness of pages in the web repository of a crawler, whilekeeping the amount of carbon emission on remote webservers low enough.

• For one web server and one crawling thread, we show thatthe optimal page refresh policy is a greedy one. At eachtime slot, the page to be refreshed is selected based on ametric that considers the staleness of the page, its size, andthe greenness of the energy consumed by the web server.

• We extend the optimal policy to the cases of (i) many webservers, (ii) multiple crawling threads, and (iii) web pageswith variable freshness requirements. We also proposevarious heuristics along the lines of the optimal policy.

• We conduct simulations with a large-scale, real data setobtained from Yahoo.

III. SYSTEM MODEL

A. Web Crawler

In large-scale search engines, web crawling is performedby clusters of computers, where each computer runs multiplecrawling threads. A crawling thread either downloads a newly

Fig. 1. Our system model for the crawler: m crawling threads concurrentlyretrieve pages from N web servers at time slot t .

discovered URL or redownloads a previously stored page in therepository by issuing HTTP requests to web servers. These twooperations are known as discovery and refresh, respectively.The discovery operations increase the size of the web reposi-tory of the crawler. The refresh operations help to decrease thestaleness of pages in the repository. The focus of this work ison the page refresh operations only. Decreasing the stalenessof a page repository is important since this has a direct impacton the quality of the search results presented to the users, thusaffecting the monetization of the search engine.

B. Web Servers

The system model is depicted in Fig. 1. We assume that thereexists a set K of N web servers. Also, let Wi be the set of webpages that are hosted in server i ∈ K and thus, W = ⋃

i∈KWi

be the set of all pages in the system. For each page j ∈ Wi ,let p j be the size of its content, in bytes. The hardware deviceslocated in the remote servers consume energy for serving thepage download requests issued by the crawler. Fetching therequested pages from the disk, processing them in the CPU, andtransmitting them through the server transmit circuit, all con-sume energy. To account for the factors above, we assume thatthe amount of energy consumed to download a page j is a linearfunction of p j , i.e., e j = αp j + β, with α, β > 0, known con-stants. These assumptions are made to better expose the maincharacteristics of our approach, namely server greenness andweb page staleness, to be presented below.

We also assume that each page may have its own freshnessrequirement which is mainly determined by two factors: (i)its PageRank and (ii) its likelihood of change. The PageRankof a page measures its relative importance within the set ofweb pages [32]. On the other hand, the likelihood of changeof a page is determined by the estimated frequency of contentchange of the page. Pages with high PageRank value and highlikelihood of change may have higher freshness requirements.We define weights γ j , ∀ j ∈ Wi with i ∈ K, which indicatethe different freshness requirements of pages. Web pages withhigh γ j should be refreshed more often. These weights can beestimated based on crawler statistics and past crawling history.

Page 4: Optimal Web Page Download Scheduling Policies for Green Web …jordan/PAPERS/JSAC-2016.pdf · 2016-10-07 · 1378 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5,

HATZI et al.: OPTIMAL WEB PAGE DOWNLOAD SCHEDULING POLICIES 1381

C. Greenness of Server Energy Consumption

For each server i , we define a time-varying value, gi (t),which indicates (on a given scale) the amount of produced car-bon emissions per unit of consumed energy (Wh). This valuedenotes the “greenness” of the energy consumed to run theserver. For example, in the [0, 1] scale, the carbon emissions(or greenness) of a very green energy source is close to zero(no carbon emissions, maximum greenness) and, as the valueincreases, the source becomes more “brown”. For instance, aserver may be powered entirely by clean nuclear energy or windpower, or entirely by brown energy generated from coal or lig-nite, or by a mixture of these. The gi (t) values may vary intime due to the time-varying power output generated by renew-able energy sources or because the local electricity companyincreases the brownness due to high demand. We assume thatthere is no a priori knowledge of the gi (t) values, and they canbe communicated from the web servers to the crawler only atthe time that a decision needs to be made.

D. Web Page Staleness

We assume that time is divided into slots and that the slotsize is large enough to cover the download of the largest page.Hence, page downloads occur on a time slot basis. This assump-tion is made to simplify the subsequent analysis and betterexpose the structure of the optimal policy. Nevertheless, it is notrestrictive since it still captures different energy consumptionincurred to remote web servers.

At the beginning of each slot, m threads are directed by thecrawler towards m web pages that are selected for download.For each page j , we need to define a measure of its staleness.Let s j (t) be the staleness of a page j at the beginning of slot t .If this page is selected for download, then at most by the endof the slot the page download will finish, and at the end of theslot its staleness will be 0. However, if this page is not selected,then its staleness will increase and at the end of the slot it willbe s j (t) + 1. We observe that as long as page j is not selected,it becomes more and more stale, which means that the stale-ness of the page depends on the time elapsed since the last timeit was downloaded. It should be noted that the slot assumptionabove leads to a somewhat conservative consideration in termsof the computed staleness, in the sense that it leads to largerstaleness increase for pages that are not scheduled for down-load. However, even if this assumption is relaxed, the structureof our analysis is not expected to change.

IV. PROBLEM FORMULATION

We are interested in scheduling the page download threadsfrom the crawler to the web servers with the objective to keepthe web pages as much as possible fresh and the carbon emis-sions due to page download requests low enough. The decisionat each time t is to pick a server i ∈ K and a page j ∈ Wi todownload in such a way that the total staleness of the pages isminimized and the amount of carbon emissions is kept belowa given threshold. On the one hand, we would like to choosepages with large staleness. On the other hand, we would like

to schedule downloads of pages from servers that have lowgi (t) values. Also, out of all pages it is not clear whether weshould schedule for download the ones with smaller size or theones with larger size. Small-size pages consume less energy.However, larger pages should also be downloaded at some pointin order to reduce their staleness. There may also be pages withhigh freshness requirements γ j , which should be given highpriority in the download process.

The joint consideration of all parameters above and the con-flicting objectives of keeping the pages fresh and the carbonemissions at the remote servers low, make the thread schedul-ing problem non-trivial. First, we formulate and solve thebasic single-server, single-thread problem presenting the vari-ous intuitions behind it. We then extend it to the cases of manyservers, multiple crawling threads, and pages with differentfreshness requirements.

A. Single Web Server, Single Thread Scheduling Problem

First, we consider the simple case of one server that hosts aset W of web pages. A horizon of (T + 1) time slots is assumed.As we mentioned above, the page download time is one slot,and we assume that g(t) remains stable over the time slot.At each time t , m = 1 thread is sent to fetch one web page.We define the vector variable x(t) = (x j (t) : j ∈ W), wherex j (t) = 1 if at time slot t the web page j ∈ W is downloaded,else x j (t) = 0. Clearly,

∑j∈W x j (t) = 1, ∀t = 0, . . . , T − 1,

since at each slot t , only one thread to a web page is allowed tobe active to the server. We note that the staleness of a web pagej at the beginning of slot t + 1 is:

s j (t + 1) ={

0 if x j (t) = 1s j (t) + 1 if x j (t) = 0.

(1)

The time evolution of the staleness of a page j can be written ass j (t + 1) = (s j (t) + 1)(1 − x j (t)). Our goal is to find a policyx∗ = (x(t) : t = 0, . . . , T − 1) that minimizes the total (overtime and over pages) staleness of the server, i.e.,

minx∗

T∑t=0

∑j∈W

s j (t) (2)

subject to the constraints that only one thread to a web page isallowed to be active to the server at each time t , i.e.,

s.t.∑j∈W

x j (t) = 1, ∀t = 0, . . . , T − 1, (3)

and that the total amount of carbon emissions due to pagedownload requests does not exceed a given threshold G, i.e.,

s.t.T −1∑t=0

∑j∈W

x j (t)e j g(t) ≤ G, (4)

where G is set by the agreement between the crawler and theremote server. Since we estimate the total staleness (over pages)at the beginning of each slot t , we assume that the last schedul-ing decision is made at the beginning of slot T − 1, the last

Page 5: Optimal Web Page Download Scheduling Policies for Green Web …jordan/PAPERS/JSAC-2016.pdf · 2016-10-07 · 1378 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5,

1382 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5, MAY 2016

page download finishes by the end of slot T − 1, the last totalstaleness estimation is made at the beginning of slot T , and noaction is taken during slot T . We observe that the time evolutionof the total (over pages) staleness is:

∑j∈W

s j (t + 1) =∑j∈W

s j (t) + (|W| − 1) − s j∗(t)(t), (5)

where j∗(t) is the selected web page at time slot t with stalenesss j∗(t)(t) (which becomes zero at the beginning of slot t + 1).

We observe that for t = T − 1, and after a number ofcalculations, equation (5) can be written as:

∑j∈W

s j (T ) =∑j∈W

s j (0) + T (|W| − 1) −T −1∑t ′=0

s j∗(t ′)(t′).

Therefore, for any t = 0, . . . , T − 1, we have:

∑j∈W

s j (t) =∑j∈W

s j (0) + t (|W| − 1) −t−1∑t ′=0

s j∗(t ′)(t′) , (6)

and using the decision variable x(t), (6) can be written as:

∑j∈W

s j (t) =∑j∈W

s j (0) + t (|W| − 1) −t−1∑t ′=0

∑j∈W

x j (t′)s j (t

′).

(7)Using (7), the total staleness

∑Tt=0

∑j∈W s j (t) can now be

written as:

(T + 1)∑j∈W

s j (0) + T (T + 1)

2(|W| − 1)−

−T∑

t=0

t−1∑t ′=0

∑j∈W

x j (t′)s j (t

′). (8)

In order to minimize (8) we just have to maximize the quantity∑Tt=0

∑t−1t ′=0

∑j∈W x j (t ′)s j (t ′). Moreover, in order to accom-

modate constraint (4) in our problem, we include it in theobjective function parameterized by Lagrange multiplier λ ∈R+. Thus, we end up with the following objective:

maxx∗

T∑t=0

t−1∑t ′=0

∑j∈W

x j (t′)s j (t

′) − λ

⎛⎝T −1∑

t=0

∑j∈W

x j (t)e j g(t) − G

⎞⎠ .

(9)Parameter λ denotes the significance of greenness for the server.Its value, which is set in consultation with the crawler, dependson the capability of using green energy at the local serverpremises. If there is no such capability, then λ = 0 and thecrawler’s goal is to just minimize the total staleness. On theother hand, if there is high potential for energy from renewablesources, the value of λ is high. In that case, besides minimizingstaleness, the crawler also wishes to keep the carbon footprintas low as possible, as suggested by the values of λ and G. Weobserve that as λ increases, the amount of carbon emissionsbecomes more and more important for the server and the pagedownload scheduling is largely determined by its value.

B. Optimal Web Page Download Scheduling Policy

Now, we will try to understand the structure of objective (9).We write (9) for T = 2 and after some algebra we get:

maxx∗

∑j∈W

(2s j (0) − λe j g(0)

)x j (0)+

+∑j∈W

(s j (1) − λe j g(1)

)x j (1) + λG . (10)

For T > 2, the objective is written as:

maxx∗

∑j∈W

(T s j (0) − λe j g(0)

)x j (0)+

+∑j∈W

((T − 1)s j (1) − λe j g(1)

)x j (1) + . . . +

+∑j∈W

(s j (T − 1) − λe j g(T − 1)

)x j (T − 1) + λG . (11)

Since g(t) is not known a priori, we study the online versionof the problem. Objective (11) can be decomposed to separateterms to be optimized with respect to the scheduling decisiononly at slot t , which means that the crawler’s decision at t isindependent of the decisions made at the other slots. Thus, theoptimal policy involves greedy decision making.

Optimal policy: At the beginning of each slot t , the crawlerchooses x(t) that maximizes

∑j∈W

((T − t)s j (t) − λe j g(t)

)x j (t)

such that∑

j∈W x j (t) = 1. For the selected web page j∗(t), itholds:

j∗(t) = arg maxj∈W

((T − t)s j (t) − λe j g(t)),

which means that at each time t , given g(t), the crawler decidesbased on the values of s j (t) and e j , for j ∈ W.

In the special case where all pages need the same energy tobe downloaded, i.e., e j = e, ∀ j ∈ W , the crawler chooses thepage j with the maximum staleness s j (t). On the other hand, ifall pages have the same staleness at t , i.e., s j (t) = s, ∀ j ∈ W ,then the crawler chooses the page j with the minimum requiredenergy e j , i.e., the one with the minimum size p j .

C. Extensions to the Model

1) Many Web Servers: Consider a set K of N > 1 webservers where each server is characterized by its own gi (t)value. Again, we assume that at each slot t , only one threadto a web page is allowed to be active to any server (m = 1).

Our goal is to find a policy x∗ = (x(t) : t = 0, . . . , T − 1)

that minimizes total staleness (over time and all pages) of allservers, i.e.,

minx∗

T∑t=0

N∑i=1

∑j∈Wi

s j (t), (12)

Page 6: Optimal Web Page Download Scheduling Policies for Green Web …jordan/PAPERS/JSAC-2016.pdf · 2016-10-07 · 1378 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5,

HATZI et al.: OPTIMAL WEB PAGE DOWNLOAD SCHEDULING POLICIES 1383

subject to the constraints that only one thread to a web page isallowed to be active to any server at each slot t , i.e.,

s.t.∑i∈K

∑j∈Wi

x j (t) = 1, ∀t = 0, . . . , T − 1, (13)

and that the total amount of carbon emissions of each remoteserver i does not exceed a given threshold Gi , i.e.,

s.t.T −1∑t=0

∑j∈Wi

x j (t)e j gi (t) ≤ Gi , ∀i ∈ K. (14)

where Gi is set by the agreement between the crawler and theremote server i . Based on (13), we observe that a policy x∗ is asequence of vectors x(t), where for given t , x j (t) = 1 for onlyone i and only one j ∈ Wi .

Similar to Section IV-A, objective (12) is converted to:

maxx∗

T∑t=0

t−1∑t ′=0

N∑i=1

∑j∈Wi

x j (t′)s j (t

′). (15)

By bringing (14) in the objective with Lagrange multipliersλλλ = (λ1, λ2, . . . , λN ) ∈ R

N+ , our problem becomes:

maxx∗

⎡⎣ T∑

t=0

t−1∑t ′=0

N∑i=1

∑j∈Wi

x j (t′)s j (t

′)

−N∑

i=1

λi

⎛⎝T −1∑

t=0

∑j∈Wi

x j (t)e j gi (t) − Gi

⎞⎠

⎤⎦ , (16)

subject to (13). As in Section IV-B, the objective can be decom-posed to separate objectives, where each objective needs to beoptimized with respect to the scheduling decision only at timeslot t . Similarly, the optimal policy in the case of many webservers involves greedy decision making.

Optimal policy for many web servers: At the begin-ning of each slot t , the crawler chooses the server iand the web page j ∈ Wi with the maximum value of((T − t)s j (t) − λi e j gi (t)

).

We observe that at each slot t , the crawler makes its decisionsbased on the values of s j (t), e j , λi , and gi (t), for i = 1, . . . , Nand j ∈ Wi . We identify some special cases here:

• If at slot t , s j (t) = s, ∀i ∈ K, ∀ j ∈ Wi , then the crawlerchooses the server i ∈ K and the web page j ∈ Wi

with the minimum product λi e j gi (t). If it is also λi = λ

and e j = e, ∀i ∈ K, ∀ j ∈ Wi , then the crawler choosesrandomly a page from the server with minimum gi (t).

• If λi = λ and e j = e, ∀i ∈ K, ∀ j ∈ Wi , then at slot tthe crawler chooses the server i ∈ K and the web pagej ∈ Wi with the maximum s j (t) − gi (t). If also at slot t ,gi (t) = g(t), ∀i ∈ K, then the crawler chooses out of allpages the web page j with maximum staleness s j (t).

2) Many Web Servers, Multiple Crawling Threads: Now,we assume that the crawler uses m > 1 threads at each slot.The problem is the same as the one above but now, at each slott , the crawler selects m pages for fetching. This means that in

the above problem, constraint (13) is transformed to:∑i∈K

∑j∈Wi

x j (t) = m, ∀t = 0, . . . , T − 1. (17)

Again, the optimal policy turns out to involve greedy decisionmaking.

Optimal policy for many web servers and multiple simul-taneous crawling threads: At the beginning of each time slott , the crawler selects the m server-page pairs (i, j) with largestvalues of (T − t)s j (t) − λi e j gi (t).

3) Web Pages With Variable Freshness Requirements:Here, we incorporate into our problem the weights γ j defined inSection III, which indicate the variable freshness requirementsof web pages. Our optimization problem becomes:

minx∗

T∑t=0

N∑i=1

∑j∈Wi

γ j s j (t) , (18)

subject to constraints (14) and (17).Optimal policy for many web servers, multiple threads

and pages with variable freshness requirements: At thebeginning of each slot t the crawler chooses the m server-pagepairs (i, j) with largest values of (T − t)γ j s j (t) − λi e j gi (t).

The following simple example for m = 1 shows the influenceof weights γ j on the crawler’s decisions: Assume that at slott the crawler has to decide between two pages k, l hosted inthe same server (i.e., the same gi (t), λi ). If sk(t) = sl(t), ek =el and γk > γl , the crawler gives priority to the one with thebiggest freshness requirement, i.e., it selects page k.

V. PERFORMANCE EVALUATION

A. Data Set

As a web collection, we used web pages sampled from alarge web crawl performed by Yahoo. The web crawling wasa continuous process. Therefore, our data represents only asnapshot of this crawl (the snapshot was obtained in November2011). The data contains the most important web servers (about500,000 servers) and the pages hosted on those servers (abouttwo billion pages). The importance of a web server was esti-mated by a proprietary link analysis metric. This metric issimilar to PageRank in that the importance of a server iscomputed based on the number of its inbound links and the esti-mated importance of servers that provide those links. Hence,our collection is large and also represents high-quality contentthat is of importance to a web search engine. The scale of thecollected dataset makes us confident that the numerical inves-tigation of our problem (Section V-D) is realistic and gives agood sense of the performance in a realistic system.

In our simulations, we estimate the greenness value of a webserver based on the timezone of the country in which it is phys-ically located and the time an HTTP request is issued to theweb server. We estimate the country information for each serverby means of a proprietary classifier. The classifier assigns eachserver to a country based on a number of features, includingthe IP address of the server, some link features (e.g., the coun-try information associated with the servers providing inbound

Page 7: Optimal Web Page Download Scheduling Policies for Green Web …jordan/PAPERS/JSAC-2016.pdf · 2016-10-07 · 1378 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5,

1384 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5, MAY 2016

Fig. 2. The variation in (a) the solar irradiance [33] and (b) the gi (t) valuesduring the day.

links) and some content features (e.g., the language of pageshosted on the server). For each server, we also compute someinformation about the number of pages hosted on it as well as

its average page size pi =(∑

j∈Wip j

)/|Wi |.

B. Greenness and Staleness Computation

Estimation of Carbon Emissions: First, we assume that eachserver can be powered by a mixture of solar (green) energy dur-ing the day and energy provided by the grid (brown) duringthe night. The actual solar insolation/irradiance reaching a solararray is strongly dependent on the solar array’s position on theEarth and on local weather conditions [33]. It varies through-out the day from 0 kW/m2 at night to a maximum of about1 kW/m2. Fig. 2a shows an example in which the solar irra-diance reaches its maximum value at noon when the sun is atits highest point in the sky. The actual solar irradiance is usuallybelow this value because it depends on the angle of incidence ofthe sun’s rays with the ground. Here, for simplicity, we ignorethe case of clouds on the horizon.

Based on the day of the year, the time-zone, latitude and lon-gitude of each server, we estimate the day length as well asthe sunrise and sunset times at the server location. We definegi (t) = 1−normalized solar irradiance (using the maximumsolar insolation as a measure of scale). This definition cap-tures the fact that the amount of solar insolation determinesthe amount of generated solar energy and thus, the amount ofproduced carbon emissions. We observe that gi (t) takes its min-imum value in the middle of the day, whereas during the nightit takes its maximum value 1.

For example, if for a server i located in a country in thenorthern hemisphere, the sun in winter rises at 07:30 a.m. andsets at 17:00 p.m., the gi (t) value is 1 at 07:30 a.m., around0 at 12:15 p.m., again 1 at 17:00 p.m. and 1 during the night.From Fig. 2a, we observe that during the intermediate hoursthe values of gi (t) will range between 0 and 1 and that theycan be approximately determined by a linear function α1t + β1between sunrise and noon, and by a function α2t + β2 betweennoon and sunset. For the pair of points (07:30,1), (12:15,0), weget α1 = −0.21 and β1 = 2.57, while for the pair (12:15,0),(17:00,1) we get α2 = 0.21 and β2 = −2.57. Fig. 2b shows thedaily variation of gi (t) for the above example.

Note that our approach is transparent to such derivations andmay be used in conjunction with various regimes of computingthe time evolution of gi (t). For example, in the case of wind

energy, gi (t) could be computed using wind velocity and direc-tion [24]. Also, gi (t) could be estimated using the expectedenergy generation pattern of each renewable source installedat server i premises, which can be approximated by an averagetime sequence based on historical data [34].

Staleness Computation: For the experimental scenario, weassume that the scheduling of a thread towards a web page jresults in a download time of k j slots, which is proportional tosize p j . We assume that k j = � + p j

b , where b is the bandwidthand � is the network latency between the crawler machine andthe server, which we take for simplicity to be the same acrossservers. Since the available data does not allow us to estimatethe staleness of each page separately, we use a more practicalmethod to compute the total staleness Si (t) of a server i . Forexample, let us consider a server n with |Wn| pages. We assumethat at time t0 a page j from that server is selected to be crawled.After k j time units, the total staleness of each server i increasesby |Wi |k j , i.e., Si (t0 + k j ) = Si (t0) + |Wi |k j , since each pagegets k j time units older. Now, at time t0 + k j + 1, we assume:

Sn(t0 + k j + 1) = Sn(t0 + k j + 1)(|Wn| − 1)

|Wn| , (19)

i.e., after crawling a page from server n, we simply decrease itscurrent total staleness by its average staleness.

C. Heuristic Policies

Here, we devise some heuristic page download schedulingpolicies, along the lines of the optimal policy, in order to eval-uate the relative importance of the different system parametersand study the tradeoff between staleness and greenness. Dueto the nature of the collected data described in Section V-A,the proposed policies work at the granularity of servers. Eachpolicy achieves a different objective and relies on differentparameters to make its decisions, using less information thanthe optimal policy. We define two more metrics: the averagepage staleness (over all pages) at slot t as,

s(t) =∑

i∈K∑

j∈Wis j (t)∑

i∈K |Wi | , (20)

as well as the average staleness of a server i at slot t as,

Si (t) =∑

j∈Wis j (t)

|Wi | . (21)

The proposed heuristic policies are the following:• Random Server Selection (RS): The crawler picks a server

uniformly at random out of all servers in the system.Then, it selects a random page in that server to crawl. Weuse this policy for benchmarking.

• Maximum Greenness Server Selection (MG): The crawlerpicks the server with the minimum gi (t) value (maximumgreenness). Then, it selects a random page in that server tocrawl. This policy tries to minimize the amount of carbonemissions produced by the crawling process.

• Maximum Average Staleness Server Selection (MS): Thecrawler selects the server with the maximum average stal-eness Si (t). Then, it selects a random page in that server

Page 8: Optimal Web Page Download Scheduling Policies for Green Web …jordan/PAPERS/JSAC-2016.pdf · 2016-10-07 · 1378 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5,

HATZI et al.: OPTIMAL WEB PAGE DOWNLOAD SCHEDULING POLICIES 1385

Fig. 3. The performance of the optimal policy in comparison with that of theEDD-like policy in terms of total staleness as a function of λ.

to crawl. This policy does not take greenness into accountat all. Its goal is to minimize the average staleness of theservers (and thus, that of the pages).

• Maximum Product of Average Server Staleness and(1−carbon emissions) (MPSC): The crawler selects theserver with the maximum product Si (t)(1 − gi (t)). Then,it selects a random page in that server to crawl. Among theservers with gi (t) = 0, the crawler chooses the one withthe maximum average staleness. The goal is to maintainthe amount of carbon emissions and the average stalenessof the servers at low levels.

• Maximum Product of Average Server Staleness and(1/average page size) (MPSS): The crawler selects theserver with the maximum product Si (t)(1/ pi ). Then, itselects a random page in that server to crawl.

• Maximum Product of Average Server Staleness,(1/average page size), and (1−carbon emissions)(MPSSC): The crawler selects the server with the max-imum product Si (t)(1/ pi )(1 − gi (t)). Then, it selects arandom page in that server to crawl. Among the serverswith gi (t) = 0 (maximum greenness), the crawler prefersservers with large average staleness and small averagepage size.

By experimenting with the last two policies we try to under-stand the impact of average page size on system performance interms of reduction in carbon emissions and staleness.

D. Experimental Results

Here, we present our experimental results in order to (i) showthe performance of the optimal policy in terms of total stale-ness and total carbon emissions as a function of parameter λ,and compare it with the performance of an EDD-like policy,(ii) show the performance of our heuristics in terms of averageamount of carbon emissions and average staleness reduction,and (iii) compare the performance of all proposed policies. Werun our experiments for T =24 hrs, �=100 ms, b=1 Mbps, andm=1, 50, 100, 150, 200.

1) Performance of the Optimal Policy: First, we study theperformance of the optimal policy for the case of a single server

Fig. 4. The performance of the optimal policy in comparison with that of theEDD-like policy in terms of carbon emissions as a function of λ.

and a single thread as a function of λ. Here, we use syntheticdata as a way to study the performance at the level of web pages.We use a set of 1000 pages and normalized values between 0and 1 for the energy required for page download. As mentionedin Section IV-A, the page download time is one slot, and duringthis slot g(t) remains stable. The page staleness values are cal-culated using the method described in Section III and IV-A. Theg(t) values are computed according to the method describedin Section V-B. We assume that staleness is measured in min-utes (min) and that the amount of carbon emissions at each slott , e j g(t), is measured in grams (g). Since the crawler decidesbased on the value of (T − t)s j (t) − λe j g(t), the measurementunit of λ is minutes/grams (min/g).

Figs. 3 and 4 show the performance of the optimal policy interms of total (over time and over pages) staleness and total car-bon emissions, respectively. As λ increases, the total stalenessincreases in a non-linear convex manner, whereas the total car-bon emissions decrease almost linearly. It is obvious that thereis a tradeoff between page staleness and server greenness, andthat λ can be set to quantify this tradeoff. These results stemfrom the fact that as λ increases, the potential of using greenenergy increases as well and the server wants more and more toexploit this potential. Thus, although the crawler wants to mini-mize the staleness of pages, it is hindered by the server’s desireto keep its carbon footprint low.

Figs. 3 and 4 also depict the performance of a carbon-unaware web crawling technique that places emphasis only onthe freshness of the downloaded web pages. This policy is rem-iniscent of the EDD policy since at each slot it selects the webpage with the maximum staleness. As expected, this EDD-likepolicy outperforms our optimal policy in terms of total stalenessfor all values of λ, whereas the reverse occurs in terms of totalcarbon emissions. The performance values of the two policiescoincide for λ = 0.

2) Performance of the Heuristic Policies: Here, we studythe performance of our heuristics using the collected real datadescribed in Section V-A. Figs. 5 and 6 show their performancein terms of average (over time) amount of carbon emissionsand average (over time) staleness reduction, respectively. As

Page 9: Optimal Web Page Download Scheduling Policies for Green Web …jordan/PAPERS/JSAC-2016.pdf · 2016-10-07 · 1378 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5,

1386 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5, MAY 2016

Fig. 5. The average amount of carbon emissions (measured in grams (g))generated by the proposed heuristic policies.

Fig. 6. The performance of all heuristics in terms of staleness reduction.

TABLE IPERFORMANCE RESULTS AS THE IMPACT OF COLD-START IS REDUCED

the number of threads increases, the crawler potential increases,and the differences between the policies are more pronounced.We observe that both carbon emissions and staleness reductionconstantly grow due to the increasing number of downloadedpages per unit of time.

In Fig. 5, MG outperforms the other five policies in termsof reduced carbon emissions as expected. Compared to thecase where there is no possibility of using green energy, i.e.,gi (t) = 1, ∀i and ∀t , it achieves on average a 99.35% reductionin the amount of carbon emissions. The other two policies thatalso take the value of gi (t) into account, MPSC and MPSSC,are characterized by an average carbon emissions reduction of97.74% and 63.42%, respectively. Their gain is lower since theyalso count in server staleness and/or web page size. The tradeoffbetween staleness and greenness prevents the amount of carbonemissions from being as small as possible.

In Fig. 6, we observe that MS, which can be considered asan EDD-like heuristic policy, outperforms four of the other five

Fig. 7. The average staleness of the first four policies at the end of thesimulation.

Fig. 8. The average page staleness of all heuristics at the end of the simulation.

policies in terms of staleness reduction. The fifth one (i.e., RS)performs almost the same as MS. This happens due to a “coldstart” phenomenon: in the first hours of the simulation, the greatmajority of servers have the same average staleness since thenumber of downloaded pages (and hence, the number of stale-ness reduction events) during this period is too small comparedto the number of the pages that are not downloaded. Thus, eachtime a decision must be made, RS is more likely to select a pagefrom a server whose staleness is at maximum. In order to elimi-nate the impact of this phenomenon on our results, we ran bothRS and MS for three more days (m = 1). Table I shows theresults. We observe that as the algorithm runs for more days,the difference between the two policies in terms of stalenessreduction builds up.

In Figs. 5 and 6, we observe that MPSC tries to strike a bal-ance between staleness and greenness. Besides its relativelyhigh gain in terms of reduced carbon emissions, its averagestaleness reduction is only 4.34% lower than that of MS.

Figs. 7 and 8 depict the average page staleness after runningthe proposed policies for one day. Considering only the firstfour policies, we observe that RS and MS, as expected, achievethe lowest average page staleness. On the other hand, the per-formance of MPSS and MPSSC, as shown in Fig. 8, revealsthe impact of average page size on page freshness. Althoughthe average staleness reduction of both policies remains low

Page 10: Optimal Web Page Download Scheduling Policies for Green Web …jordan/PAPERS/JSAC-2016.pdf · 2016-10-07 · 1378 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5,

HATZI et al.: OPTIMAL WEB PAGE DOWNLOAD SCHEDULING POLICIES 1387

TABLE IIPERFORMANCE COMPARISON OF THE PROPOSED POLICIES

(Fig. 6), the average page staleness at the end of the day ismuch lower than that of the first four policies. Their ability tokeep pages more fresh stems from the following fact: Since bothpolicies give priority to pages with relatively small size, thenumber of downloaded pages during the horizon of the T + 1slots is much greater (88.98% and 87.44% on average for theMPSS and the MPSSC, respectively) than that of the other poli-cies. This yields to an increased total staleness reduction, whichoffsets the low average staleness reduction per unit of time, andwhich in turn yields to a relatively low average page stalenessat the end of the day.

In Figs. 5 and 8, we observe that MPSSC could also be usedby the crawler according to our objective. Although the carbonemission reduction of MPSSC is lower than that of MPSC, itsaverage page staleness at the end of the simulation is lower thanthat of MPSC.

3) Performance Comparison: In order to compare the per-formance of all policies, we use synthetic data. Our systemconsists of 15 servers and each server hosts 1000 pages. Again,we use normalized values between 0 and 1 for the energyrequired for page download. For the optimal policy, we con-sider the same value of λ for all servers, i.e., λi = λ for i =1, . . . , 15. As we mentioned above, the page download time isone slot, and we assume that during this slot the gi (t) values,for i = 1, . . . , 15, remain stable. At each slot t , m = 1 threadis sent to fetch one web page. The decision at each time slot tis to pick 1 out of the 15 servers and a page from the selectedserver to download.

Table II shows the performance of all proposed policies. Weran our optimal policy both for λ = 100 (Optimal 1) and forλ = 500 (Optimal 2). As we can see, both versions of the opti-mal policy outperform all the heuristic policies in terms of totalstaleness reduction and average page staleness at the end of thesimulation. This stems from the fact that the optimal policyworks at the granularity of web pages. Specifically, for rela-tively low values of λ, it applies more weight to the stalenessfactor. Thus, in contrast to all heuristics, which select a webpage from the selected server at random, it has the ability toexamine in more detail the characteristics of all web pages andpick a web page with relatively high staleness. As λ increases,the servers’ desire to keep their carbon footprint low increasesas well and thus, the crawler is obstructed from minimizing thestaleness of pages. As we mentioned above, λ is a parameterthat can be set to quantify the tradeoff between staleness andgreenness. On the other hand, for the specific values of λ, thepolicy that outperforms all the other ones in terms of reducedcarbon emissions is MG. However, the produced carbon emis-sions of the optimal policy (in both cases) are less than those ofthe RS, MS (EDD-like heuristic) and MPSS policies which donot consider the greenness of the servers at all.

VI. CONCLUSION

In this work, we introduced the problem of green web crawl-ing: minimizing the total staleness of pages in the repository ofa web crawler while keeping the amount of carbon emissionson web servers, due to HTTP requests issued by the crawler,low enough. We devised an optimal policy, which can be imple-mented in an online fashion, based only on the instantaneousvalues of page staleness and greenness indicators of the servers.We also devised some heuristics along the lines of the opti-mal policy and studied their performance through experimentswith real data. In this work, we assumed that the refresh deci-sions are made centrally at each slot t . In the future, we plan tostudy a scenario where the decisions are made in a distributedfashion, where at each time slot t , each server would decideautonomously whether it would download a web page based onpage staleness and server greenness constraints. Moreover, anenhanced model that would include various aspects of actualuser experience owing to web page download latency andstaleness is also worth studying.

REFERENCES

[1] B. B. Cambazoglu and R. A. Baeza-Yates, “Scalability challenges inweb search engines,” in Synthesis Lectures on Information Concepts,Retrieval, and Services. San Mateo, CA, USA: Morgan, 2015.

[2] V. Hatzi, B. B. Cambazoglu, and I. Koutsopoulos, “Web page downloadscheduling policies for green web crawling,” in Proc. 22nd Int. Conf.Software Telecommun. Comput. Netw. (SoftCom), 2014, pp. 56–60.

[3] M. de Kunder. The size of the World Wide Web (The Internet) [Online].Available: http://www.worldwidewebsize.com/

[4] Verbatique. (Mar. 25, 2015). Average Power Use Per Server [Online].Available: http://www.vertatique.com/average-power-use-server

[5] Verbatique. (Oct. 15, 2009). Carbon Footprints of Servers Can Vary By10X [Online]. Available: http://www.vertatique.com/carbon-footprints-servers-can-vary-10x

[6] J. Cho and H. Garcia-Molina, “Effective page refresh policies for webcrawlers,” ACM Trans. Database Syst., vol. 28, no. 4, pp. 390–426, Dec.2003.

[7] J. Edwards, K. McCurley, and J. Tomlin, “An adaptive model for opti-mizing performance of an incremental web crawler,” in Proc. 10th WorldWide Web, 2001, pp. 106–113.

[8] Q. Tan and P. Mitra, “Clustering-based incremental web crawling,” ACMTrans. Inf. Syst., vol. 28, no. 4, pp. 1–27, Nov. 2010.

[9] K. Radinsky and P. N. Bennett, “Predicting content change on the web,”in Proc. 6th ACM Web. Search Data Min. (WSDM), 2013, pp. 415–424.

[10] J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen,“Optimal crawling strategies for web search engines,” in Proc. 11th WorldWide Web, 2002, pp. 136–147.

[11] S. Pandey and C. Olston, “User-centric web crawling,” in Proc. 14thWorld Wide Web, 2005, pp. 401–411.

[12] C. Olston and S. Pandey, “Recrawl scheduling based on informationlongevity,” in Proc. 17th World Wide Web, 2008, pp. 437–446.

[13] S. Panwar, D. Towsley, and J. Wolf, “Optimal scheduling policies fora class of queues with customer deadlines to the beginning of service,”J. ACM, vol. 35, no. 4, pp. 832–844, Oct. 1988.

[14] S. Shakkottai and R. Srikant, “Scheduling real-time traffic with dead-lines over a wireless channel,” in Proc. 2nd ACM Int. Workshop WirelessMobile Multimedia, 1999, pp. 35–42.

Page 11: Optimal Web Page Download Scheduling Policies for Green Web …jordan/PAPERS/JSAC-2016.pdf · 2016-10-07 · 1378 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5,

1388 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 5, MAY 2016

[15] T. Ren, I. Koutsopoulos, and L. Tassiulas, “QoS provisioning for real-time traffic in wireless packet networks,” in Proc. IEEE GLOBECOM,2002, pp. 1673–1677.

[16] A. Dua and N. Bambos, “Downlink wireless packet scheduling with dead-lines,” IEEE Trans. Mobile Comput., vol. 6, no. 12, pp. 1410–1425, Dec.2007.

[17] W. Yuan and K. Nahrstedt, “Energy-efficient soft real-time CPU schedul-ing for mobile multimedia systems,” in Proc. ACM Symp. Oper. Syst.Principles, 2003, pp. 149–163.

[18] A. El Gamal, C. Nair, B. Prabhakar, E. Uysal-Biyikoglu, and S. Zahedi,“Energy-efficient scheduling of packet transmissions over wireless net-works,” in Proc. IEEE INFOCOM, 2002, pp. 1773–1782.

[19] L. Wang and Y. Xiao, “A survey of energy-efficient scheduling mech-anisms in sensor networks,” Mobile Netw. Appl., vol. 11, no. 5,pp. 723–740, Oct. 2006.

[20] Z. Wang, N. Tolia, and C. Bash, “Opportunities and challenges tounify workload, power, and cooling management in data centers,” ACMSIGOPS Oper. Syst. Rev., vol. 44, no. 3, pp. 41–46, Jul. 2010.

[21] D. Kliazovich, P. Bouvry, and S. U. Khan, “DENS: Data center energy-efficient network-aware scheduling,” Cluster Comput., vol. 16, no. 1,pp. 65–75, Mar. 2013.

[22] D. Xu and X. Liu, “Geographic trough filling for internet datacenters,” inProc. INFOCOM, 2012, pp. 2881–2885.

[23] J. Shuja et al., “Survey of techniques and architectures for designingenergy-efficient data centers,” IEEE Syst. J., pp. 1–13, Jul. 2014. doi:10.1109/JSYST.2014.2315823

[24] B. Aksanli, J. Venkatesh, L. Zhang, and T. Rosing, “Utilizing greenenergy prediction to schedule mixed batch and service jobs in data cen-ters,” in Proc. 4th Workshop Power-Aware Comput. Syst. (HotPower’11),2011, pp. 1–5, Article no. 5.

[25] C. Ren, D. Wang, B. Urgaonkar, and A. Sivasubramaniam, “Carbon-aware energy capacity planning for datacenters,” Proc. IEEE 20th Int.Symp. Model. Anal. Simul. Comput. Telecommun. Syst. (MASCOTS),2012, pp. 391–400.

[26] K. Le et al., “Managing the cost, energy consumption, and carbonfootprint of internet services,” in Proc. ACM SIGMETRICS, 2010, pp.357–358.

[27] Z. Liu et al., “Geographical load balancing with renewables,” ACMSIGMETRICS Perform. Eval. Rev., vol. 39, no. 3, pp. 62–66, Dec. 2011.

[28] X. Li, Z. Qian, S. Lu, and J. Wu, “Energy efficient virtual machine place-ment algorithm with balanced and improved resource utilization in a datacenter,” Math. Comput. Modell., vol. 58, nos. 5–6, pp. 1222–1235, Sep.2013.

[29] W. Yue and Q. Chen, “Dynamic placement of virtual machines with bothdeterministic and stochastic demands for green cloud computing,” Math.Prob. Eng., vol. 2014, 11 pp., Jul. 2014.

[30] D. Hatzopoulos, I. Koutsopoulos, G. Koutitas, and W. Van Heddeghem,“Dynamic virtual machine allocation in cloud server facility systems withrenewable energy sources,” in Proc. IEEE Int. Conf. Commun. (ICC),2013, pp. 4217–4221.

[31] J. Doyle, R. Shorten, and D. O’Mahony, “Stratus: Load balancing thecloud for carbon emissions control,” IEEE Trans. Cloud Comput., vol. 1,no. 1, pp. 116–128, Aug. 2013.

[32] L. Page et al., “The PageRank Citation Ranking: Bringing Order to theWeb”, Tech. Rep. 1999-66, Stanford InfoLab, Nov. 1999.

[33] Woodbank Communications Ltd. (2005). Electropaedia: SolarPower (Technology and Economics) [Online]. Available: http://www.mpoweruk.com/solar_power.htm

[34] A. Sfetsos, “A comparison of various forecasting techniques applied tomean hourly wind speed time series,” Renew. Energy, vol. 21, no. 1,pp. 23–35, Sep. 2000.

Vassiliki Hatzi received the Diploma and M.S.degrees in electrical and computer engineering fromthe University of Thessaly, Volos, Greece, in 2008and 2010, respectively. Currently, she is pursuingthe Ph.D. degree at the University of Thessaly. Herresearch interests include system and user-centricoptimization and control methods for smart powergrids and web search engines.

B. Barla Cambazoglu received the B.S., M.S., andPh.D. degrees, all in computer engineering, fromthe Department of Computer Engineering, BilkentUniversity, Ankara, Turkey, in 1997, 2000, and 2006,respectively. After he received the Ph.D. degree, heworked as a Postdoctoral Researcher with BilkentUniversity for a short period of time. In 2006, hejoined the Department of Biomedical Informatics,Ohio State University, Columbus, OH, USA, as aPostdoctoral Researcher. In 2008, he joined YahooLabs, as a Postdoctoral Researcher. He held Research

Scientist and Senior Research Scientist positions at the same institution, in 2010and 2012, respectively. Between 2013 and 2015, he was a Senior Manager,heading the Web Retrieval Group, Yahoo Labs, Barcelona, Spain. His researchinterests include distributed information retrieval and web search efficiency. In2010, 2011, 2014, and 2015, he co-organized the LSDS-IR Workshop. He wasthe proceedings Chair for WSDM’09 and the poster and proceedings Chair forECIR’12. He served as an Area Chair for SIGIR’13 and SIGIR’14. He regularlyserves on the program committees of SIGIR, WWW, and KDD conferences. Hehas many papers published in prestigious journals including the IEEE TPDS,JPDC, JASIST, Inf. Syst., ACM TWEB, and IP&M, as well as papers and tuto-rials presented at top-tier conferences, such as SIGIR, CIKM, WSDM, WWW,and KDD.

Iordanis Koutsopoulos (S’99–M’03–SM’13)received the Diploma degree in electrical andcomputer engineering from the National TechnicalUniversity of Athens (NTUA), Athens, Greece, in1997, and the M.S. and Ph.D. degrees in electricaland computer engineering from the Universityof Maryland, College Park, College Park, MD,USA, in 1999 and 2002, respectively. He is nowan Associate Professor with the Department ofInformatics, Athens University of Economics andBusiness (AUEB), Athens, Greece. He was an

Assistant Professor (2013–2015) with AUEB. Before that, he was an AssistantProfessor (2010–2013) and a Lecturer (2005–2010) with the Department ofComputer Engineering and Communications, University of Thessaly, Volos,Greece. His research interests include network control and optimization,with applications on wireless networks, social and community networks,crowd-sensing systems, smart-grid, and cloud computing. He was the recipientof the single-investigator European Research Council (ERC) CompetitionRunner-Up Award for the project RECITAL: Resource Management forSelf-coordinated Autonomic Wireless Networks (2012–2015).