Clustering Web Content for Efficient Replication

1

Clustering Web Content for Efficient Replication

Yan Chen, Lili Qiu*, Weiyu Chen, Luan Nguyen, Randy H. Katz

EECS DepartmentUC Berkeley

*Microsoft Research

2

Motivation• Amazing growth in WWW traffic

– Daily growth of roughly 7M Web pages– Annual growth of 200% predicted for next 4 years

• Content Distribution Network (CDN) commercialized to improve Web performance

– Un-cooperative pull-based replication• Paradigm shift: cooperative push more cost-effective

– Strategically push replicas can achieve close to optimal performance [JJKRS01, QPV01]

– Improving availability during flash crowds and disasters• Orthogonal issue: granularity of replication

– Per Website? Per URL? -> Clustering! – Clustering based on aggregated clients’ access patterns

• Adapt to users’ dynamic access patterns– Incremental clustering (online and offline)

3

Outlines• Motivation• Simulation methodology• Architecture• Problem Formulation• Granularity of replication• Dynamic replication

– Static clustering– Incremental clustering

• Conclusions

4

Simulation Methodology• Network Topology

– Pure-random, Waxman & transit-stub models from GT-ITM– A real AS-level topology from 7 widely-dispersed BGP peers

• Web WorkloadWeb Site

Period Duration # Requests avg –min-max

# Clients avg –min-max

# Client groups avg –min-max

MSNBC Aug-Oct/1999 10–11am 1.5M–642K–1.7M 129K–69K–150K 15.6K-10K-17KNASA Jul-Aug/1995 All day 79K-61K-101K 5940-4781-7671 2378-1784-3011

– Aggregate MSNBC Web clients with BGP prefix» BGP tables from a BBNPlanet router» 10K groups left, chooses top 10% covering >70% of requests

– Aggregate NASA Web clients with domain names– Map the client groups onto the topology

• Performance Metric: average retrieval cost– Sum of edge costs from client to its closest replica

5



• Conclusions

6

CDN name server

Client 1

Local DNS server

Local CDN server

1. G

ET re

ques

t

4. lo

cal C

DN

serv

er IP

ad

dres

sWeb content server

Client 2

Local DNS server

Local CDN server

2. Request for hostname resolution

3. Reply: local CDN server IP

address

5.GET request8. Response

6.GET request if cache miss

ISP 2

ISP 1

Conventional CDN: Un-cooperative Pull

7. Response

Big waste of replication!

7

CDN name server

Client 1

Local DNS server

Local CDN server

1. G

ET re

ques

t

4. R

edire

cted

serv

er I

P ad

dres

sWeb content server

Client 2

Local DNS server

Local CDN server

2. Request for hostname resolution

3. Reply: nearby replica server or

Web server IP address

ISP 2

ISP 1

5.GET request

6. Response

6. Response

5.GET request if no replica yet

Cooperative Push-based CDN

0. Pu

sh re

plica

s

Significantly reduce # of replicas and consequently,the update cost (only 4% of un-coop pull)

8

Problem Formulation

• Subject to the total replication cost• Find a replication strategy that minimize the total access cost

9



• Conclusions

10

Where R: # of replicas/URL K: # of clusters M: # of URLs (M >> K)C: # of clients S: # of CDN serversf: placement adaptation frequency

Replication Scheme States to Maintain Computation CostPer Website O (R) f × O(R × S × C) Per Cluster O(R × K + M) f × O(K × R × (K + S × C))Per URL O(R × M) f × O(M × R × (M + S × C))

• Use greedy placement• 30 – 70% average

retrieval cost reduction for Per URL

• Per URL is too expensive for management!

Replica Placement: Per Website vs. Per URL

11

Clustering Web Content• General clustering framework

– Define the correlation distance between URLs– Cluster diameter: the max distance b/t any two

members» Worst correlation in a cluster

– Generic clustering: minimize the max diameter of all clusters

• Correlation distance definition based on– Spatial locality– Temporal locality– Popularity– Semantics (e.g., directory)

12

Spatial Clustering

k

i

k

i ii

k

iii

BABA

BAdistcor1 1

22

11),(_

• Correlation distance between two URLs defined as– Euclidean distance– Vector similarity

• URL spatial access vector

– Blue URL

1

2

3

4

0201

13

Clustering Web Content (cont’d)

• Popularity-based clustering

– OR even simpler, sort them and put the first N/K elements into the first cluster, etc. - binary correlation

|)(_)(_|),(_ BfreqaccessAfreqaccessBAdistcor

)()(),(_

1),(_BoccurrenceAoccurrence

BAoccurrencecoBAdistcor

• Temporal clustering– Divide traces into multiple individuals’ access sessions [ABQ01]– In each session,

– Average over multiple sessions in one day

14

Performance of Cluster-based Replication

• Tested over various topologies and traces• Spatial clustering with Euclidean distance and

popularity-based clustering perform the best– Even small # of clusters (with only 1-2% of # of URLs) can

achieve close to per-URL performance

0

20

40

60

80

100

120

140

1 10 100 1000Number of clusters

Ave

rage

retr

ieva

l cos

t

Spatial clustering: Euclidean distanceSpatial clustering: cosine similarityTemporal clusteringAccess frequency clustering

MSNBC, 8/2/1999, 5 replicas/URL NASA, 7/1/1995, 3 replicas/URL

0

10

20

30

40

50

60

70

80

90

100

1 10 100 1000Number of clusters

Ave

rage

retr

ieva

l cos

t

Spatial clustering: Euclidean distanceSpatial clustering: cosine similarityTemporal clusteringAccess frequency clustering

15



• Conclusions

16

Static clustering and replication

• Two daily traces: old trace and new trace

• Static clustering performs poorly beyond a week– Average retrieval cost almost doubles

Methods Static 1 Static 2 OptimalTraces used for clustering Old Old NewTraces used for replication

Old New New

Traces used for evaluation

New New New

0

10

20

30

40

50

60

8/3 8/4 8/5 8/10 8/11 9/27 9/28 9/29 9/30 10/1New traces

Ave

rage

retr

ieva

l cos

t Staticclustering 1

Staticclustering 2

Reclustering,re-replication(optimal)

17

Incremental Clustering• Generic framework

1. If new URL u match with existing clusters c, add u to c and replicate u to existing replicas of c

2. Else create new clusters and replicate them• Online incremental clustering

– Push before accessed -> high availability– Predict access patterns based on semantics– Simplify to popularity prediction – Groups of URLs with similar popularity? Use hyperlink structures!

» Groups of siblings» Groups of the same hyperlink depth: smallest # of links from

root

18

Online Popularity Prediction

• Experiments– Use WebReaper to crawl http://www.msnbc.com on 5/3/2002

with hyperlink depth 4, then group the URLs– Use corresponding access logs to analyze the correlation– Groups of siblings has the best correlation

• Measure the divergence of URL popularity within a group:

)_()_(_

frequencyaccessaveragefrequencyaccessdevstd

access freq span =

19

Online Incremental Clustering• Semantics-based incremental clustering

– Put new URL into existing clusters with largest # of siblings– When there is a tie, choose the cluster with more replicas

• Simulation on 5/3/2002 MSNBC– 8-10am trace: static popularity clustering + replication– At 10am: 16 new URLs emerged - online incremental

clustering + replication – Evaluation with 10-12am trace: 16 URLs has 33,262 requests

1 2 3 4 5 6

+ ?

2 3 5 61 4

1

4

2

3

5 6

20

Online Incremental Clustering & Replication Results

Average retrieval cost reduction (16 URLs)• Compared with no replication of new URLs:

- 12.5%

• Compared with random replication of new URLs: - 21.7%

• Compared with static clustering + replication (oracle): - 200%

21

Conclusions• CDN operators: cooperative, clustering-based

replication– Cooperative: big savings on replica management and

update cost– Per URL replication outperforms per Website scheme

by 60-70%– Clustering solves the scalability issues, and gives the

full spectrum of flexibility» Spatial clustering and popularity-based clustering

recommended• To adapt to users’ access patterns: incremental

clustering – Hyperlink-based online incremental clustering for

» High availability» Performance improvement

– Offline incremental clustering performs close to optimal

22

Offline Incremental Clustering

• Study spatial clustering and popularity-based clustering

• Step 1: assign new URLs into existing clusters– When the correlation within that cluster (diameter) is

unchanged– Add it to existing replicas

• Step 2: Un-matched URLs - static clustering and replication

• Performance close to complete re-clustering + re-replication, with only 30-40% replication cost

Clustering Web Content for Efficient Replication

Documents

Transcript of Clustering Web Content for Efficient Replication