An Evaluation of Caching Policies for Memento TimeMaps

Post on 28-Nov-2014

1.137 views 2 download

description

JCDL2013 presentation by Justin F. Brunelle

Transcript of An Evaluation of Caching Policies for Memento TimeMaps

An Evaluation of Caching Policies for Memento TimeMaps

Justin F. Brunelle and Michael L. NelsonOld Dominion University

{jbrunelle, mln}@cs.odu.edu

JCDL 2013Indianapolis, Indiana

07/2013

Discovering Archived nasa.gov Pages

Archived Pages => mementosMementos identified by URI-M

Live Pages => resourcesResources identified by URI-R

2

3

TimeMaps: Lists of mementos<http://mementoproxy.lanl.gov/aggr/timegate/http://www.nasa.gov/>;rel="timegate", <http://www.nasa.gov/>;rel="original",

<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT",

<http://api.wayback.archive.org/memento/19970605230559/http://www.nasa.gov/>;rel="memento";datetime="Thu, 05 Jun 1997 23:05:59 GMT",

<http://api.wayback.archive.org/memento/19970711094601/http://www.nasa.gov/>;rel="memento";datetime="Fri, 11 Jul 1997 09:46:01 GMT",

<http://api.wayback.archive.org/memento/19981202170636/http://www.nasa.gov/>;rel="memento";datetime="Wed, 02 Dec 1998 17:06:36 GMT",

<http://api.wayback.archive.org/memento/19981212031235/http://www.nasa.gov/>;rel="memento";datetime="Sat, 12 Dec 1998 03:12:35 GMT",

<http://api.wayback.archive.org/memento/19990116233500/http://nasa.gov/>;rel="memento";datetime="Sat, 16 Jan 1999 23:35:00 GMT",

<http://api.wayback.archive.org/memento/19990117063022/http://nasa.gov/>;rel="memento";datetime="Sun, 17 Jan 1999 06:30:22 GMT",

<http://api.wayback.archive.org/memento/19990125091025/http://nasa.gov/>;rel="memento";datetime="Mon, 25 Jan 1999 09:10:25 GMT",

<http://api.wayback.archive.org/memento/19990203005545/http://nasa.gov/>;rel="memento";datetime="Wed, 03 Feb 1999 00:55:45 GMT",

<http://api.wayback.archive.org/memento/20080903053412/http://www.nasa.gov/>;rel="memento";datetime="Wed, 03 Sep 2008 05:34:12 GMT",

<http://webarchive.nationalarchives.gov.uk/20080904014810/http://www.nasa.gov/>;rel="memento";datetime="Thu, 04 Sep 2008 00:00:00 GMT",

<http://api.wayback.archive.org/memento/20080904055742/http://www.nasa.gov/>;rel="memento";datetime="Thu, 04 Sep 2008 05:57:42 GMT",

<http://webarchive.nationalarchives.gov.uk/20080906134025/http://www.nasa.gov/>;rel="memento";datetime="Sat, 06 Sep 2008 00:00:00 GMT",

<http://api.wayback.archive.org/memento/20080906143204/http://www.nasa.gov/>;rel="memento";datetime="Sat, 06 Sep 2008 14:32:04 GMT",

<http://webarchive.nationalarchives.gov.uk/20080907124040/http://www.nasa.gov/>;rel="memento";datetime="Sun, 07 Sep 2008 00:00:00 GMT",

<http://api.wayback.archive.org/memento/20080907160232/http://www.nasa.gov/>;rel="memento";datetime="Sun, 07 Sep 2008 16:02:32 GMT",

<http://webarchive.nationalarchives.gov.uk/20120809003120/http://www.nasa.gov/>;rel="memento";datetime="Thu, 09 Aug 2012 00:00:00 GMT",

<http://webarchive.nationalarchives.gov.uk/20120814175606/http://www.nasa.gov/>;rel="memento";datetime="Tue, 14 Aug 2012 00:00:00 GMT",

<http://webarchive.nationalarchives.gov.uk/20120819212348/http://www.nasa.gov/>;rel="memento";datetime="Sun, 19 Aug 2012 00:00:00 GMT",

<http://webarchive.nationalarchives.gov.uk/20120826185010/http://www.nasa.gov/>;rel="memento";datetime="Sun, 26 Aug 2012 00:00:00 GMT",

<http://webarchive.nationalarchives.gov.uk/20120909230516/http://www.nasa.gov/>;rel="last memento";datetime="Sun, 09 Sep 2012 00:00:00 GMT"

<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT"

http://webarchive.nationalarchives.gov.uk/20080907124040/http://www.nasa.gov/;rel="memento";datetime="Sun, 07 Sep 2008 00:00:00 GMT",

4

Aggregating TimeMapes

• Multiple archives• Expensive• Caching reduces

load on archives• Write-through

Cache

Aggre-gator

Sort

IA TM

AIT TM

HTTPCache

5

Aggregator Cache

• TimeMaps change• Only want to cache better TimeMaps

– Bigger is better

• Ideally monotonically increasing• Two extremes:

– Never cache (TTL=0)– Never update in cache (TTL=92)

6

Agenda

7

Cache content measures

• |a| => # of archives<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/

>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT”,

• |m| => # of mementos<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/

>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT”,

8

Same TimeMap

• |a| == |a'|• |m| == |m'|All archives have reported the same mementos.

TimeMap T

9

mm mm

mm

TimeMap T'

mm mm

mm

|a| = 2; |m| = 3 |a| = 2; |m| = 3

Gained Archives, Gained Mementos• |a| < |a`|• |m| < |m`|A new archive (WebCite) has just indexed and

reported a memento for the first time.

10

TimeMap T

mm mm

mm

TimeMap T'

mm mm

mm

mm

|a| = 2; |m| = 3 |a| = 3; |m| = 4

• |a| == |a`|• |m| < |m`|The Internet Archive has released a set of new

mementos.

11

TimeMap T

mm mm

mm

TimeMap T'

mm mm

mm mm

Same Archives, Gained Mementos

|a| = 2; |m| = 3 |a| = 2; |m| = 4

Lost Archives, Same Mementos• |a| > |a`|• |m| == |m`|A redaction of 1 memento took place in the Internet Archive which

now does not report mementos for this resource. The UK Web Archive has released 1 new memento for this resource.

1212

TimeMap T '

mm mm

mm

TimeMap T

mm

mm

mm

|a| = 3; |m| = 3 |a| = 2; |m| = 3

Lost Archives, Gained Mementos• |a| > |a`|• |m| < |m`|A redaction of 2 mementos took place in the Internet Archive which

now does not report mementos for this resource. The UK Government Web Archive has released 3 new mementos for

this resource.

13

TimeMap T

mm mm

mm

TimeMap T'

mm

mmmm

mm

|a| = 2; |m| = 3 |a| = 1; |m| = 4

Lost Archives, Lost Mementos• |a| > |a`|• |m| > |m`|Archive-It has removed a collection, and no longer reports

those mementos. No other archives have new mementos of those resources.

14

TimeMap T

mm mm

mm

TimeMap T'

mm

|a| = 2; |m| = 3 |a| = 1; |m| = 1

Gained Archives, Lost Mementos• |a| < |a`|• |m| > |m`|A new archive (WebCite) has just indexed and reported 1 memento for

the first time.A server error at the Internet Archive caused an omission of 2

mementos.

15

TimeMap T

mm mm

mm

|a| = 2; |m| = 4

TimeMap T'

mm

mm

mm

|a| = 3; |m| = 3

mm

Agenda

16

Experiment Design

• Eliminate caching from local Memento proxies• Daily observations of 4,000 TimeMaps for 92 days in 2013• TimeMaps analyzed for changes & cardinality• Investigated caching policies• Outages observed from Memento/archives/department

17

ObservationsOccurrence Description Action

77.4% Unchanged TimeMap Do not update cache

19.7% Lost archives, lost mementos Do not update cache

2.4% Gained archives, gained mementos Update cache

0.4% Same archives, gained mementos Update cache

0.1% Gained archives, lost mementos Do not update cache

0.01% Lost archives, same mementos Update cache

0.01% Lost archives, gained mementos Update cache

18

Impact of Change in TimeMaps

• Caching transient errors– Not returned or not archived?

19

Cardinality of TimeMaps<http://mementoproxy.lanl.gov/aggr/timegate/http://www.nasa.gov/>;rel="timegate", <http://www.nasa.gov/>;rel="original", <http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT", <http://api.wayback.archive.org/memento/19970605230559/http://www.nasa.gov/>;rel="memento";datetime="Thu, 05 Jun 1997 23:05:59 GMT", <http://api.wayback.archive.org/memento/19970711094601/http://www.nasa.gov/>;rel="memento";datetime="Fri, 11 Jul 1997 09:46:01 GMT", <http://api.wayback.archive.org/memento/19981202170636/http://www.nasa.gov/>;rel="memento";datetime="Wed, 02 Dec 1998 17:06:36 GMT", <http://api.wayback.archive.org/memento/19981212031235/http://www.nasa.gov/>;rel="memento";datetime="Sat, 12 Dec 1998 03:12:35 GMT", <http://api.wayback.archive.org/memento/19990116233500/http://nasa.gov/>;rel="memento";datetime="Sat, 16 Jan 1999 23:35:00 GMT", <http://api.wayback.archive.org/memento/19990117063022/http://nasa.gov/>;rel="memento";datetime="Sun, 17 Jan 1999 06:30:22 GMT", <http://api.wayback.archive.org/memento/19990125091025/http://nasa.gov/>;rel="memento";datetime="Mon, 25 Jan 1999 09:10:25 GMT", <http://api.wayback.archive.org/memento/19990203005545/http://nasa.gov/>;rel="memento";datetime="Wed, 03 Feb 1999 00:55:45 GMT",

|TM| ?

20

Strict vs. Loose Matching• Different archive, URI-M, datetime- Strict: 2, Loose: 2

<http://api.wayback.archive.org/memento/20080509125659/http://flare.prefuse.org/>;rel="memento";datetime="Fri, 09 May 2008 12:56:59 GMT",<http://webarchive.nationalarchives.gov.uk/20080908074106/http://flare.prefuse.org/>;rel="memento"; datetime="Mon, 08 Sep 2008 00:00:00 GMT",

• Same archive, datetime, different URI-M- Strict: 3, Loose: 1<http://web.archive.org/web/20101101060204/http://aarp.org:80/Health/>;rel="memento";

datetime="Mon, 01 Nov 2010 06:02:04 GMT",<http://web.archive.org/web/20101101060204/http://www.aarp.org:80/Health/>;rel="memento";datetime=“Mon, 01 Nov 2010 06:02:04 GMT",<http://web.archive.org/web/20101101060204/http://www.aarp.org:80/health/>;rel="memento";datetime=“Mon, 01 Nov 2010 06:02:04 GMT",

• Same archive, different URI-M, bad datetime- Strict: 2, Loose: 2<http://wayback.archive-it.org/2342/20110321192906/http://www.apple.com/iphone/find-my-iphone-setup/>...datetime="Mon, 21 Mar 2011 00:00:00 GMT"

<http://wayback.archive-it.org/2354/20110321035356/http://www.apple.com/iphone/find-my-iphone-setup/>...datetime="Mon, 21 Mar 2011 00:00:00 GMT"

21

Strict vs. Loose: translate.google.com

22

Agenda

23

Testing• TTLs [0, 92]

– 0: Thrashed cache, best freshness– 92: First TimeMap cached, no replacement

• Policies– Unconditional

• Cardinality ignored

– Conditional• Replacements occur when cardinality is better

24

Evaluation

• Minimize cost values:– Q – Queries to the archives– MemDays – number of missed mementos/day

• Calculated MemDays: mementos missed/day

TTL: ∞

TTL: 0 MemDays

Q

25

MemDays

26

6

|TM|=10

MemDay=8

Optimal TTLUnconditional

Conditional

Optimal TTL= 9

Optimal TTL= 15

27

Agenda

28

Conclusion & Future Work

• 3-month observation of 4,000 TimeMaps• Change patterns studied

– 80.2% of TimeMaps monotonically increase– Others decrease

• Optimal TTL = 15 days• Cache Improvements:

– Saves requests to the archives

• Worth reinvestigating– Changed Memento landscape

29

Backups

30

www.nasa.gov 1996 - 2012

31

MementoIntegrates the past and present web

Now

Always Current

2008 2006 200120082010

32

33

Cardinality• Size of a TimeMap

– # Archives?– # Date times?

• TimeMaps:

• Cardinality:

• Monotonic Increase:

34