Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University
description
Transcript of Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University
![Page 1: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/1.jpg)
How much preservation do I get if I do absolutely nothing?
Using the Web Infrastructure for Digital Preservation
Michael L. Nelson, Frank McCown, Joan A. Smith, Martin KleinOld Dominion University
Norfolk VA, USA
{mln,fmccown,jsmit,mklein}@cs.odu.edu
Media Production Berlin 2006
Berlin, Germany
December 8, 2006
Research supported in part by NSF, Library of Congress and Andrew Mellon Foundation
![Page 2: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/2.jpg)
![Page 3: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/3.jpg)
Preservation: Fortress Model
1. Get a lot of $
2. Buy a lot of disks, machines, tapes, etc.
3. Hire an army of staff
4. Load a small amount of data
5. “Look on my archive ye Mighty, and despair!”
image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg
Five Easy Steps for Preservation:
![Page 4: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/4.jpg)
Alternate Models of Preservation
• Lazy Preservation– Let Google, IA et al. preserve your website
• Just-In-Time Preservation– Wait for it to disappear first, then a “good enough” version
• Shared Infrastructure Preservation– Push your content to sites that might preserve it
• Web Server Enhanced Preservation– Use Apache modules to create archival-ready resources
image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm
![Page 5: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/5.jpg)
Lazy Preservation
![Page 6: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/6.jpg)
![Page 7: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/7.jpg)
![Page 8: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/8.jpg)
Research Questions
• How much digital preservation of websites is afforded by lazy preservation?– Can we reconstruct entire websites from the WI?– What factors contribute to the success of website
reconstruction?– Can we predict how much of a lost website can be
recovered?– How can the WI be utilized to provide preservation of
server-side components?
![Page 9: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/9.jpg)
Warrick: Crawling the Crawlers
• Is website reconstruction from WI feasible?– Web repository: G,M,Y,IA– Reconstructed 24 websites
• How long do search engines keep cached content after it is removed?
![Page 10: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/10.jpg)
SE Caching Experiment
• Create html, pdf, and images• Place files on 4 web servers• Remove files on regular schedule• Examine web server logs to determine
when each page is crawled and by whom• Query each search engine daily using
unique identifier to see if they have cached the page or image
Joan A. Smith, Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites. D-Lib Magazine, February 2006, 12(2)
![Page 11: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/11.jpg)
Caching of HTML Resources - mln
![Page 12: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/12.jpg)
Reconstructing a Website
Warrick
Starting URL
Web Repo
Original URL
Results page
Cached URL
Cached resourceFile system
Retrieved resource
1. Pull resources from all web repositories
2. Strip off extra header and footer html
3. Store most recently cached version or canonical version
4. Parse html for links to other resources
![Page 13: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/13.jpg)
How Much Did We Reconstruct?
A
“Lost” web site Reconstructed web site
B C
D E F
A
B’ C’
G E
F
Missing link to D; points to old resource G
F can’t be found
![Page 14: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/14.jpg)
Reconstruction Diagram
added 20%
identical 50%
changed 33%
missing 17%
![Page 15: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/15.jpg)
Websites to Reconstruct
• Reconstruct 24 sites in 3 categories:1. small (1-150 resources) 2. medium (150-499 resources)3. large (500+ resources)
• Use Wget to download current website• Use Warrick to reconstruct• Calculate reconstruction vector
![Page 16: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/16.jpg)
Results
Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.
![Page 17: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/17.jpg)
Web Repository Contributions
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Reconstructed websites
Contribution
Yahoo
IA
MSN
![Page 18: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/18.jpg)
Warrick Milestones
• www2006.org – first lost website reconstructed (Nov 2005)
• DCkickball.org – first website someone else reconstructed without our help (late Jan 2006)
• www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006)
• Internet Archive officially “blesses” Warrick (mid Mar 2006)1
1http://frankmccown.blogspot.com/2006/03/warrick-is-gaining-traction.html
![Page 19: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/19.jpg)
Shared Infrastructure Preservation
(slightly less lazy)
![Page 20: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/20.jpg)
Shared, Existing Infrastructure
• Can we (re)use existing installed network infrastructure for preservation purposes?
Who has the Bigger Fortress?
![Page 21: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/21.jpg)
Research Objective• Premise: use common Internet Protocol implementations to
replicate repository contents• Inject the contents of an OAI-PMH repository directly into:
– Email (SMTP)– Usenet News (NNTP)
• Instrument existing email, news servers• Use mod_oai (www.modoai.org) to do resource harvesting
– complex object formats (e.g. MPEG-21 DIDL) used to encode the resources as “lumps of XML”
– results are generalizable to any repository system
• Analyze testbed, simulate very large collections
![Page 22: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/22.jpg)
complex objects
Prototype Architecture
![Page 23: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/23.jpg)
Test Repository
• Website with 72 files – HTML, PDF, PNG, JPEG, GIF– 1KB - 1.5 MB
• Used a script to harvest the MPEG-21 DIDLs, and then:– attach to outbound email mesgs– post to a moderated newsgroup
(repository.odu.test1)
![Page 24: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/24.jpg)
Email Headers
OAI-PMH & HTTPheaders
base64 encoded DIDL
original email mesg
![Page 25: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/25.jpg)
News Posting
OAI-PMH & HTTPheaders
base64 encoded DIDL
![Page 26: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/26.jpg)
Simulation Parameters
• Repository– 100,000 items– 1MB/item– 100 daily additions– 400 daily updates
• Time– 2000 days (5.5 years)
• Email– granularity=1
– follows ODU power law example
• News– servers hold contents
for 30 days
![Page 27: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/27.jpg)
News Policies
![Page 28: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/28.jpg)
NNTP Results
![Page 29: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/29.jpg)
SMTP Policies
• passive, “piggybacking”• History list of receiver domains
– not maintained; history pointer off» duplicates
– maintained; history pointer on» no duplicates
• Granularity Filter for emails– every Gth email will be processed
![Page 30: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/30.jpg)
SMTP Results no history pointer with history pointer
G = 1
![Page 31: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/31.jpg)
Summary
• Shared Infrastructure Preservation provides a communications channel with unknown, future trading partners– SMTP approach is only feasible for “advertising” the existence of
the repository– NNTP approach is promising for holding content
• Lazy Preservation has been used to restore several dozen websites– but is it an archival strategy? depends on your tolerance for risk– prediction: search engines will see preservation as a business
opportunity
![Page 32: Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University](https://reader035.fdocuments.us/reader035/viewer/2022062519/5681513a550346895dbf51dc/html5/thumbnails/32.jpg)