Brass: A Queueing Manager for Warrick
description
Transcript of Brass: A Queueing Manager for Warrick
![Page 1: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/1.jpg)
Brass: A Queueing Manager for Warrick
Frank McCown, Amine Benjelloun, and Michael L. Nelson
Old Dominion UniversityComputer Science Department
Norfolk, Virginia, USA
IWAW 2007Vancouver, BCJune 23, 2007
![Page 2: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/2.jpg)
2
Agenda
• Dangers facing website• Web-repository crawling• Comparing web crawling with web-
repository crawling• All about Brass• Alternate Warrick deployments
![Page 3: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/3.jpg)
3Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg
![Page 4: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/4.jpg)
4
![Page 5: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/5.jpg)
5
A couple weeks ago I… accidentally deleted my entire database of about 30 articles. After I finished berating myself for being so stupid, I realized that my hosting company would have a backup, so I sent an email asking them to restore the database. Their reply stated that backups were “coming soon”…OUCH! So right after I signed up with a better hosting company I had to figure out a plan B.
![Page 6: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/6.jpg)
6
Crawling the Crawlers
World Wide Web
Repo1
Repo2
Repon
...
Web crawling
Repo
Web-repository crawling
![Page 7: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/7.jpg)
7
• McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007.
• McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.
• McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006.
• McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.
Available at http://warrick.cs.odu.edu/
![Page 8: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/8.jpg)
8
![Page 9: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/9.jpg)
9
![Page 10: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/10.jpg)
10
Cached Image
![Page 11: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/11.jpg)
11
Cached PDF
http://www.fda.gov/cder/about/whatwedo/testtube.pdf
MSN version Yahoo version Google version
canonical
![Page 12: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/12.jpg)
12
Examples of Lost Websites Recovered with Warrick
![Page 13: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/13.jpg)
13
Web Crawler
![Page 14: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/14.jpg)
14
Web-Repository Crawler
![Page 15: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/15.jpg)
15
Issues
Web crawling• Limit hit rate per host• Websites periodically
unavailable • Portions of website off-
limits (robots.txt, passwords)
• Deep web• Spam• Duplicate content• Flash and JavaScript
interfaces• Crawler traps
Web-repo crawling• Limit hit rate per repo• Limited hits per day (API
query quotas)• Repos periodically
unavailable• Flash and JavaScript
interfaces• Can only recover what
repos have stored• Lossy format conversions
(thumb nail images, HTMLlized PDFs, etc.)
![Page 16: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/16.jpg)
16
Problems with Warrick
• Requires user to download, install, and run from the command linewarrick.pl –d –r –o log.txt –c –wr ia http://foo.org/
• Google API keys are no longer available
• Screen-scrapes Google’s web user interface which can cause Google to black-list an IP address
![Page 17: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/17.jpg)
17
Solution: Brass
• Queueing system using ODU nodes, so API query limits can be spread across several machines
• Uses Google API keys which we obtained before they were no longer made available
• Easy-to-use web interface utilizing email to notify user when reconstructions are complete
![Page 18: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/18.jpg)
18
Warrick Brown Captain Jim Brass
http://www.cbs.com/primetime/csi/bios/index.php?cast_member=gary
![Page 19: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/19.jpg)
19
![Page 20: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/20.jpg)
20
![Page 21: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/21.jpg)
21
![Page 22: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/22.jpg)
22
![Page 23: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/23.jpg)
23
Brass Architecture
![Page 24: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/24.jpg)
25
Other Warrick Deployments
• GUI interface for client executable– Installation difficulties– Lack of Google API keys
• Web interface along with client application which makes queries– Browser plug-in, Flash, or applet– Must manage Google API keys– Browser must be left open and continued
Internet access
![Page 25: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/25.jpg)
26
Conclusions
• Warrick interface is almost ready for the public
• Web interface will likely greatly increase Warrick usage
• Collection of usage data will allow us to better understand what kinds of websites the public is interesting in recovering
![Page 26: Brass: A Queueing Manager for Warrick](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813fd8550346895daabd73/html5/thumbnails/26.jpg)
27
Frank [email protected]
And that’s everything there is to know about
Brass!Thanks, Dad, but I just
wanted to know when you were going to change my
diaper…