K12 Web Archiving Program Lori Donovan Coordinator, K12 Web Archiving Program Internet Archive.
Web Archiving Challenges and Opportunities Presentation for Web archiving Engineering position
description
Transcript of Web Archiving Challenges and Opportunities Presentation for Web archiving Engineering position
Web Archiving Challenges and Opportunities
Presentation for Web archiving Engineering position
Ahmed AlSumPhD Candidate
Old Dominion University
Outline
• Engineer– What I did
• Web Archive– What I know– What I did– What I can do for SUL
CCSP Project
• It is an internal IBM support portal that provides client-facing audiences a by-client, holistic view of client situations.
• Technologies: The project depends on IBM technologies, WebSphere Portal, DB2, and deployed on zLinux machines
• Responsibilities:– Software Engineer.– Administrator on production and staging.– Customer support team lead.– Software engineer team leader.
• Developing Enterprise Applications with J2EE platform technologies for frontend (Servlets, JSP, Portlet APIs), and the support for backend tasks based on EJB.
• Lotus Sametime developer for both Plugins and Bot development.• Development front-end components based on Web 2.0 technologies (AJAX based
on dojo 1.0, and Java Script).• Developing and deploying Portal solutions on WebSphere Portal.• WebSphere Portal Administration on for standalone and clustered environment.• Administration on Linux and Windows OS.• DB2 server’s administration for single instance and multiple instances with HADR
support.• Leading the customer support activities.• Support in some project quality activities.• Code review and static analysis activities.
• Certifications:• IBM Certified System Administrator, IBM WebSphere Portal V6.0. (May. 2008) • IBM Certified Solution Developer, XML and Related Technologies. (Since Mar.
2008) • IBM Certified Solution Developer, IBM WebSphere Portal V6.0. (Since Feb.
2008)• Sun Certified Web Component Developer for the Java 2 Platform, Enterprise
Edition 1.4 (Since Jan. 2008).• Sun Certified Programmer for the Java 2 Platform, SE 5.0, (Since March 2007).• IBM Rational Software Certified, RAD 6.0 Associate Developer (Since Apr.
2006)• Microsoft Certified Professional in Designing and Implementing Desktop
Applications with Microsoft® Visual C++® 6.0. (Since Sep. 2002)
Memento
• Memento is an extension for the (HTTP) to allow the user to browse the past web as the current web.
I. Jacobs and N. Walsh. Architecture of the world wide web. Technical report, W3C, 2004. http://www.w3.org/TR/webarch/.
Now
T1
T2
T3
Memento
• Memento Aggregator– Developer and Adminstartor
Memento
• Memento Client– MementoFox: Firefox addon– mcurl: command line in Perl
• Both of them have been implemented based on Memento internet draft 8.0.
WAT Extraction
• Web Archive Transformation (WAT) is a specification for structuring metadata generated by Web crawls.
• Technologies: Hadoop, PigLatin, JAVA.
WEB ARCHIVINGChallenges and Opportunities
Web Archive Life Cycle
Hockx-Yu, H., 2011. The Past Issue of the Web. In Proceedings of 3rd International Conference on Web Science. pp. 1–8.
Selection
• Decide what to capture• We studied what is already captured.
How Much of the Web is archived?
• Tell me what is your URI source!!
Including SE cache
Excluding SE Cache
90% 79%
97% 68%
35% 16%
88% 19%
S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. “How much of the Web is Archived?” In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11, Ottawa, Canada. 2011.
Where is it archived?
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
IA Internet Archive
LoC Library of Congress
IC Icelandic Web Archive
CAN Library and Archives Canada
BL British Library
UK UK National Library
PO Portuguese Web Archive
CAT Web Archive of Catalonia
CR Croatian Web Archive
CZ Archive of the Czech Web
TW National Taiwan University
AIT Archive It
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
What is missing?
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
Selection
• Curator• TwitterCrowdsource:– UK Web archive: Twittervana.– Internet Memory: Collect URIs from twitter APIs.– VA Tech: CTRNET project.
Web Archive Life Cycle
Hockx-Yu, H., 2011. The Past Issue of the Web. In Proceedings of 3rd International Conference on Web Science. pp. 1–8.
Harvesting
• Services– Archive-It– WAS @ CDLib
• Dedicated server– Heritrix
Harvesting
• Challenges– Ajax and Web 2.0/3.0– Streaming Media– URI challenges (i.e. twitter hash-bang)– Mobile
Harvesting
• SiteStory - Transaction Archive
Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool, Proceedings of TPDL 2013.
Web Archive Life Cycle
Hockx-Yu, H., 2011. The Past Issue of the Web. In Proceedings of 3rd International Conference on Web Science. pp. 1–8.
Storage
• Flat files:– WARC files (ISO standard)
• No-SQL db:– Internet memory
Storage
• Wrong solution could be a disater
Access