Web Archiving Challenges and Opportunities Presentation for Web archiving Engineering position
404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web...
Transcript of 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web...
![Page 1: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/1.jpg)
404: Archiving the WebRyder Kouba
Digital Collections ArchivistRare Books and Special Collections Library
The American University in Cairo
![Page 2: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/2.jpg)
What is web archiving?
- Collecting and preserving content from the web for future use
- Began in 1996 with the Internet Archive and various national libraries
- Seeds, crawls, captures, warc
![Page 3: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/3.jpg)
Why archive the web?
- Necessary to document society - Temporary lifespan of content
online- Continuation of current collections
(e.g. newspapers)- More evidentiary value than
printing
![Page 4: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/4.jpg)
Web content is vulnerable
![Page 5: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/5.jpg)
Accountability
![Page 6: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/6.jpg)
Websites as records
![Page 7: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/7.jpg)
![Page 8: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/8.jpg)
Culture
![Page 9: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/9.jpg)
Common tools
- Wayback Machine- Archive-It- Webrecorder.io- Mirrorweb- Heritrix- WARCreate- Archive.is
![Page 10: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/10.jpg)
Wayback Machine
- Easy to preserve and provide access to websites
- Difficult to organize seeds and metadata- Currently free- No control over archived data
![Page 11: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/11.jpg)
Archive-It
- Subscription service AUC currently uses- Ability to organize collections, create metadata, etc.- More technical options- Limits on amount of data that can be saved- Requires more time and resources than other options
![Page 12: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/12.jpg)
Archive-It
- Automated crawls- Flexible scheduling
![Page 13: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/13.jpg)
Other options
- Webrecorder.io - the least mysterious web crawler in history
- Heritrix: Open-source Internet Archive crawler
![Page 14: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/14.jpg)
The harsh reality
![Page 15: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/15.jpg)
Building collections
- How can web archiving compliment existing collections?
- What is the collection scope?
- Who will be responsible for what tasks?
![Page 16: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/16.jpg)
Other workflow challenges
- Constant changes and updates- Managing/organizing seeds- Quality Assurance
![Page 17: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/17.jpg)
Description and access
- Necessary?- Level of description?- OCLC recently
released “Descriptive Metadata for Web Archiving”
![Page 18: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/18.jpg)
Searching a phonebook
- Wayback search is not full-text
- Currently very helpful to know the URL of a (former) site
![Page 19: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/19.jpg)
Access through Archive-It
![Page 20: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/20.jpg)
Ethical issues
- Require permission from creators?
- Privacy by obscurity?- Expectation of privacy?
![Page 21: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/21.jpg)
![Page 22: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/22.jpg)
Digital Humanities
![Page 23: 404: Archiving the Web · What is web archiving? - Collecting and preserving content from the web for future use - Began in 1996 with the Internet Archive and various national libraries](https://reader034.fdocuments.us/reader034/viewer/2022042613/5f8cbb544c110a56bd1c4e76/html5/thumbnails/23.jpg)