Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject...
Transcript of Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject...
![Page 1: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/1.jpg)
Jefferson Bailey, Director, Web Archiving, Internet Archive @jefferson_bail | [email protected] Grotke, Web Archiving Team Lead, Library of Congress @agrotke | [email protected] Mark Phillips, Associate Dean for Digital Libraries, UNT Libraries @vphill | [email protected]
Harvesting Democracy: Archiving Federal Government Web Content at End of Term
AALL | July 17, 2016
![Page 2: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/2.jpg)
it all began a long, long, time ago, in a far away place
https://flic.kr/p/4N2jHUhttps://flic.kr/p/4JNkLE
![Page 3: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/3.jpg)
original end of term web archive partners
for 2008/2012 - all IIPC & NDIIPP/NDSA partners
![Page 4: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/4.jpg)
extant gov web archiving effortsCapture, Preservation, & Access
• LOC: .gov, election, other• GPO: agency sites, often ephemeral • NARA: congressional web harvest
every 2 years• IA: global & curated crawls• Agency-level: NIH/NLM, DOE, DOL,
HHS, CMS, others, using AIT or comm tools
• UNT & Others: Topical .gov collecting
Community Efforts
• Federal Web Archiving Group• most of those at left plus other
feds• Research Initiatives
• academic• NGO or watchdog
• Citizen Driven• grassroots efforts
• End of Term• focused but large-scale multi-
institutional project
![Page 5: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/5.jpg)
◻ work collaboratively to preserve public U.S. Government websites◻ document federal agencies’ presence on the web during the end of
Presidential terms◻ enhance the existing research collections of the partner institutions◻ raise awareness about the need for preservation◻ engage with researchers and subject experts
goals of the end of term project
![Page 6: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/6.jpg)
eot collaborative distribution of work• IA: crawling, preservation, access, full-text
search• LC: crawling, preservation, data transfers• UNT: nomination tool development,
crawling, nomination mgmt, preservation, access
• CDL: web portal, metadata• GPO: URL nomination, outreach• All: URL contributions, outreach, project
management• Others: URLs, education
Some variance of roles between 2008 & 2012 (and for 2016)
![Page 7: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/7.jpg)
https://flic.kr/p/8uMXjb
major funding brought to you by….
![Page 8: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/8.jpg)
https://flic.kr/p/8uMXjb
no one
![Page 9: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/9.jpg)
defining the “government web presence”
Stanford WebBase Project
2004 crawl list of URLs
![Page 10: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/10.jpg)
and people like you!
![Page 11: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/11.jpg)
.gov websites proliferate like invasive species
![Page 12: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/12.jpg)
and yes, invasivespecies.gov once existed
![Page 13: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/13.jpg)
some are non-public or unlisted
![Page 14: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/14.jpg)
“web waste” & preservation mentalities
![Page 15: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/15.jpg)
end of term web archive
http://eotarchive.cdlib.org/
![Page 16: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/16.jpg)
affiliated efforts
http://www.thinkingprojects.org/rabina_cocciolo_peet_EOT.pdf
https://twitter.com/eotarchive
![Page 17: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/17.jpg)
eot extentIn Internet Archive• EOT 2008
• ~3,000 seeds• ~102m URLs (~160m total across partners)• 17.95 TB (compressed)• multiple crawls & duplication
• EOT 2012• ~5,500 seeds• ~45m URLs (~120m total across partners)• 18.60 TB (compressed)• more focused crawls & deduped
• Similar data sizes, but 2012 had fewer URLs• 2012 notable for media richness, uniqueness, density
![Page 18: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/18.jpg)
eot stats 2008 and 2012
http://vphill.com/journal/post/5861/http://vphill.com/journal/post/5872/
![Page 19: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/19.jpg)
![Page 20: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/20.jpg)
![Page 21: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/21.jpg)
![Page 22: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/22.jpg)
![Page 23: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/23.jpg)
EOT2008-EOT2012 – TLD biggest change
![Page 24: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/24.jpg)
![Page 25: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/25.jpg)
![Page 26: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/26.jpg)
![Page 27: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/27.jpg)
![Page 28: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/28.jpg)
Researchers: PoliSci, Comms, Legal, Informatics, CSProject: Mining ~100TB of .gov dataPros: Data w/ services, subsidized cluster, collaborative structure, some R&DCons: Low up-take, tech hurdles, resource constraintsLessons Learned: Researcher use of “big data” of web archives produce challenges of scale, processing, expertise, and familiarity with context and provenance.
researcher access to .gov
![Page 29: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/29.jpg)
researcher access to .gov
WAT Datasets(Web Archive
Transformation)Key Metadata from Every
Resource
LGA Datasets(Longitudinal
Graph Analysis)What Links to What
over Time
WANE Datasets(Web Archive
Named Entities)Names of People, Places,
Organizations
Web Archive Datasets (via platform, disk, APIs, whatever)
![Page 30: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/30.jpg)
http://webarchives.ca/
http://www.websci16.org/hackathonhttp://archivesunleashed.com/
https://github.com/vinaygoel/ars-workshop
researcher access to .gov
![Page 31: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/31.jpg)
wbm beta access to .gov
https://web-beta.archive.org
![Page 32: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/32.jpg)
wbm beta access to .gov (ppt/pdf)
https://waybacksearch.archivelab.org:8091
![Page 34: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/34.jpg)
Federal Government Web Archiving Working Group
![Page 35: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/35.jpg)
rough timeframe for 2016 project2016 ◻ July 2016: Recruitment of subject experts/nominators to help identify additional websites for
prioritized crawling. Today is the kickoff! ◻ September 2016: Bookend (baseline) crawl of government web domains begins. ◻ Fall 2016: Partners will crawl various aspects of government domains at varying frequencies,
depending on selection polices/interests. Team will determine strategy for crawling prioritized websites.
◻ November - February 2016-17: Crawl of prioritized websites, continued crawls of bulk lists.
2017 ◻ January 2017: Focused crawls will be conducted as needed during this period, particularly around
Inauguration day ◻ Spring or Summer 2017: Bookend crawl of all seeds, plus additional crawl of prioritized websites
as determined by team.
![Page 36: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/36.jpg)
eot 2016 opportunities• Expand Acquisition
• distribute crawling• deploy new tech• build web archiving capacity
• Nomination and Annotation• community engagement• contributed seed lists• educational opportunities
• Researcher Engagement• notable longitudinal breadth• good periodicity for data-mining• growing community of interest
• More Partners!
![Page 37: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/37.jpg)
eot 2016 strategies • Potential Project Strategies
• distributed crawling – deduped/replay?
• coordinated outreach – affiliate communities?
• more listserv & project interest• researcher access – datasets and
hosts? • Access & Preservation
• updated portal w/ FTS for all 3 eots• single replay WB• distributed preservation?
![Page 38: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/38.jpg)
eot challenges• Same ol’ web challenges
• complexity of content• volume & proliferation• “you get what you get” w/ little
cataloging or QA• Distribution of work
• more partners = more project/partner mgmt
• contributed seed lists• Resource constraints
• the “it isn’t anyone’s actual job” problem
• tech, time limitations & scale of data• funding = ☹
![Page 39: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/39.jpg)
eot 2016 content• Content
• 7,000+ social media accounts (scrape of gov SM registry API) 44% FB, 37% TW, 10% YT
• ~6,000 known seeds (via gov data, WB, FOIA)
• ??? of gov on non-gov domains/seeds
• more crowdsourced, curatorial nominations
gov,dontserveteens)gov,dot)gov,dot,adfs)gov,dot,fastlane)gov,dot,fhwa)gov,dot,fhwa,borderplanning)gov,dot,fhwa,collaboration)gov,dot,fhwa,efl)gov,dot,fhwa,environment)gov,dot,fhwa,fhwapap04)gov,dot,fhwa,flh)gov,dot,fhwa,international)gov,dot,fhwa,mutcd)gov,dot,fhwa,nhi)gov,dot,fhwa,ops)gov,dot,fhwa,safety)gov,dot,fhwa,wfl)gov,dot,fhwa,wwwcf)gov,dot,fmcsa)gov,dot,fmcsa,ai)gov,dot,fmcsa,cms)gov,dot,fmcsa,csa)gov,dot,fmcsa,csa2010)gov,dot,fmcsa,li-public)gov,dot,fmcsa,mrb)gov,dot,fmcsa,nrcme)gov,dot,fmcsa,safer)gov,dot,fra)gov,dot,fra,safetydata)gov,dot,fta)gov,dot,fta,transit-safety)gov,dot,isddc)gov,dot,its)gov,dot,its,benefitcost)gov,dot,its,pcb)gov,dot,its,standards)gov,dot,marad)gov,dot,nhtsa)gov,dot,nhtsa,www-esv)gov,dot,nhtsa,www-fars)gov,dot,nhtsa,www-nrd)gov,dot,nhtsa,www-odi)gov,dot,oig)gov,dot,ost,airconsumer)gov,dot,ost,dotcr)gov,dot,ost,dothr)gov,dot,ost,testimony)gov,dot,phmsa)gov,dot,phmsa,npms)gov,dot,phmsa,opsweb)gov,dot,phmsa,primis)
![Page 40: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/40.jpg)
http://digital2.library.unt.edu/nomination/eth2016/
![Page 41: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/41.jpg)
eot 2016: how you can help◻ any and all nominations welcome
◻ we need particular help with:⬜ judicial branch websites ⬜ government content on non-
government domains (.com, .edu, etc.)
⬜ important content or subdomains on very large websites (such as NASA.gov) that might be related to current Presidential policies
⬜ Social media
![Page 42: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/42.jpg)
further information and the form to submit : http://digital2.library.unt.edu/nomination/eth2016
![Page 43: Harvesting Democracy: Archiving Federal Government Web ...2016 July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today](https://reader036.fdocuments.us/reader036/viewer/2022062415/60482a234363cf486f7cf464/html5/thumbnails/43.jpg)
going forward
THANKS!
• Crawl it All!• Community opportunity for more distributed crawling and
acquisition methods• Access it All!
• Unified portal and search indices• New access models, user groups, analytical tools
• Preserve it All!• Take our WARCs and datasets, please!
• Join the Fun of it All! • Email: [email protected] (or any of us)