Archival Web Research Datasets -...

Post on 11-Aug-2020

3 views 0 download

Transcript of Archival Web Research Datasets -...

Archival Web Research Datasetswww.archive.org

Internet Archive

• Established in 1996• 501(c)(3) non profit organization• Over twenty petabytes (compressed) of publicly accessible archival material• Technology partner to libraries, archives, museums, universities, researchinstitutes, and memory institutions• Currently archiving books, texts, film, video, audio, images, software, educationalcontent and the Internet

www.archive.org

IA Web ArchiveBegan in 1996415+ billion publicly accessible web instancesOperate web wide, survey, end of life, selective, &resource specific harvestsDevelop freely available, open source, webarchiving & access tools

Approaches to Collecting…Thematic / Topical collectionsResource specific crawls: PDFs, videos, etc.Exhaustive: end of life, closure crawls, nationaldomain crawls for .au, .es, .fr, .il, .nz, .se etc.Broad survey crawls: domain wide for.org/.net/.edu/.gov/.comNo more 404’s project

Primary Methods of Web Harvest

• Proprietary web crawlersthat harvest and preservedocs using the ARC filestandard– E.g.

• open sourceweb crawls that harvestand preserve docs usingthe W/ARC file standard

Public Data Extractions & CrawlDatasets: ArchiveHub

• Hurricane Katrina extraction• Senate.gov/House.gov extractions• NARA Congressional Crawls (2006 – 2012)• Occupy Wall Street extraction• Superstorm Sandy extraction• US Media extraction

Public dataset: Wide – 00002 (2011)

• Available for download• Crawl start: 09 Mar, 2011• Crawl end: 23 Dec, 2011• Captures: 2.7 Billion• Unique URLs: 2.2 Billion• Hosts: 29 Million• Size: 80 TB• Contact: info@archive.org

Public dataset: Wide – 00005 (2012)

Available soon!Crawl start: 30 Apr, 2012Crawl end: 11 Sep, 2012Captures: 11.6 BillionUnique URLs: 4.2 BillionHosts: 31 MillionSize: 360 TBRelease paired with hackathons

Public dataset: .gov (1995 – 2013)• Hosted by Altiscale

(https://www.altiscale.com/)

• Date range: 1995 – September 30, 2013• Captures: 4.1 Billion• Size: 285 TB• Deduped and compressed size is ~90TBs, plus

indexes

Additional Extracted datasetsExtracted longitudinal web data for

� .uk� .pt� .ie� .il� .is� .dk� .fr� .de (in process)

Contact respective national libraries for access!

500,000+1,643,000+2,000,000+2,500,000+6,300,000+

415,000,000,000+

BooksMoving ImagesAudio RecordingsHours of TVDigital TextsArchived Web Pages