Archival Web Research Datasets -...

11
Archival Web Research Datasets www.archive.org

Transcript of Archival Web Research Datasets -...

Page 1: Archival Web Research Datasets - wp.comminfo.rutgers.eduwp.comminfo.rutgers.edu/.../2014/07/Carpenter_WebResearchIA_Jun… · Archival Web Research Datasets . Internet Archive •

Archival Web Research Datasetswww.archive.org

Page 2: Archival Web Research Datasets - wp.comminfo.rutgers.eduwp.comminfo.rutgers.edu/.../2014/07/Carpenter_WebResearchIA_Jun… · Archival Web Research Datasets . Internet Archive •

Internet Archive

• Established in 1996• 501(c)(3) non profit organization• Over twenty petabytes (compressed) of publicly accessible archival material• Technology partner to libraries, archives, museums, universities, researchinstitutes, and memory institutions• Currently archiving books, texts, film, video, audio, images, software, educationalcontent and the Internet

www.archive.org

Page 3: Archival Web Research Datasets - wp.comminfo.rutgers.eduwp.comminfo.rutgers.edu/.../2014/07/Carpenter_WebResearchIA_Jun… · Archival Web Research Datasets . Internet Archive •

IA Web ArchiveBegan in 1996415+ billion publicly accessible web instancesOperate web wide, survey, end of life, selective, &resource specific harvestsDevelop freely available, open source, webarchiving & access tools

Page 4: Archival Web Research Datasets - wp.comminfo.rutgers.eduwp.comminfo.rutgers.edu/.../2014/07/Carpenter_WebResearchIA_Jun… · Archival Web Research Datasets . Internet Archive •

Approaches to Collecting…Thematic / Topical collectionsResource specific crawls: PDFs, videos, etc.Exhaustive: end of life, closure crawls, nationaldomain crawls for .au, .es, .fr, .il, .nz, .se etc.Broad survey crawls: domain wide for.org/.net/.edu/.gov/.comNo more 404’s project

Page 5: Archival Web Research Datasets - wp.comminfo.rutgers.eduwp.comminfo.rutgers.edu/.../2014/07/Carpenter_WebResearchIA_Jun… · Archival Web Research Datasets . Internet Archive •

Primary Methods of Web Harvest

• Proprietary web crawlersthat harvest and preservedocs using the ARC filestandard– E.g.

• open sourceweb crawls that harvestand preserve docs usingthe W/ARC file standard

Page 6: Archival Web Research Datasets - wp.comminfo.rutgers.eduwp.comminfo.rutgers.edu/.../2014/07/Carpenter_WebResearchIA_Jun… · Archival Web Research Datasets . Internet Archive •

Public Data Extractions & CrawlDatasets: ArchiveHub

• Hurricane Katrina extraction• Senate.gov/House.gov extractions• NARA Congressional Crawls (2006 – 2012)• Occupy Wall Street extraction• Superstorm Sandy extraction• US Media extraction

Page 7: Archival Web Research Datasets - wp.comminfo.rutgers.eduwp.comminfo.rutgers.edu/.../2014/07/Carpenter_WebResearchIA_Jun… · Archival Web Research Datasets . Internet Archive •

Public dataset: Wide – 00002 (2011)

• Available for download• Crawl start: 09 Mar, 2011• Crawl end: 23 Dec, 2011• Captures: 2.7 Billion• Unique URLs: 2.2 Billion• Hosts: 29 Million• Size: 80 TB• Contact: [email protected]

Page 8: Archival Web Research Datasets - wp.comminfo.rutgers.eduwp.comminfo.rutgers.edu/.../2014/07/Carpenter_WebResearchIA_Jun… · Archival Web Research Datasets . Internet Archive •

Public dataset: Wide – 00005 (2012)

Available soon!Crawl start: 30 Apr, 2012Crawl end: 11 Sep, 2012Captures: 11.6 BillionUnique URLs: 4.2 BillionHosts: 31 MillionSize: 360 TBRelease paired with hackathons

Page 9: Archival Web Research Datasets - wp.comminfo.rutgers.eduwp.comminfo.rutgers.edu/.../2014/07/Carpenter_WebResearchIA_Jun… · Archival Web Research Datasets . Internet Archive •

Public dataset: .gov (1995 – 2013)• Hosted by Altiscale

(https://www.altiscale.com/)

• Date range: 1995 – September 30, 2013• Captures: 4.1 Billion• Size: 285 TB• Deduped and compressed size is ~90TBs, plus

indexes

Page 10: Archival Web Research Datasets - wp.comminfo.rutgers.eduwp.comminfo.rutgers.edu/.../2014/07/Carpenter_WebResearchIA_Jun… · Archival Web Research Datasets . Internet Archive •

Additional Extracted datasetsExtracted longitudinal web data for

� .uk� .pt� .ie� .il� .is� .dk� .fr� .de (in process)

Contact respective national libraries for access!

Page 11: Archival Web Research Datasets - wp.comminfo.rutgers.eduwp.comminfo.rutgers.edu/.../2014/07/Carpenter_WebResearchIA_Jun… · Archival Web Research Datasets . Internet Archive •

500,000+1,643,000+2,000,000+2,500,000+6,300,000+

415,000,000,000+

BooksMoving ImagesAudio RecordingsHours of TVDigital TextsArchived Web Pages