Archival Web Research Datasets -...

Archival Web Research Datasetswww.archive.org

Internet Archive

• Established in 1996• 501(c)(3) non profit organization• Over twenty petabytes (compressed) of publicly accessible archival material• Technology partner to libraries, archives, museums, universities, researchinstitutes, and memory institutions• Currently archiving books, texts, film, video, audio, images, software, educationalcontent and the Internet

www.archive.org

IA Web ArchiveBegan in 1996415+ billion publicly accessible web instancesOperate web wide, survey, end of life, selective, &resource specific harvestsDevelop freely available, open source, webarchiving & access tools

Approaches to Collecting…Thematic / Topical collectionsResource specific crawls: PDFs, videos, etc.Exhaustive: end of life, closure crawls, nationaldomain crawls for .au, .es, .fr, .il, .nz, .se etc.Broad survey crawls: domain wide for.org/.net/.edu/.gov/.comNo more 404’s project

Primary Methods of Web Harvest

• Proprietary web crawlersthat harvest and preservedocs using the ARC filestandard– E.g.

• open sourceweb crawls that harvestand preserve docs usingthe W/ARC file standard

Public Data Extractions & CrawlDatasets: ArchiveHub

• Hurricane Katrina extraction• Senate.gov/House.gov extractions• NARA Congressional Crawls (2006 – 2012)• Occupy Wall Street extraction• Superstorm Sandy extraction• US Media extraction

Public dataset: Wide – 00002 (2011)

• Available for download• Crawl start: 09 Mar, 2011• Crawl end: 23 Dec, 2011• Captures: 2.7 Billion• Unique URLs: 2.2 Billion• Hosts: 29 Million• Size: 80 TB• Contact: info@archive.org

Public dataset: Wide – 00005 (2012)

Available soon!Crawl start: 30 Apr, 2012Crawl end: 11 Sep, 2012Captures: 11.6 BillionUnique URLs: 4.2 BillionHosts: 31 MillionSize: 360 TBRelease paired with hackathons

Public dataset: .gov (1995 – 2013)• Hosted by Altiscale

(https://www.altiscale.com/)

• Date range: 1995 – September 30, 2013• Captures: 4.1 Billion• Size: 285 TB• Deduped and compressed size is ~90TBs, plus

indexes

Additional Extracted datasetsExtracted longitudinal web data for

� .uk� .pt� .ie� .il� .is� .dk� .fr� .de (in process)

Contact respective national libraries for access!

500,000+1,643,000+2,000,000+2,500,000+6,300,000+

415,000,000,000+

BooksMoving ImagesAudio RecordingsHours of TVDigital TextsArchived Web Pages

Archival Web Research Datasets -...

Documents

Transcript of Archival Web Research Datasets -...

Annotating Research Datasets

Journal of Archival Organization The Archival Photograph ... · Journal of Archival Organization ... The Archival Photograph and Its Meaning: Formalisms ... 'The Archival Photograph

Using open datasets for research purposes

ICARUS - The International Centre for Archival Research

Introduction to archival research 2015

Efficient Archival Data Storage - Storage Systems Research Center

Content4All Open Research Sign Language Translation Datasets

Archival resources for the history of the West Midlands · Cadbury Research Library: Special Collections ... Archival resources for the history of the West Midlands : , Cadbury Research

Acoustic and Archival Tags: Applications to Salmonid Research

Preparing Research Datasets

Building an Infrastructure for Archival Research · BUILDING AN INFRASTRUCTURE FOR ARCHIVAL RESEARCH 151 Table I. Typology of Research Fields in Archival Science, 1988-98 (Couture

Archival Research Assignment Handout - Brigitte Fielder€¦ · Archival Research Assignment Handout Rachel Ross, Whistler Somers, Abigail Sutherland Description: The chapter “Appearance”

Non-experimental research: observational, archival, case ......Non-experimental research: observational, archival, case-study research 9.63 Fall 2005. ... Are non-experimental approaches

Climatic Research Unit (CRU) Datasets – and some analyses!

Fingerprints & Palmprints: Research Datasets

Historians’ Use of Digital Archival Collection: the Web, Historical Scholarship and Archival Research Donghee Sinn (University at Albany) August 15, 2013,

IASSIST conference 2006 Efficient Ingest of Datasets in a Two-Stage Archival Process: The First Phase - Easy-Store Marion Wittenberg marion.wittenberg@dans.knaw.nl.

EPSRC research data expectations and PURE for datasets

The Use of Convent Archival Records in Medical Research

From Several Datasets to One Graph - Society of American ... · From Several Datasets to One Graph designing a prototype for a graph visualization of archival metadata from different