Document

76
Slides from Humanities on the Web: Is it working? Date: Thursday, 19 March 2009, 10-4 Location: Oxford University, Oxford, UK Webcast URL: http://webcast.oii.ox.ac.uk/? view=Webcast&ID=20090319_275 Slide URL: http://www.slideshare.net/etmeyer/WWWoH Afternoon Event: 1:30 – 2:45: JISC/NEH Transatlantic Digitisation Collaboration Programme in conjunction with the Internet Archive: The World Wide Web of Humanities OII: Selecting and analysing the sample WWI and WWII collections (Christine Madsen & Dr. Eric Meyer) The Internet Archive: Extracting the data (Molly Bragg) Hanzo Archives Ltd.: Working with the data (Mark Middleton) Discussion and questions Full details: http://www.oii.ox.ac.uk/events/details.cfm? id=238

description

Presentations from Oxford Internet Institute, the Internet Archive, and Hanzo Archives Ltd presenting the results of a JISC-NEH funded transatlantic digitisation project.

Transcript of Document

Page 1: Document

Slides from Humanities on the Web: Is it working?Date: Thursday, 19 March 2009, 10-4Location: Oxford University, Oxford, UKWebcast URL: http://webcast.oii.ox.ac.uk/?view=Webcast&ID=20090319_275Slide URL: http://www.slideshare.net/etmeyer/WWWoH

Afternoon Event:1:30 – 2:45: JISC/NEH Transatlantic Digitisation Collaboration Programme in conjunction with the Internet Archive: The World Wide Web of Humanities

OII: Selecting and analysing the sample WWI and WWII collections (Christine Madsen & Dr. Eric Meyer)

The Internet Archive: Extracting the data (Molly Bragg)Hanzo Archives Ltd.: Working with the data (Mark Middleton)Discussion and questions

Full details: http://www.oii.ox.ac.uk/events/details.cfm?id=238

Page 2: Document

Selecting and Analysing the WWI and WWII collections

Christine MadsenEric Meyer

19 March 2009

Page 3: Document

Why WWI and WWII?

Many branches of the humanities

History Journalism Art

Art history Advertising

Literature

Poetry Political science

Military history

Page 4: Document

Why WWI and WWII?

Well-rounded set of materials

Page 5: Document

Why WWI and WWII?

Language Doc types

Top-level domains

Secondary domains

Page 6: Document

Building the Collection

Supplemented with keyword searches in

the Archive

Selected from the live web

Page 7: Document

Building the Collection

Seeds are:

the website or portion of the website that you plan to include in your collection

Initial Collection

Seed 3

Seed 2Seed 1

Page 8: Document

Building the Collection

Seed 1

www

wwwwww

www

Seed 2

www

wwwwww

wwwSeed 3

www

wwwwww

www

Seed 4

www

wwwwww

www

Seed 5

www

wwwwww

www

Seed 6

www

wwwwww

www

Expanded Collection

A seed is also a web site from which additional sites can be discovered via the hyperlinks of the site

Page 9: Document

Building the Collection

Started with WWI

Too small (under 1,000,000 pages / object)Target was 250 million

Page 10: Document

Building the Collection

Expanded to WWII

Final collection: 5,362,425 unique URLs

Page 11: Document

Building the Collection

‘World War One’

‘World War I’

‘First world war’

‘World War II’

‘World War Two’

‘the great war’

‘Première Guerre Mondiale’

‘zweiter Weltkrieg’

Page 12: Document

Building the Collection

Record links from first 20

pages of search

Following links

[include dead links]

Returning to ‘hub’ sites for

further analysis

Page 13: Document

Building the Collection

http://www.greatwar.co.uk/westfront/Somme/index.htm

http://www.greatwar.co.uk

Expanding scope

Page 14: Document

Building the Collection

memory.loc.gov/ammem/collections/maps/wwii/index.html

www.memory.loc.gov/ammem/collections maps/wwii/

Expanding scope

Page 15: Document

Building the Collection

www.eyewitnesstohistory.com/ <= don’t

want whole site

www.eyewitnesstohistory.com/blitzkrieg.htmwww.eyewitnesstohistory.com/dday.html

www.eyewitnesstohistory.com/midway.htmwww.eyewitnesstohistory.com/airbattle.htmwww.eyewitnesstohistory.com/dunkirk.htm

www.eyewitnesstohistory.com/francesurrenders.htm

Dealing with illogical or flat directory structures

Page 16: Document

Building the Collection

• Stop when most results are redundant• Narrow in on more specific topics

WWIWWII

Page 17: Document

Building the Collection

• Materials in Foreign language– Focused on German sites– Consider local conventions, not just translations

WWII (zweiter Weltkrieg)

the period of National Socialism

(Zeit des Nationalsozialismus)

the period in which the Nazis ruled

(Nazizeit)

Page 18: Document

• Other foreign languages were included, but not sought after

Belarusian; Catalan/Valencian; Chamorro; Czech; Danish; German; Dzongkha; English; Spanish/Castilian; Finnish; French; Hebrew; Hungarian; Italian; Japanese; Luba-Katanga; Dutch/Flemish; Polish; Portuguese; Russian; Slovenian; Turkish; Ukrainian; Chinese

Page 19: Document

Building the Collection

Page 20: Document

The World Wide Web of Humanities “Extracting The Data”

St Anne's College, OxfordMarch 19, 2009

Molly Bragg, Partner Specialist

Web Group

The Internet Archive

Page 21: Document

Agenda

Brief Introduction to IA’s Web Archives

Discipline Specific Data Extraction from Longitudinal Web Archives: The WWWoH Case Study

Recommendations for Future Research and Tools Development Efforts

Page 22: Document

Brief Introduction to IA’s Web Archives

Page 23: Document

The Archive’s combined collections receive over 6 mil downloads a day!

www.archive.org

The Internet Archive is…

Web Pages Educational Courseware Films & Videos Music & Spoken Word Books & Texts Software Images

A digital library of ~4 petabytes of information

Page 24: Document

IA Web Archives

1.6+ petabytes of primary data (compressed)

150+ billion URIs, culled from 85+ million sites, harvested from 1996 to the present

Includes captures from every domain Encompasses content in over 40 languages As of 2009, IA will add ½ petabyte to 1 petabyte of

data to these collections each year.

Page 25: Document

Discipline Specific Data Extraction from Longitudinal Web Archives:

The WWWoH Case Study

Page 26: Document

WWWoH Case Study

http://neh-access.archive.org/neh/

Page 27: Document

WWWoH Case Study

Unique URLs in the collection: 5,362,425

Total number of captures: 23,006,857

Captures span: May, 1996 to Aug, 2008

Total size of compressed data: ~250 GBs

Page 28: Document

The Data Extraction Process

Oxford Internet Institute selected relevant sites/URLs

Identified all captures related to the seeds Identified all files embedded in each capture

(on & off seed domains) for extraction Attempted to locate additional candidate

seed URLs/domains for inclusion in the collection using outbound link data

Page 29: Document

The Data Extraction Process

Relevant URLs not identified as seeds were not extracted. Automatically harvesting ALL outbound links

can capture relevant non-seed urls however it can also introduce a large amount of extraneous content into the collection

Manually curating outbound links excludes non-relevant content, however it can be an overwhelming task due to the volume of links

Page 30: Document

WWWoH Case Study: WWI

Number of Seeds: 2263

Unique Hosts: 906

Number of Links: 143+ mil

Page 31: Document

WWWoH Case Study: WWI

Page 32: Document

WWWoH Case Study: WWI

Page 33: Document

WWI: Example

Page 34: Document

WWI: Example

Page 35: Document

WWI: Example

Page 36: Document

WWI: Example

Page 37: Document

WWWoH Case Study: WWII

Number of Seeds: 2592

Unique Hosts: 1475

Number of Links: 252+ mil

Page 38: Document

WWWoH Case Study: WWII

Page 39: Document

WWWoH Case Study: WWII

Page 40: Document

WWII: Example

Page 41: Document

WWII: Example

Page 42: Document

WWII: Example

Page 43: Document

Challenges

Identifying subject matter-specific resources of interest for an extraction and then automating those procedures.

Tools are missing from the workflow that might make the initial scoping of an extraction easier to define and revise Available tools for collection building and access are too technically focused for the average humanities scholar

Page 44: Document

Recommendations for Future Research and Tools

Development Efforts

Page 45: Document

Implications for Future Research

Need link and web graphing tools that use inbound and outbound link data to identify further resources of interest

Need to experiment with a more diverse range of UI navigational paradigms that address the dimension of time and curatorial input

Page 46: Document

Ideas/Concepts to Explore: Nomination Tools

Page 47: Document

Ideas/Concepts to Explore: Nomination Tools

Page 48: Document

Opportunities

Extractions make it easier for humanities scholars to locate and assemble source materials of interest. These collections can accelerate and/or augment discipline specific research efforts Extractions can encourage distributed collaboration and cooperation between entities who might not otherwise be aware of one another

Page 49: Document

Thank You!

http://neh-access.archive.org/neh/

Molly Bragg, Partner Specialist

The Internet Archive, Web Group

[email protected]

Page 50: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Search and Analysis of Data in WWWoH

Mark Middleton

Page 51: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Agenda

Brief introduction to Hanzo

Open Source Search-Tools: a toolkit for implementing analytical applications using web archives

WWWoH — working with the data

Recommendations for future research

Recommendations for future tools development

WWWoH Tools Deliverables

Page 52: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Introduction to Hanzo

Page 53: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Hanzo Archives Limited

Web Archiving Services

Company websites and intranets

Litigation support

E-Discovery

IP protection

Focus on legally defensible web archives of exceptional quality

Very advanced crawlers and access tools: dynamic html, video, flash, web 2.0

Some public archives

Mainly closed archives

Page 54: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Hanzo Archiving Technology

Need advanced capabilities very quickly — continuous product innovation

Rapid development of tools

Create research and open source projects to promote mainstream awareness of web archives and web archiving technology

Open source projects include

WARC Tools

Search Tools

Page 55: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

WWWoH and Development of Open Source Search-Tools

Page 56: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Objectives

Deliver an open source search engine for web archives that is simple to extend, easy to install and deploy

Integrate with WARC Tools, the open source web archive file manipulation tools (Hanzo and IIPC)

Extend the search engine with interesting directives and options

Extend the search engine to provide data to analytical tools, develop an API, tools, and exemplar analytical tools

Encourage third party analytical tools to use web archives as their data repository

Migrate WWWoH extraction from ARC to WARC and ingest into Search Tools

Page 57: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Full Text Search

Implemented FT search on top of WARC Tools — the toolkit for manipulating ISO-28500 WARC files

Reviewed several options: Java Lucene (and clones), Xapian, DB indexing (Sphinx, OpenFTS), etc.

Criteria: vibrant development community, extensible (searching web archives is different: temporal dimension, duplicate handling, etc.), fast and full-featured (boolean, time queries, ability to index multiple fields, query language)

Page 58: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Component Architecture

Full text search engine based on Open Source Ferret

Knowledge Base stores search results

Python application with Django model and Django WUI

Memcache

Plug-in architecture to support multiple analytical applications

Page 59: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Ferret

Ferret is FAST, both indexing and searching

Highly scalable, up to 100m documents on a single CPU

Supports distributed search

Phrase search, proximity ranking, stemming in several languages, stopwords, multiple document fields

Ferret Query Language

Docu

men

ts/s

http://ferret.davebalmain.com/trac/wiki/FerretVsLucene

Page 60: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Advanced Search

url: (+bbc +wwii) -- search for URLs containing both ‘bbc’ and ‘wwii’

date: [2001 2002] -- search within date range

tag: wwwoh -- search content with the tag ‘wwoh’

title: (+wilfred +owen) -- search for Wilfred and Owen within the title

domain: fr -- restrict search to within .fr domain

Page 61: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Working with the Data

Page 62: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Migrating ARC to WARC

Data extracted from IA in ARC files

Hanzo WARC Tools and Search Tools projects combined enabled us to migrate ARC to WARC files (WARC is the new ISO standard):

Some challenges: broken ARCs, scale, etc.

3,264 WARC files

Page 63: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Programmable Access to Data

WARC Tools and Search Tools provide a rich collection of programmable tools to enable analytics tools developers to use web archives:

Object-oriented C, REST API, fast iterators

Command lines for manipulating WARCs, indexing, searching

Web applications for browsing, searching, demonstrator analytics

C/C++, Python, Ruby, Perl, … and if you need to, Java, C#

Demonstration: the web applications

Page 64: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

http://wwwoh.hanzoarchives.com

/

Page 65: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Page 66: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Page 67: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Page 68: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Analytical Tools

Frequency Tables for:

Domains, MIME Types, Countries

Graphing Tools:

GUESS -- an exploratory data analysis and visualization tool for graphs and networks

Graphviz -- makes diagrams in several formats: images and SVG for web pages, Postscript; or display in an interactive graph browser

Hypergraph -- provides visualisation of hyperbolic geometry, to handle graphs and to layout hyperbolic trees

Page 69: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Page 70: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Graphing Tools

Page 71: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Recommendations for Future Research and Tools

Development

Page 72: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Future Research

Faster, richer analytics

Rich API for analytics, to be developed in collaboration with IA, other archives, and IIPC

Temporal analytics and techniques

Link and network graphing and analytics

Enhance outreach/dissemination to the mainstream development community and research community

Page 73: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Future Tools Development

Multi-machine indexing and application engine

Tighter integration of graphing tools, with more user parameters and configurations

Temporal analysis (animation of link graphs over time)

Enhance WARC Tools integration and investigate interoperability with other IIPC toolsets

Developer documentation

Analyst/researcher documentation

Installation tools for Linux, Mac OS X and Windows XP/Vista

Page 74: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Deliverables at End March 2009

Page 75: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Deliverables

The Search Tools project home is http://code.google.com/p/search-tools/

Source code

Documentation

Issue management

Mailing list

The WARC Tools project home is http://code.google.com/p/warc-tools/

The prototype application is http://wwwoh.hanzoarchives.com/

Page 76: Document

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Thank YouHanzo Archives Limited

+44 20 8816 8226

www.hanzoarchives.com