Document

Post on 14-May-2015

4.028 views 1 download

Tags:

description

Presentations from Oxford Internet Institute, the Internet Archive, and Hanzo Archives Ltd presenting the results of a JISC-NEH funded transatlantic digitisation project.

Transcript of Document

Slides from Humanities on the Web: Is it working?Date: Thursday, 19 March 2009, 10-4Location: Oxford University, Oxford, UKWebcast URL: http://webcast.oii.ox.ac.uk/?view=Webcast&ID=20090319_275Slide URL: http://www.slideshare.net/etmeyer/WWWoH

Afternoon Event:1:30 – 2:45: JISC/NEH Transatlantic Digitisation Collaboration Programme in conjunction with the Internet Archive: The World Wide Web of Humanities

OII: Selecting and analysing the sample WWI and WWII collections (Christine Madsen & Dr. Eric Meyer)

The Internet Archive: Extracting the data (Molly Bragg)Hanzo Archives Ltd.: Working with the data (Mark Middleton)Discussion and questions

Full details: http://www.oii.ox.ac.uk/events/details.cfm?id=238

Selecting and Analysing the WWI and WWII collections

Christine MadsenEric Meyer

19 March 2009

Why WWI and WWII?

Many branches of the humanities

History Journalism Art

Art history Advertising

Literature

Poetry Political science

Military history

Why WWI and WWII?

Well-rounded set of materials

Why WWI and WWII?

Language Doc types

Top-level domains

Secondary domains

Building the Collection

Supplemented with keyword searches in

the Archive

Selected from the live web

Building the Collection

Seeds are:

the website or portion of the website that you plan to include in your collection

Initial Collection

Seed 3

Seed 2Seed 1

Building the Collection

Seed 1

www

wwwwww

www

Seed 2

www

wwwwww

wwwSeed 3

www

wwwwww

www

Seed 4

www

wwwwww

www

Seed 5

www

wwwwww

www

Seed 6

www

wwwwww

www

Expanded Collection

A seed is also a web site from which additional sites can be discovered via the hyperlinks of the site

Building the Collection

Started with WWI

Too small (under 1,000,000 pages / object)Target was 250 million

Building the Collection

Expanded to WWII

Final collection: 5,362,425 unique URLs

Building the Collection

‘World War One’

‘World War I’

‘First world war’

‘World War II’

‘World War Two’

‘the great war’

‘Première Guerre Mondiale’

‘zweiter Weltkrieg’

Building the Collection

Record links from first 20

pages of search

Following links

[include dead links]

Returning to ‘hub’ sites for

further analysis

Building the Collection

http://www.greatwar.co.uk/westfront/Somme/index.htm

http://www.greatwar.co.uk

Expanding scope

Building the Collection

memory.loc.gov/ammem/collections/maps/wwii/index.html

www.memory.loc.gov/ammem/collections maps/wwii/

Expanding scope

Building the Collection

www.eyewitnesstohistory.com/ <= don’t

want whole site

www.eyewitnesstohistory.com/blitzkrieg.htmwww.eyewitnesstohistory.com/dday.html

www.eyewitnesstohistory.com/midway.htmwww.eyewitnesstohistory.com/airbattle.htmwww.eyewitnesstohistory.com/dunkirk.htm

www.eyewitnesstohistory.com/francesurrenders.htm

Dealing with illogical or flat directory structures

Building the Collection

• Stop when most results are redundant• Narrow in on more specific topics

WWIWWII

Building the Collection

• Materials in Foreign language– Focused on German sites– Consider local conventions, not just translations

WWII (zweiter Weltkrieg)

the period of National Socialism

(Zeit des Nationalsozialismus)

the period in which the Nazis ruled

(Nazizeit)

• Other foreign languages were included, but not sought after

Belarusian; Catalan/Valencian; Chamorro; Czech; Danish; German; Dzongkha; English; Spanish/Castilian; Finnish; French; Hebrew; Hungarian; Italian; Japanese; Luba-Katanga; Dutch/Flemish; Polish; Portuguese; Russian; Slovenian; Turkish; Ukrainian; Chinese

Building the Collection

The World Wide Web of Humanities “Extracting The Data”

St Anne's College, OxfordMarch 19, 2009

Molly Bragg, Partner Specialist

Web Group

The Internet Archive

Agenda

Brief Introduction to IA’s Web Archives

Discipline Specific Data Extraction from Longitudinal Web Archives: The WWWoH Case Study

Recommendations for Future Research and Tools Development Efforts

Brief Introduction to IA’s Web Archives

The Archive’s combined collections receive over 6 mil downloads a day!

www.archive.org

The Internet Archive is…

Web Pages Educational Courseware Films & Videos Music & Spoken Word Books & Texts Software Images

A digital library of ~4 petabytes of information

IA Web Archives

1.6+ petabytes of primary data (compressed)

150+ billion URIs, culled from 85+ million sites, harvested from 1996 to the present

Includes captures from every domain Encompasses content in over 40 languages As of 2009, IA will add ½ petabyte to 1 petabyte of

data to these collections each year.

Discipline Specific Data Extraction from Longitudinal Web Archives:

The WWWoH Case Study

WWWoH Case Study

http://neh-access.archive.org/neh/

WWWoH Case Study

Unique URLs in the collection: 5,362,425

Total number of captures: 23,006,857

Captures span: May, 1996 to Aug, 2008

Total size of compressed data: ~250 GBs

The Data Extraction Process

Oxford Internet Institute selected relevant sites/URLs

Identified all captures related to the seeds Identified all files embedded in each capture

(on & off seed domains) for extraction Attempted to locate additional candidate

seed URLs/domains for inclusion in the collection using outbound link data

The Data Extraction Process

Relevant URLs not identified as seeds were not extracted. Automatically harvesting ALL outbound links

can capture relevant non-seed urls however it can also introduce a large amount of extraneous content into the collection

Manually curating outbound links excludes non-relevant content, however it can be an overwhelming task due to the volume of links

WWWoH Case Study: WWI

Number of Seeds: 2263

Unique Hosts: 906

Number of Links: 143+ mil

WWWoH Case Study: WWI

WWWoH Case Study: WWI

WWI: Example

WWI: Example

WWI: Example

WWI: Example

WWWoH Case Study: WWII

Number of Seeds: 2592

Unique Hosts: 1475

Number of Links: 252+ mil

WWWoH Case Study: WWII

WWWoH Case Study: WWII

WWII: Example

WWII: Example

WWII: Example

Challenges

Identifying subject matter-specific resources of interest for an extraction and then automating those procedures.

Tools are missing from the workflow that might make the initial scoping of an extraction easier to define and revise Available tools for collection building and access are too technically focused for the average humanities scholar

Recommendations for Future Research and Tools

Development Efforts

Implications for Future Research

Need link and web graphing tools that use inbound and outbound link data to identify further resources of interest

Need to experiment with a more diverse range of UI navigational paradigms that address the dimension of time and curatorial input

Ideas/Concepts to Explore: Nomination Tools

Ideas/Concepts to Explore: Nomination Tools

Opportunities

Extractions make it easier for humanities scholars to locate and assemble source materials of interest. These collections can accelerate and/or augment discipline specific research efforts Extractions can encourage distributed collaboration and cooperation between entities who might not otherwise be aware of one another

Thank You!

http://neh-access.archive.org/neh/

Molly Bragg, Partner Specialist

The Internet Archive, Web Group

mbragg@archive.org

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Search and Analysis of Data in WWWoH

Mark Middleton

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Agenda

Brief introduction to Hanzo

Open Source Search-Tools: a toolkit for implementing analytical applications using web archives

WWWoH — working with the data

Recommendations for future research

Recommendations for future tools development

WWWoH Tools Deliverables

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Introduction to Hanzo

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Hanzo Archives Limited

Web Archiving Services

Company websites and intranets

Litigation support

E-Discovery

IP protection

Focus on legally defensible web archives of exceptional quality

Very advanced crawlers and access tools: dynamic html, video, flash, web 2.0

Some public archives

Mainly closed archives

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Hanzo Archiving Technology

Need advanced capabilities very quickly — continuous product innovation

Rapid development of tools

Create research and open source projects to promote mainstream awareness of web archives and web archiving technology

Open source projects include

WARC Tools

Search Tools

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

WWWoH and Development of Open Source Search-Tools

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Objectives

Deliver an open source search engine for web archives that is simple to extend, easy to install and deploy

Integrate with WARC Tools, the open source web archive file manipulation tools (Hanzo and IIPC)

Extend the search engine with interesting directives and options

Extend the search engine to provide data to analytical tools, develop an API, tools, and exemplar analytical tools

Encourage third party analytical tools to use web archives as their data repository

Migrate WWWoH extraction from ARC to WARC and ingest into Search Tools

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Full Text Search

Implemented FT search on top of WARC Tools — the toolkit for manipulating ISO-28500 WARC files

Reviewed several options: Java Lucene (and clones), Xapian, DB indexing (Sphinx, OpenFTS), etc.

Criteria: vibrant development community, extensible (searching web archives is different: temporal dimension, duplicate handling, etc.), fast and full-featured (boolean, time queries, ability to index multiple fields, query language)

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Component Architecture

Full text search engine based on Open Source Ferret

Knowledge Base stores search results

Python application with Django model and Django WUI

Memcache

Plug-in architecture to support multiple analytical applications

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Ferret

Ferret is FAST, both indexing and searching

Highly scalable, up to 100m documents on a single CPU

Supports distributed search

Phrase search, proximity ranking, stemming in several languages, stopwords, multiple document fields

Ferret Query Language

Docu

men

ts/s

http://ferret.davebalmain.com/trac/wiki/FerretVsLucene

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Advanced Search

url: (+bbc +wwii) -- search for URLs containing both ‘bbc’ and ‘wwii’

date: [2001 2002] -- search within date range

tag: wwwoh -- search content with the tag ‘wwoh’

title: (+wilfred +owen) -- search for Wilfred and Owen within the title

domain: fr -- restrict search to within .fr domain

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Working with the Data

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Migrating ARC to WARC

Data extracted from IA in ARC files

Hanzo WARC Tools and Search Tools projects combined enabled us to migrate ARC to WARC files (WARC is the new ISO standard):

Some challenges: broken ARCs, scale, etc.

3,264 WARC files

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Programmable Access to Data

WARC Tools and Search Tools provide a rich collection of programmable tools to enable analytics tools developers to use web archives:

Object-oriented C, REST API, fast iterators

Command lines for manipulating WARCs, indexing, searching

Web applications for browsing, searching, demonstrator analytics

C/C++, Python, Ruby, Perl, … and if you need to, Java, C#

Demonstration: the web applications

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

http://wwwoh.hanzoarchives.com

/

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Analytical Tools

Frequency Tables for:

Domains, MIME Types, Countries

Graphing Tools:

GUESS -- an exploratory data analysis and visualization tool for graphs and networks

Graphviz -- makes diagrams in several formats: images and SVG for web pages, Postscript; or display in an interactive graph browser

Hypergraph -- provides visualisation of hyperbolic geometry, to handle graphs and to layout hyperbolic trees

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Graphing Tools

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Recommendations for Future Research and Tools

Development

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Future Research

Faster, richer analytics

Rich API for analytics, to be developed in collaboration with IA, other archives, and IIPC

Temporal analytics and techniques

Link and network graphing and analytics

Enhance outreach/dissemination to the mainstream development community and research community

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Future Tools Development

Multi-machine indexing and application engine

Tighter integration of graphing tools, with more user parameters and configurations

Temporal analysis (animation of link graphs over time)

Enhance WARC Tools integration and investigate interoperability with other IIPC toolsets

Developer documentation

Analyst/researcher documentation

Installation tools for Linux, Mac OS X and Windows XP/Vista

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Deliverables at End March 2009

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Deliverables

The Search Tools project home is http://code.google.com/p/search-tools/

Source code

Documentation

Issue management

Mailing list

The WARC Tools project home is http://code.google.com/p/warc-tools/

The prototype application is http://wwwoh.hanzoarchives.com/

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Thank YouHanzo Archives Limited

+44 20 8816 8226

www.hanzoarchives.com