The Library of Congress Martha Anderson Program Officer, NDIIPP Office of Strategic Initiatives...
-
Upload
maryann-wilkins -
Category
Documents
-
view
212 -
download
0
Transcript of The Library of Congress Martha Anderson Program Officer, NDIIPP Office of Strategic Initiatives...
The Library of Congress
Martha Anderson
Program Officer, NDIIPPOffice of Strategic InitiativesLibrary of Congress April 2005
LC Perspective : Preservation Partnerships
2 The Library of Congress
Born Digital “At-Risk” Web Born Digital “At-Risk” Web SitesSites
http://www.loc.gov/minerva/collect/elec2000
http://www.loc.gov/minerva/collect/sept11
3 The Library of Congress
Take Actions that are• Catalytic
– Invest in existing strengths
• Collaborative– Engage partners in areas of mutual interest and
expertise
• Iterative– Learn by doing
• Strategic– Broad spectrum of balanced short-term &
investments
NDIIPP Strategic Direction
4 The Library of Congress
Web of projects
UIUC
NARAGPO
LC Web Projects
IIPC
NDIIPCDL
IA
AIHT
Preservation Partners
StatesInitiative
5 The Library of Congress
Library of Congress Web Archiving
• Collaborate with partners working on the same preservation issues
• Develop collection strategies to leverage available resources
• Learn by doing
Strategy
6 The Library of Congress
Collaborate with partners working on the same preservation issues
• Membership in the International Internet Preservation Consortium (IIPC)
• Cooperative projects with NDIIPP Preservation Partners– California Digital Library– University of Illinois at Champaign-Urbana
• Technical information sharing with other US government agencies– Government Printing Office– National Archives and Records Administration
7 The Library of Congress
• Collect thematically both by crawling and by acquiring collections gathered by others
Develop collection strategies to leverage available resources
Learn by doing• Case studies and regular collection of theme-
based collections• Participate in tools development with IIPC• Archive Ingest & Handling Project
8 The Library of Congress
Challenges of collecting from the Web • Characteristics of the resource--dynamic,
deep, linked• Intellectual property laws and regulations• Tension of preservation vs access goals• Degree of alignment with current collection
policies for other media• Curation strategy• Tools for identification and selection• Tools for collection, curation, and archiving of
large web collections
9 The Library of Congress
Average Web Collection
• Begins with a theme or event• Usually does not include commercial
sites• Starts with a list of about 200 urls• Is crawled by vendor • Yields about 1 TB of data per month • Has a frequency of once a week
10 The Library of Congress
Web Collections to date at LC
• Event-based– US National Elections—2000, 2002, 2004– War in Iraq– September 11
• Public Policy Topics– Health Care– Legislative Branch– Terrorism
• 26 TB
11 The Library of Congress
Archive Ingest & Handling Test
• AIHT is a first test of proposed NDIIP preservation architecture.
• The test is conducted with a common data set.– George Mason University 9/11 Archive
• Phase I tests ingest and data handling in local systems.
• Phase II tests export and import between institutions.
• Phase III explores format migration.
12 The Library of Congress
GMU 9/11Archive Participants demonstrate capabilities
Participants exchangearchive
13 The Library of Congress
Participants
• Old Dominion University, Department of Computer Science
• Stanford University Libraries &
Academic Information Resources
• The Johns Hopkins University, Sheridan Libraries
• Harvard University Library
14 The Library of Congress
AUDIO4%
VIDEO0.2%
PDF3%
OTHER2%
IMAGES27%
HTML29%
TEXT35%
`
George Mason University 9/11 Archive: Breakdown
by File Types
57,450+ files12GBOriginally stored in a Linux environment
15 The Library of Congress
Goals of AIHT
• Gain practical experience with multiple institutions
• Document transfer and ingest processes for multiple systems
• Determine next set of tasks for developing interfaces between layers and institutions
16 The Library of Congress
Status of AIHT
• All phases completed.– Imports focused on technical assessment of
archive and developing tools to examine the archive
– Exports included METS and MPG21 DID objects– Migrations included transforms to JPG2000,
TIFF, and some exploration of html to xml and avi to mpg
• Full report expected by early summer.
17 The Library of Congress
For more information….
• NDIIPP Technical Architecture version 0.2 http://www.digitalpreservation.gov
• International Internet Preservation Consortium http://netpreserve.org/about/index.php
• MINERVA: Mapping the INternet Electronic Resources Virtual Archive http://www.loc.gov/minerva/
18 The Library of Congress
Martha AndersonNDIIP Program OfficerOffice of Strategic InitiativesThe Library of CongressWashington, DC