Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital...
-
Upload
jessie-duggan -
Category
Documents
-
view
216 -
download
1
Transcript of Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital...
Panel: What Changes With Digital?Web Archiving
ARL Forum 2009Tracy Seneca – California Digital Library
Ground to cover:
• Brief background: Web Archiving Service
• What changes about collecting
• 2 Case studies in collaboration• Across institutions• Between faculty, librarians• Across disparate archiving systems• Between libraries, content owners
Web Archiving Service
• Developed by the University of California Curation Center of the California Digital Library– Formerly the Digital Preservation Group
• Outcome of the Web-at-Risk grant– 1st round of NDIIPP grant work
• UC campus libraries, NYU, Stanford, University of North Texas
The Web Archiving Service
http://webarchives.cdlib.org
What Changes About Collecting?
1. The target of collection becomes debatable:– An archive?– A site?– A document?
What Changes About Collecting?
2. There’s a lot we don’t know about what we’re collecting– How big is it? How much storage will it use?– What’s in it?– What are the new publications on the site?– Is the site linking to valuable, relevant
information?
Mining Sites for Documents
What Changes About Collecting?
3. It’s not always clear what to collect– A national library may have a clear mandate to
capture nation’s web domain
– The Institute of Transportation Studies may have an immediately obvious scope of content to collect
– What does a large research library collect?
What Changes About Collecting?
4. We don’t know how scholars will use this information
• Object of study could be:– Content of the documents– Site change– Acts of citizen journalism– Blog spam, viruses
“There is an ongoing need for case studies that can illustrate possible
approaches to early interventions with digital records creators, institutional collaborations, and partnerships with information technology specialists.”
Case 1: 2003 Recall
2003 California Recall Archiveb
• 200+ sites selected by UC Librarians, Stanford
• Sites crawled by Stanford Computer Science Dept. as part of WebBase project
• Content captured in entirely different format from the WARC archival format used by WAS & Archive-It
• Content migrated to WARC format, transferred to CDL in 2008 – public accessvia WAS in July 2009
Case 1
• Collaborative content selection across campuses, institutions
• Data stewardship across institutions• Migration of data across formats, archive data
models• Collaboration between Social Science faculty,
Computer Science grad students• A dark archive goes light!
Case 2: California Government
Collaborative CollectionState of California
Government Information librarians across UC campuses manage the archive.
300 sites derived from California State Agency Directory
Source for shared cataloging of key California State Documents
Twice yearly captures of all agency sites; more frequent captures of approximately 30 priority sites
Collaborative Collection:Local California
Seven archives of local California agencies maintained by separate UC campuses
Testing cross-archival search tools to combine all state, local search results
518 sites preserved in local archives
Challenge: varying resources, priorities at UC campuses, some geographic areas missed
Cornell Web Lab study identifies .ca.gov as the third largest U.S. government subdomain.
Need for Collaboration with Content Owners
Robots.txt Patterns in California State Agency Sites
Restricted:• California State Library• California State Controller• Office of State Publishing• Secretary of State
Not Restricted:• Office of Information Security and Privacy Protection• Office of Systems Integration• Legislative Analyst's Office
• Consistent design• Strong patterns to restrictions
User-agent: *Disallow: /images Disallow: /classes Disallow: /cgi-binDisallow: /htdigDisallow: /jsDisallow: /stylesDisallow: /ssiDisallow: /cssDisallow: /javascript
28 Sites read exactly:
Is Robots.txt Really a Copyright Management Tool?
“The conversation used to be between the library and the publisher.
Now, it is between the library and a webmaster.”-Gildas Illien, Bibliothèque nationale de France
A Potentially Fruitful Conversation?
• The National Archives comprehensively archive UK Central Government sites
“Continuity and Preservation: The National Archives approach to maintaining permanent access to the web presence of UK Central Government”- Amanda Spencer and Alison Heatherington
Case 2
• Collaboration across campuses in selection, resource allocation
• Shared collection of material relevant to all campuses
• Communication underway with state agencies, webmasters
• Potential to provide service directly to agency site
• Potential to begin linking archives together
Parting thoughts
• MANY more examples of collaborative work!– End of Term Harvest– International Internet Preservation Consortium – international
Olympics archive– Zepheira: data visualization portal to NDIIPP content
“…longer-term preservation costs for these kinds of materials are not well understood.”
“In the digital world, it is all too easy to acquire materials that a library cannot afford to keep in perpetuity.”