Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital...

25
Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library

Transcript of Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital...

Page 1: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

Panel: What Changes With Digital?Web Archiving

ARL Forum 2009Tracy Seneca – California Digital Library

Page 2: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

Ground to cover:

• Brief background: Web Archiving Service

• What changes about collecting

• 2 Case studies in collaboration• Across institutions• Between faculty, librarians• Across disparate archiving systems• Between libraries, content owners

Page 3: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

Web Archiving Service

• Developed by the University of California Curation Center of the California Digital Library– Formerly the Digital Preservation Group

• Outcome of the Web-at-Risk grant– 1st round of NDIIPP grant work

• UC campus libraries, NYU, Stanford, University of North Texas

Page 4: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

The Web Archiving Service

Page 5: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

http://webarchives.cdlib.org

Page 6: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

What Changes About Collecting?

1. The target of collection becomes debatable:– An archive?– A site?– A document?

Page 7: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

What Changes About Collecting?

2. There’s a lot we don’t know about what we’re collecting– How big is it? How much storage will it use?– What’s in it?– What are the new publications on the site?– Is the site linking to valuable, relevant

information?

Page 8: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

Mining Sites for Documents

Page 9: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

What Changes About Collecting?

3. It’s not always clear what to collect– A national library may have a clear mandate to

capture nation’s web domain

– The Institute of Transportation Studies may have an immediately obvious scope of content to collect

– What does a large research library collect?

Page 10: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

What Changes About Collecting?

4. We don’t know how scholars will use this information

• Object of study could be:– Content of the documents– Site change– Acts of citizen journalism– Blog spam, viruses

Page 11: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

“There is an ongoing need for case studies that can illustrate possible

approaches to early interventions with digital records creators, institutional collaborations, and partnerships with information technology specialists.”

Page 12: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

Case 1: 2003 Recall

Page 13: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

2003 California Recall Archiveb

• 200+ sites selected by UC Librarians, Stanford

• Sites crawled by Stanford Computer Science Dept. as part of WebBase project

• Content captured in entirely different format from the WARC archival format used by WAS & Archive-It

• Content migrated to WARC format, transferred to CDL in 2008 – public accessvia WAS in July 2009

Page 14: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.
Page 15: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

Case 1

• Collaborative content selection across campuses, institutions

• Data stewardship across institutions• Migration of data across formats, archive data

models• Collaboration between Social Science faculty,

Computer Science grad students• A dark archive goes light!

Page 16: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

Case 2: California Government

Page 17: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

Collaborative CollectionState of California

Government Information librarians across UC campuses manage the archive.

300 sites derived from California State Agency Directory

Source for shared cataloging of key California State Documents

Twice yearly captures of all agency sites; more frequent captures of approximately 30 priority sites

Page 18: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

Collaborative Collection:Local California

Seven archives of local California agencies maintained by separate UC campuses

Testing cross-archival search tools to combine all state, local search results

518 sites preserved in local archives

Challenge: varying resources, priorities at UC campuses, some geographic areas missed

Cornell Web Lab study identifies .ca.gov as the third largest U.S. government subdomain.

Page 19: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

Need for Collaboration with Content Owners

Page 20: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

Robots.txt Patterns in California State Agency Sites

Restricted:• California State Library• California State Controller• Office of State Publishing• Secretary of State

Not Restricted:• Office of Information Security and Privacy Protection• Office of Systems Integration• Legislative Analyst's Office

Page 21: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

• Consistent design• Strong patterns to restrictions

User-agent: *Disallow: /images Disallow: /classes Disallow: /cgi-binDisallow: /htdigDisallow: /jsDisallow: /stylesDisallow: /ssiDisallow: /cssDisallow: /javascript

28 Sites read exactly:

Page 22: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

Is Robots.txt Really a Copyright Management Tool?

“The conversation used to be between the library and the publisher.

Now, it is between the library and a webmaster.”-Gildas Illien, Bibliothèque nationale de France

Page 23: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

A Potentially Fruitful Conversation?

• The National Archives comprehensively archive UK Central Government sites

“Continuity and Preservation: The National Archives approach to maintaining permanent access to the web presence of UK Central Government”- Amanda Spencer and Alison Heatherington

Page 24: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

Case 2

• Collaboration across campuses in selection, resource allocation

• Shared collection of material relevant to all campuses

• Communication underway with state agencies, webmasters

• Potential to provide service directly to agency site

• Potential to begin linking archives together

Page 25: Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

Parting thoughts

• MANY more examples of collaborative work!– End of Term Harvest– International Internet Preservation Consortium – international

Olympics archive– Zepheira: data visualization portal to NDIIPP content

“…longer-term preservation costs for these kinds of materials are not well understood.”

“In the digital world, it is all too easy to acquire materials that a library cannot afford to keep in perpetuity.”