Electronic Mail List Preservation Takes Off: The H-Net Archive

33
Electronic Mail List Preservation Takes Off: The H-Net Archive Lisa M. Schmidt [email protected] http://www.h-net.org/archive/ MATRIX: The Center for Humane Arts, Letters & Social Sciences Online Michigan State University May 1, 2009

description

Electronic Mail List Preservation Takes Off: The H-Net Archive. Lisa M. Schmidt [email protected] http://www.h-net.org/archive/ MATRIX: The Center for Humane Arts, Letters & Social Sciences Online Michigan State University May 1, 2009. Preserving the H-Net E-Mail Lists. - PowerPoint PPT Presentation

Transcript of Electronic Mail List Preservation Takes Off: The H-Net Archive

Page 1: Electronic Mail List Preservation Takes Off:  The H-Net Archive

Electronic Mail List Preservation Takes Off:

The H-Net Archive

Lisa M. [email protected]://www.h-net.org/archive/

MATRIX: The Center for Humane Arts, Letters & Social Sciences Online

Michigan State UniversityMay 1, 2009

Page 2: Electronic Mail List Preservation Takes Off:  The H-Net Archive

Preserving the H-Net E-Mail Lists

• H-Net Background

• Original “Preservation” Practices

• Use of the Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC)

• Preservation Improvement Plan

Page 3: Electronic Mail List Preservation Takes Off:  The H-Net Archive

H-Net: Humanities and Social Sciences Online

• International consortium of scholars and teachers

• Oldest collection of born-digital and content-moderated arts, humanities, and social science material on the Internet

• Valuable scholarly resource– More than 180 networks, or e-mail lists– More than 230 “private” lists

• More than 1 million e-mail messages• Hosted by MATRIX

Page 4: Electronic Mail List Preservation Takes Off:  The H-Net Archive

MATRIX

• Digital humanities research center• Devoted to the application of new

technologies in humanities and social science teaching and research

• Uses Internet technologies to improve education and increase the flow of information

Page 5: Electronic Mail List Preservation Takes Off:  The H-Net Archive

NHPRC Grant

• Conduct assessment of existing H-Net preservation policies and practices

• Apply OCLC/CRL TRAC checklist • Develop and implement an improved long-

term preservation plan• Useful to those managing large collections of

electronic records• Research semantic clustering search

techniques

Page 6: Electronic Mail List Preservation Takes Off:  The H-Net Archive

How H-Net Works:Backup & Security

• 3 TB of data, including H-Net• Server rack kept in climate controlled,

physically secured room• Daily incremental backups, weekly full • Monthly full, “permanent” tape backups

Page 7: Electronic Mail List Preservation Takes Off:  The H-Net Archive

How H-Net Works

• H-Net runs on LISTSERV software

• Submission policies– Users must be list subscribers to post– Messages written in plain text– No attachments allowed on public lists

Page 8: Electronic Mail List Preservation Takes Off:  The H-Net Archive

How H-Net Works:An Archival Perspective

• Appraisal/Acquisition/Accession– All approved messages permanently

archived– Editors approve and post messages– Messages post from a few seconds up to

several days after approval

Page 9: Electronic Mail List Preservation Takes Off:  The H-Net Archive

How H-Net Works:An Archival Perspective

Message Posting Process

Page 10: Electronic Mail List Preservation Takes Off:  The H-Net Archive

How H-Net Works:An Archival Perspective

• Arrangement– Messages kept in flat text files called

“notebooks”– Single notebook includes messages

posted during seven-day time period, concatenated in original order

Page 11: Electronic Mail List Preservation Takes Off:  The H-Net Archive

How H-Net Works:An Archival Perspective

Page 12: Electronic Mail List Preservation Takes Off:  The H-Net Archive

How H-Net Works:An Archival Perspective

• Arrangement– Notebooks appear to be arranged in

original order within each list directory

Page 13: Electronic Mail List Preservation Takes Off:  The H-Net Archive

How H-Net Works:An Archival Perspective

• Description– Most descriptive metadata for messages

automatically generated on creation/posting– “Author’s Subject” inserted by creator

Page 14: Electronic Mail List Preservation Takes Off:  The H-Net Archive

How H-Net Works:An Archival Perspective

Period Day of Month

a 1-7

b 8-14

c 15-21

d 22-28

e 29-31

- Ex. “h-africa.log0802a”• Notebook description contained in filename

Notebook File Naming

Page 15: Electronic Mail List Preservation Takes Off:  The H-Net Archive

How H-Net Works:Message Retrieval

• BRS Database– Newest notebook messages parsed and copied

every 24 hours– MD5 hashes created for each message– Available for full-text search

• MySQL Database Cache– Key metadata extracted, MD5 hashes created,

written to database cache– Enables more efficient browsing

Page 16: Electronic Mail List Preservation Takes Off:  The H-Net Archive

How H-Net Works:Message Retrieval

Message Metadata Stored in MySQL Database

Page 17: Electronic Mail List Preservation Takes Off:  The H-Net Archive

How H-Net Works:Message Retrieval

http://h-net.msu.edu/cgi-bin/logbrowse.pl?trx=vx&list=H-Albion&month=0808&week=b&msg=w8utW6nKNO1FuY19vSK2mo

&user=&pw=

Page 18: Electronic Mail List Preservation Takes Off:  The H-Net Archive

How H-Net Works

Message Ingest, Storage, and Retrieval Processes

Page 19: Electronic Mail List Preservation Takes Off:  The H-Net Archive

Original “Preservation” Practices

• Backup, but only local—and no true archiving

• No normalization or migration strategy– Message/notebook content: No need

• Created and stored in plain text formats• XML encoding only required with proprietary

e-mail formats

– Needed for attachments on private lists

Page 20: Electronic Mail List Preservation Takes Off:  The H-Net Archive

Original “Preservation” Practices

• Authenticity– Informal check by author and/or editor on

posting– Broken URL on message retrieval attempt– Cached metadata as PDI

• Reference, Content, Provenance Information• MD5 hashes for message discovery, not fixity• No Fixity Information for notebook files

Page 21: Electronic Mail List Preservation Takes Off:  The H-Net Archive

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)

• TRAC 1.0 published in February 2007• For certification by third party or self

assessment• Three sections

– A. Organizational Infrastructure– B. Digital Object Management– C. Technologies, Technical Infrastructure,

& Security• 84 audit criteria

Page 22: Electronic Mail List Preservation Takes Off:  The H-Net Archive

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)

• Compare core audit criteria to local capabilities—“Gap Analysis,” illuminating areas requiring improvement

• Formulate strategies to narrow the gap and improve trustworthiness of repository

Page 23: Electronic Mail List Preservation Takes Off:  The H-Net Archive

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)

• Example 1:Repository has formal succession plan– H-Net: No succession plan in place– Narrow the gap: Identify, negotiate with,

and make preliminary plans with potential successor; document intent, describing what’s needed in successor

Page 24: Electronic Mail List Preservation Takes Off:  The H-Net Archive

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)

• Example 2:Repository functions on well-supported operating systems and other core infrastructural software– H-Net: Servers run on Debian distribution

of Linux– No gap!

Page 25: Electronic Mail List Preservation Takes Off:  The H-Net Archive

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)

Page 26: Electronic Mail List Preservation Takes Off:  The H-Net Archive

The TRAC Experience

• Thorough yet flexible, leaving room for interpretation, lots of options for supporting documentation/evidence

• Good snapshot of current state of repository • Clarifies what’s needed to narrow the gap• Great internal audit tool• Useful for certification of a trusted digital

repository

Page 27: Electronic Mail List Preservation Takes Off:  The H-Net Archive

Preservation Improvement Plan:Backup & Archival Storage

Backup• Long-term (“permanent”) backup tape sets stored offsite,

put on 3-year retention schedule

• Reciprocal backup storage arrangement with ICPSR

Archival Storage• Annual copying to tape of H-Net data, databases,scripts

• Media refreshment every 5 years

• Future: Copy to alternative storage repository

• Future: Participation in distributed archival storage system

Page 28: Electronic Mail List Preservation Takes Off:  The H-Net Archive

Preservation Improvement Plan:Authenticity

Fixity: Individual Messages (SIPs/AIPs)• Shorten time window for generation of hashes• Create database of SHA-256 hashes for fixity checks• Validate message hashes on notebook completion

Fixity: Notebook Files (AICs)• Create SHA-256 message digests on completion of notebooks• Calculate SHA-256 message digests for existing notebooks• Create database of SHA-256 message digests for fixity checks• Validate notebook hashes on weekly basis

Page 29: Electronic Mail List Preservation Takes Off:  The H-Net Archive

Preservation Improvement Plan:Authenticity

Page 30: Electronic Mail List Preservation Takes Off:  The H-Net Archive

Preservation Improvement Plan:Other Technical Improvements

Attachments• Found with < 0.01% of H-Net messages

– MS Office, PDF, image files

• Provide constructed URLs, as with public lists• Provide download links• No file normalization or migration plan

– Most files should open in viewers, later versions of applications

– Will help users if problems arise

Preservation of Links to Original Content• Redirect URLs within messages to archived websites

Page 31: Electronic Mail List Preservation Takes Off:  The H-Net Archive

Preservation Improvement Plan:Narrowing the Gap

• Technical improvements

• Digital preservation policies

• Lather, rinse, repeat: New TRAC assessment

Page 32: Electronic Mail List Preservation Takes Off:  The H-Net Archive

Conclusions

• Relevant to e-mail preservation discussion• Applicable to preservation of LISTSERV-

based and other e-mail lists• Testbed for other preservation tools and

systems• Useful foundation for digital preservation

planning at Michigan State

Page 33: Electronic Mail List Preservation Takes Off:  The H-Net Archive

References

• H-Net Archives Project, http://www.h-net.org/archive/• H-Net: Humanities and Social Sciences Online,

http://www.h-net.org• MATRIX: The Center for Humane Arts, Letters, and Social

Sciences Online, http://www.matrix.msu.edu• OAIS Reference Model,

http://public.ccsds.org/publications/archive/650x0b1.pdf• Trusted Digital Repositories: Attributes and Responsibilities,

http://www.oclc.org/programs/ourwork/past/trustedrep/repositories.pdf

• Trustworthy Repositories Audit & Certification: Criteria and Checklist, http://www.crl.edu/PDF/trac.pdf