Beecher cni fall 2010 v4

Post on 17-Jan-2015

695 views 3 download

description

This is a talk from the Coalition for Networked Information Fall 2010 Member Meeting (CNIfall2010). I talked about our project to use Fedora as archival storage for social science research data and documentation.

Transcript of Beecher cni fall 2010 v4

Preserving Social Science Research Data Using Fedora

Bryan Beecher

Inter-university Consortium for Political and Social Research (ICPSR)

CNI Fall 2010 Membership Meeting

ICPSR

• World’s largest social science research data archive– Lots of files (millions)– Small files (6TB total)

• Long track record of success – 50 yrs– Trust us– Enormous legacy burden

ICPSR

• Survey data are our core– Low volume of new content compared

to natural sciences– We curate each item extensively

(disclosure, quality, format, usability)

• Strong access orientation– Talk like an archive– Walk like an archive?

Walking the walk

• Good storage container for content and its metadata

• OAIS-compliant• Generate SIPs and AIPs (and DIPs)• But…

What should we do?

Where to begin?

Focus areas• Preservation• Going forward• Reusable

Do not try to include• Access• Everything we have

A Solution

• Fedora objects– Container for stuff we ingest and

preserve

• Fedora services– To generate AIPs and SIPs

• Tool to generate FOs from existing content and metadata

Ingest

• The Motivated Depositor– Eager to describe

the research data in great detail

– Uploads complete, machine-readable metadata

Ingest (continued)

• The Unmotivated Depositor– Upload a variety

of proprietary file formats for documentation and data

– Leaves the baby on the doorstep

Ingest – Nov 2010 deposits

Ingest (continued)

• Typical deposit– Research data in one of the common

stat packages (SAS, SPSS, etc)– Technical documentation in a

proprietary format (Word, PDF)– A proto-SIP in quasi-OAIS terms– Minimal level of metadata regarding

how the survey was conducted

Ingest container – file level

• Vanilla Fedora Object– Will never know

what sort of content format to expect

– Use the RELS-EXT to connect related files

Ingest container – deposit

• Another plain Fedora Object– Points to all of the

files stored in the file-level objects

– Relatively little metadata stored for this level of object

Ingest container – example

Ingest container – example

Ingest and the OAIS PDI

• Reference – unique Fedora PID• Fixity – Fedora-generated checksum• Provenance – identity of depositor

recorded in the DC Datastream• Context – original file name captured

in the content Datastream• Access Rights – terms of deposit

Generating OAIS SIPs

• Original content– Normalized version too, if applicable– What’s normalization in this context?

• Preservation Description Information (PDI)– As described previously

• Delivered via SDef/SDep combo

Ingest – continued

• Data– Disclosure analysis– Recoding

• Documentation– Corrections– Clarifications

• Normalized formats

Ingest – finale

• Packaged into a “study”– Data, doc

questionnaire, user guide, etc

– Normalized formats for preservation

– Convenient formats for access

Ingest – finale

PID

REPORT(test/plain)

objectProperties

DC

RELS-EXT

AUDIT

icpsr:release-28748-file-3

QUESTIONNAIRE(application/pdf)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

icpsr:release-28748-file-1

STATA-DICT(text/plain)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

DATA(text/plain)

DDI(text/xml)

SAS-SETUPS(text/plain)

SPSS-SETUPS(text/plain)

STATA-SETUPS(text/plain)

icpsr:release-28748-file-2

CODEBOOK(application/pdf)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

Generating OAIS AIPs

• For each object (file)– Everything from the SIP plus

• Preservation events• Description of the transformation used• Preservation commitment

– Its post-processed version

• Delivered via SDef/SDep combo

Example AIP

PID

REPORT(test/plain)

objectProperties

DC

RELS-EXT

AUDIT

icpsr:release-28748-file-3

QUESTIONNAIRE(application/pdf)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

icpsr:release-28748-file-1

STATA-DICT(text/plain)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

DATA(text/plain)

DDI(text/xml)

SAS-SETUPS(text/plain)

SPSS-SETUPS(text/plain)

STATA-SETUPS(text/plain)

icpsr:release-28748-file-2

CODEBOOK(application/pdf)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

PID

objectProperties

DC

RELS-EXT

AUDIT

Questions we faced

• Datastreams or relationships?• What about our XML?• AIPs or DIPs?• How to build FOXML?

Datastreams /relationships?

PID

CONTENT X

objectProperties

DC

RELS-EXT

AUDIT

PID

CONTENT Y

objectProperties

DC

RELS-EXT

AUDIT

PID

CONTENT Y

objectProperties

DC

RELS-EXT

AUDIT

CONTENT X

Our XML

• DDI v2– Contains lots of the information one

might expect to find in the DC

• Strategy– Duplicate it

AIPs or DIPs

• Lots of copies• Destination

– Archival Storage remote location– Repository for ingest

Building FOXML

• Source– Database– DDI XML

• Re-usable tool

Special Thanks

The Team• Peggy Overcashier• Nathan Adams• Nancy McGovern• Mary Vardigan

The Funder• National Science

Foundation Award 0958382

• INTEROP EAGER program