Digital Curation and Higher Education IT: Lessons from the National Agenda for Digital Stewardship...
Transcript of Digital Curation and Higher Education IT: Lessons from the National Agenda for Digital Stewardship...
1
Digital Stewardship and Higher Education IT: Lessons from the National
Agenda
Prepared for
NERCOMP Annual Conference
March 2014
Presented by:
Micah Altman, <[email protected]>
Director of Research, MIT Libraries
Non-Resident Senior Fellow, Brookings Institution
Capturing Contributor Roles in Scholarly Publications
DISCLAIMERThese opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the future!”-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw,
Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.
PreviewWho are the NDSA?
Why develop an agenda for digital stewardship?
What should national stewardship priorities be?
… research& foundations of stewardship … digital content
… technical infrastructure… organizational roles
Lessons for Higher Ed IT4
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
5
Collaborators & Co-Conspirators• The 160+ institutional members of NDSA, and the
10000+ hours contributed by their representatives to NDSA working groups, meetings and reports
• National Agenda Authors:
Micah Altman, Jefferson Bailey, Karen Cariani, Jim Corridan, Jonathan Crabtree, Blaine Dessy, Michelle Gallinger, Andrea Goethals, Abigail Grotke, Cathy Hartman, Butch Lazorchak, Jane Mandelbaum, Carol Minton Morris, Trevor Owens, Meg Phillips, John Spencer, Helen Tibbo, Tyler Walters, Kate Wittenberg, Kate Zwaard
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
Who are the NDSA?
6
About the NDSA• Founded in 2010, the National Digital Stewardship Alliance (NDSA) is a
consortium of institutions that are committed to the long-term preservation of digital information.
• Our mission is to establish, maintain, and advance the capacity to preserve our nation's digital resources for the benefit of present and future generations.
• NDSA member institutions represent all sectors, and include universities, consortia, professional associations, commercial enterprises, and government agencies at the federal, state, and local levels. The Library of Congress provides organizational support and substantive collaboration as Secretariat.
• Based on collaborative community effort -- there are no fees for NDSA membership. Each member institution commits to to NDSA principles, and contributes efforts to working groups, reports, surveys, meetings and other NDSA initiatives.
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 7
NDSA Initiatives
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 8
Wor
king
G
roup
sR
ecen
t O
utpu
ts
Extending Knowledge• Preservation Storage Survey• Web Harvesting Survey• Preservation Staffing Survey• Geospatial Selection &
Appraisal report• Content case studies• NDSA Interview Series
Tools for Practice
• Levels of Preservation• Digital Preservation in a Box• Digital Preservation on
Wikipedia
Dissemination• National agenda for digital
stewardship • NDSA Innovation Awards• NDSA Social Media
NDSA Member Organizations
• 165 Member Organizations
• From all sectors• Committed to
digital stewardship
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 9
digitalpreservation.gov/ndsa/memberslist.html
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
Why develop an agenda for digital
stewardship?
10
Why a national agenda for digital stewardship?
• Effective digital stewardship is vital for:– maintaining authentic public records– growing a reliable scientific evidence base– providing durable access to our cultural heritage
• Knowledge of ongoing research, practice, and organizational collaborations is distributed widely across disciplines, sectors, and communities of practice
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 11
How was this accomplished it?• Contributed community effort
- Development: contributions from the (now 150+) institutional members through working group participation, workshop discussion, commentary
- Writing: LC Staff, chairs of NDSA working groups, coordination committee- Reviewing: expert reviewers in the preservation community
• Integrating diverse perspectives from multiple disciplines & sectors
• The persistence, organization, and commitment of the Library of Congress in its role as Secretariat
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 12
Why Now - Climate
Strong trends towards:• More production of digital content• More publishing, filtering and access • More learners and collaborators• More attention to public information
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 13
Trends in Higher Education Technology willIncrease Need for Information Stewardship
• Adoption Trends– Growing Ubiquity of Social Media – Integration of Online, Hybrid, and Collaborative
Learning – Rise of Data-Driven Learning and Assessment– Shift to Students as Creators – Evolution of Online Learning
• Significant Challenges– Low Digital Fluency of Faculty – Scaling Teaching Innovations
• Important Developments– Learning Analytics– 3D Printing– Quantified Self
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 14
more information, in new forms, created by more people
need to manage, understand, and retain information for teaching, research, and evaluation
Requires curation at scale
Maximizing the Impact of Research through Research Data Management
15
Climate vs Weather• Climate is what you should expect -- weather is what you get. • Climate for reproducibility and data management seems
favorable… prepare for shifts in the weather.
What Was Accomplished?The
National Agenda for Digital Stewardship identifies high-impact opportunities to
advance:
• the state of the art• the state of practice
• the state of collaboration
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 16
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
Foundations of Content
Stewardship— Framework &
Research
17
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
18
What is Content Stewardship?
• Stewardship involves taking broad responsibility for preservation and curation
• The goal of preservation is ensuring meaningful long-term access
• Example:
If you have 1000 files (bitstreams), and you’d like to have 99.99% chance of accessing them in 20 years. How do you store them?
Why not store everything with Amazon?
• Why not put everything in Amazon?• Amazon claims reliability of 99.999999999%
(Better odds than winning Powerball ®, being struck by lightning, and finding alien life… combined)
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 19
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
20
What’s left out of the Eleven Nines?• What are the units? - Collection? Object? Bit?• How was the failure rate calculated? (It’s theoretical)
– MBTF + Independence * enough replicas = lots of nines– But.. No details for estimate provided; No historical reliability statistics provided; No service reliability
auditing provided
• What is the empirical evidence for MBTF?– Storage manufacture hardware MTBF (mean time between failures) is inaccurate…– Failures across hardware replicas are not independent
• What threats are assumed away? – software failure
(e.g. a bug in the AWS software for its control backplane)– legal threats (leading to account lock-out — such as this, deletion, or content removal);– institutional threats (such as a change in Amazon’s business model)– Process threats (someone hits the delete button by mistake; forgets to pay the bill; or AWS rejects the
payment)
• Do SLA’s or audits back up “design” reliability claims?– No claim to reliability in SLA’s (or uptime, availability, response time…) – Can’t even prove AWS has the content without taking it out!– Sole recovery for breach is limited to refund of fees for periods the service was unavailable– No right to inspect Amazon logs, assistance with forensics, etc.
And How Much Does it Really Cost?
• Glacier storage is relatively cheap• Getting your data back is not –
if you want it fast• Creates lock-in and gotcha’s
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 21
Observations
• Digital preservation does not equal “backup”• Ensuring long-term access requires ongoing
evaluation and management of a broad spectrum of risks & costs
• Without attention the digital evidence base will erode
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 22
The Problem - RestatedKeeping risk of object loss fixed -- what choices minimize $?
“Dual problem” Keeping $ fixed, what choices minimize risk?
Extension
For specific cost functions for loss of object:
Loss(object_i), of all lost objects
What choices minimize:
Total cost= preservation cost+ sum(E(Loss))
risk
cost
Are we there yet?
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
24
Insider & ExternalAttacks
What are some threats?
Physical & Hardware
Software
Curatorial Error
OrganizationalFailure
Threat ModelingBit Corruption
Media characteristics
Threat characteristics
Correlations
Logical Scope of Corruption
Format Characteristics
File/encoding Characteristics
Filesystem Characteristics
Probability of Successful
Repair
Auditing Frequency
Auditing Algorithm
Repair Algorithm
Repair Frequency
Repair duration
Corruption
Detection
Repair
Methods for Mitigating Bit-Level Risk
Physical:Media,
Hardware,Environment
Number of copies
Diversification of copies
Formats FileTransforms:compressio
n,encoding, encryption
Fixity Repair
Loca
l S
tora
ge File
Systems:transforms,deduplicatio
n, redundancy
Rep
licat
ion
Verif
icat
ion
Audit
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
27
Observations• Blind replication is rarely a rational long-run
strategy – even with lots of copies.• Without verification/audit and repair strategies
long-term risk often remains high• There are multiple methods to mitigating threats
to access – use these to guide diversification• Threat / lifecycle modeling order to make an
rational choice• Better practices, models, and evidence are
needed
Research Priorities• Applied Research for Cost Modeling and
Audit Modeling• Value of information• Understanding Information Equivalence &
Significance• Policy Research on Trust Frameworks• Preservation at Scale• The Evidence Base for Digital Preservation
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 28
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
29
What Else do We Need To Know?• What is the expected future value of a specified collection of digital
content? • What content is already being effectively stewarded by other organizations? • How much is the expected future cost of preserving that content?• How often do different threats to information manifest
– storage hardware or media failures– software errors cause information loss– stored information becomes inaccessible because of obsolete formats, or loss of
other contextual knowledge– that human error or maliciousness causes loss content in an information system
• What is the reliability of current digital preservation networks and services?• How successful are other proposed strategies for replication, monitoring,
certification, and auditing at preventing loss due to these threats?
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
30
The Limits of Case StudiesMost current evidence for digital preservation practices and outcomes are based on local case studies and convenience samples
• Case studies are useful for:– existence proofs– raising awareness of problems– process tracing– hypothesis generation,
• Case studies are not enough to– advance our scientific knowledge– create robust predictive models– test causal hypotheses– strongly guide decision making.
• Systematic Evidence is needed both to support – general selection of digital preservation practices and method– applications of selected digital preservation methods in a specific operational context.
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
31
How will we learn?• Apply existing research methodologies from other fields
-- especially fields involving observation research on humans and human systems
• Some useful methodologies:– probability-based surveys
(e.g. of information management practice and outcomes) – replicable simulation experiments tied to theoretically grounded
models of information management and risk; – creation of testbeds and test-corpuses which can be used to
systematically compare new practices, tools, and methods; – field experiments, in which randomized interventions are applied
and evaluated in real operational environments.
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
32
Observations
• Developing better practices will require going beyond case studies – to formal modeling, computer simulation, statistical analysis, experiments
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
National priorities for…
Digital Content
33
Selected Digital Content Areas that Challenge Curation
• Web and Social Media • Electronic Records• Moving Image and Recorded Sound • Research Data• “Big” Data
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 34
Goals of content curation• Curation involves selection of content for retention,
and management for use• Selection requires predicting future value, in order
to build an information portfolio that increases in value
• Management requires capturing and maintaining tacit information that ensures fitness for use: Content size, uncertain value, rapid change, unstable form, and external context are core challenges to curation
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 35
Observations: • The tacit information needed to understand formats is lost
over time. Format migration plans are needed to mitigate risk.
• Information objects are rarely self-documenting, ensuring fitness for use: requires metadata, provenance, “documentation”, rights, authenticity, To select content for long-term access, we need to develop theoretically grounded and empirically tested models of information valuation and portfolios.
• Cost-models for digital stewardship exist, but they are most accurate for collections of small, static, digital objects in stable formats. Generally, a few things are clear:- Raw storage is rarely limiting cost factor- Management of objects is cheapest and most effective if tacit information is captured early in the lifecycle
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 36
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
National priorities for…
Technical Infrastructure
37
2014 Technical Infrastructure Priorities
• Interoperability and Portability in Storage Architectures
• Integration of Digital Forensics Tools• Ensuring Content Integrity
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 38
Interoperability and Portability in Storage Architectures
• As stewardship organizations manage increasingly large and complex data sets, the need for interoperability at various levels within the technical hardware and software stacks that make digital preservation becomes increasingly important.
• Interoperability of storage devices, hardware, data tape, and file systems software and would help alleviate bottlenecks in the interrelationship between distinct functions in workflows.
• Need for establishing and promoting technical means by which lower levels of the technology stack can directly integrate without requiring extensive computation and processing at higher levels.
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 39
Integration of Digital Forensics Tools• Digital Forensics tools are essential for working across the
range of heterogeneous kinds of digital materials coming under stewardship
• Projects like BitCurator are pulling together the suite of tools to do this work and developing processes and workflows.
• We are now at the point of implementation, it’s time for organizations to start implementing and sharing information about their work
• The result of this work, will be large sets of heterogeneous digital files which will then push for the development of tools to work with these kinds of data at scale.
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 40
Ensuring Content Integrity• Digital preservation is possible through a chain of migration
of current hardware and software systems to yet-to-be-established future infrastructures.
• Maintaining file fixity is a minimum requirements.• Beyond file fixity there is a need to ensure that the
semantics of the data and the quality of representation remain unchanged when the object is represented in different forms.
• Identifying the significant semantic properties of the digital object, and algorithms to create semantic fingerprints can ensure that meaning is preserved over time.
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 41
Observations: • Interoperability and portability across
local and cloud storage architecture remains a significant issue – beware economic and technical lock-in
• Curation of objects acquired later in the information lifecycle often require digital forensics – invest in tools and expertise
• Ensuring integrity of content over time requires assessing fixity at both a file and semantic level
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 42
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
National priorities for…
Organizational Development
43
State of the curation practice: Trusted Digital Repositories
An organization with a mission and to provide reliable, long-term access to managed digital resources to its designated community; coupled with sufficient evidence of practices to ensure the success of this mission.
• Formalized in:– OAIS Reference Model
(standardized in ISO 14721:2012)– Trustworthy Repositories Audit & Certification (TRAC)
(standardized in ISO 16363:2012)
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 44
National Priorities for Organizational Roles, Policies,
and Practices
Identifies need to increase cross‐organizational cooperation to increase the impact and leverage investments
made by individual institutions.
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 45
Auditing Distributed Digital Preservation
Networks
Potential Nexuses for Preservation Failure• Technical
– Media failure: storage conditions, media characteristics– Format obsolescence– Preservation infrastructure software failure– Storage infrastructure software failure– Storage infrastructure hardware failure
• External Threats to Institutions– Third party attacks – Institutional funding– Change in legal regimes
• Quis custodiet ipsos custodes?– Unintentional curatorial modification – Loss of institutional knowledge & skills– Intentional curatorial de-accessioning– Change in institutional mission
Source: Reich & Rosenthal 2005
46
1) Provision networked preservation services – network of preservation service providers with specialized services rather than every organization performing all aspects of digital preservation -- A number of core risks are institutional
2) Collaborate on shepherding and promotion of standards– digital preservation community representation on the relevant standards bodies rather than each organization needing to participate in every body
3) Share digital preservation training and staffing resources
Priorities for Organizational Collaboration
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 47
Observations• Trustworthy repository standards provide good
abstract models of a single institutions curatorial responsibilities, and an inventory of accepted practices
• Many threats to content require multi-institutional stewardship
• Certification of trustworthiness and evaluation of impact of accepted practices is still in early stages
• Both intra- and inter- institutional collaboration is needed to prevision preservation services, set standards, establish and evaluate trustworthiness
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 48
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
What’s next?
49
A National Stewardship Agenda for 2015 and Beyond
• Drafts and update process starts this winter• Community review process late spring• An update will be presented in July at
Digital Preservation 2014
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 50
Moving Digital Stewardship Forward
NDSA has a commitment to:
• Facilitating broad collaboration• Promoting dissemination and engagement• Regular updates and revisions of the
National Agenda and core NDSA surveys
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 51
Want more information?
Contact NDSA for… • Briefings, webinars, and consultations on the
Agenda or other NDSA work • Assistance in gathering comments on National
policies and programs• Assistance in recruiting experts for review and
discussion panels; grant review• Referrals to content stewards in specific areas
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 52
Observation: Principles• The core of digital stewardship is taking
broad responsibility for preservation and curation
• The goal of preservation is meaningful long-term access
• The principle activities of curation are selection and management for use
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 53
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
54
Observations: Planning• Blind replication is rarely a rational long-run strategy –
even with lots of copies.• Without verification and repair strategies long-term
risk often remains high• There are multiple methods to mitigating threats to
access – use these to guide diversification• Threat / lifecycle modeling order to make an rational
choice• Developing better practices will require going beyond
case studies – to formal modeling, computer simulation, statistical analysis, experiments
Observations: Curation • The tacit information needed to understand formats is lost
over time. Format migration plans are needed to mitigate risk.
• Information objects are rarely self-documenting, ensuring fitness for use: requires metadata, provenance, “documentation”, rights, authenticity, To select content for long-term access, we need to develop theoretically grounded and empirically tested models of information valuation and portfolios.
• Cost-models for digital stewardship exist, but they are most accurate for collections of small, static, digital objects in stable formats. Generally, a few things are clear:- Raw storage is rarely limiting cost factor- Management of objects is cheapest and most effective if tacit information is captured early in the lifecycle
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 55
Observations: Curation • The tacit information needed to understand formats is lost
over time. Format migration plans are needed to mitigate risk.
• Information objects are rarely self-documenting, ensuring fitness for use: requires metadata, provenance, “documentation”, rights, authenticity, To select content for long-term access, we need to develop theoretically grounded and empirically tested models of information valuation and portfolios.
• Cost-models for digital stewardship exist, but they are most accurate for collections of small, static, digital objects in stable formats. Generally, a few things are clear:- Raw storage is rarely limiting cost factor- Management of objects is cheapest and most effective if tacit information is captured early in the lifecycle
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 56
Observations: Infrastructure • Interoperability and portability across
local and cloud storage architecture remains a significant issue – beware economic and technical lock-in
• Curation of objects acquired later in the information lifecycle often require digital forensics – invest in tools and expertise
• Ensuring integrity of content over time requires assessing fixity at both a file and semantic level
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 57
Observations: Organizations• Interoperability and portability across
local and cloud storage architecture remains a significant issue – beware economic and technical lock-in
• Curation of objects acquired later in the information lifecycle often require digital forensics – invest in tools and expertise
• Ensuring integrity of content over time requires assessing fixity at both a file and semantic level
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 58
Key Terms• Audit: An independent evaluation of records and activities to assess a
system of controls • Authenticity: information used to verify the truthfulness of assertions
about content or ite provenance• Curation: selection of content for retention, and management for fit use• Content stewardship: broad responsibility for curation and preservation • File fixity: information used to verify that a digital object has not been
altered or corrupted.• Provenance: the chronology of the ownership, custody, operations on,
and/or location of an information object.• Preservation: ensuring meaningful long-term access• Trusted Digital Repository: an organization with a mission and to
provide reliable, long-term access to managed digital resources to its designated community; coupled with sufficient evidence of practices to ensure the success of this mission
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 59
Bibliography• Bailey, Charles (2011). Digital Curation and Preservation Bibliography, <
digital-scholarship.org/dcpb/>• CCSDS (2012), Reference model for an open archival information system (OAIS),
<public.ccsds.org/publications/archive/650x0m2.pdf >• Digital Curation Center, (2010-4):
How to Guides: <dcc.ac.uk/resources/how-guides>Curation Reference Manual: <dcc.ac.uk/resources/curation-reference-manual>
• Giaretta, David (2011). Advanced Digital Preservation. <amazon.com/Advanced-Digital-Preservation-David-Giaretta>
• ISO, 2012, ISO 16363:2012: Audit and certification of trustworthy digital repositories. < iso.org/iso/catalogue_detail.htm?csnumber=56510 >
• Johnson, L., Adams Becker, S., Estrada, V., Freeman, A. (2014). NMC Horizon Report: 2014 Higher Education Edition. Austin, Texas: The New Media Consortium.
• NDSA (2013), National Agenda for Digital Stewardship, <digitalpreservation.gov/ndsa/nationalagenda/>
• Rosenthal, David SH, Thomas S. Robertson, Tom Lipkis, Vicky Reich, and Seth Morabito. (2005) "Requirements for digital preservation systems: A bottom-up approach”. Dlib 11(11)<dlib.org/dlib/november05/rosenthal/11rosenthal.html>
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 60
More Information
digitalpreservation.gov/ndsa/nationalagenda
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 61
Digital Stewardship and Higher Education IT: Lessons from the National Agenda
Questions?E-mail: [email protected]:informatics.mit.edu
62