The Other Security: A New and Nimble Approach to Digital Preservation Stephen Abrams Perry Willett...

The Other Security:A New and Nimble Approach to Digital

Preservation

Stephen AbramsPerry Willett

Digital Preservation ProgramCalifornia Digital Library

University of California

UCCSC 2009: Focus on SecurityUC Davis, June 16–17, 2009

Focus on Security

“Traditional” security risks

– Natural disaster– Infrastructure failure– Storage failure– Server failure– Operating system failure– Application failure– Human error– Malicious attack

Focus on Security

The “other” security risks

– Legal encumbrances– External dependencies– Media obsolescence– Format obsolescence– Staff competencies– Institutional commitment– Financial stability– Changing user expectations

Focus on Security

The “other” security risks

– Anything that interferes with the usability of managed digital assets now or in the future

Libraries Have a Long Time Horizon

The UC Melvyl union catalog holds over 28 million items; 11,000 are more than 500 years old

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

<1500 <1600 <1700 <1800 <1900 <2000 <2100

Date

Mel

vyl

Cat

alo

g R

eco

rds

(logrithmic

)

By century

Total

Libraries Have a Long Time Horizon

What can we do to ensure that today’s digital assets are still usable 500 years from now?

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

<1500 <1600 <1700 <1800 <1900 <2000 <2100

Date

Mel

vyl

Cat

alo

g R

eco

rds

(logrithmic

)

By century

Total

Agenda

What is digital curation?

Redefining the repository: A micro-services approach to curation

Web archiving

CDL/campus curation collaborations

Trusted digital curation services

Summary

Digital Curation

Activities focused on maintaining and adding value to trusted digital content

Encompasses preservation and access, which are complementary, not disparate functions

– Preservation ensures access over time– Access depends on preservation up to a point in time

How can we make the “Save” button really mean “save”?

ManageCreate Use

Add value

Curation Imperatives

Integrated business process

– Robust technological infrastructure and– Human analysis and decision-making

Programmatic (not project-oriented)

Services (not systems)

Content (not repositories)

Agenda



Web archiving



Summary

D'où venons nous, que sommes nous, où allons nous?

Paul Gauguin, 1897-98, Museum of Fine Arts Boston, 32.270

Where are we from, what are we, where are we going?


Where [is our stuff] from, what [is it], where are we going [with it]?


Where From? What? Where To?

Producer Repository Consumer


Producer

Ingest

Repository

Data management /archival storage

Consumer

Access /preservation

planning


Producer

Ingest

Provenance

Repository

Data management /archival storage

Characterization

Consumer

Access /preservation

planning

View paths

Information Landscape

Increasing diversity in types and uses of content

Content arising from non-library contexts

Inevitable technological change

Infrastructure Design Goals

Devolve repository function into a set of independent, but interoperable, services

– Since each is small and self-contained, they are more easily developed and maintained

– Since the level of investment is lower, they are more easily replaced

Provide complex function through the flexible combination of atomistic services


Support interaction through procedural APIs, command line applications, and web interfaces

– Let content managers and curators interact with the services without requiring changes to existing work practices

Rather than force content to come to the services, push the services out to the content

– Easy deployment centrally or locally, either independently or in strategic combinations


Defer implementation decision making until needs and outcomes are clearly articulated

– Requirements are first stated as sets of values and strategies that promote those values

– Strategies are then embodied as abstract services, and, finally, instantiated in technical systems

Object-Centric Values and StrategiesValue Justification Strategy

Identity To distinguish an object from others Persistent naming, actionable resolution

Viability To recover an object from its medium Redundancy, heterogeneity, media refresh

Fixity To ensure that an object is unchanged from its accepted state

Redundancy, error-correcting codes, message digests

Authenticity To ensure that an object is what it purports to be

Provenance, cryptographically-secure signatures

Ontology To understand an object’s significant nature

Syntactic, semantic, and pragmatic characterization

Visibility To enable users to find objects of interest

Public discovery systems and registries, exposure for web harvest

Utility To expose an object’s underlying content

Behavior-rich delivery

Portability To facilitate content sharing and succession planning

Self-contained, self-documenting objects, packaging standards

Appraisement To understand the consequences of time Analysis and assessment

Timeliness To know when a preservation value is threatened

Technology watch, stakeholder engagement

Service-Centric Values and StrategiesValue Justification Strategy

Availability To provide access at a time of user choosing

Redundancy, automated failover

Responsivity To provide appropriate throughput Redundancy, load-balancing

Security To enforce appropriate use of services and content

Cryptographically-secure identity and role management

Interoperability To facilitate creative reuse of content and services

Standard interfaces

Extensibility To enable graceful evolution over time Granularity, orthogonality, virtualization

Trustworthiness To promote users’ sense of predictability and reliability

Transparency, audit, certification

Sustainability To ensure ongoing access and use Commodity components, institutional commitment, financial cost-recovery, professional development

Micro-Services

Interoperation

Publication Annotation

Application Ingest Index Search Transformation

Replication Identity Storage Fixity Replication

Interpretation

Catalog Characterization

Value

Service

Context

State

Curation

Preservation

“Lots of uses keeps stuff valuable”

“Lots of services keeps stuff useful”

“Lots of description keeps stuff meaningful”

“Lots of copies keeps stuff safe”

Suf

ficie

ncy

Nec

essi

ty

Design Process

What are the conceptual entities underlying the service?

What are their state properties?

What are their behaviors?

Storage Service

Storage service

– An aggregation of storage nodes

Storage node

– A particular configuration of object storage

Object

– An aggregation of files over time

Version

– A particular configuration of files at a point in time

File

– A formatted bit stream

Storage Service Methods

Help [idempotent, safe]

Get-state [idempotent, safe]

Get-node-state [idempotent, safe]

Get-object-state [idempotent, safe]

Get-object [idempotent, unsafe]

Get-version-state [idempotent, safe]

Get-version [idempotent, unsafe]

Get-file-state [idempotent, safe]

Get-file [idempotent, unsafe]

Add-version [non-idempotent, unsafe]

Storage Service Interfaces

METHOD Get-File-State [idempotent, safe]

Node Identifier Mandatory Node identifier

Object Identifier Mandatory Object identifier

Version Identifier Optional Version identifier

File Identifier Mandatory File identifier

ResponseForm Enum Optional Response form

RETURN Response form Mandatory File state

GET /node/object/version/file?m=state HTTP/1.1 Accept: application/json

% store –get node/object/version/file –m state –f JSON

File.getState (“node/object/version/file”, Format.JSON);

Technological Change and Invariance

Circa 1989– FTP– POSIX– SQL

Circa 2029?– HTTP– URI– XML

Due to their inherent abstracting nature, protocols and interfaces last longer than systems

Storage Service Implementation

Using the file system as the controlling managerial abstraction, what is the thinnest smear of additional functionality that will make it an effective object store?

– Namaste– CAN– Pairtree– Dflat– ReDD

Name As Text (Namaste) Tags

Directory-level signature files extending Dublin Core Kernel metadata

– [ Tag h0 ] 0=name_version

– Who h1 1=who

– What h2 2=what

– When h3 3=when

– Where h4 4=where

Content Access Node (CAN)

File system conventions (structure and reserved names) for an object store

can/ 0=can_0.2 can-info.txt log/ store/ pairtree...

Pairtree

Use a bigram decomposition of an object’s identifier to determine its file system path

pairtree/ 0=pairtree_0.1 pairtree-info.txt pairtree_root/ id/ en/ ti/ fi/ er/ dflat...

Dflat

A “digital flat” for object data and metadata

dflat/ 0=dflat_0.11 dflat-info.txt v001/ d-manifest.txt delta/ redd... v002/ f-manifest.txt full/ data/ metadata/ enrichment/ annotation/

Reverse Delta Directory (ReDD)

File-level reverse delta compression

redd/ 0=redd_0.1 add/ delete.txt

Performance Scaling

Modern file systems, e.g. ZFS, exhibit good performance characteristics at reasonable scale

Average CopyTime

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

1 92 183 274 365 456 547

AverageCopyTime

Average MkDir Time

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

1 100 199 298 397 496 595

AverageMkDir Time

Traverse Time

0

5000

10000

15000

20000

25000

1 7 13 19 25 31 37 43 49 55 61 67 73

Traverse Time

2,272,000 files = 28.5 TB

127,058,820 files = 25.7 TB

Status

We are completing development of the foundational Storage and Identity services

– Identity is based on N2T (name-to-thing) and Noid systems

We are planning for the Ingest, Catalog, and Characterization services

– Characterization is based on JHOVE2

As these services become available they will be deployed centrally and locally on campuses

Agenda



Web archiving



Summary

Today’s Web is History’s Source Material

The web is indispensible to science, commerce, education, entertainment, and culture

Yet, it is highly volatile

UC faculty and researchers have their own web publications

Libraries and archives wish to preserve important websites

How can we secure this valuable content into the future?

Web Archiving Service (WAS)

Provides open source tools for curators to select and preserve content from the free web

Allows curators to define scope of collection, frequency of crawling, work collaboratively

Content is saved in “projects,” grouped by common subject matter or publisher

Crawl operation in WAS

WAS Public Access

Starting in July, curators will be able to provide public access to their projects

Rights based on recommendations of Section 108 Study Group

– 6 month embargo– Opportunities for content owner to opt-out

Libraries will add links in their online catalogs to documents, websites

Advantages: curated collections, persistent access and URLs, full-text searching

WAS Partners

Library of Congress: grant funding for development

– UC campuses, University of North Texas, and others

Internet Archive: software and experience

– Heritrix crawler, Wayback display, Nutch indexing

National Library of France: standards and leadership

– IWAW international web archiving workshop

– IIPC (national libraries consortium) commitment

Agenda



Web archiving



Summary

CDL Curation Collaborations

DataOne

– NSF-funded project to preserve distributed scientific data and develop infrastructure for distributed scientific research on global change

– University of New Mexico, UC Santa Barbara

Media Vault Program

– UC Berkeley

Historical Newspapers

– UC Riverside

Agenda



Web archiving



Summary

Trusted Digital Repositories

Trusted Repositories Audit and Certification (TRAC)

– Criteria for evaluating repository trustworthiness

– Developed by RLG, OCLC, NARA, CRL

– Based on Open Archival Information System (OAIS) reference model (ISO 14721)

TRAC

Basic approach

– TRAC checklist provides framework– Organization documents planning and policies

Allows organizations to self-audit and identify gaps

Allows other organizations to perform external audit

TRAC

Three sections

– Organization– Digital Object Management– Technologies

Audits Aren’t Perfect

Total Transparency Is Not Possible

Budgets, personnel issues

NDAs, competitive environment

Computer security, firewalls

Burden of documentation and maintenance

Trust but Verify

Trust but Verify

Process requires both trust and willingness to question assumptions

For process to work, the underlying motivation must be a desire to improve service…

– Resulting in greater transparency– Leading to trust between repositories and clients

Summary

Safety through redundancy

Meaning through description

Utility through service

Value through use

Code to interfaces

Orthogonality, but interoperability

Composition, not addition

Bring services to content, not content to services

Questions?

http://www.cdlib.org/inside/diglib/

[email protected]@ucop.edu

The Other Security: A New and Nimble Approach to Digital Preservation Stephen Abrams Perry Willett...

Documents

Transcript of The Other Security: A New and Nimble Approach to Digital Preservation Stephen Abrams Perry Willett...