GRANT AGREEMENT: 601138 | SCHEME FP7 ICT 2011.4.3 Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving Semantics [Digital Preservation]
“This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no601138”.
Policy Management & Ontology Supported PreservationJean-Yves Vion-Dury (Xerox/PERICLES)Justin Simpson (Artefactual)Stratos Kontopoulos (CERTH/PERICLES)Joel Simpson (Artefactual)
@PericlesFP7 #PERIconf2016
How do preservation policies evolve and are managed over time?▶ Institutions’ preservation practices (and policies)
evolve as the institution learns from their own work or from knowledge/best practices used by the community.
▶Before implementing changes, it would be helpful to understand the implications or potential impact on current repository of DOs.
▶We will explore changing needs for email preservation.
Objectives
Who we are:
About Artefactual & Archivematica
Why we are here:▶Archivematica already uses policies▶We believe that there are many opportunities to
improve how policies are used▶We are excited by the potential of leveraging
technology and learning from the PERICLES project
(the company) is the organisational home for two open source
projects: (a preservation platform) and (a digital repository)
Archivematica Format Policies
Rule
Command
Tool
Format
Purpose
applied to
is for a particular preservation
executes a
using a
➔ Over 850 file formats defined with suggested assessment for preservation & access purposes
➔ Over 1,000 predefined rules provided
➔ 39 predefined commands provided
➔ 18 different tools available to be used
▶Format policies are simple rules applied to digital objects of a particular format, for a particular preservation purpose
How the FPR Works Today
▶The Format Policy Registry (FPR) is a significant body of knowledge derived from Artefactual and our users
Local FPR Database
Storage Services
Workflow Engine
FPR Server
Preservation PlanningWorkflow DashboardsAccess & Admin
Web based User Interface
Artefactual maintains the current knowledge base of formats, rules & commands When Archivematica is installed, the latest rules are downloaded so users can start to use them immediately
The Preservation Planning user interface provides an easy way to manage all of the rules, commands etc The rules are executed by the workflow engine. Users can review status and perform manual steps using the workflow dashboards
Policies: Current Benefits
We believe the focus on policies in Archivematica today provides a number of benefits▶Simplification: separating rules (policies) from
workflow make both easier to configure and manage
▶Understandability: abstracting policies from technical implementation enables non-technical users to interact more directly with the system
▶Shareability: enables some level of sharing best practices across the community
Policies: Potential Improvements
We think the PERICLES approach may help us improve upon our existing focus on policies: ▶Simplification: many important preservation
decisions are still deeply embedded in technical implementation
▶Understandability: using well defined vocabularies & languages (ontologies) to define policy will make make it easier to be precise and eliminate ambiguity
▶Shareability: using common standards will make it easier to share policy within a community
Policies: New Benefits
There are a few benefits that may be achieved through the PERICLES approach to policies: ▶Impact analysis: ability to determine the
impact of a change in policy before it is made▶Reasoning / change management: once
impact analysis is automated, it is possible to automate the management (resolution) of impacts
Policies: New Benefits
▶Validation: we can attach ad hoc validation processes (tests)
▶Reuse: making use of existing ontological knowledge bases on formats and preservation policies in general
PERICLES Model-driven Preservation
▶Abstraction of complex systems as models that can be manipulated independently
Model-driven Preservation
Models
Digital ecosystem◦ Analogy with biological
systems◦ Evolving systems of
interdependent entities
Capture and representation of the environment▶ Understand the
wider context around digital objects that impacts their long-term reuse
Continuous change and reuse
Continuum approach▶ Merging of active-life
and archival phases▶ Non-custodial
“... a formal, explicit specification of a shared conceptualization...” [Studer et al., 1998]
Upper ontology: A model of the common objects that are generally applicable across multiple knowledge domains.
Domain ontology: A model of concepts that belong to a specific domain or part of the world.
What is an Ontology?
machine readable with computational
semantics
unambiguous concepts, properties,
functions, axioms definition
commonly accepted
consensual knowledge
abstract, simplified model of a domain
[Studer et al., 1998] Studer, R., Benjamins, V.R. and Fensel, D. (1998), Knowledge engineering: Principles and methods. Data & Knowledge Engineering, Elsevier Ltd, Vol. 25, Issues 1-2, pp. 161-197
◦ Classes (concepts)Superclass/subclass
relationship◦ Properties (relationships)
Subject → Predicate → Object◦ Axioms, restrictions and constraints◦ Individuals (instances)
OWL - the Web Ontology Language
Key Notions
PERICLES Models▶ LRM -
ontology for modelling linked resources
▶DEM – formalism for digital ecosystems
▶Domain ontologies
Aims ▶Model digital objects, dependencies between
them, temporal evolution▶Maximise interoperability with other ontologies
(existing or future)▶Interoperate with environment information and
digital ecosystem models
Linked Resource Model (LRM)
▶Relation between change and dependency▶Understanding dependencies between digital
objects and resources within their environment is the key to assess and manage change
▶Given objects A and B, A is dependent on B if changes to B have a significant impact on the state of A, or if changes to B can impact the ability to perform function X on A.
Dependency and Change
Dependency: the association, relation or interaction among two or more Resources
Plan: presents a set of actions/steps to be executed by Agent
precondition and impactDescription:
intention: the intended usage of a Resourcespecification: the context of the Dependency
itself
LRM Dependency
LRM Dependency
▶Digital Ecosystem represents the surrounding environment of a digital object that impacts reuse▶Digital ecosystem can include data objects,
software, user communities, processes, technical services and their dependencies▶Scope depends on the particular use case
Digital Ecosystem Model (DEM)
Domain Ontologies
▶Modelling DP-related risks in◦ Digital Video Art (DVA)◦ Software-based Art (SBA)◦ Born-digital Archives (BDA)
▶Facilitate curators in modelling, projecting & tackling risks throughout DP process▶Extensible for future adopters▶Ontology reuse: LRM, DEM, CIDOC-CRM,
CRMdig, DC
Key Constructs▶Dependencies
◦ HW dependencies: HW requirements for a resource to function properly
◦ SW dependencies: Dependency of a resource or activity on specific SW
◦ Data dependencies: Requirement of knowledge/data/information
◦ Further dependency specialization via intentions & specifications
▶Activities: Temporal entities representing actions intentionally carried out by actors that generate changes◦ E.g. creation, acquisition, display etc.
▶Agents: Resources that may bring change to another resource or participate in an activity◦ Further specialized in Human & SW Agents
DVA Domain Ontology
HW Dependency (DVA)
PERICLES Design Pattern for Policies
http://ontologydesignpatterns.org/wiki/Submissions:Policy
Email Scenario
▶Real world examples of email preservation policies.▶Changes proposed based on lessons
learned.▶“Historic” set of processes & policies and
“future” set.▶Explore how PERICLES can help understand
the implications of moving from the historic to the future policies before enacting those changes.▶Explore benefits of digital ecosystem
approach compared to existing approaches.
Workshop Approach
Transfer Email from Source System
Process Dissemination Package (DIP)Process Archival Package (AIP)
Simplified Email Process
Export Email Data
Pre-accession
review
Transfer (to preservation
platform)
Virus Scan Fixity Check Extract Attachments
Identify & Validate Format
Clean File Names
Normalize Emails Create AIP Add Rights
MetadataIdentify
Sensitive Information
Create DIP
Process Submission Information Package (SIP)
Historic Email Policies
Process StepExport
Email DataPreferred source format for email is maildir (where possible)
Policy
Normalize Emails
Preferred preservation format for email is maildir
Extract Attachment
sAttachments must be extracted from source emails & stored as discrete objects
Converted attachments should retain links to emails to which they were attached
Changing Email Policies
Historic Policy1) Preferred source format for email is
maildir (where possible)
Future (changed) Policy
2) Preferred preservation format for email is maildir
3) Attachments must be extracted from source emails & stored as discrete objects
4) Converted attachments should retain links to emails to which they were attached
Preferred source format for email is IMAP protocol (second preference is mbox)
Preferred preservation format for email is mbox
Attachments must be extracted from source emails for archival processing (but need not be retained as discrete objects)No change to policy per se -- but implementation will change from using Archivematica UUIDs to native email UUIDs
New Email PoliciesThese policies did not have a precedent or equivalent in our historic process5) Email Accounts should be characterized and metadata to be extracted / described should include: Total number of attachments, size of mailbox, first email sent date, first email received date, last email sent date, etc. etc.
6) Digital signatures provided within any email should be verified
Modelling Policies in PERICLES
Impact Analysis of Email Changes
How will PERICLES help us identify the impacts of the policy changes?1. Identify all objects where source format was Maildir? (in
practice we may not care to attempt to extract source data again)
2. Identify all objects preserved in Maildir format so that we know how many should be re-normalized into Mbox format
3. Identify all extracted attachments we no longer need to store4. Identify all attachments that need a new reference (the
native email UUID instead of the previously generated Archivematica UUID)
5. Identify all email accounts that should be characterized6. Identify all emails with digital signatures that should be
verified
A Sample Policy in PERICLES
▶How do preservation policies evolve and are managed over time?▶Organizations seeking ways to improve
how policies are used▶PERICLES model-driven preservation
approach▶The email policy preservation scenario
Conclusions
Thank You!
Top Related