DHARMa Project Final Report - Bodleian Libraries...

DHARMa was a one-year project

addressing effective digital research

data preservation in the Humanities. It

was funded by the John Fell Fund and

managed by the Bodleian Digital

Library.

DHARMa Project Final Report November 2014

J McKnight;C Madsen;J Prag

Dharma Project Final Report 1

J. McKnight; C. Madsen; J. Prag | November 2014

Table of Contents

Introduction .......................................................................................................................... 3

The projects ............................................................................................................................................................................... 3

Context ......................................................................................................................................................................................... 3

The problem of preservation .............................................................................................................................................. 4

Strategic positioning .............................................................................................................................................................. 4

Focus on the Humanities ...................................................................................................................................................... 5

Assessment of existing tools and services .............................................................................. 7

ORA-Data ..................................................................................................................................................................................... 7

Depositing data .............................................................................................................................................................................. 7

Current and future work ......................................................................................................................................................... 10

Other software ........................................................................................................................................................................ 12

DataStage ....................................................................................................................................................................................... 12

ORDS (Online Research Database Service) ..................................................................................................................... 13

BEAM Web Deposit .................................................................................................................................................................... 13

External repositories ................................................................................................................................................................. 14

3rd-party software ..................................................................................................................................................................... 14

Service providers ................................................................................................................................................................... 16

Research Data Management single point of contact ................................................................................................. 16

Research Services / Research Accounts (UAS) .............................................................................................................. 16

Bodleian Libraries Digital Library Systems and Services (BDLSS) ..................................................................... 16

Academic IT Research Support team (IT Services) ..................................................................................................... 17

Infodev (IT Services) .................................................................................................................................................................. 17

IT Learning Programme (IT Services) .............................................................................................................................. 17

Subject Librarians ...................................................................................................................................................................... 17

BEAM (Bodleian Electronic Archives and Manuscripts) .......................................................................................... 18

Faculty and Department ITSS ............................................................................................................................................... 18

Methods & Activities ........................................................................................................... 18

Semi-structured interviews with researchers ........................................................................................................... 18

Conversations with IT Support Staff ............................................................................................................................. 19

Data acquisition ...................................................................................................................................................................... 20

Data ingest and metadata creation................................................................................................................................. 21

Case studies: research projects ............................................................................................ 22

Sphakia Survey ....................................................................................................................................................................... 22

Overview ......................................................................................................................................................................................... 22

Scope of research ........................................................................................................................................................................ 22

Research materials .................................................................................................................................................................... 23

Common findings ................................................................................................................ 28

2 Dharma Project Final Report


Finding 1: Data v interfaces ............................................................................................................................................... 28

Finding 2: Funding gaps for sustainability and preservation ............................................................................. 28

Finding 3: Lack of reuse and evolution ......................................................................................................................... 29

Finding 4: Advice, training, and mentoring ................................................................................................................ 31

Finding 5: Repository policy on data and formats ................................................................................................... 32

Finding 6: ORA-Data API .................................................................................................................................................... 34

Results ................................................................................................................................ 35

Datasets in ORA-Data ........................................................................................................................................................... 35

Improved digital preservation guidelines ................................................................................................................... 35

Conclusions ......................................................................................................................... 36

Recommendations................................................................................................................................................................. 36

1. Preservation and sustainability ...................................................................................................................................... 36

2. Funding gaps for sustainability and preservation ................................................................................................. 36

3. Reuse and evolution .............................................................................................................................................................. 37

4. Advice, training, and mentoring ..................................................................................................................................... 37

5. Repository policy on data and formats ........................................................................................................................ 38

6. ORA-Data API .......................................................................................................................................................................... 39

Next steps ................................................................................................................................................................................. 39

1. Information ............................................................................................................................................................................... 39

2. Innovation ................................................................................................................................................................................. 39

3. Investment ................................................................................................................................................................................. 40



Introduction

DHARMa (Digital Humanities Archives for Research Materials) was a one-year project (August

2013 to August 2014), funded by the John Fell Fund. The project worked closely with 13 self-

selected Digital Humanities projects to investigate the nature of the research data they are using

and creating; to understand the preservation requirements arising from these; to pilot data

acquisition and ingest in ORA-Data; and to use the findings to inform the central provision of

data preservation services.

The projects

● Ashmolean Cyprus Digitisation Project (Anja Ulbrich)

● Centre for the Study of the Cantigas de Santa Maria (Stephen Parkinson)

● Creative Practice in Contemporary Concert Music (Eric Clarke / Mark Doffman)

● Dictionary of Medieval Latin from British Sources (Tobias Reinhardt / Richard

Ashdowne)

● Digital Miscellanies Index (Abigail Williams)

● Early Modern Festival Books (Helen Watanabe-O’Kelly)

● First World War Poetry Digital Archive & The Great War Archive (Stuart Lee / Katharine

Lindsay)

● Inscriptions of Sicily (Jonathan Prag)

● Last Statues of Antiquity (Bryan Ward-Perkins & Bert Smith)

● Lexicon of Greek Personal Names

● Oxford Archive of Russian Life History (Catriona Kelly)

● Oxford Roman Economy Project (Alan Bowman)

● Sphakia Survey (Lucia Nixon)

Context

Since the early 1970s Oxford has been at the cutting edge of the Digital Humanities (broadly

defined as the application of advanced digital technologies to humanities research). The

Digital.Humanities@Oxford website currently lists over 200 projects in this area, in which Oxford

has secured more grant awards than any other UK institution (see

http://digital.humanities.ox.ac.uk/SubjectAreas/subject_areas.aspx). The projects involve

leading academics across the Humanities disciplines (and beyond, in increasingly

interdisciplinary collaborations) as well as staff in the University’s IT Services, the Oxford e-

Research Centre (OeRC), the Bodleian Libraries, Museums, and Colleges. The data produced

by these projects constitutes a major corpus of research resources with the power to transform

the work of humanities scholars.

http://digital.humanities.ox.ac.uk/ProjectProfile/Project_page.aspx?pid=333

http://csm.mml.ox.ac.uk/

http://www.cmpcp.ac.uk/cpiccm.html

http://www.dmlbs.ox.ac.uk/

http://digitalmiscellaniesindex.org/

http://festivals.mml.ox.ac.uk/

http://www.oucs.ox.ac.uk/ww1lit/

http://www.oucs.ox.ac.uk/ww1lit/gwa/

http://www.ocla.ox.ac.uk/statues/

http://www.lgpn.ox.ac.uk/

http://www.ehrc.ox.ac.uk/html/ehrc/lifehistory/archive.htm

http://oxrep.classics.ox.ac.uk/

http://sphakia.classics.ox.ac.uk/

http://digital.humanities.ox.ac.uk/SubjectAreas/subject_areas.aspx



The problem of preservation

Digital preservation is the active management of digital content over time to ensure ongoing

access.1 Without appropriate preservation, the digital research data being created within

humanities projects may be lost to the research community. Digital preservation is a particular

issue in the humanities, where the scholarly benefits of research outputs necessarily accrue

over the medium- to long-term. Grant-giving bodies accordingly require firm guarantees for the

sustainability and preservation of the resources that they fund. The AHRC, for example, has

recently reviewed and strengthened its demands in this respect: it now includes a requirement

for applicants producing digital resources to complete a Technical Plan specifying exactly how,

and for how long, data outputs will be sustained and preserved. Other funders are making

similar requirements, with attention also turning toward ensuring maximum ‘value for money.’2

Preservation involves resource commitments beyond the end of any project’s funding period,

and in the humanities these costs are not normally covered by grant-awarding bodies. It is

therefore essential to consider a strategic and wide-ranging approach to how these costs can be

minimised, and the limited resources used in the most effective way possible.

The decentralised nature of Oxford makes preservation a particular challenge compared to

other institutions. In Oxford, some centralised services for research data creation, management,

and preservation are under development, but digital resources may be built and/or hosted by

any number of organisations or individuals including the Bodleian, IT Services, colleges, and

departments, with the latter two particularly struggling to meet long term preservation goals. To

organise preservation services efficiently – and demonstrate this to funders and donors – will be

a significant strength in Oxford’s capacity to bid for funding for digital projects. However, it

cannot be stressed enough that this is not simply a financial consideration, but rather a

fundamental necessity for the realization of the potential of digital research in the humanities.

Without efficient and effective preservation, the future of humanities research is at risk. Libraries

and archives are filled with humanities research data of the past, but the raw materials of

research of the future are very much endangered.

Strategic positioning

With more and more digital data being created, and funders increasing their requirements for

effective preservation, Oxford is already moving towards meeting the growing challenges of

research data management (RDM) and preservation. Several strategy documents recognise

RDM as vitally important:

1 See http://www.digitalpreservation.gov/about/ 2 EPSRC funding from May 2015 will require researchers to show that data will be available for 10 years from last access.

http://www.digitalpreservation.gov/about/



● The University of Oxford Strategic Plan 2013-183 names two key priorities, “global reach”

and “networking, communication, and interdisciplinarity”, both of which rely on effective

management and dissemination of research data (the section on IT Infrastructure further

develops this connection)

● One of the key objectives of the IT Strategic Plan4 is “To provide the infrastructure and

tools to allow researchers to be compliant with regulatory requirements to preserve and

share electronic research outputs.”

● The Bodleian Libraries’ Strategic Plan emphasises the importance of digital initiatives,

and a more detailed digital strategy (‘The Digital Shift’) is being developed5 which

stresses that preservation must be the bedrock of any successful digital strategy.

● Oxford has released a Policy on the Management of Research Data and Records6 which

highlights the value of research data “for research, teaching, and for wider exploitation

for the public good” as well as acknowledging the need for compliance with funder

requirements.

Against a background of strategy and policy creation, the infrastructure is being developed to

meet the growing requirements for research data management and preservation; however this

work is not sufficiently resourced, so development has been incremental and reactive.

Focus on the Humanities

This project focused specifically on the humanities partly because of a perception that the

discourse around research data management and preservation was strongly focused on the

needs of the sciences and social sciences, and that the humanities may have different

requirements which were in danger of being sidelined. In practice, the project has emphasised

that while there are common principles and practices which apply to all research materials

across the disciplines, there are also certain characteristics of humanities data which should be

taken into account when developing strategies for supporting and facilitating its preservation

and management. None of these issues, of course, are found exclusively in humanities

research; the social sciences may be regarded as the ‘nearest neighbour’ of the humanities in

these respects, and shared strategies should be explored, but nonetheless these are strong

trends within the humanities disciplines:

3 http://www.ox.ac.uk/media/global/wwwoxacuk/localsites/gazette/documents/supplements2012-13/University_of_Oxford_Strategic_Plan_2013-2018_(1)_to_No_5025.pdf 4 http://www.it.ox.ac.uk/about/itstrategy/itstrategicplan/ 5 http://www.bodleian.ox.ac.uk/bodley/news/2014-mar-3 6http://researchdata.ox.ac.uk/files/2014/01/Policy_on_the_Management_of_Research_Data_and_Records.pdf

http://www.ox.ac.uk/media/global/wwwoxacuk/localsites/gazette/documents/supplements2012-13/University_of_Oxford_Strategic_Plan_2013-2018_(1)_to_No_5025.pdf

http://www.ox.ac.uk/media/global/wwwoxacuk/localsites/gazette/documents/supplements2012-13/University_of_Oxford_Strategic_Plan_2013-2018_(1)_to_No_5025.pdf

http://www.it.ox.ac.uk/about/itstrategy/itstrategicplan/

http://www.bodleian.ox.ac.uk/bodley/news/2014-mar-3

http://researchdata.ox.ac.uk/files/2014/01/Policy_on_the_Management_of_Research_Data_and_Records.pdf

http://researchdata.ox.ac.uk/files/2014/01/Policy_on_the_Management_of_Research_Data_and_Records.pdf



● data is often captured from other sources rather than being created from scratch, and

may be derived from a working subset of a larger corpus; both of these things may have

ramifications for data conversion, determining copyright, and recording provenance.

● data is often qualitative, textual, and/or narrative, which can make it harder to

standardise and structure than purely quantitative data

● research outputs often comprise heterogeneous collections of data types, making it even

more important to a) standardise formats as far as possible and b) include clear and

consistent metadata for describing the relationships between components of datasets

● data which has been collected is often recombined and restructured to generate new

patterns and allow new questions to be posed; it is often the complex structures and

interrelations between the datasets which provide the value, so it is vital that these are

preserved meaningfully

Humanities research projects also have commonalities of structure which will have implications

for data preservation strategies.

● as mentioned above, the value of humanities research tends to lie in the medium- to

long-term; understanding and documenting the context of the original data

capture/creation as early as possible is extremely important if its full value is to be

realised, as this information may be impossible to reconstruct by the time the data

comes to be used

● data which has been collected is often recombined and restructured to generate new

patterns and allow new questions to be posed; it is often the complex structures and

interrelations between the datasets which provide the value, so it is vital that these are

preserved meaningfully

● projects are often long-running (the longest studied in this project celebrated its

hundredth year in 2014), meaning that the technology used may well change several

times throughout the life of the project. If the project responds to this change then the

resulting data may have been through several conversions by the time it comes to be

ingested for preservation; if it does not, the resulting data may be in obsolete formats

which are difficult to understand and convert.

Another consideration is that there are very few national or subject-specific repositories

available for humanities data; one notable exception is the Archaeology Data Service

(http://archaeologydataservice.ac.uk/), and some humanities research materials may be eligible

for the UK Data Archive (http://www.data-archive.ac.uk/). Given this, Oxford’s institutional

repository services may be of much higher significance to researchers in the humanities than to

those in many of the sciences. A more detailed analysis of differences between research

practices across divisions has been conducted by the DaMaRo project

(http://damaro.oucs.ox.ac.uk/outputs.xml).

http://archaeologydataservice.ac.uk/

http://www.data-archive.ac.uk/

http://damaro.oucs.ox.ac.uk/outputs.xml



Assessment of existing tools and services

We investigated and tested (where possible) all existing tools and services that seemed to be

relevant to the processes of research data management and ingest for preservation. Some of

these are described below, but a full Glossary of RDM/Preservation Projects and Services is

also provided in Appendix 2.

ORA-Data

http://databank.ox.ac.uk/

ORA-Data is the institutional research data repository which is being developed by the Bodleian

Libraries, formerly known as DataBank but now rebranded to show more explicitly that it is part

of the Oxford University Research Archive (ORA), which to date has only contained article and

other ‘book-like’ research outputs. ORA-Data is live and contains data, but is still in a pilot

phase. It should be noted that all technical development of data services so far has been on

‘soft’ externally funded projects. Current work (summer 2014) is being supported by a

combination of the Bodleian Libraries and RCUK OA Project (where there is overlap with

publications). There has been a small amount of funding (1 FTE) for staff to support a

mainstream service.

ORA-Data is suitable for research data in any discipline and can accept data in any format for

deposit. Each dataset has a metadata record describing the dataset, and Digital Object

Identifiers (DOIs) are assigned to all data packages deposited in ORA-Data.7 Datasets can

either be made publicly available or embargoed for as long as necessary. ORA-Data’s other key

purpose is its role as a data catalogue for the University. The aim is to hold and display records

of data held in a location other than ORA-Data, be that within Oxford or in an external subject or

other data archive.

Depositing data

Deposit can currently only be undertaken by Bodleian staff, although an online deposit interface

is in development and is due to be released in Autumn 2014 (see below). Broadly speaking,

data can currently be ingested into ORA-Data in two ways:

7 A digital object identifier is a string of characters which uniquely identifies an object (e.g. a dataset or a

document). Metadata about the object is stored in association with the DOI and may include a location (e.g. a URL) where the object can be found; the DOI remains stable over the object’s lifetime, while the location and other metadata may change.

http://databank.ox.ac.uk/



1. Manual, via web interface

A data package is created within a silo:

At this stage a minimal set of default metadata is created, following the DataCite8 core

(identifier, mediator, licence, embargo, rights, version, publisher, creation date). More metadata

can be added manually to the RDF manifest:

8 http://www.datacite.org/



and a file or files can be uploaded (this can include a zip file, which will then automatically be

unzipped in the package):

There are a number of problems and limitations with this method as it stands:

● manual metadata input increases the likelihood of error and inconsistency

● many users (including ITSS) will not have the resources to construct RDF by hand

● it is not possible to replace or delete metadata

● it is not possible to add separate metadata for individual files within a dataset (and

manually creating a separate package for each component is impractical for all but the

smallest datasets)

● the interface is not user-friendly; very little help or documentation is provided

However these issues are being addressed (see current and future work, below).

2. Programmatic, scripted

Databank (the software underlying ORA-Data) also has a REST API (documented at

https://databank.ora.ox.ac.uk/api/), and BDLSS developers have written a Python library making

use of this, which will form a useful basis for future software development. It should be noted

that while Databank itself is written in Python, the REST API is language-agnostic by its nature,

so interfaces could be developed using other languages.

If complex datasets are to have detailed metadata attached to their components at a level of

granularity lower than the dataset as a whole (and this is strongly recommended for effective

preservation and reusability) then these components will need to be ingested as separate but

linked packages, which will require programmatic ingest. This could work at a variety of levels of

complexity, e.g.

● bespoke scripts interacting directly with the Databank API

● customisable scripts for handling common forms of dataset

● a fully-featured interface allowing users (whether researchers or repository staff) to

organise a dataset as if in a standard filesystem, then export this structure as data

packages with associated metadata (see DataStage)

https://databank.ora.ox.ac.uk/api/



The first two of these would probably be written and/or configured by BDLSS staff or ITSS; the

third would make it more feasible for users to deposit their own data (though could also be used

by library/IT staff).

At the time of writing there were few real examples of programmatic ingest into ORA-Data. A

script was written to ingest data from the Great War Archive, but the code was written by a

contractor and has since been lost.

Accessing data

The web interface to ORA-Data allows for some browsing and searching of data packages:

The screenshot above shows the data packages from the Great War Archive as they appear in

ORA-Data. Such a view of the data emphasises the difference between data preservation and

the websites that are often built by (or for) researchers to access research data. The data

packages are more suited to machines than to humans: while the metadata is fully indexed,

there are no ‘browse’ features such as those on the project website

(http://www.oucs.ox.ac.uk/ww1lit/gwa).

Current and future work

Work in ORA-Data is ongoing and currently working towards:

● a clear and simple user interface for deposit and access to data, with a new facility for

creating and editing rich metadata

http://www.oucs.ox.ac.uk/ww1lit/gwa



● Implementation of single sign-on (SSO) authentication for depositors

● developing the API for accessing the data (see also Common findings: ORA-Data API)

● transitioning to a robust, clearly defined service with a transparent charging model and

procedures to manage payment

● allowing ORA to serve as a catalogue and search interface to Oxford-generated

research, whether it is archived in ORA or elsewhere. This means that metadata-only

records will be created (whether added manually or automatically harvested) for

externally-stored research materials, and these will include a link to the location of the

dataset.

● allowing direct deposit from research data creation/management software, so that

deposit can be more tightly integrated into research workflows

● legal framework including terms and conditions and deposit licence

● policies to underpin the service

● Quantifying the staff required to run a mainstream service [recruitment dependent on

adequate resourcing]

● Helpdesk services including RT ticketing system staffed by BDLSS staff

● supporting documentation and ‘how to’ guides for depositors and users

● an implementation plan to steer development towards ‘EPSRC readiness’ for 1 May

2015

● scaleable infrastructure (dependent on adequate resourcing)

A workflow has been developed for ingest into ORA-Data9; this is intended for Bodleian staff,

but the (possibly optimistic) expectation is that researchers will eventually be able to do their

own ingest (or delegate it to e.g. research assistants).

Work is currently under way to add a more user-friendly form for submitting a record and

uploading data with a choice of either ‘rich’ or ‘simple’ metadata. Wireframes have been created

(an example screenshot is shown below) and these have been trialled with researchers from

two of the selected projects, providing useful feedback in both cases.

9 See Appendix 4: Workflow for uploading to DataBank as of 26/08/2014



Work is also under way to provide researchers with ways to deposit directly to ORA-Data from

the software they use to create and manage their research data, with ORDS (Online Research

Database Service) and DataStage both being steps towards this goal. However at the time of

writing neither is yet fully operational as an institutional service, and archival data export from

ORDS is still wholly manual. IT Services and the Bodleian Libraries are currently planning a joint

project to work on research data deposit from other systems, of which DataStage is a possible

example.

Other software

DataStage

http://www.dataflow.ox.ac.uk/index.php/about/about-datastage

DataStage is software developed as part of the JISC-funded Dataflow project

(http://www.dataflow.ox.ac.uk/) to help researchers manage their ‘active’ digital research data

prior to publication or archiving. It is a means for researchers to deposit selected data into their

data repository of choice (providing that repository complies with the SWORD2 standard).

DataStage is a secure personalised ‘local’ file management environment for use at the research

group or individual level, appearing as a mapped drive on the researcher’s computer. It can be

deployed on a local server, or on an institutional or commercial cloud.

Users save files to DataStage just as they would to an ordinary desktop drive (e.g. to their C:

drive), but with added extras:

http://www.dataflow.ox.ac.uk/index.php/about/about-datastage

http://dataflow.ox.ac.uk/



● Private, shared and collaborative directories, with password-controlled access

● Web access – work securely with stored files over the web, anywhere in the world

● Users can add richer metadata via the web interface, using free-text ‘notes’ fields

● All files can be automatically backed up via the usual backup service

● Users can invite colleagues to access files made available to a defined group via

password control

● Repository submission interface makes it easy for researchers to define data packages,

enter minimal metadata, and deposit them in a data archive of choice

● Flexibility to dynamically invoke additional cloud storage as required


IT Services are investigating the feasibility of developing a central multi-tenanted DataStage

instance; a local instance has already been deployed in Zoology, and Classics are considering

another. DataStage is also being used with ORA-DataDatabank in DigitalSafe

(http://digitalsafe.wordpress.com/), an electronic archive pilot project at Oxford. None of the

DataStage projects to date have gone beyond a pilot stage, however, and its future is unclear.

ORDS (Online Research Database Service)

http://ords.ox.ac.uk/

ORDS is a hosted database service, accessed over the web, which allows researchers at the

University of Oxford (and their external collaborators) to create/import databases, to

add/edit/search the data, publish datasets, and (eventually) to deposit data into ORA-Data. It

allows multiple editors to work on a single database; access controls allow users to decide who

gets permission to do what.


ORDS is currently being rolled out to early adopters, but work is still needed in the following

areas: improvements to the user interface and documentation; handling complex structures and

different import formats; the ability to deposit directly to ORA-Data (this is currently a wholly

manual process); transitioning to a robust, clearly defined service with a transparent charging

model.

BEAM Web Deposit

http://beamwebdeposit.sourceforge.net/

BEAM (Bodleian Electronic Archives & Manuscripts: http://www.bodleian.ox.ac.uk/beam) has a

web service which allows depositors of electronic archival material to upload their data to a

dedicated silo of ORA-Data for archiving, for datasets of up to 4GB. It offers a simple upload (for

single files, or compound files such as zips and tars) as well as a Java applet which allows

http://digitalsafe.wordpress.com/

http://ords.ox.ac.uk/

http://beamwebdeposit.sourceforge.net/

http://www.bodleian.ox.ac.uk/beam



users to edit their upload as they go, for example to exclude certain files or directories. Users

are asked to supply a very small amount of metadata, and both they and the repository receive

an email receipt to confirm what was transferred. Administrative functions provided for BEAM

include: user management; the ability to alter settings for the user receipt issued on deposit; and

the ability to view details for transfers in the BEAM landing zone which are ready for transfer to

the BEAM preservation store.


This service is currently out of action following a server crash in November 2013; there are

plans to restore it, but currently no timescale or resources allocated for this.

External repositories

The University actively encourages researchers to deposit their data in external discipline-

specific repositories where appropriate ones exist in their field; metadata records can be created

for these in ORA-Data (either by harvesting or by manual submission) so that the data will

appear in local catalogues of Oxford University research materials and then link to the data

source for access. The initial focus for archiving datasets in ORA-Data is on data that

1. underpins publications, and/or

2. has to be archived as a requirement of a funding grant, and

3. where there is no other suitable archive for deposit.

It should be noted however that there are very few national subject repositories for humanities

subjects; the main exception is the Archaeology Data Service

(http://archaeologydataservice.ac.uk/), but obviously this only serves one specific area of

humanities research. The UK Data Service (http://ukdataservice.ac.uk/) may also be appropriate

for some humanities data, though none of the projects we worked with would fall into this

category.

3rd-party software

Many departments, faculties, and individual research groups and projects have already

built/commissioned (or are planning) their own systems and services for data management,

creation, and preservation, built on third-party software with varying degrees of local

configuration and customisation. Some examples:

● Archaeology have a rapid prototyping system for database-driven research

websites/applications, built on Filemaker Pro Server;

● IT Services, Bodleian Libraries Digital Library Systems and Services (BDLSS) and

departmental ITSS have each built bespoke research databases and collections for

many projects.

http://archaeologydataservice.ac.uk/

http://ukdataservice.ac.uk/



We recommend that central services (the Bodleian, IT Services) should keep lines of

communication with ITSS (and with each other) open, learn from their experience, and

interoperate with existing local systems where it is useful and possible to do so.

Other commonly used tools

Researchers are of course not limited to the software offered by the University, and will often

use the tools they have to hand, particularly in the early stages of a project where the

format/scope of the data may not yet be clearly defined. It is important that we support them in

this freedom while advising on general principles of effective data management with long-term

preservation in mind, including risks they should be aware of and steps they can take to ensure

that their data remain within their control, ‘portable’, accessible, and reusable.

Examples of software used by researchers and ITSS to create, edit, and share research data:

● database software (e.g. Access, MySQL, Postgres, Filemaker Pro)

● spreadsheets (e.g. Excel, OpenOffice)

● word-processing (e.g. Microsoft Word, Google Docs, OpenOffice)

● file-sharing (e.g. Dropbox, Google Docs)

● web content management systems (e.g. Drupal)

None of these are specific to research data, but it is important that our general support of third-

party software is joined up with specific support concerning how that software is used in a

research context. Two risks arise from this situation: first, that without sufficient or suitable

central resource, people will invest money in third party software and/or invent their own

solutions, which may be unsuitable, unscaleable, or incompatible; second, that without sufficient

communication across the various sections of the University, practices and developments will

continue to diverge despite the existence of central resources

Other projects in development

A full investigation of all relevant third-party software would be impossible, and was outside the

scope of this project; however we did work with IT Services and OSS Watch to conduct an

investigation into the suitability of CKAN (http://ckan.org/: an open-source data management

system which has been adopted by a number of UK HE institutions) for RDM at Oxford. This

identified a number of functions which we do not currently provide to researchers, such as:

● Preview simple data in the research data repository

● Allow simple HTML publishing alongside archived data

● Persistently address data at arbitrary levels of granularity for citation

● Easily link research data with research information

● Provide a personalised and customisable ‘presence’ for individuals and research groups

● Create & visualise explicit links between files in a dataset

http://ckan.org/



● ‘Deep search’ within datasets (not just searching metadata)

● Simple way to generate citation text

Following on from this investigation, IT Services are bidding for 2.5 years’ funding to set up a

CKAN service. Further details of both the investigation and the funding bid are provided in

Appendix 6: investigation of CKAN for RDM at Oxford.

Service providers

Research Data Management single point of contact

http://researchdata.ox.ac.uk/

This year saw the launch of a new website, http://researchdata.ox.ac.uk/, intended to support

researchers in sharing, managing, and preserving their data and research materials. The site

aims to provide a starting point for answers to questions about data storage, organisation, and

preservation; funder requirements; and available training; and to aggregate useful links to

external resources such as sample technical plans, checklists, funder requirements, etc. It also

publicises the new ‘single point of contact’ email address ([email protected]) for an ‘RDM

Enquiries team’ comprising representatives from the Bodleian Libraries, e-Research Centre, IT

Services and Research Services. This ‘team’ is entirely virtual, members are not co-located but

co-ordinate their responses online.

Research Services / Research Accounts (UAS)

http://www.admin.ox.ac.uk/researchsupport/

Research Services is the central research administration support service, providing

comprehensive support to researchers across the research lifecycle. Their services include

supporting the grant process, negotiating research-related contracts and agreements, providing

information on funding opportunities, helping to ensure compliance with regulatory and sponsor

requirements, facilitating technology transfer, supporting the University’s knowledge exchange

activities.

Bodleian Libraries Digital Library Systems and Services (BDLSS)

http://www.bodleian.ox.ac.uk/bdlss/

The Bodleian Digital Libraries Systems and Services works in collaboration with scholars and

librarians at Oxford and around the world on the most pressing issues facing researchers in the

digital age, including digital preservation and the capture and delivery of digital research data;

they also have an active digitization programme that aims to make some of the Bodleian’s rare

materials available globally for learning, teaching, and research. As well as managing ORA and



mailto:[email protected]

http://www.admin.ox.ac.uk/researchsupport/

http://www.bodleian.ox.ac.uk/bdlss/



ORA-Data, BDLSS provides hosting, development and maintenance for many bespoke

research websites and collections.

Academic IT Research Support team (IT Services)

http://blogs.it.ox.ac.uk/acit-rs-team/

The Research Support team provides IT-related advice to researchers (e.g. assistance with

technical appendices for funding bids; developing data management plans; advising on suitable

software or storage solutions for research), undertakes technical development work, and

provides a variety of training events in the area of research data management. The Research

Support team offers an initial consultation free of charge; other services are charged at a set

rate.

Infodev (IT Services)

http://www.it.ox.ac.uk/infodev/

IT Services' academic development team. InfoDev provides IT support, development and

consultation to help facilitate and disseminate research, support teaching and learning, and

increase access to museums and collections through the development and hosting (via NSMS)

of web applications; it also partners with academic research projects on funding bids (providing

guaranteed staff time, consultation, or undertaking specific work-packages) and offers help and

assistance in writing technical appendices to such bids. InfoDev offers an initial consultation free

of charge; other services are charged at a set rate.

IT Learning Programme (IT Services)

http://www.it.ox.ac.uk/itlp/

The IT Learning Programme offer training courses on a wide variety of IT topics, including topics

and methods directly relevant to RDM and the digital humanities (e.g. database design,

reference management systems, surveys), as well as co-ordinating the more specific RDM

training offered by Academic IT’s Research Support team.

Subject Librarians

Subject librarians are sometimes approached with questions about RDM and preservation; in

recognition of this, the Bodleian has created the position of Data Librarian in the Social

Sciences (the holder of this post also acts as a subject librarian for Sociology, Economics and

Social Policy & Intervention). The Bodleian Data Priority Group, comprising subject librarians,

members of BDLSS and others, has been convened to ensure subject specialists are informed

of data developments and to act as a point of contact between researchers and BDLSS.

http://blogs.it.ox.ac.uk/acit-rs-team/

http://www.it.ox.ac.uk/infodev/

http://www.it.ox.ac.uk/itlp/



BEAM (Bodleian Electronic Archives and Manuscripts)

BEAM provides a trusted digital repository service for the management of born-digital archives

and manuscripts acquired by the Bodleian Library’s Special Collections department. The

repository allows the Bodleian’s archivists to gather, describe, manage and preserve the digital

components in archive and manuscript collections while maintaining their relationship with more

traditional components of the same collection. While there is considerable overlap here with the

issues involved in archiving research data, BEAM currently do not provide a service to

academic departments or directly to researchers.

BEAM staff have been involved in the Digital Safe project (see Glossary), again looking at

similar issues but in the context of institutional data.

Faculty and Department ITSS

The faculties and departments that make up the Humanities Division have widely varying levels

of IT provision.

Methods & Activities

Semi-structured interviews with researchers

At the time of application for funding an open call was made seeking participation from Digital

Humanities projects willing to participate in this project. Fourteen projects volunteered, of which

thirteen were eventually able to take part. Within each of the self-selected projects (see

Introduction: The projects), we conducted interviews with project researchers (generally the PI

on each of the projects) to determine:

● the aims and scope of their research

● their data creation or collection methods

● their data management practices as a whole, including organisation and documentation

of data, hosting, storage, data entry or editing, preservation plans, backups, etc

● the current state of their data in a wider research context, including citations, related

publications, collaboration with other projects, reuse, etc

● their views on whether or how the University was meeting their RDM requirements

● their understanding of digital preservation and sustainability, both generally and in the

context of their project

We took a semi-structured approach to these interviews rather than administering a fixed

questionnaire or survey, and the resulting conversations were invaluable in foregrounding the

real issues and challenges faced by digital humanities academics. A full description of each

project and the data gathered can be found in Appendix 1: project case studies.



Conversations with IT Support Staff

We also conducted discussions with several of the IT Support Staff (ITSS) from the academic

departments involved in the selected projects, who had worked to support these research

projects in some capacity. These were similar to the conversations with the researchers but with

a more technical focus. Again, allowing IT staff to talk freely proved rewarding, and helped to

build up a rich picture of the wide variety of experience and knowledge that local support staff

are currently providing.

The level and type of local (departmental) IT support varies widely between units, but services

provided to researchers by ITSS usually include some or all of the following:

● website/database development

● website/database hosting

● managing backups of desktop computers and project websites

● supplying, configuring, and maintaining third-party software

● advising on technical components of research funding bids

There is considerable overlap here with the services provided centrally by IT Services and the

Bodleian. This is not in itself a problem – local expertise can be valuable, and in any case

central resources are insufficient to meet all demand – but highlights the need for

communication between providers to ensure interoperability and long-term sustainability.

Despite the variety of experience, some common ground emerged:

● Most ITSS were dealing with several research projects at any one time, at various

stages of the research lifecycle (some ‘live’, i.e. current funded projects; some ‘legacy’,

usually only maintained on a best-effort basis).

● Faculty ITSS were generally deeply knowledgeable about the projects for which they

were responsible, with both an understanding of the technologies used (even in legacy

projects which they had only ‘inherited’) and an appreciation of the research aims.

● Despite that sense of investment in the research, ITSS inevitably had different priorities

from researchers, and in many cases we sensed a tension between the desire for

standardisation (on the basis that a reliable service can be more easily and efficiently

provided by supporting only a limited set of technologies) and an appreciation of the

importance of meeting individual researchers’ requirements (often by building and/or

maintaining bespoke systems).

● A frequently recurring issue was the lack of consultation about technical requirements for

research projects; ITSS reported that they were often only informed of requirements

once a project had already been funded (i.e. once they had effectively been committed

to providing hosting and/or development resources). In one extreme case the first ITSS

involvement or indeed knowledge of a project was a request for a firewall exception from

a contractor who was developing the project website.



Data acquisition

One of the initial goals of this project was to gather research materials from the self-selected

projects for ingest into the central repository (ORA-Data). These datasets comprised a variety of

file formats:

● Databases: MySQL; PostgreSQL; Access; Filemaker Pro

● XML texts and metadata: TEI; EpiDoc; bespoke schemas

● Images: TIFF; JPEG

● Video: MPEG; MP4

● Audio: MP3

● Word processing: .docx; .doc; WordPerfect

● PDFs

Several projects had physical media and non-digital assets associated with them, such as:

● CDs, DVDs

● audio recordings on analog cassettes

● paper transcripts of interviews

● slides, microfilm, microfiche

● museum objects

● notebooks

● archaeological finds (pot sherds etc)

Archiving and cataloguing these assets fell outside the scope of this project, but the existence of

these forms of research data have flagged the need for facilities for recording the existence and

location of related physical assets when ingesting digital assets into a repository. It also raised

the question of whether there is need for a clearer policy on whether the humanities equivalent

of ‘lab notebooks’ and other pre-publication working data is in scope for the institutional

repository (and if not, where or whether these should be preserved).

The main obstacles to data acquisition were:

● the simple logistics of liaising with so many different researchers and ITSS

● the availability of data holders – both researchers and ITSS had higher priorities

● technical issues with exporting data in a useful format

The main obstacles to data deposit were:

● the repository software – a limited interface for ingest and creation of metadata

● some datasets turned out to be so complex that formatting and describing them for

preservation would be a project in its own right



As there was no user-facing process or service for uploading data into ORA-Data during the

course of this project, files were transferred to project staff by combinations of the following:

● email

● Oxfile file-sharing service (https://oxfile.ox.ac.uk/)

● portable hard drive

● CD/DVD through internal mail

● standard file compression software (gzip, tar)

Note that none of these tools allow for verifying the integrity of the data, i.e. ensuring that the

data is transferred without truncation or corruption; verification was performed manually by

checking file sizes and number of files against the information supplied by the project PI. This

sufficed for the current proof-of-concept project, but future development of systems for

submitting research data should take this requirement into account.

Despite efforts, much less data than intended was gathered. The main barrier to data

acquisition was not technical but social and logistical. Where research materials were stored in

databases to which only local IT staff had access, or even on researchers’ personal laptops, the

process of extracting data required considerable pro bono effort from already over-committed IT

staff and researchers. In all cases personal contact was required to gather data, and in most

cases this became a protracted discussion to establish the answers to questions such as what

data were currently held, what needed to be preserved, whether embargoes were required,

where copyright resided, and in what format(s) data could be exported. It is expected that as a

Data Management Plan (DMP) becomes a requirement for funding, some of these questions will

be answered earlier in the project lifecycle; none of the projects consulted had a DMP.

Data ingest and metadata creation

Once gathered, the data was ingested to ORA-Data through the manual browser-based process

described above, creating only the minimal metadata demanded by this method. While manual

creation of further metadata would have been possible, it would have been labour-intensive,

error-prone, and almost certainly inconsistent with the richer metadata which the new input

forms being developed (see ORA-Data above) will facilitate.

We did, however, investigate the possibility of creating additional targeted metadata extracted

from the content of datasets, namely bibliographies and the temporal, prosopographic, linguistic,

and geographic extent of data. This was also, inevitably, a largely manual process. While there

is scope for partial automation, e.g. template scripts for processing different data formats,

manual configuration would probably always be required to determine the correct fields to use

for e.g. person, place, or date information, and the potential gains (cross-searching, comparing

multiple datasets, visualising the extent of data) could as efficiently be addressed by more

standardisation of formats in the repository and better indexing/search within datasets.

https://oxfile.ox.ac.uk/



Documentation was ingested alongside data where possible, but this was often patchy and

never machine-processable; the latter is not a problem per se (human-readable documentation

will be necessary and useful for as long as humans are involved in the reuse of data for further

research!) but it might be useful to experiment with template systems and ‘toolkits’ for

encouraging more structured documentation of data as research progresses: e.g. for databases,

a pre-populated form where a brief description of each field can be entered; for XML, achieving

a similar goal by automatically inserting comment fields at key points in the schema; for multi-

format datasets, automatically documenting the file types and any folder hierarchy, leaving

space for human explanations of the significance of those properties.

Case studies: research projects

A full description of each of the cases can be found in Appendix 1: project case studies, but in

presenting the findings in this report, we decided to focus on one particular case, which turned

out to exhibit a wide range of characteristic issues for digital humanities research data, and to

use this as a starting point from which to highlight the same issues found in other projects.

These points of overlap are highlighted in the blue boxes.

Sphakia Survey

Overview

● Website: http://sphakia.classics.ox.ac.uk/

● Summary: Databases of finds recorded in an archaeological survey of Crete

● Project lifetime: 1987–

● Formats: Filemaker Pro; JPG; TIFF; .mov; HTML

● Funding: Social Sciences and Humanities Research Council of Canada; Institute for

Aegean Prehistory (New York); Craven Committee, University of Oxford

Scope of research

The Sphakia Survey is an interdisciplinary archaeological project, begun in 1987, whose main

objective was to reconstruct the sequence of human activity in a remote and rugged part of

Crete (Greece) from the time that people arrived in the area, by ca 3000 BC, until the end of

Ottoman rule in AD 1900. The research covers three major epochs, Prehistoric, Graeco-Roman,

and Byzantine-Venetian-Turkish, and has involved the work of many people using

environmental, archaeological, documentary, and local information.

Four main types of information are recorded: the artefacts discovered in the surface survey

which was conducted; the environmental data that records the context in which they were found;

text and inscriptions found; and oral/ethnographic data (notes from interviews; photos of sites

and surrounding area). The information is categorised into different regions and environmental

zones, and individual artefactual finds are classified by shape, materials, colour and so on.

http://drive.google.com/open?id=1Ui8GFw-LNvLE0BcSp3OA8H-ao0SiXOIOq9yi0cqsLOU

http://sphakia.classics.ox.ac.uk/



The project represents nearly 30 years of academic research; the archaeological survey is

inherently unrepeatable and the subsequent analysis would be at best costly and at worst

impossible to replicate.

Research materials

Databases

All of the information collected for the Sphakia Survey was recorded in a set of databases. The

survey databases were created in Filemaker Pro. One set of linked databases (focusing on

Region 8 of the survey, as a case study) is accessible via the website through a bespoke PHP

interface, running on the following setup:

● Hardware: PowerMac G3 server (warranty ended in 1999)

● OS: Mac OS X 10.3

● Webserver: 4D WebSTAR

● Database: Filemaker Pro version 5.5

This version of Filemaker Pro is long since out of support and has no direct upgrade path to a

current version; the hardware is long overdue for upgrade. As a whole the project’s web

presence is in urgent need of intervention to safeguard its long-term survival.

In addition to the online databases, 29 further databases (mostly concerning detailed

macroscopic fabric analysis of the finds from the survey) exist only offline in Filemaker Pro

version 4.1 format on the researchers’ personal computers. This version of Filemaker Pro, still

being used by the project team, requires Mac OS 9 (or ‘Classic’ environment to emulate this) to

run, which in turn requires a PPC Mac running OS X 10.5 or earlier10: this hardware has not

been supported by Apple since 2009. In order to export these databases into a reusable format

they would first have to be updated (via a two-stage migration process) to a modern version of

Filemaker Pro.

Unsupported software: also found in one other project

Server setup at risk: also found in one other project

There are at least three copies of these databases (on the PI’s laptop; on an external hard drive,

as a backup copy; and on the Co-PI’s computer) and as no version control is in place it is quite

likely that the copies are no longer in sync and would have to be intelligently merged. This

operation is complicated by the fact that Filemaker’s “last modified” timestamp is updated

whenever a database is queried, so programmatically determining the most recent substantive

changes may be difficult or even impossible.

10 Current Mac OS is 10.9



Version control issues: also found in one other project

Image & video

One database includes scanned black and white drawings of pottery finds (copyright status

uncertain: the originals are owned by the Canadian Institute in Greece but they were not

responsible for the digitization and are considered to be unlikely to assert their rights); there are

also a large quantity of photographs (low resolution versions are mostly on the project website,

but higher resolution versions exist offline). Photographs have unique IDs and associated

captions/descriptions are recorded in the database. Video clips (.mov/.rm format) are available

through the website; these are taken from a 50-minute film, originally in VHS format, since

digitised.

Copyright status in images: also found in one other project

Different quality versions of images: also found in one other project

Website and documentation

Technical documentation for the website and databases is believed to exist but project staff

were not able to find or obtain it within the duration of the project.

The website provides detailed explanations of the project aims, a summary of results, how to

use the database, and so on. It is regarded by the researchers as a publication in its own right,

to be preserved alongside the data.

The website also hosts a number of scholarly publications (HTML format) and newspaper

articles (reproduced as JPG and PDF), as well as a bibliography of articles published

elsewhere.

Documentation embedded in website: also found in two other projects

Physical media

In addition to the digital data, the PI has custody of the following media associated with the

project:

● Original paper drawings of pottery finds

● 2 copies of VHS video (as described at http://sphakia.classics.ox.ac.uk/video.html)

● Approximately 45 CDs of digitised photos, slides, & drawings (in the form of TIFFs, jpeg

derivatives for the website, and thumbnails of these), plus course materials for

http://sphakia.classics.ox.ac.uk/video.html



Archaeology for Amateurs: The Mysteries of Crete, a course created for TALL in 2002

(http://crete.classics.ox.ac.uk/)

● Approximately 1000 35mm slides (some of which were digitised for the website)

● 1 reel of microfilm

● 1 box of microfiche

Except for the cataloguing/captioning of photographs for the website, none of these resources

have been catalogued or documented.

Further non-digital resources

The pot sherds from the archaeological survey are currently housed in the Archaeological

Museum in Khania (West Crete), and are owned by the Greek Archaeological Service. The

site/number recorded in the database can be used to identify the physical object.

Handwritten notebooks from the survey also exist; these have not been catalogued or digitised.

A two-volume publication based on the finds of the survey is currently in progress, expected in

2017.

Associated non-digital assets: also found in three other projects

Linked publications: also found in six other projects

Other projects

Summary information only is included here for the remaining projects; more detail is available in

Appendix 1: project case studies.

Project 1: Ashmolean Cyprus Digitisation Project

Website: http://digital.humanities.ox.ac.uk/ProjectProfile/Project_page.aspx?pid=333

Department: Ashmolean Museum

Summary: Database of description/provenance for Ashmolean Cyprus collection

Formats: Zetcom MuseumPlus relational database; TIFF; JPG

Funding: A. G. Leventis Foundation

Project 2: Centre for the Study of the Cantigas de Santa Maria

Website: http://csm.mml.ox.ac.uk/

Department: Faculty of Medieval and Modern Languages

Summary: Metadata database for manuscripts of medieval Galician poems, plus some full texts

Formats: MySQL database, linked PDF and XML texts

Funding: Leverhulme Trust; Research Development Fund of Oxford University; MHRA; British

Academy

http://crete.classics.ox.ac.uk/

http://digital.humanities.ox.ac.uk/ProjectProfile/Project_page.aspx?pid=333

http://csm.mml.ox.ac.uk/



Project 3: Creative Practice in Contemporary Concert Music

Website: http://www.music.ox.ac.uk/research/cpccm/

Department: Faculty of Music

Summary: Video/transcripts of rehearsals; database of quantitative data derived from these

Formats: mp4, .docx, relational database

Funding: AHRC

Project 4: Dictionary of Medieval Latin from British Sources

Website: http://www.dmlbs.ox.ac.uk/

Department: Faculty of Classics

Summary: Complete dictionary of medieval Latin digitised as XML

Formats: XML (bespoke schema)

Funding: AHRC; Packard Humanities Institute; British Academy; John Fell Fund

Project 5: Digital Miscellanies Index

Website: http://digitalmiscellaniesindex.org/

Department: Faculty of English

Summary: Bibliographic data about 18th century poetic miscellanies, stored as XML

Formats: XML (bespoke schema)

Funding: Leverhulme Trust

Project 6: Early Modern Festival Books

Website: http://festivals.mml.ox.ac.uk/


Summary: Database of bibliographic information about early modern festival books

Formats: MySQL relational database

Funding: John Fell Fund

Project 7: First World War Poetry Digital Archive / Great War Archive

Website: http://www.oucs.ox.ac.uk/ww1lit/, http://www.oucs.ox.ac.uk/ww1lit/gwa/

Department: Faculty of English / Oxford University Computing Services

Summary: Partly community-sourced collection of multimedia objects with metadata/provenance

Formats: TIFF, JPEG, MP3, MPEG, TEI XML, txt, csv

Funding: JISC; HEFCE

Project 8: Inscriptions of Sicily


Summary: Epigraphic database of Sicilian inscriptions

Formats: currently Access database, being reworked as EpiDoc XML

Funding: John Fell Fund

Project 9: Last Statues of Antiquity

Website: http://www.ocla.ox.ac.uk/statues/

http://www.music.ox.ac.uk/research/cpccm/

http://www.dmlbs.ox.ac.uk/

http://digitalmiscellaniesindex.org/

http://festivals.mml.ox.ac.uk/

http://www.oucs.ox.ac.uk/ww1lit/

http://www.oucs.ox.ac.uk/ww1lit/gwa/

http://www.ocla.ox.ac.uk/statues/



Department: Faculty of History / School of Archaeology

Summary: Database of information about statues, plus digitised photographs

Formats: Filemaker Pro database; JPG

Funding: AHRC

Project 10: Lexicon of Greek Personal Names Online

Website: http://www.lgpn.ox.ac.uk/online/


Summary: Prosopographical database of ancient Greek names, based on paper publication

Formats: ingres database, TEI XML, RDF (CIDOC-CRM), postgres database, PDF

Funding: AHRC

Project 11: Oxford Archive of Russian Life History

Website: http://www.ehrc.ox.ac.uk/lifehistory/archive.htm


Summary: Audio recordings & transcripts of ethnographic interviews, plus photographs

Formats: Microsoft Word docs, WordPerfect, mp3, JPG

Funding: Leverhulme Trust; AHRC

Project 12: Oxford Roman Economy Project

Website: http://oxrep.classics.ox.ac.uk/


Summary: Several linked quantitative databases relating to Roman economics and trade

Formats: postgres database, CSV, Word docs

Funding: AHRC; Augustus Foundation (Baron Lorne Thyssen)

http://www.lgpn.ox.ac.uk/online/

http://www.ehrc.ox.ac.uk/lifehistory/archive.htm

http://oxrep.classics.ox.ac.uk/



Common findings

The research in this project identified a number of common themes or concerns which rarely

have clear solutions. The recommendations below are derived directly from the evidence

gathered.

Finding 1: Data v interfaces

Although a distinction is sometimes drawn between preservation (long-term storage of data for

reuse) and sustainability (keeping the data available online, e.g. via a dynamic website), this is a

distinction which often proves too subtle for most researchers.

There is a considerable (and wholly understandable) lack of clarity among researchers around

the difference between their research data (for textual data, most commonly held in either a

relational database or increasingly as XML) and the interfaces to it (usually a website with some

kind of search interface). In some cases this confusion is not aided by the construction of

systems and applications which blur the line between data and interface: user documentation

and logic, essential to the understanding and use of the data, are embedded in the interface

software, making it harder to extract the data for preservation.

All of the academics consulted started from the point of view that the optimum method of

preservation would be to “keep the website working”; some took a stronger position that the

data were meaningless without the interface. This strong preference for sustaining applications

over preserving data in isolation stemmed from a firm belief in the current importance of these

datasets for the research community – that is, that if the data are still ‘in use’ then they should

not be ‘archived’. This, in turn, came from a strong perception of ‘archiving’ as involving taking

data out of circulation, mothballing it, making it inaccessible, marking it as less current or

relevant; this prejudice is not irrational but should be borne in mind when communicating with

researchers about preservation.

▶ Recommendations: 1. Preservation and sustainability

Finding 2: Funding gaps for sustainability and preservation

As mentioned above, most research funding is allocated to a fixed-term project and can only be

used to cover costs incurred during that period; this means that it can be difficult to fund the

ongoing maintenance of a website, database, or other digital resource. This is a particular

problem in the humanities where the scholarly benefits of resources generally accrue over the

medium- to long-term, and means there is an increased risk of technical maintenance having to

be done ‘under the radar’, on a ‘best effort’ or pro bono basis, by IT staff (or indeed researchers)

who have no time ring-fenced for the work. This in turn means that:



● staff are responsible for an exponentially increasing set of projects and research data

● single points of failure are more likely, e.g. where a resource relies heavily on a single

individual’s expertise and goodwill

● maintenance is likely to be more reactive than proactive, concentrating on ‘firefighting’ to

keep a resource alive in the short-term rather than working towards its long-term

sustainability

● the true cost of maintaining the resources is therefore obscured and continues to be

underfunded

However, the research found that it was rare for any research or IT staff to take active steps to

close down or discontinue a digital resource even when it is no longer funded, so this

maintenance debt just increases: this is clearly not sustainable without increasing resources to

meet demand, and it is recommended that more active curation of research outputs is

undertaken. It should be noted that the University RDM policy states that “research data and

records should be retained for as long as they are of continuing value to the researcher and the

wider research community”; at present it is not clear who has the authority to make that

judgement.

The combination of the explicit cost of preservation (and the difficulty of including this in a

funding bid) and the hidden cost of ongoing maintenance may mean that research materials

stay in a grey zone of best-efforts sustainability rather than being consciously and effectively

preserved; this puts them at greater risk of being lost altogether, as a) outdated and unpatched

software may be more vulnerable to deliberate attack or accidental failure, and b) expertise may

be lost as knowledgeable individuals who have been maintaining ‘unofficial’ resources out of

goodwill may move or leave without proper handover. Essentially, it is vital to be explicit about

what resources (data and interfaces) exist; what their current status is (e.g. development, live,

maintenance, archive); and who ‘owns’ them, i.e. who takes responsibility for their maintenance,

preservation or deletion.

▶ Recommendations: 2. Funding gaps for sustainability and preservation

Finding 3: Lack of reuse and evolution

The ability to reuse and build on existing data for future research is one of the most common

arguments for preservation; however, good examples of effective data reuse in the digital

humanities are comparatively hard to identify (though some are described below). We believe

there are two simple reasons for this:

● research materials are frequently not preserved in a form where they can easily be

discovered and reused

● digital humanities is still a relatively young field, particularly given the generally long

gestation periods for humanities research



However there is also a third, more paradoxical effect that, because digital humanities projects

tend to evolve over the long-term, their research materials are in a sense continually being

reused by the researchers involved in collecting/creating them. This has problematic

implications for reuse by others because:

● it can be difficult to motivate investment of time and resources in documentation while

projects are still ‘in progress’, as this is seen as an activity to be carried out when the

project is ‘finished’ (and as such may never actually happen)

● projects often grow organically over the course of decades, converting data and

adopting new technologies according to their needs (and as the technology changes

around them, the research may change to make use of new capabilities – the tools

influence the research), resulting in more different technologies to document

If research data are to realise their full potential as re-usable research materials, it is vital that

they are preserved in a way that allows them to be discovered, accessed, and used by future

researchers. Discovery and access rely first and foremost on good metadata, which needs to be

created at the point of preservation. Even the most perfectly preserved and documented

research materials are of little use if nobody is aware of their existence in the first place. The

‘usability’ of an existing dataset depends on recording the original context of the data

capture/creation alongside the data itself. In some cases there may be several different points of

capture and/or conversion and it may be necessary to preserve or document the project’s

history rather than simply a snapshot of the current/finished state of the data. Examples of

existing projects which demonstrate reuse are:

Digital Miscellanies Index: in the course of the original project, data has been optimised for

reuse and preservation; this has enabled phase 2 of the project, which will combine the original

data with that from the Verse Miscellanies Online project

(http://versemiscellaniesonline.bodleian.ox.ac.uk/) and from another ‘orphaned’ database,

creating mappings between the three datasets to allow effective cross-searching and

visualisation of verse miscellanies across a much broader historical period.

Lexicon of Greek Personal Names and SNAP:DRGN: the LGPN has digitised, converted and

normalised data to make it more widely available and reusable, and SNAP:DRGN defines

standards for making prosopographies of the ancient world (including the LGPN) available as

open linked data. While neither of these in themselves constitute reuse of data for novel

research, both make strong contributions to the work of enabling reuse.

It should be noted that all these examples of effective reuse involve strong collaborations with

other institutions; any proposed institutional data management solutions must meet the

requirement for data sharing and collaborative working outside Oxford as well as within it.

▶ Recommendations: 3. Reuse and evolution

http://versemiscellaniesonline.bodleian.ox.ac.uk/



Finding 4: Advice, training, and mentoring

Almost every respondent interviewed stressed the value of conversations in person with

research technologists (in IT Services, in the Bodleian, and in faculties/departments) for all

aspects of creating/collecting, managing, developing, and disseminating their data: selecting

tools and designing methods for data collection; data analysis and visualisation; data

conversion; developing user interfaces; and so on. These dialogues offered a chance to explore

possibilities and work through the advantages and disadvantages of possible approaches rather

than presenting a single technical ‘solution’.

At the outset of a project researchers may not have a clear idea of what their eventual data-

management requirements will be; they may not know what technological options are available

to them and what the longer-term consequences of those options may be; and they may not

know what questions they need to ask to fill in these gaps. More open-ended conversation with

data management and research technology experts helps to build up a fuller picture of the

research data and related resources, how the data has been managed to date, and how best to

go forward with long-term sustainability and preservation in mind.

Ideally this sort of conversation should happen at the beginning of a project – expert support

early on can highlight potential problems, suggest possible developments and enhancements,

and identify areas for possible collaboration with other projects. In this last area it is important to

note the advantage of not having a separate humanities-specific research support area: that is,

central support services can identify areas where knowledge and technological development

can be shared between projects in different disciplines. This can result in potentially fruitful

interdisciplinary collaboration, as well as simple efficiency.

While researchers had much praise for the support given to them by individual members of staff

and by departments as a whole, however, they also expressed frustration at:

● Receiving conflicting advice from different sources

● Different technologies being supported by different departments/services – some insist

on migrating/rewriting when inheriting an application from elsewhere

● Lack of clarity about how the different central services interact

Even within any one central department (IT Services or the Bodleian Libraries) there are several

different sections which might be asked to give advice on research technologies and data

management (see Relevant service providers for more details of these services), and these

sections operate largely independently of each other. While personal conversations are

important, then, they need to be with people who are aware of the available options, and as part

of a joined-up approach.

Some steps have already been taken to address these issues, notably:



● a new RDM website (http://researchdata.ox.ac.uk/) and email address

([email protected]) offering a single point of contact for RDM enquiries (see

Relevant service providers)

● a training programme offered by Research Support at IT Services

(http://blogs.it.ox.ac.uk/acit-rs-team/events/rdmcourses/) promoting consistent data

management principles while tailoring its message to specific disciplines

While these are welcome beginnings and necessary underpinnings for increased harmonisation

of service provision, it is clear from conversations with academics that more passive, static

information (websites, formal training sessions) will not meet all their requirements and that

personal dialogues with experts are still vital. It should also be noted that no ‘single point of

contact’ can prevent people accessing services via other points; what is needed is more of a

clearing-house for projects and information, however and wherever the first contact is made.

Several researchers independently suggested that ‘digital humanities mentoring’ would have

been useful to them, and some of these indicated that they might be willing to act as mentors in

the future. A pilot ‘RDM mentoring’ project is currently under way in the Social Sciences (with

mentoring initially being offered by the Research Support team at IT Services), and has already

garnered considerable interest from academics.

▶ Recommendations: 4. Advice, training, and mentoring

Finding 5: Repository policy on data and formats

In general researchers interviewed for this project were aware a) of the central publications

repository (though few had used it), and b) that a central data repository (ORA-Data) either

already existed or was imminent. There was, however, much less clarity about the purpose of

ORA-Data, i.e. whether it was intended for:

● discoverability (search or serendipity)

● actively promoting the diversity of Oxford research (marketing our research)

● archiving/preservation (ensuring nothing is lost)

● backups (ensuring data and interfaces can be efficiently restored in event of a crash)

● accountability (e.g. fulfilling the REF)

● compliance (e.g. meeting funder requirements)

● sharing data as a principle (Open Access)

or some combination of the above. There were also questions and concerns expressed by

researchers about what data is desirable and permissible within ORA-Data, e.g. should/could it

include:


mailto:[email protected]

http://blogs.it.ox.ac.uk/acit-rs-team/events/rdmcourses/

http://blogs.it.ox.ac.uk/acit-rs-team/events/rdmcourses/



● all data produced by Oxford researchers?

● data associated with publications only?

● data which has no other appropriate (e.g. subject-specific) repository? (NB a significant

majority of humanities data is believed to fall into this category, far more than in other

disciplines)

● ‘unfinished’ or unfunded data? (e.g. data which is unpublished, incomplete, not part of a

project)

A deposit policy is currently being drafted11 and these questions are being addressed.

As well as the general question of what categories of research materials are permitted or

invited, there was a more specific technical question of what file formats the repository would

accept. Reasons for selecting formats and technologies are often contextual rather than

functional; there are usually many different options which could meet researchers’ immediate

needs, but they are arguably most likely to use:

● what they have always used

● what is most readily available

● what colleagues in their field use, or

● what their local ITSS prefers/mandates for ongoing support

In many cases these choices may have been based on business considerations and/or

historical accident as much as technical or methodological reasons.

Under Oxford’s devolved structure no-one has the power to prescribe (or proscribe) what

formats are accepted for preservation; however clear guidelines on which formats we can most

easily support and preserve should be clearly disseminated, along with advice on which formats

are most appropriate for enabling future reuse. In practice we can always ‘preserve the

bitstream’12 even if the format is suboptimal or the data is poorly understood/documented, but

making such data reusable may not be cost-effective.

File formats are of course not institution-specific, and various organisations already provide

comprehensive guidelines on the effectiveness of different formats for long-term preservation,

e.g.

● the Library of Congress recently released its recommended format specifications:

http://www.loc.gov/preservation/resources/rfs/

11 See Appendix 3: Draft ORA-Data Policy Statement 12 Bitstream preservation refers to the process of storing and maintaining digital objects over time,

ensuring that there is no loss or corruption of the bits (units of information) making up those objects; this

is effectively the lowest-level preservation possible, necessary but not sufficient to preserve meaningful

ongoing access. A more detailed definition is available here:

http://www.paradigm.ac.uk/workbook/preservation-strategies/degree-bitstream.html

http://www.loc.gov/preservation/resources/rfs/

http://www.paradigm.ac.uk/workbook/preservation-strategies/degree-bitstream.html



● the UK Data Archive has recommendations on file formats and software:

http://www.data-archive.ac.uk/create-manage/format/formats

● the Digital Curation Centre (DCC) offers guidance on selecting formats which, while

slightly older, articulates useful principles for consideration when choosing file formats to

recommend: http://www.dcc.ac.uk/resources/curation-reference-manual/completed-

chapters/file-formats

▶ Recommendations: 5. Repository policy on data and formats

Finding 6: ORA-Data API

Several ITSS expected or wished to be able to use ORA-Data as a ‘back end’ for building a

website or application (e.g. storing the data in ORA-Data but providing an alternative web

interface for searching, browsing, visualising that data). The ability to do this would be a clear

incentive for depositing data in ORA-Data; however at present it would be difficult for most

datasets because of the structure of objects in the repository and the limitations of the API. The

main barriers to using ORA-Data for this type of development are:

1. Locations of objects in ORA-Data are not persistent. URLs in ORA-Data include the

silo name, which will change if a package is moved from one silo to another. It should be

acknowledged that in practice this is unlikely to happen frequently; however, the lack of

persistent addresses would be considered a risk if using the repository as the back end

for a web-based application.

2. Lack of granularity in addressing data. Once a database or table is ingested into

ORA-Data there is no way to address individual entries (e.g. by linking directly to them,

or targeting them in a search); similarly, once XML documents are ingested there is no

way to address or query them at anything lower than file level. Any application which

wanted to make effective use of data from ORA-Data in these formats would have to

store a local copy of the entire dataset and perform its own indexing, thus losing some of

the advantages of storing the data in the repository.

3. Inability to restrict a search by silo. The Databank API does not provide a way to limit

a search (e.g. for a data package or file name) by silo; at present there are relatively few

silos but as ORA-Data grows this will introduce significant inefficiencies into common

searches.

The API in its current state would allow ORA-Data to be used as a back end for simple

applications where the data are more ‘collection-like’, e.g. a set of images with metadata; the

First World War Poetry Digital Archive and Great War Archive produced a report which

http://www.data-archive.ac.uk/create-manage/format/formats

http://www.dcc.ac.uk/resources/curation-reference-manual/completed-chapters/file-formats

http://www.dcc.ac.uk/resources/curation-reference-manual/completed-chapters/file-formats



addressed the possibility of using ORA-Data (then Databank) in this way13, and concluded that it

would be possible. However, this report also observed that the available search functionality

would be fairly restricted; that Databank was not yet a reliable enough service; and that

adequate support could not be guaranteed.

There is clearly a ‘chicken and egg’ problem here, that is: the limitations of the API restrict

development of interfaces using ORA-Data as a back end; the API could be developed further,

but without use cases, it is not clear what direction that development should take (and resources

for any further development are currently extremely limited).

▶ Recommendations: 6. ORA-Data API

Results

Datasets in ORA-Data

One of the key aims of the project was to acquire and ingest datasets from all the collaborating

projects. This aim has only partially been achieved, with datasets being acquired from the

following projects:

● Ashmolean Cyprus Digitisation Project

● Centre for the Study of the Cantigas de Santa Maria

● Dictionary of Medieval Latin from British Sources

● Early Modern Festival Books

● First World War Poetry Digital Archive / Great War Archive

● Lexicon of Greek Personal Names

● Oxford Roman Economy Project

● Sphakia Survey

Negotiations are still in progress to acquire data from Digital Miscellanies Index and Last

Statues of Antiquity. Only minimal metadata has been gathered for most projects.

Improved digital preservation guidelines

Some improvements have been made to the digital preservation guidelines on the DH@Ox

website14. However, our conversations with researchers strongly suggested that ‘passive’

guidelines like this were rarely consulted, and that the most useful information that a website

could give would be an indication of where to go for direct personalised advice and consultancy.

13 See Appendix 5: Databank and other solutions for archiving, searching and displaying the First World

War Poetry and Great War Archive collections 14 http://digital.humanities.ox.ac.uk/Support/Guidelines.aspx

http://drive.google.com/open?id=0B19js9ggrnsETk9uSHJ4aVFmc1k

http://drive.google.com/open?id=0B19js9ggrnsETk9uSHJ4aVFmc1k



Two specific areas where it was acknowledged that passive guidelines could serve a useful

reference purpose were

● accepted or recommended file formats for preservation

● information about licensing

Neither of these are institution-specific, and we would recommend resisting the urge to reinvent

the wheel in either case.

Conclusions

Recommendations

1. Preservation and sustainability

1.1 Ensure that data are discoverable, searchable, viewable, and downloadable while

‘archived’; this requires:

● good metadata

● some form of preview functionality for data in the repository

● ideally, the ability to search within datasets, not just within top-level metadata

1.2 Actively promote Oxford’s digital data as research materials, framing them in a

meaningful research context. We should regard them as assets in a collection, to be curated,

displayed, and exhibited where possible rather than merely catalogued and stored.

1.3 Choose terminology carefully in our communications to emphasise that ‘preserving’ is

ideally about keeping the data alive for ongoing use, not burying it. This message will be more

convincing if concrete examples can be cited which demonstrate active reuse of well-preserved

data.

1.4 While separating ‘data’ from ‘interface’ or ‘application’ may not be possible for existing

research data, future projects should bear this separation in mind and set different levels of

expectation for the longevity of each. Data will last longer than interfaces.

2. Funding gaps for sustainability and preservation

2.1 Maintain full oversight of research data, websites and applications for which

faculties/departments are currently responsible (and what that responsibility entails, e.g.

funding, hosting, maintenance, preservation, right to delete) and what their current status is (e.g.

development, live, maintenance, archive) by means of regular faculty-level audits of digital

research assets



2.2 Develop a clear funding model for the ongoing maintenance of relevant resources which

are judged to be current and useful

2.3 Provide a robust and reliable system of long-term preservation for resources which are

to be archived

2.4 Investigate the feasibility of offering a ‘free at point of service’ preservation facility (by

underwriting at institutional level, top-slicing divisions, or a combination of both); as long as

unofficial maintenance appears cheaper than official preservation, people will choose the former

despite the increased risks of data loss

2.5 Acknowledge that it may not be possible (or practical) to preserve everything. Every new

project should have an ‘end of life’ plan, prioritising what is to be preserved and what is not

3. Reuse and evolution

3.1 Improve the discoverability and searchability of datasets held by the institution by

encouraging the creation of good metadata and supporting the ingest of metadata with

metadata assistants

3.2 Identify, invest in and promote projects which demonstrate effective reuse

3.3 Encourage and promote collaboration with other institutions

3.4 Move towards wider adoption of providing research outputs as linked open data,

enabling more immediate and effective reuse

3.5 Where appropriate and practical, preserve project history and contexts for data capture

as part of the metadata

4. Advice, training, and mentoring

4.1 Invest in high-value consultancy and mentoring

4.2 Extend the pilot Social Sciences Research Data Management (RDM) mentoring scheme

to the Humanities, and where possible facilitate peer-to-peer mentoring rather than relying on

limited central resources

4.3 Build on exemplary training such as the Digital Humanities at Oxford Summer School

(http://digital.humanities.ox.ac.uk/dhoxss/), which combines the teaching of good data

management principles and the development of a strong community of practice with more

hands-on practical training

http://digital.humanities.ox.ac.uk/dhoxss/



4.4 Promote technological consensus where appropriate (while recognising that there may

be cases where individual projects need to diverge from this, and considering the knock-on

effect of those decisions) – Oxford’s research support provision can never be ‘one size fits all’

but rather should target a suite of supported and well-understood technologies, chosen to fit

common patterns of research, on which in-house expertise can be focused;

4.5 Emphasise that preservation and sustainability need to be understood from the very

beginning of a project and that the technologies selected will have an impact on both; identify

key areas in existing programmes of humanities research training and advice where awareness

could be increased

4.6 Draw on enquiries to IT Services, BDLSS, and new RDM single point of contact to

develop a ‘knowledge base’ which all digital scholarly support staff can use

4.7 Set up a Digital Scholarly Support office (ideally with a staffed physical location as well

as online resources) to act as a unifying framework for all these initiatives, a knowledge

exchange centre for providers of research data management and preservation support, and a

more visible and approachable ‘front of house’ inviting enquiries from researchers, whether

simple questions or more exploratory, open-ended dialogue about digital research requirements

4.8 Invest in projects that join up and build upon existing technical and social services.

Ensure that the collaboration is fully resourced, funded, and encouraged.

4.9 Ensure all projects have a Data Management Plan (DMP)

5. Repository policy on data and formats

5.1 Establish internal clarity and consensus on repository policy, and communicate this

proactively to potential users

5.2 Handle communications about what data is permitted/invited tactfully, particularly in

terms of a) data which has no other appropriate repository and b) data which is not eligible or

appropriate for ORA-Data.

5.3 Avoid reinventing the wheel when producing guidelines on file formats – build on existing

guidelines from other institutions, always bearing in mind that the methods and formats chosen

for any research project should above all fit the goals of the project



5.4 Investigate and user-test the possibility of categorising data by format and/or availability

of metadata, as a way of signifying how reusable it is in different contexts (by analogy with e.g.

the emerging systems for signalling Open Access status15)

6. ORA-Data API

6.1 Enable persistent URLs for data packages

6.2 Enable the addressing of data at arbitrary levels of granularity, whether directly within

ORA-Data or by introducing an intermediate application layer which can ‘unpack’ data and

handle these requests16

6.3 Improve search options available (restrict search by silo, search within datasets), either

via the API or again via an intermediate layer

6.4 Develop prototype applications and interfaces making use of data in ORA-Data

Next steps

1. Information

We propose that a digital humanities data audit be carried out, investigating the extent of

research materials which have been produced and their current preservation status. This will

involve three stages:

● Perspective: survey researchers, research facilitators, ITSS to assess the extent of data

currently ‘at risk’.

● Priorities: rank data according to ‘endangeredness’, effort required to preserve it, and

value to research community.

● Projects: identify projects which can share methods, define work packages to preserve

‘at risk’ data, identify appropriate sources of funding where possible.

2. Innovation

We recommend more proactive curation and development of research materials to identify

datasets, research questions, applications, and methods which can act as prototypes for active

promotion of the benefits of data reuse to individuals, to the institution, and to the wider

research community. Money should be invested in projects that can directly illustrate the

benefits of preservation of research data, particularly interdisciplinary and cross-institutional

15 http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Open_Access/Signalling_OA-ness 16 For an outline of one possible way this could work, see Appendix 7: “How to curate an XML resource in ‘live’ and ‘dormant’ modes”

http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Open_Access/Signalling_OA-ness



collaborations, and in working towards the goal of making more research outputs available as

open linked data.

A set of exemplar projects that highlight reuse and collaboration would make explicit the benefit

of preservation and open data, thereby motivating more investment and involvement in data

preservation and reuse.

3. Investment

We recommend continued investment in user education and in digital preservation

infrastructure, both technical and social, in order to capitalise on the creation of the Digital

Humanities Champion and the DH Network co-ordinator.

User education. A bottom-up approach is essential. We believe that personal ‘digital scholarly

support’ is key to establishing awareness and ‘buy-in’ within the division, not only at the level of

project development, but also in conveying the potential and challenges of DH approaches to

the entire academic community. It is expected that the Digital Humanities Champion will lead

here, but this role alone is not sufficient. By drawing attention to the possibilities of DH, the DH

Champion will emphasise the need for more joined up advice and support, and identify where

those connections need to be made. We therefore recommend

● identifying existing successful initiatives and building on these (e.g.: developing from the

DH@Ox Summer School to an in-term programme of training in DH methods for Oxford

academics; extending the RDM mentoring scheme, currently being piloted in Social

Sciences, to the Humanities), and

● identifying key areas where awareness of digital methods & RDM issues could be

increased in general humanities research training.

Digital preservation infrastructure. It is vital that we collaborate closely with the process of

investment in digital preservation infrastructure (technology, personnel, and policy framework) to

ensure both that the needs of humanities researchers are being met and that solutions

developed in DH can inform wider development. The proposed data audit (see 1, above) will

give us the information needed to predict more accurately our future infrastructure requirements;

proactive curation and development of DH data (see 2 above) will highlight the positive

contribution to digital preservation which DH research has to offer.

DHARMa Project Final Report - Bodleian Libraries...

Documents

Transcript of DHARMa Project Final Report - Bodleian Libraries...