Rob Walpole Deve X e Limited The National Archives

download Rob Walpole Deve X e Limited The National Archives

If you can't read please download the document

description

The National Archives Digital Records Infrastructure Catalogue First Steps to Creating a Semantic Digital Archive. Rob Walpole Deve X e Limited The National Archives. Disclaimer. - PowerPoint PPT Presentation

Transcript of Rob Walpole Deve X e Limited The National Archives

  • The National ArchivesDigital Records Infrastructure Catalogue

    First Steps to Creating a Semantic Digital ArchiveRob Walpole

    DeveXe LimitedThe National Archives

  • Disclaimer This presentation is in no way intended to express views or opinions of The National Archives and is solely the work of Rob Walpole, an employee of DeveXe Limited who are currently contracted to assist in the development of the Digital Records Infrastructure at Kew in London. Apart from providing a case study of developing a semantic digital archive, this presentation discusses the opportunities permitted by such development. It should not be assumed that these developments will occur and DeveXe Limited take no responsibility for any perceived inaccuracies.

  • Background

  • Background The National ArchivesThe National Archives (TNA)Over 11 million historical government and public recordsFrom the Domesday Book to the Agreement on a Referendum on Independence for ScotlandPhoto by Chris Hill

  • Background The National ArchivesThe National Archives (TNA)Over 11 million historical government and public recordsFrom the Domesday Book to the Agreement on a Referendum on Independence for ScotlandBut not births, deaths and marriages, these are held by the General Register Office!Photo by Chris Hill

  • Background The National ArchivesMost of these documents are currently held on paper - or even parchment...Photo by Liz West

  • Background The National ArchivesBut soon this will be overtaken by a tsunami of digital files...Photo by Marco Mazzei

  • Background The National ArchivesBut soon this will be overtaken by a tsunami of digital files...

    ...including office documents, emails, images, videos and much more.Photo by Marco Mazzei

  • Background Digital Records InfrastructureThere are many challenges around digital preservation including:-

    Format recognitionSoftware preservationCompatibilityDegradation of media

  • Background Digital Records InfrastructureThere are many challenges around digital preservation including:-

    Format recognitionSoftware preservationCompatibilityDegradation of media

    Many of these issues werehighlighted by the BBCDomesday Project (1986)

  • Background Digital Records InfrastructureTNA have been at the forefront of meeting this digital preservation challenge:-PRONOM file format registryDROID file format identification toolLegislation.gov.uk all UK legislation on-lineUK Government Web Archive http://www.nationalarchives.gov.uk/webarchive/The London Gazette published by HMSO (part of TNA)

  • Background Digital Records Infrastructure

    In 2006 TNA deployed the Digital Repository System (DRS) which provided terabyte scale long-term storage.

    In 2012 TNA starts to build DRI (Digital Records Infrastructure) on the foundations of DRS to deliver extensible storage to the petabyte scale and beyond.

  • 80,000 Digitised Home Guard Records from World War 2 were ingested into DRI as a proof of concept...

    Background Digital Records Infrastructure

  • Background Digital Records Infrastructure80,000 Digitised Home Guard Records from World War 2 were ingested into DRI as a proof of concept...

    ...now many more includingLOCOG (2012 Olympic games)Leveson Enquiry

  • Background Digital Records InfrastructureAt its core this massive storage is provided by a robot tape library with frequently requested and low resolution copies of data held in a disk cache.Photo by Cory Doctrow

  • Background The DRI CatalogueThe DRI Catalogue is essentially an inventory of the items held in the archive.It is distinct from the TNA Catalogue which is a comprehensive catalogue system covering both paper and digital documents. Public access to the TNA Catalogue is provided by Discovery.

  • Background The DRI CatalogueRich XML metadata is stored in the archive itself, alongside the original document and a copy is sent to Discovery. This comes from a variety of sources:

    Record providerArchiving processDocument transcriptionArchivists

    And there is a very good reason for using XML...

  • Background The DRI CatalogueRich XML metadata is stored in the archive itself, alongside the original document and a copy is sent to Discovery. This comes from a variety of sources:

    Record providerArchiving processDocument transcriptionArchivists

    And there is a very good reason for using XML......it's human readable!

  • Requirements

  • Requirements The DRI CatalogueApart from being an inventory, the DRI Catalogue is needed to help manage:-

    Closure informationRecord openingExport listsExport status

  • Requirements - Closure

    Closure can be very fine-grained.e.g. Home Guard records have open description (individual's name, battalion etc.) butService record closed until individual deceasedMedical record closed until record = 100 years old

  • Requirements Record ExportThe export process itself is in the form of a work-flow with many steps. The DRI Catalogue must maintain the status and other information about the export...

  • Requirements The Problem

    Initially the DRI Catalogue was held in an RDBMS. However the fine-grained nature of closure meant very slow queries when attempting to export large numbers of records sometimes taking hours to complete!

    Another approach was needed...!

  • Requirements Initial AnalysisThree different proposals were made for modelling the catalogue and therefore a trial was conducted to establish the best approach. Three models trialled were:-

    Relational optimising the existing SQL queries against a modified table structureGraph running SPARQL queries against a RDF storeHierarchical running XQuery against a XML database

  • Requirements Analysis ResultsRelational reduced query time from hours to minutes

  • Requirements Analysis ResultsRelational reduced query time from hours to minutes Graph reduced query time to seconds

  • Requirements Analysis ResultsRelational reduced query time from hours to minutes Graph reduced query time to seconds Hierarchical approach abandoned

  • Requirements Analysis ResultsThe hierarchical approach was abandoned because:-The graph approach provided a good solutionThe graph approach offered a path towards Linked Data Cost overheads and deadlines obliged us to move on.

  • Requirements Analysis ResultsThe hierarchical approach was abandoned because:-The graph approach provided a good solutionThe graph approach offered a path towards Linked Data Cost overheads and deadlines obliged us to move on.

    A hierarchical approach may have offered comparable performance and opportunity, we simply don't know...

  • Requirements Analysis ConclusionThe issues of closure and export had led to fundamental questions about the nature of the catalogue.

    We don't know exactly what information will need to go into DRI but we know it will be information about people, organisations, their relationships and activities. These things are complex and varied just like the world around us!

    A graph approach not only resolved the issues with closure and export but provides a powerful and flexible tool for discovering information within the archive.

  • Design

  • Design - TechnologyTechnologies used during the trial included:-D2RQApache Jena framework (including TDB and Fuseki)Turtle (RDF)SPARQL 1.1 (Query and Update)

    The Jena framework was chosen because:-Excellent Java APIOpen Source

  • Design - TechnologyUK Government Service Design Manual states...

    ...it remains the policy of the government that, where there is no significant overall cost difference between open and non-open source products that fulfil minimum and essential capabilities, open source will be selected on the basis of its inherent flexibility.

    Use open standards and common Government platforms (e.g. Identity Assurance) where available

  • Design The Catalogue Services

  • Design DRI VocabularyW3C recommend re-using vocabularies wherever possible and DRI already does this extensively in the XML metadata.But we needed to be able to talk about things very specific to DRI such as Closure and Export .So we extended the RDF Schema (RDFS) with a few of our own classes and properties such as:-

    rdf:type rdfs:Class .

    rdf:type rdfs:Property

  • Design DRI VocabularyThis allows us to talk about DRI exports such as:-

    adri:Export ;dri:exportMember ;dri:exportMember .

  • Design The Catalogue ServicesThe Apache Jena Framework provides a straightforward approach to reading, writing, updating and deleting data using W3C standards...

  • Design The Catalogue ServicesThe Apache Jena Framework provides a straightforward approach to reading, writing, updating and deleting data using W3C standards...

    Reading SPARQL 1.1 Query LanguageWriting creating and persisting new RDF triples (e.g. Turtle)SPARQL 1.1 Graph Store ProtocolUpdating and Deleting SPARQL 1.1 Update Language

  • Design The Catalogue ServicesHowever......having to learn SPARQL can be a hurdle to widespread acceptance of this technology!

    The answer......Elda (Linked Data API implementation) provides RESTful access to pre-configured SPARQL queries:-

    spec:collectionList a apivc:ListEndpoint ; apivc:uriTemplate "/collection"; apivc:selector [apivc:where " ?item a dri:Collection . " ;]; .

  • Design - ImplementationSo how did we actually do it...?Create a mapping from RDBMS to vocabulary termsExport data from RDBMS to N-Quads using D2RQLoad N-Quads into Jena TDB (embedded version)Write SPARQL transform (CONSTRUCT) queries to refine RDFRun queries in Fuseki, download results and reload into clean database instances

  • Design - Implementation

  • Design Catalogue Services APIRESTful JAX-RS web application providing a very simple API

    e9f3c8e9-e883-4fcf-a9a3-5caf0c808c5d

    Why XML? Why not JSON? Web services consumed by Java applications.JSON is used in some circumstances, i.e. for a JavaScript tree editor.

  • Design Insights Issues and LimitationsElda Linked Data API Implementation

  • Design Insights Issues and Limitations

  • Design Insights Issues and Limitations

  • Design Insights Issues and LimitationsXturtleShortage of RDF/SPARQL editors and IDEs!Xturtle provides a useful syntax highlighting plug-in for Eclipse...

  • Design Insights Issues and LimitationsScardf - http://code.google.com/p/scardf/

    Model model = ModelFactory.createDefaultModel();model.createResource( "http://somewhere/JohnSmith" ) .addProperty( N, model.createResource() .addProperty( Given, "John" ) .addProperty( Family, "Smith" ) );

  • Design Insights Issues and LimitationsScardf - http://code.google.com/p/scardf/

    Model model = ModelFactory.createDefaultModel();model.createResource( "http://somewhere/JohnSmith" ) .addProperty( N, model.createResource() .addProperty( Given, "John" ) .addProperty( Family, "Smith" ) );

    Graph( UriRef( "http://somewhere/JohnSmith" ) -N-> Branch( Given -> "John", Family -> "Smith" ) )

  • Design Insights Issues and LimitationsScale and Performance

    Will the DRI Catalogue cope with the tsunami?

  • Design Insights Issues and LimitationsScale and Performance

    Will the DRI Catalogue cope with the tsunami?We think it will...1) This solution was chosen because of it's performance2) We are confident we can scale horizontally. In fact a catalogue for each collection makes some sense. You could then create a catalogue of catalogues to search everything!3) If the existing framework fails to scale satisfactorily the fact that we are using open standards means moving to another framework should be straightforward.

  • The Future

  • The FutureThe story so far:-Remodelling of the DRI CatalogueSolution for Closure and Export

  • The FutureThe story so far:-Remodelling of the DRI CatalogueSolution for Closure and Export

    So what next?

  • The FutureThe story so far:-Remodelling of the DRI CatalogueSolution for Closure and Export

    So what next?

    More metadata into the CatalogueStarting with the rich XML that we already have

  • The Future Named Entity RecognitionSo what could this mean for members of the public viewing records on Discovery?

  • The Future Named Entity RecognitionRecords cease to be just text and become machine readable with context and meaning...

  • The Future Ontology-driven NLPNatural Language Processing (NLP) tools can be used in conjunction with RDF to extract meaning...

    From 1 Aug 44 to 20 Oct 44 Bayeux, Rouen and Antwerp. During the period from 1 Aug to date this officer has carried the principal strain of establishing and re-establishing the hospital in three situations. His unrelenting energy, skill and patience have been the mainstay of the unit. His work as a quartermaster is the most outstanding I have met in my service. (A.R.ORAM) Colonel Comdg. No.9 (Br) General Hospital

  • The Future Ontology-driven NLPNatural Language Processing (NLP) tools can be used in conjunction with RDF to extract meaning...

    From 1 Aug 44 to 20 Oct 44 Bayeux, Rouen and Antwerp. During the period from 1 Aug to date this officer has carried the principal strain of establishing and re-establishing the hospital in three situations. His unrelenting energy, skill and patience have been the mainstay of the unit. His work as a quartermaster is the most outstanding I have met in my service. (A.R.ORAM) Colonel Comdg. No.9 (Br) General Hospital

  • The Future Semantic SearchSearching for George John Potter in Discovery currently returns 361 results...

  • The Future Semantic SearchSearching for George John Potter in Discovery currently returns 361 hits...

    ...that's 360 irrelevant ones as there is only one record for a person with that name.

  • The Future Semantic SearchSearching for George John Potter in Discovery currently returns 361 hits...

    ...that's 360 irrelevant ones as there is only one record for a person with that name.

    A semantic search would allow you to search for a person, a soldier or an officer with that name.

    This is known as query string extension.

  • The Future Semantic SearchSemantic search also allows you to search for terms closely associated with your matches known as cross referencing.

    In this case we would receive information about Colonel A.R.Oram as he also had an entry in Discovery...

  • The Future Semantic SearchBecause related concepts are held in a graph it is possible to do exploratory search into a particular area of interest.In this case we might discover that Colonel Oram was himself awarded a medal his work with No.9 British General Hospital...

  • The Future Semantic SearchIt also becomes possible to do reasoning whereby rules can be applied creating new statements that are implied rather than explicit. For example we could say Colonel Oram served with Captain Potter...

  • The Future Linked DataWhile TNA is a huge national (and international) source of information it is not an authority on all things.Linked Data, the brainchild of WWW inventor Sir Tim Berners-Lee provides a way of un-siloing and linking datasets using RDF-based machine-readable formats standardised by the W3C.

  • The Future Linked DataWhile TNA is a huge national (and international) source of information it is not an authority on all things.Linked Data, the brainchild of WWW inventor Sir Tim Berners-Lee provides a way of un-siloing and linking datasets using RDF-based machine readable formats standardised by the W3C.

    TNA data could be Linked Data sources such asDBPediaOrdnance SurveyBritish LibrarySmithsonian

  • The Future Crowd-sourced linkingEven the best machine reading will miss key facts and links.Digitised documents rely on transcriptions for metadata as OCR still has a long way to go.Crowd-sourced linking would allow users to link established vocabulary terms to specific documents.Discovery already allows tagging but users tend to create very personal terminology which doesn't necessarily help others...

  • The Future Open World AssumptionUsing a semantic approach allows for an Open World Assumption. That is to say that it is...

    implicitly assumed that a knowledge base may always be incomplete[Hitzler, Krtzsch, Rudolph Foundations of Semantic Web Technologies]

    This means that TNA can always add new information to the DRI Catalogue as it is discovered without needing to redesign the storage architecture. Exactly what you want for an archive!

  • Thank you