Archive integration with RDF
-
Upload
lars-marius-garshol -
Category
Technology
-
view
372 -
download
1
description
Transcript of Archive integration with RDF
Archive integration at MattilsynetBouvet Tech Meetup 2014-06-11
Lars Marius Garshol, [email protected], http://twitter.com/larsga1
Archive integrations
A few systems integrated with the archive– every integration is expensive and painful
Need many more integrations– to reduce amount of manual work– hesitation because of cost
Consequences of integrations– if archive upgraded, must retest all systems– archive slows down integrated systems– changes to archive structure require
rewriting all integrationsArkiv
Regelverk
Fagsystem #2
Fagsystem #1
Nettsider
Rekrut-tering
Kvalitets-systemet
3
WebCruiter integration
Very simple project– integrate WebCruiter with ePhorte
Doing it with RDF because– it’s much easier and cheaper– want to extend to more integrations later– first step toward new architecture
Good example project– because it’s so simple
4
4
SESAM principles
Base everything on RDF and SDShare feeds– dynamic flows of structured data
Extracts from data sources do not map to a common model– instead, extract data as they are in the source– later translate to representation needed by
consumers– this way, changes in source or target do not spill
over to the other
No hard bindings from code to data model– code should have no knowledge of the data model– all data model-specific logic should be configuration– makes data changes much easier to handle
5
W3C standard– for interchange of structured data– has query language, schema languages, formats, ...
Essentially a graph database– known as a triple store– like Neo4j or similar– but standardized– and with many extra features
Note that databases are schemaless– so this is NoSQL– powerful query language with SPARQL
RDF?
6
Architecture
WebCruiter WS
XML in files
SDShare
Oversettelse
ePhorteRDF
SDShare
SDShare
Oversettelse
SDShare
ePhorte adapter
HTTP POST
HTTP POST
SPARQLUpdate
SPARQLUpdate
SPARQLUpdate
external call
Bus
Boxes in orange areSesam components
SDShare
A protocol for tracking changes in a data source– essentially allows clients to keep track of all changes, for
replication purposes– based on Atom and REST
Data source can be anything– triple store– relational database– XML files on disk– ...
Data flows as RDF– not an absolute must, but it’s how we do things
A CEN specification– http://sdshare.org
Basic workings
Server Client
Fragmen
t
Server publishes fragments representing changes in datastore
Client pulls these in, updateslocal copy of dataset
Fragmen
t
Fragmen
t
Fragmen
t
9
From WebCruiter to triple store
Fragmen
t
Fragmen
t
Fragmen
t
Fragmen
tXML adapter
SDShare server
Triple storeSDShare
client
On the server:• XPath queries to map to RDF
On the client:• Two URLs
10
11
Translation of metadataTitle: Søknad om betalingsutsettelseProcess: 384192Author: 123Customer: 789
Oversetter
Tittel: Søknad om betalingsutsettelseSak: 485283Ansvarlig: 456Kontakt: 987Doktype:IArkivdel:17
Application
Archive
ActiveDirectory
123
xyz
456
789
987
12
How the mapping works
Standard RDF vocabulary– mapping between properties– traversing properties to add values– uses owl:sameAs to map values
Java implementation– called metadata-translator (~500 LOC)– uses very simple SDShare push protocol– writes translated data to Virtuoso
Supports multiple mappings– configured using graphs so we know which
properties and values to translate to
13
What’s to be mapped?
Department cannot be mapped
– structure in WebCruiter added manually
Users cannot be mapped, either
– no common key– solved using Duke
Department can be defaulted
– in the cases where we know the user
WebCruiter ePhorte
14
Data transfer to translation
Simply write SPARQL queries to– produce fragment feed (based on
timestamps)– produce a fragment (trivial)– produce a snapshot (trivial)
Then configure SDShare client– just requires two URLs– translation receives an HTTP POST with
the fragment, then does its job
15
ePhorte adapter
Receives RDF– introspects the RDF and translates to Java API– Java API is stubs calling SOAP services
Given <foo> rdf:type <.../MyClass>– it looks up the Java class “MyClass” then
instantiates
Then, given <foo> <.../prop> “value”– it looks up method “setProp” on MyClass– calls object.setProp(“value”)
That’s it– requires translation to produce RDF exactly
aligned with Java API– means there’s no code
https://github.com/Mattilsynet/arkivgrensesnitt
16
Configuration
WebCruiter WS
XML in files
SDShare
Oversettelse
ePhorteRDF
SDShare
SDShare
Oversettelse
SDShare
ePhorte adapter
HTTP POST
external call
Bus
Look, ma, no code!
XPath mapping
RDF mapping
SQL queries
SPARQL queries
Look, ma, no code!not much code!
17
Properties
Adding more object types or properties is simple– we just extend the mapping (and maybe
queries)
Data quality improves with more data– if we don’t have the data to translate
employees that information gets lost– if the necessary mapping is added later
translation improves automagically
Adding more systems is very easy– requires more SDShare feeds plus
mappings
18
The public journal problem
Internet
DMZ Secure zone
Oracle
ePhorte
Journalapp
ePhorte
19
The public journal solution
Internet
DMZ Secure zone
Oracle
ePhorte
Journalapp
Oracle
ePhorte
RDFfilteredSDShare SDShare
20
Relatively small project, not that many hours– includes writing reusable ephorte-adapter– parts of writing the metadata translator, too– also the XML adapter– system documentation– automated deploy system based on Jenkins
Flexible, simple solution– most of it reusable– actually captures, as a side-effect, information not
available in any other system
Conclusion
21
Questions?