Making data FAIR using InterMine

Making data FAIR using InterMine

Justin Clark-Casey (@justincc), Software Engineer, InterMine

Making data FAIR using InterMine 1.What is FAIR?

2.InterMine and FAIR

What is FAIR?● A set of data reuse principles first formally published in Scientific Data in 2016

− “Comment: The FAIR Guiding Principles for scientific data management and stewardship”

− By Mark Wilkinson, Michel Dumontier, Carole Goble, Barend Mons and 49 others

● Generated by academia, funding agencies, industry and scholarly publishers● Spurred on by trends such as big data, machine learning, reproducibility, Linked

Data and an increasingly diverse and distributed data ecosystem● Aims to clarify what good data management means, which has been largely

undefined.● Principles apply to data, algorithms, tools and workflows – anything necessary for

humans and machines to work with scientific outputs

What are the FAIR principles?4 Guiding Principles

FindabilityAccessibilityInteroperabilityReusability

Practical entailments expounded in 15 statements, 3 to 4 per principleSubject to some interpretation, not explained further in the paper!Not all apply to InterMineCan be adhered to in any combination and incrementally (a continuum)

InterMine and FAIR● InterMine already has elements of the FAIR principles

● For example, data in InterMine is ● Findable via search and templates ● Accessible through reports, links, PathQuery and webservices● Interoperable by cross-references to other mines and sources● Reusable by giving license information for sources

But FAIR has crystallized areas where we can improve and get ahead

InterMine and FAIR● So in 2016, Gos decided to apply for a grant with the BBSRC in the UK to bring

InterMine to the vanguard of FAIR● Formulated a compendium of improvements to meet FAIR principles● 2 developer grant (Daniela, myself) for 3 years● Features released ongoing as part of mainline InterMine● Time to engage with the community to promote FAIR InterMine

InterMine and FAIR● In the rest of this talk:

● I will go through the FAIR InterMine improvements planned for the next 3 years and show how they relate to the FAIR principles

● Discussion of these plans during this talk is extremely welcome – please interrupt me at any time, now and in the future● Nothing is set in stone● Details will change as we understand the problems better● Can adapt and pursue emerging opportunities

● But first, we’ll go through the 15 practical FAIR entailments

FindabilityF1. (meta)data are assigned a globally unique and persistent identifierF2. data are described with rich metadata (defined by R1 below)F3. metadata clearly and explicitly include the identifier of the data it

describesF4. (meta)data are registered or indexed in a searchable resource

AccessibilityA1. (meta)data are retrievable by their identifier using a standardized

communications protocolA1.1 the protocol is open, free, and universally implementableA1.2 the protocol allows for an authentication and authorization

procedure, where necessaryA2. metadata are accessible, even when the data are no longer

available

InteroperabilityI1. (meta)data use a formal, accessible, shared, and broadly applicable

language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles

I3. (meta)data include qualified references to other (meta)data

ReusabilityR1. meta(data) are richly described with a plurality of accurate and

relevant attributes

R1.1. (meta)data are released with a clear and accessible data usage license

R1.2. (meta)data are associated with detailed provenance

R1.3. (meta)data meet domain-relevant community standards

Elements of FAIR InterMine

InterMine

Stable URIs

RegisterURIs

externallyOntologies

in data model

Embed metadata in webpages

Add metadata to query results

Objects available in XML/JSON

Generate RDF

Generatebulk RDF

Objectsavailable in RDF

SPARQL

Better linking

Better licensingmetadata

F1 (meta)data are assigned a globally unique and persistent identifier● At the moment, InterMine objects each have their own URI (e.g.

http://www.humanmine.org/humanmine/report.do?id=20637123→ PPARG_HUMAN protein

● But this incorporates an ID which is not stable over data reloads● Share option provides a more persistent URI but not perfect

● Current idea is to construct an ID incorporating the class names and IDs that form a primary key for any object− e.g. http://www.humanmine.org/taxon:9606/protein:PPARG_HUMAN− (note, this would not be enough for phytomine with mult orgs on one taxon)

● Looking to make this the primary URI instead of the InterMine ID embedding one● Good enough? What if/when gene IDs change?

Stable URIs

http://www.humanmine.org/humanmine/report.do?id=20637123

http://www.humanmine.org/taxon:9606/protein:PPARG_HUMAN

● InterMine has internal query capability (search, PathQuery) but not so great visibility in external systems without manual action

● Two levels− At whole mine level, provide 1-click ability to register installations with external

coarse-grained repositories such as BioSharing, Elixir bio.tools● Not so important for establish mines that have done this manually

− Provide facilities to register top-level mine objects with fine-grained registries such as identifiers.org● For example, stable HumanMine gene URIs could be registered in their

own namespace or as an alternative information source for IDs in an exisitng collection (e.g. NCBI taxon namespace)

● Questions of control, avoiding spamming identifiers.org, etc.

Register URIs

externally

● Currently, InterMine:− is based on the Sequence Ontology− loads ontologies for integration other with data sources (e.g. GO annotations)− But does not provide a way to attach terms to object properties or non-SO

classes● Need to address this as ontologies really drive data integration, FAIR or otherwise● So will extend data model mechanisms to allow ontology ● Will also annotate core model with suitable existing ontologies

− e.g. possibly Dublin Core for Publications− WILL NOT create ontologies ourselves

Ontologies in data model

● There is a general problem with finding datasets related to a particular bioentity (gene, protein, etc.) that are spread out over many different databases (especially the long tail)

● A European initiative called Bioschemas (http://bioschemas.org) is looking to address this− Of which InterMine is an active part

● Idea is to embed extremely basic metadata in webpages (e.g. with JSON-LD)− So that general search engines (Google, Bing, etc.) and specialized indexers

can make biological data on a particular subject more findable− Start with basics (DataSet), work up to Gene, Protein, etc.

● InterMine will drive this out of the data model (e.g. create Bioschemas DataSet JSON-LD from InterMine DataSource and DataSet entries.

Embed metadata in webpages

● InterMine returns query and template results as tables, as either CSV, XML, JSON or as rows of results in client APIs

● Metadata provided varies by format – CSV is just the data whilst JSON also provides column headers

● We will look to− Make the provision of existing metadata (column headers) consistent acro

output formats− Add more metadata, such as ontology term URIs presented in the data model

for objects and their fields● This is important for making data reusable by other systems● May require changes/extension to webservices/client APIs

Add metadata to query results

● Information on InterMine entities is currently only retrievable in a selective manner, i.e. by naming specific view columns

● However, this does not mesh well with the idea of machine findability – instead you have to know which fields you want and write a query to retrieve them

● So will make it possible to go to a bioentity’s stable URI and with HTTP content negotiation or an URL fragment retrieve all the data (and metadata) for that object in XML or JSON (or RDF...)− The machine equivalent of a currently human-only report page

● One challenge is determining how much data gets included − Sections of a report page can have thousands of results and be expensive to

generate (for the human report page this is controlled by paging)● Links to other bioentitys will be embedded as their stable URIs

Objects available

in XML/JSON

● A big stream within FAIR (but not an overbearing part) is that of Linked Data and the Semantic Web

● RDF has a simple uniform data model that describe everything in terms of SUBJECT, PREDICATE, OBJECT triples, commonly referencing URIs− e.g (http://humanmine.org/Gene:4835, http://so.org/length, “4859”)

● In principle, where different data sources share ontologies and URIs to common objects, RDF triples from each source can be more automatically integrated, better meeting FAIR’s interoperability principle than InterMine specific formats.

● Maxime Déraspe from Michel Dumontier’s lab has already done work to ‘RDFize’ 6 model organism InterMines in MOLD.

● We will productize this work and provide RDF as another download format.

Generate RDF

● Another way for someone to integrate RDF is to download the entire graph of triples for an InterMine installation, or some subset thereof

● They can put this into a triplestore (a specialized database for RDF) with other data and run queries on their local systems rather than relying on federated query (of which more shortly)

● So we will provide a mechanism to generate this bulk RDF that InterMine operators can publish if they choose

● As a separately generated bulk download, this has no impact on runtime InterMine performance.

Generatebulk RDF

● In addition to making entire bioentities available as XML and JSON, we will also make them available as RDF

● This is part of the Linked Data idea, where RDF retrieved from one system will contain stable URIs to objects in other systems which can be navigated to retrieve their RDF (and so on)

Objectsavailable in RDF

● Another way of retrieving RDF is to submit SPARQL queries to a data source●

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>prefix flymine-res: <http://flymine.intermine.org/resource/>

SELECT ?gene ?label WHERE{?gene rdf:type flymine-res:flymine_Gene . ?gene rdfs:label ?label .}LIMIT 20

SPARQL

● This is much like submitting PathQuery to InterMine with the following differences− SPARQL is a standardized query language, so more tools and information

● And so better meets FAIR’s interoperability principle.− SPARQL provides ways to involve data model ontology annotations in queries

● Doing this in PathQuery may require language extensions, needs discussion/thought

− SPARQL is generally much more expressive than PathQuery● So more is possible● But more complex to understand and there are performance concerns

SPARQL

● Because of performance concerns, plan to provide SPARQL support as a separate Docker image− Building on MOLD work− Explore techniques for better SPARQL reliability

● Such as Linked Data FragmentsOptimization approach on client-side

SPARQL

● As mentioned previously, linking between data sources is important in FAIR, since it improves Findability (via following links) and Interoperability (by showing connections between different data sources).

● InterMine already provides some cross-referencing by publishing this information where it is provided by an input data source (chiefly as cross-references to other databases).

● However, we don’t save enough information on the input data source itself to construct links back to that data source (as you know, the right-hand sidebar external links are specified separately outside of the InterMine data model).

● We will address this and provide a way to link bioentities to third-party data sources as well, such as URIs in the Bio2RDF project.

● We will get this links from fine-grained URI registries such as identifiers.org

Better linking

● Currently, for most mines information on the licensing of the integrated data sources is generated manually and published in the data sources tab.

● As per FAIR’s reusability principle, we want to collect and publish this information more systematically, and in a consistent machine-readable format.

● Work in this area is young, there are pilot and emerging specifications (e.g. the Creative Commons Rights Expression Language and bioCADDIE’s Machine Actionable Licenses) but nothing generally deployed as far as we know.

● But reviewers highlighted this as a major area of concern, with suggestions such as a traffic light signal on report pages showing whether all the integrated data was suitable for reuse.

● So we will look to pursue this objective vigorously.

Better licensingmetadata

Providing FAIR capability in InterMine

Advantages(many of these are classic InterMine advantages)Much of this comes ‘for free’ for existing InterMine operators on updateAllows operators to say FAIR if askedTailoring can be done through configurationDriven by the model which already lies at the heart of InterMineContributions to FAIR facilities in InterMine from others curated by Cambridge and shared by all

ThankyouQuestions?

Ongoing discussion very [email protected]

@justincc

mailto:[email protected]

Making data FAIR using InterMine

Science

Transcript of Making data FAIR using InterMine