Post on 24-May-2018
Validator – Documentation 28. Apr. 2014
Page 1 / 18
0. Validator Tutorial: Introduction Overview of the architecture
The PSI Validator is a framework that allows one to validate a data against a set of rules. These rules
can defines how controlled vocabularies and ontologies are used, but also, arbitrary rules that are
defined and implemented by the developer a specific instance of a validator.
A. Bird's Eye View
B. Technologies and Requirements
The validator framework was written in Java and uses Maven 2 as build system. The configuration of
the framework is mostly done using XML files. Should you wish to write your own validator, the
following requirements apply:
Java 5 and higher
Maven 2 and higher (if you wish to take advantage of existing infrastructure)
A data model written in Java (this is the data you are going to validate) you can also use
our sample data model to try the validator out.
C. Validator's Components
The validator is built in a component oriented manner, here is a short decriptions of the major ones:
Controlled vocabularies and Ontologies access: this module is meant to give a unified access to
Controlled vocabularies and Ontologies (whether they are available locally or remotely) via a
simple API.
The Controlled Vocabulary Mapping Rules are definition of Controlled vocabularies and
Ontologies usage in a specific data model. By mean of XPath like expressions, one can define
what ontology terms are allowed in a specific location of a data model.
The User Defined Rules are defined and implemented by the Validator's developer when Mapping
rules do not allow to perform the desired validation. These rules do have access to the controlled
vocabularies and ontologies and their complexity can potentially be much higher as YOU are
coding them.
Below is a simple comparison of the 2 kinds of rules a validator can be build upon:
Validator – Documentation 28. Apr. 2014
Page 2 / 18
D. Flow of a Validation
Once you have a data model and a validator consisting of a set of rules, you can run your first
validation. Here we define step by step how this is done:
1. Data model is submitted to the validator. Usually one would rather submit specific objects to
validate than the whole data model at once. However, every data model is different and the
granularity of the objects defined in this model would vary accordingly, consequently, you have
to defined for yourself what is going to be your unit of work (eg. a car in a car factory, a
molecular interaction in a proteomics experiment, ...).
2. If any CV mapping rules have been defined, the validator is going to run an internal validations
on them and potentially remove all those that are not valid. The model provided is then run on
the remaining rules. Messages are returned should a validation exception occur, this message
include a description of the issue, a level of severity (values in DEBUG, INFO, WARN, ERROR,
FATAL).
3. If any user defined rules have been defined, the validator is running each of them on the data
model and here again, messages can be generated upon exceptions.
4. All messages are returned to the user that is then free to process them.
This tutorial will now take you through the steps required to build a Validator and is organised as
follow:
1. How to write your own validator ?
2. Getting access to the needed Ontologies and Controlled vocabularies
3. Building rules to map ontologies and controlled vocabulary terms to your domain model
4. Building your own rules
5. Wiring it together: build your validator and run it on sample data
6. Download Validator's tutorial source code
E. Contact
Should you have any further questions about the Validator, please send an email to skerrien [at] ebi
[dot] ac [dot] uk
Validator – Documentation 28. Apr. 2014
Page 3 / 18
1. Validator Tutorial: How to Write Your Own Validator
In this section, we are going to give more information about what you should if you are planning to
write your own validator.
a. Requirements
Java 5 or above (http://www.oracle.com/technetwork/java/index.html)
Maven 2 or above (http://maven.apache.org/) This is not per se a mandatory requirements but as
we have developed the framework using it that are many advantages to be gained. Should you
choose not to use it, please be aware that we have made available a version of the validator
framework on SourceForge that contains all necessary dependencies.
A Java IDE to ease the development (in this tutorial I will mostly refer to IntelliJ 13.x --
http://www.jetbrains.com/idea)
b. Defining your needs
Here are a few question you could ask yourself before to go any further:
What is to be checked on ?
What part of my data model ?
Am I using ontologies and controlled vocabularies ?
What ontologies and controlled vocabularies is my model using ?
Are these ontologies available in OBO format ?
Are these available in the Ontology Lookup Service (http://www.ebi.ac.uk/ontology-lookup) ? On the
Internet ? On your local computer ?
Anything else you need to check on ?
How would I proceed to validate it, how can I implement it ?
In the following sections we are going to define more precisely how to use the various components of
the validator.
Validator – Documentation 28. Apr. 2014
Page 4 / 18
2. Validator Tutorial: Getting access to the needed Ontologies and Controlled vocabularies
In this section, we are going to see in more details how one can deal with Ontologies and Controlled
vocabularies in the validator framework. To start with, you would have to define what ontologies will be
required to validate your data model. It could be any one available in the Ontology Lookup Service
(http://www.ebi.ac.uk/ontology-lookup/ontologyList.do), any data in OBO format available on the
network or locally.
A. configuration file
- Representation of the schema, as well as the location of the XSD
XSD available at: http://www.psidev.info/files/validator/CvSourceList.xsd
1. Description of the attributes
source - Physical source of the CV file or term information. The keywords 'OLS' or 'file' should be used
in this attribute and coupled with the appropriate URI. A fully qualified class name is also allowed when
it implements the ontology loader interface (ie.
psidev.psi.tools.ontology_manager.interfaces.OntologyAccess) and has a public default constructor.
name - Name of the CV as in the PSI CV resource.
identifier - Internal identifier for the CV source to be cross-referenced in the CVTerm instances.
uri - Universal identifier of the CV resource.
format - To describe the CvFomart use consistently the upper case of the acronymes of the CV
language, e.g. 'OBO', 'OWL', or the 'plain text' keyword when applicable.
version - Version of the OBO format used.
2. Sample file
Validator – Documentation 28. Apr. 2014
Page 5 / 18
You can download this sample file here: ontologies.xml
B. Different types of access
The framework currently allows several ways to access a controlled vocabulary or ontology resource.
We are going to describe below some of the facilities provided:
source={OLS, FILE, user-defined-class}
1. File
This is essentially any obo file that can be found via a URL (http, ftp, file...) or in the classpath or the
running application.
a. Using a local file
Local file can be accessed by defining a URL that uses the file protocol, here is an example:
b. Using a URL
Here is an example of access using the HTTP protocol:
c. Using your classpath
If you have made available an OBO file in your classpath, you can access it by prefixing the URI with
classpath:, here is an example
2. Ontology Lookup Service
As of May 23rd 2008, OLS has integrated 61 ontologies and 720,114 terms amongst which one can
access GO, PSI-MI, PSI-MS, PSI-MOD... The Ontology Manage module is provided with a
implementation that uses OLS to access ontologies and controlled vocabularies.
Please note that when using OLS, the URI of the source is not mandatory as OLS is relying on the
source's identifier to access the data. A complete list of all supported identifier can be found on
the OLS web site
3. Writing your own implementation of OntologyAccess
Currently, only the OBO format is supported. Should one of the ontology or controlled vocabularies you
use not been supported you can extend the functionality of the Ontology Manager.
You can write your own class that implements
psidev.psi.tools.ontology_manager.interfaces.OntologyAccess.
Validator – Documentation 28. Apr. 2014
Page 6 / 18
Now let's say you have implemented an OWL access in the following class:
com.company.ontology.OwlAccess
you can then declare a new CvSource using is as follow:
Obviously, the compiled class OwlAccess would have to be in the classpath when running the validator.
Validator – Documentation 28. Apr. 2014
Page 7 / 18
3. Validator Tutorial: How to Build CV Mapping Rules?
In this section we are going to see how one can simply define a direct mapping between a data model
and a set of ontologies/controlled vocabularies.
A. Defining how the model is supposed to relate to the ontologies
This is a crucial step in the design of your mapping rules as you are going to define which part of the
data model is going to map to which specific part of the ontologies or controlled vocabularies.
B. Formalizing this binding in a configuration file
b1. Format of the configuration file
XSD available here.
Definition of the attributes of each elements:
CvMapping
modelName - Name of the PSI data exchange schema, e.g. mzML, GelML, MIF.
modelURI - URI of the data exchange schema.
modelVersion - Version number of the model supported by the CvMapping file.
CvReference
cvIdentifier - Short label for the CV or namespace, this should correspond to a cvIdentifier
attribute of CvTerm in the CvSourceList configuration file.
cvName - Full descriptive name for the CV.
CvMappingRule
id - Unique identifier for this rule in the scope of the current configuration file. Idenfiers are
alphanumerical.
name - A short name for this rule. This may be used for error reporting.
scopePath - Element scope in the schema within which the non repeatable (isRepeatable = FALSE)
condition applies.
cvElementPath - The full xpath expression that define the part of the data model we are
mapping.
cvTermsCombinationLogic - Boolean operator describing the combination logic of multiple
CvTerm elements associated with the same CvMappingRule.
requirementLevel - The requirement level indicated, when the XML element exists in the instance
data file, if the association with CV terms is optional (MAY), recommended (SHOULD) or
mandatory (MUST).
CvTerm
Validator – Documentation 28. Apr. 2014
Page 8 / 18
cvIdentifierRef - Internal reference (e.g. namespace abbreviation) to a term source file as defined
in a CvReference element.
termAccession - CV term accession number as in the CV file.
termName - CV term name.
useTermName - Boolean to set whether the check is done on the termName (TRUE) or on the
termAccession (FALSE and default).
useTerm - This attribute indicates whether the term itself can be used to annotate data (TRUE) or
not (FALSE). This latter case may happen when a term, parent of valid terms for annotation, is
mentioned to keep the mapping concise.
allowChildren - This attribute indicates whether the children of the described term are allowed to
annotate data (TRUE) or not (FALSE).
isRepeatable - Value is 'True' when a term can be repeated in the same instance of the associated
XML element.
Sample configuration file
C. Example of rule definition
Now let's define a toy example on which we will be able to build a sample custom Validator:
In a nutshell, this model describe an experiment under which one can find one to many molecules.
Each molecule is characterized by a sequence (if applicable) and a MoleculeType (values taken from an
ontology we have defined in an OBO file: molecule-type.obo) and can have zero to many post
translational modifications (values taken from the PSI-MOD ontology).
Here is a graphical representation of the molecule type ontology:
Validator – Documentation 28. Apr. 2014
Page 9 / 18
Now let's define some rules based on this data model and express them using the cv mapping.
rule 1: all molecules must have a type that is 'protein' or 'nucleic acid' or one of it's children term
rule 2: if a modification is defined on a molecule, it should be a child term of 'protein modification categorized by
amino acid modified' (MOD:01157)
Validator – Documentation 28. Apr. 2014
Page 10 / 18
You can download the complete sample file here: cv-mapping.xml
Note: we have tried to develop this component so that it makes the developer's life a little easier when
it comes to write your XPath expression. The component automatically verifies that the XPath
expression is valid again the instance of the data model submitted and if not correct, a
ValidatorMessage will be generated in order to describe the issue, and if possible, provide a solution to
fix it. Let's take a look at an example:
We define on the above described model the following Xpath expression: /experiment/molecul/modifications/@id
When you run the validator's CV Mapping Rules on an instance of experiment that does have at least
one molecule, you would get the following error message: Could not find property 'molecul' of the xpath expression 'molecul/modifications/@id' (element position: 1) in the given object of: net.sf.psi.spe.Experiment - Did you mean 'molecules' ?
Validator – Documentation 28. Apr. 2014
Page 11 / 18
4. Validator Tutorial: How to Build Your Own Rules?
A. What can these user-defined rules do for you ?
Essentially, whenever the CV mapping rules cannot be used to model the validation you want to apply,
the Object Rules are the alternative. There is inherently no limitation to what these rules can do, as
long as you are able to program them using the Java langage and the plethora of libraries available on
the internet.
B. Implementing your first rule
The validator API defines a class that one has to extend in order to write a rule: psidev.psi.tools.validator.rules.codedrule.ObjectRule
The class diagram below illustrate this part of the Validator's data model:
As you can see on this diagram, in order to fulfill the contract of an ObjectRule, you will have to
implement the following methods: boolean canCheck( Object object ); Collection<ValidatorMessage> check( Object object )
The canCheck method allows to define what object type (ie. class) a specific rule is able to validate. The
second method 'check' is the one that performs the validation and returns messages if inconsistencies
are detected.
1. Writing a simple rule
So let's define a first very simple rule that only accesses the data available in the provided instance of
the data model. In this example we are still playing with our Simple Proteomics Experiment of which
the class diagram is available here.
Validator – Documentation 28. Apr. 2014
Page 12 / 18
In this first simple rule, we are going to to look into the Experiment and report an error whenever no
name has been given.
If you wish to run this rule yourself, you can download the source code of this sample validator here.
2. Writing a rule that does use Ontologies
Now let's write a rule that reports the following inconsistencies :
If the molecule type is protein (SPE:0326), then if the sequence is defined it has to be composed
of amino acid only.
If the molecule type is nucleic acid or one of it's children term, then if the sequence is defined it
has to be composed of nucleic acid only.
If the molecule type is ribonucleic acid or one of it's children term, then if the sequence is defined
it has to be composed of ribonucleic acid only.
If the molecule doesn't have a sequence (unless it is a small molecule), we report a low severity
(INFO) message.
Here is the rule implementing these constraints:
Validator – Documentation 28. Apr. 2014
Page 13 / 18
Please note that in order to keep this code sample consise, we have removed the import section. Please
download the full source code if you want to get the complete version.
B. Configuring Your Set of Object Rules
1. The Object Rules Schema
Validator – Documentation 28. Apr. 2014
Page 14 / 18
2. Example of rule set for the two rules defined above
Validator – Documentation 28. Apr. 2014
Page 15 / 18
5. Validator Tutorial: Wiring It Together - Bringing All Components Together
Not that you have created your CV Mapping rules and/or your own object rules, the next logical step is
to create your own validator.
Here is a graphical representation of the process of building a validator given the separate
components:
As you can see in the above representation, in order to build your own validator, you will have to bring
together your configuration files in order to define ontologies, cv mapping rules, and object rules (for
which you also have to provide your rules). Once you have brought all of this together inside a project,
you can create your own validator as follow :
Validator – Documentation 28. Apr. 2014
Page 16 / 18
In this code example, one can see that two methods have been written:
The constructor of the SPE Validator that essentially passes the 3 configuration files to the
generic validator,
The validate method that takes an Experiment and run the cv mapping validation as well as the
object rule validation. Any message generated in this process is stored into a collection and
returned to the calling process.
Now that we have put everything together, it's time to run our validator on some data and display the
result of this validation. Obviously, the aim of this tutorial is not to give a lecture on user interface or
even how to write them in Java so we are going to aim at a simple, basic user interface that allows to
print the result of our validation on the command line.
Validator – Documentation 28. Apr. 2014
Page 17 / 18
Here is what our little program output: Validation run collected 3 message(s): ValidatorMessage{message='The result found at: /molecules/modifications/@id for which the values are ''BLA:0000X'' didn't match any of the 1 specified CV term: - MOD:01157 (protein modification categorized by amino acid modified) or any of its children. The term can be repeated. The matching value has to be the identifier of the term, not its name.', level=WARN, context=Context(/molecules/modifications/@id ), rule=} ValidatorMessage{message='The result found at: /molecules/type/@id for which the values are ''SPE:0328'' didn't match any of the 2 specified CV terms: - The sole term SPE:0326 (protein) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name. - SPE:0318 (nucleic acid) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name.', level=ERROR, context=Context(/molecules/type/@id ), rule=} ValidatorMessage{message='Experiment id:3 doesn't have a name.', level=WARN, context=null, rule=null}
Validator – Documentation 28. Apr. 2014
Page 18 / 18
6. Validator Tutorial: Download Validator's Tutorial Source Code
Here are a few things you can download to get you started with the Validator:
The latest Validator framework can be downloaded from here.
The Simple Proteomics Experiment (SPE) sample project can be downloaded from here
This archive contains 2 projects: the SPE data model and the SPE simple validator.