D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV...
Transcript of D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV...
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
1
Model Driven Paediatric European Digital Repository
Call identifier: FP7-ICT-2011-9 - Grant agreement no: 600932
Thematic Priority: ICT - ICT-2011.5.2: Virtual Physiological Human
Deliverable 15.2
DCV curation tools and services to
automatically and manually acquire high-quality
curated data
Due date of delivery: 28.02.2015
Actual submission date: 06.03.2015
Start of the project: 01-03-2013
Ending Date: 28.02.2017
Partner responsible for this deliverable: ATHENA
Version: 1.0
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
2
Dissemination Level: Public
Document Classification
Title DCV curation tools and services to automatically
and manually acquire high-quality curated data
Deliverable D15.2
Reporting Period Month 24
Authors Anna Gogolou
Work Package WP15
Security Public
Nature Report
Keyword(s) Data cleaning, data curation, data validation,
databases
Document History
Name Remark Version Date
MDP_D15.2_v1.0 First draft 1.0 17/02/15
MDP_D15.2_v1.1 Reviewed draft 1.1 27/02/15
MDP_D15.2_v1.2 Internal edits 1.2 28/02/15
MDP_D15.2_v1.3 Internal edits 1.3 2/03/15
MDP_D15.2_v2 Final 2.0 3/03/15
MDP_D15.2_v2.1 Final with corrections 2.1 5/03/15
List of Contributors
Name Affiliation
Anna Gogolou ATHENA
Eleni Zacharia ATHENA
Harry Dimitropoulos ATHENA
Omiros Metaxas ATHENA
List of reviewers
Name Affiliation
Patrick Ruch HES-SO
Bruno Dallapiccola OPBG
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
3
Abbreviations
API Application Programming Interface
CFD Conditional Functional Dependency
CFM Connection and Function Manager
DC Denial Constraint
DCV Data Curation and Validation (tool)
FD Functional Dependency
MD-PAEDIGREE Model-Driven European Paediatric Digital
Repository
UDF User-Defined Function
VPH Virtual Physiological Human
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
4
Contents 1. Introduction ................................................................................................................................................... 5
2. System design ................................................................................................................................................ 6
2.1 Server-side implementation ................................................................................................................. 6
2.2 Client-side implementation................................................................................................................... 8
3. Data cleaning ................................................................................................................................................. 8
3.1. Typographical error detection ............................................................................................................. 8
3.1.a Numeric variables ............................................................................................................................. 8
3.1.b Alphanumeric variables .................................................................................................................. 10
3.2. Data cleaning rules ............................................................................................................................... 11
3.2.a Functional dependencies ................................................................................................................ 11
3.2.b Conditional functional dependencies ............................................................................................. 13
3.2.c Denial constraints ........................................................................................................................... 15
4. New derived columns .................................................................................................................................. 17
4.1 Discretisation ......................................................................................................................................... 17
4.2 Computation of medical scores ............................................................................................................. 17
5. Data visualisation ......................................................................................................................................... 20
5.1. Visualisation for discrete variables ....................................................................................................... 20
5.1.a Barcharts ......................................................................................................................................... 20
5.1.b Piecharts ......................................................................................................................................... 22
5.2. Visualisation for continuous variables .................................................................................................. 22
5.2.a Scatterplots ..................................................................................................................................... 22
5.2.b Linecharts ....................................................................................................................................... 23
6. History and workflows ................................................................................................................................. 24
6.1 History ................................................................................................................................................... 24
6.2 Workflows .............................................................................................................................................. 25
7. Future work ................................................................................................................................................. 27
8. References ................................................................................................................................................... 27
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
5
1. Introduction This deliverable describes the new web-based Data Curation and Validation (DCV) tool delivered at M24, as
the result of the activities of Task T15.1 “Data curation and validation tool” [M6-24].
Curated and valid data are a prerequisite for the upcoming research process to be carried by MD-Paedigree
researchers and not only. In particular, the need for cleaned data dominates in every scientific activity and
in today's economy. The new web-based Data Curation and Validation (DCV) tool is appropriate for usage
both from clinicians and information technology personnel. The former are the experts who feed the tool
with precise and fully accurate data cleaning rules, while the latter can transform the data through the
tool's available operations according to their own desired scientific purposes.
The data cleaning process consists of two major parts. On the one hand, general typographical errors are
detected both for numeric and alphanumeric variables. For numeric columns, the user has the possibility to
detect outliers, applying numeric filters to them. The filters provide a graphical representation of the
distribution of data, making it easy for the user to detect error values. For alphanumeric columns, the tool,
based on string distance metrics, suggests groups of very similar text values that possible are typos of the
same value. On the other hand, in a relational database it is possible that there are errors that only an
expert can define. Expert user-defined data cleaning rules over a relation serve exactly this need.
New derived columns, either through discretisation criteria or by computing and executing arithmetic
operations, are an extremely powerful functionality that DCV offers. Clinicians and researchers are
expected to benefit from these operations. Complicated computations for millions of rows of data are
executed in minimal time with the extremely powerful madIS engine1.
Data visualisation is another useful component which completes the cleaning process. Interactive barcharts
and piecharts help users identify the distinct values of a column's data, while scatterplots and linecharts
give a graphical representation of correlations between two attributes.
All the above-mentioned actions that affect the values of data are saved and presented to the user as
history records under “History” tab. The user can undo one or more actions or redo them. He can also save
workflows and re-run/re-use them in other projects or with other data.
1 https://code.google.com/p/madis/. Retrieved: 17.02.2015
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
6
2. System design The new DCV tool uses a client-server architecture. The following subsections describe its architecture.
Figure 1: Architecture of the new DCV tool. It is connected with the MD-Paedigree infostructure through
Gnubila’s API
2.1 Server-side implementation
The web server is based on the Tornado framework2. The server operates over madIS which is an extensible
relational database system built on top of SQLite3.
At the heart of madIS lies the SQLite database. SQLite natively supports extended-SQL functionality through
user-defined functions (UDFs) implemented in C. The UDFs are categorized into row, aggregate, and virtual
table functions. This functionality is inherited by madIS. To make the development, management, and use
of UDFs as simple and efficient as possible, the following layers were integrated with the engine of SQLite:
2 tornadoweb.org. Retrieved: 28.02.2015 3 https://sqlite.org. Retrieved: 28.02.2015
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
7
The first layer is APSW4, a wrapper of SQLite that makes possible the control of the database engine from
Python. APSW also makes possible to implement UDFs in Python enabling SQLite to use them in the same
way as its native UDFs. Both Python and SQLite are executed in the same process, greatly reducing the
communication cost between them.
Another layer is the Connection and Function Manager (CFM), which is the external interface of madIS. It
receives madSQL queries, transforms them into SQL92, and passes them to SQLite for execution. This
component also supports query execution tracing and monitoring. Finally, it automatically finds and loads
all the available UDFs.
In madIS, queries are expressed in madSQL, an SQL-based declarative language extended with UDFs. The
system offers row, aggregate and virtual table UDFs, and it can be extended with new UDFs using a simple
and intuitive interface. All UDFs are written in Python and can use all the pre-existing Python libraries
(NumPy5, SciPy6, etc.), as well as command line tools.
It is worth pointing out that the query language of madIS is based on SQL enhanced with two extensions.
The first one is the automatic creation of virtual tables. This permits their direct usage within the query,
without explicitly creating them before. The second extension is an inverted syntax which uses UDFs as
statements.
The following query reads the data from ‘data.tsv’ file:
create virtual table input_file using file('data.tsv');
select * from input_file;
It can be simplified as: select * from file('data.tsv');
The second extension is an inverted syntax which uses UDFs as statements. Using this syntax, the query can
be further simplified as: file 'data.tsv';
The inverted syntax provides a natural way to composite virtual table functions. In the following example,
the query that uses file, is provided as a parameter to countrows:
select * from countrows("select * from file('data.tsv')");
The above syntax is very error prone because it uses nested quote levels. By using inversion, the query is
written as: countrows file ‘data.tsv’;
The ordering is from left to right, i.e., x y z is translated to x(y(z)). Notice that this syntax is very close to the
natural language sentence “count the rows of file ‘data.tsv’”.
4 https://github.com/rogerbinns/apsw. Retrieved: 28.02.2015 5 http://www.numpy.org. Retrieved: 28.02.2015 6 http://www.scipy.org. Retrieved: 28.02.2015
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
8
2.2 Client-side implementation
The client-side of DCV is a graphical User Interface (GUI) which has been developed using the following
Javascript libraries:
1) jQuery7, which is the basic library offering the API for event handling, animation and Ajax.
2) SlickGrid8, which provides the advanced Javascript grid supporting virtual scrolling, handling hundreds of
thousands of rows with extreme responsiveness, etc.
3) D39, which offers extremely powerful visualisation components. D3 brings data to life using HTML, SVG
and CSS.
4) w2ui.js10, which provides widgets like layout, sidebar, tabs, toolbar, popup and field controls and forms.
3. Data cleaning In real life, a database is said to be dirty if it contains inconsistencies with respect to some set of
constraints. The data-cleaning process aims to remove these inconsistencies in order to clean the database.
It represents a crucial activity in many real-life information systems as unclean data often incur erroneous
results and decisions. The new web-based DCV is a tool for working with dirty data and cleaning it up. There
are two types of errors that DCV handles: the typographical errors and the errors which ask for expert-user
knowledge (rules).
3.1. Typographical error detection
General typographical errors are detected both for numeric and alphanumeric variables, as described in the
following subsections.
3.1.a Numeric variables
For numeric variables, common types of errors caused by human inattention during the typing process can
be either numeric outliers or inappropriate alphanumeric values. The nature of medical data does not allow
us to pre-decide if a numeric value is indeed an outlier, even if we run outlier detection algorithms. For this
reason, DCV offers the “Numeric filter...” option on each column’s menu. Once it is pressed, a frame on the
left of the screen appears which shows a graphical representation of the distribution of the column's
numeric data and it also lists the number of entries per type (numeric, non-numeric, blank). Users can
easily check and correct erroneous values and find the missing ones. Figures 2-5 depict screenshots
showing examples of using the “Numeric filter…”option.
7 http://jquery.com. Retrieved: 28.02.2015 8 https://github.com/mleibman/SlickGrid. Retrieved: 28.02.2015 9 http://d3js.org. Retrieved: 28.02.2015 10 http://w2ui.com/web/. Retrieved: 28.02.2015
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
9
Figure 2: “Numeric filter...” applied on column AgeAtOnset. The frame lists 143 numeric values, 0 non-
numeric and 14 blank cells. Observing data distribution and range, the user can detect a possible outlier
near value 45.
Figure 3: “Numeric filter...” applied on column AgeAtOnset. The user can isolate the range of values
between 20.00 – 45.00. On the datatable, data of that range are selected and depicted. Here, the user can
see a possible outlier (AgeAtOnset=44). The user decides if this value is indeed an outlier. If yes, he can edit
the cell and correct it (e.g. the correct AgeAtOnset for this case is 4.4).
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
10
Figure 4: “Numeric filter...” applied on column AgeAtOnset. The user can detect the missing values by
filtering the column by blank cells only (see ticked box ‘Blank’). This way, all patients with missing
“AgeAtOnset” values are displayed.
Figure 5: Interaction between multiple filters (in this case 3 filters) and the datatable. 15 records pass the
criteria of filters.
3.1.b Alphanumeric variables
The “String similarities...” option of the menu offers the possibility of typographical error detection in the
alphanumeric values of a column. Based on string distance metrics, the algorithm proposes groups of very
similar text values that are possibly typos of the same value. All the column's values, along with their
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
11
frequencies, are presented in a grouped table, appearing on the left of the datatable. The algorithm
indicates as correct value in each group the value with the greatest frequency. It is up to the user to decide
which of the changes are meaningful to apply. See Figure 6 for an example.
Figure 6: “String similarities...” applied on column Sex. ‘Male’ and ‘eMale’ are grouped together, while
‘Female’ and ‘Femaled’ are also grouped together (common typographical errors). The user can merge the
values of selected groups. The algorithm proposes as correct the value with the greatest frequency in each
group, but user can edit the cell and apply as the merging value his preferred one.
3.2. Data cleaning rules
In a relational database there are errors that occur over a relation. DCV can correct these errors in a semi-
automatic manner. Experts feed the tool with precise and fully accurate user-defined data cleaning rules.
Then DCV uses these rules to detect the errors. DCV supports the following three main categories of data
cleaning rules: Functional Dependencies (FDs), Conditional Functional Dependencies (CFDs) and Denial
Constraints (DCs).
3.2.a Functional dependencies
Functional dependencies are constraints between two sets of attributes in a relation. A functional
dependency occurs when one set of attributes uniquely determines another set of attributes in the right
direction only or in a one-to-one relationship. The syntax is as follows:
ColumnA, ColumnB | ColumnC, ColumnD...
meaning that columns A and B uniquely determine columns C and D. An example from daily life is country |
capital, which indicates that tuples with the same value on country should also have the same value on
capital. Because the relationship is one-to-one, it also means that tuples with the same value on capital
should also have the same value on country.
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
12
The user has access to this functionality by pressing the “Data cleaning” button on the upper side toolbar
and subsequently selecting the option FD (Functional Dependency rule) from the drop-down list “Category
of rule”. As shown in Figure 7, a functional dependency consists of two fields, the left-hand and right-hand.
The user can select easily the columns he wants to place on the left part of the rule and the ones he wants
to place on the right part by clicking on each field; a context menu with the names of the columns opens in
order for the user to select the ones he wants. If his rule follows a one-to-one relationship, he must select
the corresponding checkbox, because the default applied option of the rule is to the right direction only.
In the right part of the screen, there are three tabs. The first tab with the name “Info” contains information
about how the rule is defined. The second tab with the name “Rule statistics” provides a piechart to the
user, informing him about how many tuples are dirty (violate the rule) and how many are clean (don't
violate the rule) out of the total tuples of the project. The third tab “Tuples frequency” provides a barchart
per violation group for FD and CFD rules showing the frequency of the different values of the group which
violate the rule. The value with the greatest frequency could possibly be the correct one, while the others
the dirty ones.
Figure 7: Composing a functional dependency rule.
Let's form a FD rule. We assume that the attribute ‘PhysicianInformation_CityOfHospital’ uniquely
determines the attribute ‘PhysicianInformation_CountryOfHospital’ (in the right direction only). It means
that tuples with the same values on city column should also have the same values on country column. The
user selects the column ‘PhysicianInformation_CityOfHospital’ in the left field and the column
‘PhysicianInformation_CountryOfHospital’ in the right field. Subsequently, the user provides a name for his
rule, in order for it to be saved for latter usage, and finally he runs the rule. If the user doesn't provide a
name, the system automatically asks for one.
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
13
Figure 8 shows the results after running the rule. A grouped table with the violations is returned. Each
group contains tuples having the same value on ‘PhysicianInformation_CityOfHospital’ column, but
different values on ‘PhysicianInformation_CountryOfHospital’. The values which violate the rule are
coloured in red. The user can either edit individual cells and correct the wrong values or apply his preferred
merged value. Helpful tools for the user constitute the statistics information offered by both the “Rule
statistics” and the “Tuples frequencty” tabs.
Figure 8: The results after running the rule ‘PhysicianInformation_CityOfHospital’|
‘PhysicianInformation_CountryOfHospital’. The rule is saved under the name rule1. The piechart on the right
informs us about how many tuples are dirty and how many are clean after running the rule. Here, 48 clean
and 246 dirty tuples were found.
3.2.b Conditional functional dependencies
Conditional functional dependencies (CFDs) extend FDs with conditions. The syntax of a CFD rule is as
following:
ColumnA, ColumnB = (or !=) constant (or attribute) | ColumnC,
ColumnD = (or !=) constant (or attribute)...
An example of that type of rule is country=UK, area-code=131, street | postcode, city=Edinburgh, which
indicates that in the UK, if the area-code is 131, then the city has to be Edinburgh and street uniquely
determines postcode. In a one-to-one relationship, it also means that if the city is Edinburgh, then the
country has to be the UK and the area-code has to be 131, while postcode uniquely determines street.
The user can form a CFD rule by selecting the option CFD from the “Category of rule” drop-down list. CFD
also consists of two fields, the left-hand and right-hand. Furthermore, each field may have a condition.
Figure 9 shows the interface for composing a CFD rule. For example, let’s say there is a user who wants to
run the rule ‘Phys.Inf._CountryOfHospital’, ‘Phys.Inf._CountryOfHospital’=UK | ‘Phys.Inf._CityOfHospital’,
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
14
which indicates that when the ‘Phys.Inf._CountryOfHospital’ is the UK, then this country determines the
city. The user selects the column ‘PhysicianInformation_CountryOfHospital’ in the left field and the column
‘PhysicianInformation_CityOfHospital’ in the right field. Subsequently, he fills the left condition
(‘Phys.Inf._CountryOfHospital’ = UK) in the fields under the label “Left Conditions”. Figure 10 shows the
results after running this rule with the “Rule statistics” tab selected in the right part of the screen, whereas
Figure 11 shows the same but with the “Tuples frequency” tab selected instead.
Figure 9: Composing a conditional functional dependency rule.
Figure 10: The results after running the rule ‘Phys.Inf._CountryOfHospital’,
‘Phys.Inf._CountryOfHospital’=UK | ‘Phys.Inf._CityOfHospital’. The rule is saved under the name rule2. The
piechart on the right informs us that 217 clean and 77 dirty tuples were found.
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
15
Figure 11: Showing the results of rule2 as in Figure 10, but with the “Tuples frequency” tab selected,
showing the frequency of the different values of the group which violate the rule. The value with the
greatest frequency (here “LONDON”) could possibly indicate the correct one, while the others the dirty ones.
3.2.c Denial constraints
Denial constraints (DCs) define rules between one or more tuples in a relation. The syntax requires a
conjunction of predicates, where each predicate consists of an operator (=, !=, <, <=, >, >=) and a
tuple.attribute over tuple.attribute or constant. The syntax is as follows:
not(t1.ColumnA > t1.ColumnB & t1.ColumnC = constant)
meaning that there should be no tuple with value on ColumnA greater than value on ColumnB and value on
ColumnC equal to some constant. For instance, the rule not(t1.DiseaseDuration > t1.AgeAtOnset) indicates
that there should be no tuple with value on DiseaseDuration greater than value on AgeAtOnset, given that
units of measurement are the same (e.g. years).
We can form a denial constraint by selecting the corresponding option from the “Category of rule” drop-
down list. Figure 12 shows the interface that DCV offers for the composition of such a rule. The user can
add as many rows (predicates) as he wants by pressing the button “+” at the right or deleting the existing
ones by pressing the button “-” at the left. The conjunction of predicates has logical “AND” operation,
meaning that tuples must violate all the predicates for violating the rule.
An example of a denial constraint is the following. There should be no tuple for patients with value on
YearOfBirth greater than value on YearDiseaseOnset. This rule can be written as not(t1.YearOfBirth >
t1.YearDiseaseOnset). Figure 13 shows the results after running this rule. Two violations have occurred.
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
16
Values that participate in the schema of the rule are coloured in red. It is not known which one is the dirty
one. The expert user has to discover it.
Figure 12: Composing a denial constraint rule.
Figure 13: The results after running the rule not(t1.YearOfBirth > t1.YearDiseaseOnset). The rule is saved
under the name rule3. The piechart on the right informs us that 292 clean and 2 dirty tuples were found.
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
17
4. New derived columns DCV constitutes a powerful tool for both the clinicians and researchers who want to compute medical
scores derived from one or more columns or even discretise their data according to their own criteria.
These two powerful options are described in detail in the next two subsections.
4.1 Discretisation
Discretisation is available through the menu option “Discretise values”. It is applied on the values of a
column. When the user clicks at this option, a new div is created at the left under the “Results” tab. It asks
the user to fill the fields with his preferred ranges and their respective discretised values. If the first cell
stays blank, it means that there is no lower limit (minus infinity is assumed as the actual cell's true value).
Respectively, if the last cell stays blank, it means that there is not an upper limit (infinity is assumed here).
The user can add new ranges by pressing the button “+” at the right or deleting the existing ones by
pressing the red button “Del.”, which is located next to each range. The tool prevents the user
automatically from making errors by not allowing him to continue with the filling of fields immediately after
an error value is being completed (e.g. when the left value of a range is greater than the right value). If the
completed fields have no errors, the user can press the button “Compute” to execute his request for
discretisation. A new column with the name discretised_XXX (where XXX is the name of column whose
values have been discretised) is produced next to the XXX column. (Figure 14)
Figure 14: Discretisation of columns AgeAtOnset and DiseaseDuration.
4.2 Computation of medical scores
DCV offers its users a powerful environment for the execution of any arithmetic operation. MadIS is the
powerful component which executes all these operations in minimal time, from the simplest ones (+, -, *, /)
to the most complicated. A pre-defined list of functions from the math lib of Python is provided for more
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
18
functionality, which is able to cope with the needs of any demanding user. All these operations run as “row-
function” queries in madIS producing one value for each row. pyfunerrtonul madIS's row function is used
for the execution of the python ones. The provided categories of functions are the following: hyperbolic
functions, numerical functions, trigonometric functions, power and logarithmic functions, angular
functions, aggregate functions (zscores, average, etc.), while two constants are also provided for absolute
precision. These constants are pi = 3.141592... and e = 2.718281...
All this functionality is available through the “Compute medical scores” button of the toolbar. Figure 15
shows the interface that DCV offers for the computation of these scores. There are two tabs at the right,
the “Columns” tab, which contains the names of the columns of the project, and the “Functions” tab, which
contains the aforementioned functions by category. The user can write the formula for the computation of
his medical score in the textarea field. Double-clicking on a column's name, the column's name appears in
the textarea at the current cursor position. Respectively, double-clicking on a function's name, the
function's name along with the appropriate number of arguments also appear in the textarea. With just
one click, useful information about the function is provided under the “Compute score” and “Delete
function” buttons.
Let's assume that we want to compute the score (ESR + log(CReactiveProtein)) / 2. This score uses the
function log(). The user writes the formula in the textarea as it is, (ESR + log(CReactiveProtein)) / 2. As
mentioned before, he can select the columns and the functions from the right tabs, respectively. Once he
writes the formula he wants to compute, he must provide a name for the new derived column before
clicking on the button “Compute score”. If the user does not provide a name, the system automatically asks
for one. There is also the optional field “Save function as”. If the user provides a name in this field, his
formula is saved in the database under the name provided. This formula constitutes a user-defined
function, and from now on it appears along with the other functions in the “Functions” tab under the
category “User defined functions”. In this way, any user can save his own formulas and use them again as
functions. Clicking on the button “Compute score”, the new derived column is computed and added at the
end of the datatable. Users have also the option of deleting a selected function by pressing the button
“Delete function”. If no function is selected, a relevant warning message appears. (See Figures 15 & 16).
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
19
Figure 15: Computing a medical score
Figure 16: The new derived column ESR_CRP_score is the result of (ESR + log(CReactiveProtein)) / 2
computation.
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
20
5. Data visualisation Data visualisation is a helpful and often supplementary tool in the cleaning process. DCV offers a wide
range of plots and charts that users can use for visualising the data and their derived values. Powerful
visualisation components from the D3 javascript library11 were used for this purpose. The server sends the
appropriate data for visualisation to the client in JSON format after the client's request. One more powerful
asset of the new DCV is the interaction among these plots, the datatable and the filters.
5.1. Visualisation for discrete variables
DCV currently provides barcharts and piecharts for the visualisation of discrete variables.
5.1.a Barcharts
Vertical barcharts can be formed for any column. This operation is available through the “Barchart” option
of the “Data visualisation” toolbar's menu. The user can select from the context menu, which opens in front
of him, his preferred column and see immediately a barchart of its values. A new div with the graphical
chart is appended at the left under the “Results” tab. Figure 17 shows the barchart of column “Sex”.
Figure 17: Barchart of column Sex.
The user has several options, from detecting the different values of the column, to seeing if any of the
column’s values is an error. The user can also have a maximised view of the barchart by clicking the button
“Maximise”, or close it by clicking the button “Close”. Figure 18 shows the maximised view of the previous
barchart (barchart of Figure 17).
11 http://d3js.org/. Retrieved 17/02/2015.
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
21
Figure 18: Maximised view of the previous barchart in Figure 17.
However, the most important feature of barcharts, and any other of DCV's plots, is the interaction with the
datatable. Clicking on a bar, e.g. on the bar ‘Femaled’ (typographical error detection), the datatable shows
only the records with value “Femaled” on column Sex (Figure 19). The user can edit the cell and correct the
error value.
Figure 19: The datatable displays only the record with value “Femaled” on column Sex.
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
22
5.1.b Piecharts
Piecharts are available through the “Piechart” option of the “Data visualisation” toolbar's menu. These
plots, like the barcharts, are also interactive (i.e. clicking on a ‘piece’ of the pie displays only the relevant
records in datatable). (Figure 20)
Figure 20: Piechart of column Sex.
5.2. Visualisation for continuous variables
DCV currently provides scatterplots and linecharts for the visualisation of continuous variables.
5.2.a Scatterplots
Scatterplots are available through the “Scatterplot” option of the “Data visualisation” toolbar's menu. The
user is asked to select two columns of continuous variables for plotting their scatterplot. Figure 21 shows
the scatterplot of CReactiveProtein vs ESR. A “reverse” button is also provided so that the scatterplot
displays the same data but with the x and y axes interchanged. Furthermore, the user can click on a
particular point on the scatterplot and go to the actual cell (in datatable) whose value is displayed.
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
23
Figure 21: Scatterplot: CReactiveProtein vs ESR.
5.2.b Linecharts
Linecharts are available through the “Linechart” option of the “Data visualisation” toolbar's menu. Figure
22 shows the linechart of CReactiveProtein vs ESR. Similar functionality to scatterplots is provided with
linecharts (but the actual line is not clickable, since particular data points are not displayed).
Figure 22: Linechart: CReactiveProtein vs ESR.
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
24
6. History and workflows In this section, we will describe the “History” and “Workflows” functionalities of DCV.
6.1 History
DCV keeps the history of any action that affects the values of data during each workflow. The user can see
history records under the “History” tab. History records are expressed in natural language, so any user can
read them. However, DCV also offers the extremely powerful functionality of Undo/Redo History. The user
can return to any previous step of the history record and bring back his old data or return to a more recent
step again. Checkboxes at the left of each history record serve exactly this purpose. When the user unclicks
one, the table automatically comes to the condition before this history record and the most recent ones
applied. On the other hand, when the user clicks an unclicked checkbox, the table automatically comes to
the condition just after this specific history record applied. Figure 23 shows the project's table at its current
condition, while Figure 24 shows it at the condition before history record 2 and the most recent ones
applied.
Furthermore, the tracking of Do/Undo operations provide a solid basis for data auditing and data
stewardship.
Figure 23: The project's table at its current condition, with “History” steps appearing on the left div.
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
25
Figure 24: The project's table at the condition before history record 2 and the most recent ones applied.
6.2 Workflows
Last but not least, workflows constitute an extremely powerful asset of DCV. While history records have
predetermined order following the specific order of their execution, workflows give the opportunity to the
user to select which actions he wants to save; he can then re-run the same sequence of steps (the
workflow) in other projects or with other data.
Saving and execution of workflows are available through the “Workflow” button of the toolbar. The user
can select the actions which he wants to include in the workflow, clicking on the “+” button next to each
history record under the “History” tab. Clicking on this button of a history record, a JSON object, which
contains all the necessary information about the history record's action, is being appended in the textarea.
The user can subtract an action from the workflow by clicking the “-” icon, which replaces the “+” icon after
the history record has been added to the workflow. Once the user completes the composition of his
workflow, he can save it providing a name in the field “Workflow”. Clicking on the button “Save workflow”,
saves the workflow in the database. A workflow can also be run in the same project without saving it.
Let's assume that the user wants to run a saved workflow. Clicking on the field “Workflow”, a context menu
with the names of all the workflows that have been saved in the database opens. The user can select his
preferred one from the menu. Textarea automatically fills with the JSON objects of the workflow’s actions.
The user, after being satisfied that this is the workflow he wants to run, can click on the button “Run
workflow” and the workflow is executed. Workflows can be deleted if the user does no longer need them.
This can be done by selecting a workflow and clicking on the button “Delete workflow”.
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
26
Figure 25: Composing a workflow.
Figure 26: The selected workflow has been executed in the same project changing its history.
D15.2 – DCV curation tools and services to
automatically and manually acquire high-quality
curated data
MD-PAEDIGREE - FP7-ICT-2011-9 (600932)
27
7. Future work The Data Curation and Validation (DCV) tool offers a semi-automatic cleaning process, as described in this
deliverable. However, we are planning to continue improving and evolving the tool with additional
functionality, taking also into consideration the input we receive from the users (clinicians, medical
researchers, etc.) during the various training and testing sessions that will take place in MD-Paedigree.
We are planning in designing cleaning workflows and processes that will be more automated. These
enhancements, for example, may include automatic discovery of functional dependencies, conditional
functional dependencies and denial constraints. Schema matching is also to be considered, because it is
necessary for the data integration process from different hospitals, but also necessary for a more advanced
cleaning process. Machine learning algorithms will also be adopted for curation tasks. In addition, we are
planning on using DCV’s GUI for interacting with the data mining, analysis and knowledge discovery tools
that will be developed in Task T16.1. In other words, we will incorporate the work produced under T16.1
within DCV, so that it includes some well-established supervised and unsupervised techniques addressing
well-defined research tasks, capturing specific user requirements, such as: high-dimensionality reduction
and feature selection, similarity analysis, clustering, classification, etc. The final aim is to develop a fully
integrated platform with one WEB-based GUI for data analysis and KDD. In addition, by utilizing EXAREME
(previously named ADP) from T14.3, we could support Big Data analytics based on scalable data analysis
techniques and distributed execution. Scientific Workflow engine encapsulation that will support scalable
and reproducible scientific research, as well as interdisciplinary collaboration across different institutions
and scientists, is our ultimate goal.
8. References
[1] Ebraid, Arm, et al., “NADEEF: A Generalized Data Cleaning System”. Proceedings of the VLDB
Endowment, Vol. 6, No. 12, pp. 1218-1221
[2] Dallachiesa, Michele, et al. “NADEEF: A Commodity Data Cleaning System”. SIGMOD’13, June 22–27,
2013, New York, New York, USA.
[3] Stonebraker, Michael, et al. “Data Curation at Scale: The Data Tamer System”. 6th Biennial Conference
on Innovative Data Systems Research (CIDR ’13), January 6-9, 2013, Asilomar, California, USA.
[4] http://openrefine.org. Retrieved 17/02/2015.