D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV...

27
D15.2 DCV curation tools and services to automatically and manually acquire high-quality curated data MD-PAEDIGREE - FP7-ICT-2011-9 (600932) 1 Model Driven Paediatric European Digital Repository Call identifier: FP7-ICT-2011-9 - Grant agreement no: 600932 Thematic Priority: ICT - ICT-2011.5.2: Virtual Physiological Human Deliverable 15.2 DCV curation tools and services to automatically and manually acquire high-quality curated data Due date of delivery: 28.02.2015 Actual submission date: 06.03.2015 Start of the project: 01-03-2013 Ending Date: 28.02.2017 Partner responsible for this deliverable: ATHENA Version: 1.0

Transcript of D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV...

Page 1: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

1

Model Driven Paediatric European Digital Repository

Call identifier: FP7-ICT-2011-9 - Grant agreement no: 600932

Thematic Priority: ICT - ICT-2011.5.2: Virtual Physiological Human

Deliverable 15.2

DCV curation tools and services to

automatically and manually acquire high-quality

curated data

Due date of delivery: 28.02.2015

Actual submission date: 06.03.2015

Start of the project: 01-03-2013

Ending Date: 28.02.2017

Partner responsible for this deliverable: ATHENA

Version: 1.0

Page 2: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

2

Dissemination Level: Public

Document Classification

Title DCV curation tools and services to automatically

and manually acquire high-quality curated data

Deliverable D15.2

Reporting Period Month 24

Authors Anna Gogolou

Work Package WP15

Security Public

Nature Report

Keyword(s) Data cleaning, data curation, data validation,

databases

Document History

Name Remark Version Date

MDP_D15.2_v1.0 First draft 1.0 17/02/15

MDP_D15.2_v1.1 Reviewed draft 1.1 27/02/15

MDP_D15.2_v1.2 Internal edits 1.2 28/02/15

MDP_D15.2_v1.3 Internal edits 1.3 2/03/15

MDP_D15.2_v2 Final 2.0 3/03/15

MDP_D15.2_v2.1 Final with corrections 2.1 5/03/15

List of Contributors

Name Affiliation

Anna Gogolou ATHENA

Eleni Zacharia ATHENA

Harry Dimitropoulos ATHENA

Omiros Metaxas ATHENA

List of reviewers

Name Affiliation

Patrick Ruch HES-SO

Bruno Dallapiccola OPBG

Page 3: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

3

Abbreviations

API Application Programming Interface

CFD Conditional Functional Dependency

CFM Connection and Function Manager

DC Denial Constraint

DCV Data Curation and Validation (tool)

FD Functional Dependency

MD-PAEDIGREE Model-Driven European Paediatric Digital

Repository

UDF User-Defined Function

VPH Virtual Physiological Human

Page 4: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

4

Contents 1. Introduction ................................................................................................................................................... 5

2. System design ................................................................................................................................................ 6

2.1 Server-side implementation ................................................................................................................. 6

2.2 Client-side implementation................................................................................................................... 8

3. Data cleaning ................................................................................................................................................. 8

3.1. Typographical error detection ............................................................................................................. 8

3.1.a Numeric variables ............................................................................................................................. 8

3.1.b Alphanumeric variables .................................................................................................................. 10

3.2. Data cleaning rules ............................................................................................................................... 11

3.2.a Functional dependencies ................................................................................................................ 11

3.2.b Conditional functional dependencies ............................................................................................. 13

3.2.c Denial constraints ........................................................................................................................... 15

4. New derived columns .................................................................................................................................. 17

4.1 Discretisation ......................................................................................................................................... 17

4.2 Computation of medical scores ............................................................................................................. 17

5. Data visualisation ......................................................................................................................................... 20

5.1. Visualisation for discrete variables ....................................................................................................... 20

5.1.a Barcharts ......................................................................................................................................... 20

5.1.b Piecharts ......................................................................................................................................... 22

5.2. Visualisation for continuous variables .................................................................................................. 22

5.2.a Scatterplots ..................................................................................................................................... 22

5.2.b Linecharts ....................................................................................................................................... 23

6. History and workflows ................................................................................................................................. 24

6.1 History ................................................................................................................................................... 24

6.2 Workflows .............................................................................................................................................. 25

7. Future work ................................................................................................................................................. 27

8. References ................................................................................................................................................... 27

Page 5: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

5

1. Introduction This deliverable describes the new web-based Data Curation and Validation (DCV) tool delivered at M24, as

the result of the activities of Task T15.1 “Data curation and validation tool” [M6-24].

Curated and valid data are a prerequisite for the upcoming research process to be carried by MD-Paedigree

researchers and not only. In particular, the need for cleaned data dominates in every scientific activity and

in today's economy. The new web-based Data Curation and Validation (DCV) tool is appropriate for usage

both from clinicians and information technology personnel. The former are the experts who feed the tool

with precise and fully accurate data cleaning rules, while the latter can transform the data through the

tool's available operations according to their own desired scientific purposes.

The data cleaning process consists of two major parts. On the one hand, general typographical errors are

detected both for numeric and alphanumeric variables. For numeric columns, the user has the possibility to

detect outliers, applying numeric filters to them. The filters provide a graphical representation of the

distribution of data, making it easy for the user to detect error values. For alphanumeric columns, the tool,

based on string distance metrics, suggests groups of very similar text values that possible are typos of the

same value. On the other hand, in a relational database it is possible that there are errors that only an

expert can define. Expert user-defined data cleaning rules over a relation serve exactly this need.

New derived columns, either through discretisation criteria or by computing and executing arithmetic

operations, are an extremely powerful functionality that DCV offers. Clinicians and researchers are

expected to benefit from these operations. Complicated computations for millions of rows of data are

executed in minimal time with the extremely powerful madIS engine1.

Data visualisation is another useful component which completes the cleaning process. Interactive barcharts

and piecharts help users identify the distinct values of a column's data, while scatterplots and linecharts

give a graphical representation of correlations between two attributes.

All the above-mentioned actions that affect the values of data are saved and presented to the user as

history records under “History” tab. The user can undo one or more actions or redo them. He can also save

workflows and re-run/re-use them in other projects or with other data.

1 https://code.google.com/p/madis/. Retrieved: 17.02.2015

Page 6: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

6

2. System design The new DCV tool uses a client-server architecture. The following subsections describe its architecture.

Figure 1: Architecture of the new DCV tool. It is connected with the MD-Paedigree infostructure through

Gnubila’s API

2.1 Server-side implementation

The web server is based on the Tornado framework2. The server operates over madIS which is an extensible

relational database system built on top of SQLite3.

At the heart of madIS lies the SQLite database. SQLite natively supports extended-SQL functionality through

user-defined functions (UDFs) implemented in C. The UDFs are categorized into row, aggregate, and virtual

table functions. This functionality is inherited by madIS. To make the development, management, and use

of UDFs as simple and efficient as possible, the following layers were integrated with the engine of SQLite:

2 tornadoweb.org. Retrieved: 28.02.2015 3 https://sqlite.org. Retrieved: 28.02.2015

Page 7: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

7

The first layer is APSW4, a wrapper of SQLite that makes possible the control of the database engine from

Python. APSW also makes possible to implement UDFs in Python enabling SQLite to use them in the same

way as its native UDFs. Both Python and SQLite are executed in the same process, greatly reducing the

communication cost between them.

Another layer is the Connection and Function Manager (CFM), which is the external interface of madIS. It

receives madSQL queries, transforms them into SQL92, and passes them to SQLite for execution. This

component also supports query execution tracing and monitoring. Finally, it automatically finds and loads

all the available UDFs.

In madIS, queries are expressed in madSQL, an SQL-based declarative language extended with UDFs. The

system offers row, aggregate and virtual table UDFs, and it can be extended with new UDFs using a simple

and intuitive interface. All UDFs are written in Python and can use all the pre-existing Python libraries

(NumPy5, SciPy6, etc.), as well as command line tools.

It is worth pointing out that the query language of madIS is based on SQL enhanced with two extensions.

The first one is the automatic creation of virtual tables. This permits their direct usage within the query,

without explicitly creating them before. The second extension is an inverted syntax which uses UDFs as

statements.

The following query reads the data from ‘data.tsv’ file:

create virtual table input_file using file('data.tsv');

select * from input_file;

It can be simplified as: select * from file('data.tsv');

The second extension is an inverted syntax which uses UDFs as statements. Using this syntax, the query can

be further simplified as: file 'data.tsv';

The inverted syntax provides a natural way to composite virtual table functions. In the following example,

the query that uses file, is provided as a parameter to countrows:

select * from countrows("select * from file('data.tsv')");

The above syntax is very error prone because it uses nested quote levels. By using inversion, the query is

written as: countrows file ‘data.tsv’;

The ordering is from left to right, i.e., x y z is translated to x(y(z)). Notice that this syntax is very close to the

natural language sentence “count the rows of file ‘data.tsv’”.

4 https://github.com/rogerbinns/apsw. Retrieved: 28.02.2015 5 http://www.numpy.org. Retrieved: 28.02.2015 6 http://www.scipy.org. Retrieved: 28.02.2015

Page 8: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

8

2.2 Client-side implementation

The client-side of DCV is a graphical User Interface (GUI) which has been developed using the following

Javascript libraries:

1) jQuery7, which is the basic library offering the API for event handling, animation and Ajax.

2) SlickGrid8, which provides the advanced Javascript grid supporting virtual scrolling, handling hundreds of

thousands of rows with extreme responsiveness, etc.

3) D39, which offers extremely powerful visualisation components. D3 brings data to life using HTML, SVG

and CSS.

4) w2ui.js10, which provides widgets like layout, sidebar, tabs, toolbar, popup and field controls and forms.

3. Data cleaning In real life, a database is said to be dirty if it contains inconsistencies with respect to some set of

constraints. The data-cleaning process aims to remove these inconsistencies in order to clean the database.

It represents a crucial activity in many real-life information systems as unclean data often incur erroneous

results and decisions. The new web-based DCV is a tool for working with dirty data and cleaning it up. There

are two types of errors that DCV handles: the typographical errors and the errors which ask for expert-user

knowledge (rules).

3.1. Typographical error detection

General typographical errors are detected both for numeric and alphanumeric variables, as described in the

following subsections.

3.1.a Numeric variables

For numeric variables, common types of errors caused by human inattention during the typing process can

be either numeric outliers or inappropriate alphanumeric values. The nature of medical data does not allow

us to pre-decide if a numeric value is indeed an outlier, even if we run outlier detection algorithms. For this

reason, DCV offers the “Numeric filter...” option on each column’s menu. Once it is pressed, a frame on the

left of the screen appears which shows a graphical representation of the distribution of the column's

numeric data and it also lists the number of entries per type (numeric, non-numeric, blank). Users can

easily check and correct erroneous values and find the missing ones. Figures 2-5 depict screenshots

showing examples of using the “Numeric filter…”option.

7 http://jquery.com. Retrieved: 28.02.2015 8 https://github.com/mleibman/SlickGrid. Retrieved: 28.02.2015 9 http://d3js.org. Retrieved: 28.02.2015 10 http://w2ui.com/web/. Retrieved: 28.02.2015

Page 9: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

9

Figure 2: “Numeric filter...” applied on column AgeAtOnset. The frame lists 143 numeric values, 0 non-

numeric and 14 blank cells. Observing data distribution and range, the user can detect a possible outlier

near value 45.

Figure 3: “Numeric filter...” applied on column AgeAtOnset. The user can isolate the range of values

between 20.00 – 45.00. On the datatable, data of that range are selected and depicted. Here, the user can

see a possible outlier (AgeAtOnset=44). The user decides if this value is indeed an outlier. If yes, he can edit

the cell and correct it (e.g. the correct AgeAtOnset for this case is 4.4).

Page 10: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

10

Figure 4: “Numeric filter...” applied on column AgeAtOnset. The user can detect the missing values by

filtering the column by blank cells only (see ticked box ‘Blank’). This way, all patients with missing

“AgeAtOnset” values are displayed.

Figure 5: Interaction between multiple filters (in this case 3 filters) and the datatable. 15 records pass the

criteria of filters.

3.1.b Alphanumeric variables

The “String similarities...” option of the menu offers the possibility of typographical error detection in the

alphanumeric values of a column. Based on string distance metrics, the algorithm proposes groups of very

similar text values that are possibly typos of the same value. All the column's values, along with their

Page 11: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

11

frequencies, are presented in a grouped table, appearing on the left of the datatable. The algorithm

indicates as correct value in each group the value with the greatest frequency. It is up to the user to decide

which of the changes are meaningful to apply. See Figure 6 for an example.

Figure 6: “String similarities...” applied on column Sex. ‘Male’ and ‘eMale’ are grouped together, while

‘Female’ and ‘Femaled’ are also grouped together (common typographical errors). The user can merge the

values of selected groups. The algorithm proposes as correct the value with the greatest frequency in each

group, but user can edit the cell and apply as the merging value his preferred one.

3.2. Data cleaning rules

In a relational database there are errors that occur over a relation. DCV can correct these errors in a semi-

automatic manner. Experts feed the tool with precise and fully accurate user-defined data cleaning rules.

Then DCV uses these rules to detect the errors. DCV supports the following three main categories of data

cleaning rules: Functional Dependencies (FDs), Conditional Functional Dependencies (CFDs) and Denial

Constraints (DCs).

3.2.a Functional dependencies

Functional dependencies are constraints between two sets of attributes in a relation. A functional

dependency occurs when one set of attributes uniquely determines another set of attributes in the right

direction only or in a one-to-one relationship. The syntax is as follows:

ColumnA, ColumnB | ColumnC, ColumnD...

meaning that columns A and B uniquely determine columns C and D. An example from daily life is country |

capital, which indicates that tuples with the same value on country should also have the same value on

capital. Because the relationship is one-to-one, it also means that tuples with the same value on capital

should also have the same value on country.

Page 12: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

12

The user has access to this functionality by pressing the “Data cleaning” button on the upper side toolbar

and subsequently selecting the option FD (Functional Dependency rule) from the drop-down list “Category

of rule”. As shown in Figure 7, a functional dependency consists of two fields, the left-hand and right-hand.

The user can select easily the columns he wants to place on the left part of the rule and the ones he wants

to place on the right part by clicking on each field; a context menu with the names of the columns opens in

order for the user to select the ones he wants. If his rule follows a one-to-one relationship, he must select

the corresponding checkbox, because the default applied option of the rule is to the right direction only.

In the right part of the screen, there are three tabs. The first tab with the name “Info” contains information

about how the rule is defined. The second tab with the name “Rule statistics” provides a piechart to the

user, informing him about how many tuples are dirty (violate the rule) and how many are clean (don't

violate the rule) out of the total tuples of the project. The third tab “Tuples frequency” provides a barchart

per violation group for FD and CFD rules showing the frequency of the different values of the group which

violate the rule. The value with the greatest frequency could possibly be the correct one, while the others

the dirty ones.

Figure 7: Composing a functional dependency rule.

Let's form a FD rule. We assume that the attribute ‘PhysicianInformation_CityOfHospital’ uniquely

determines the attribute ‘PhysicianInformation_CountryOfHospital’ (in the right direction only). It means

that tuples with the same values on city column should also have the same values on country column. The

user selects the column ‘PhysicianInformation_CityOfHospital’ in the left field and the column

‘PhysicianInformation_CountryOfHospital’ in the right field. Subsequently, the user provides a name for his

rule, in order for it to be saved for latter usage, and finally he runs the rule. If the user doesn't provide a

name, the system automatically asks for one.

Page 13: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

13

Figure 8 shows the results after running the rule. A grouped table with the violations is returned. Each

group contains tuples having the same value on ‘PhysicianInformation_CityOfHospital’ column, but

different values on ‘PhysicianInformation_CountryOfHospital’. The values which violate the rule are

coloured in red. The user can either edit individual cells and correct the wrong values or apply his preferred

merged value. Helpful tools for the user constitute the statistics information offered by both the “Rule

statistics” and the “Tuples frequencty” tabs.

Figure 8: The results after running the rule ‘PhysicianInformation_CityOfHospital’|

‘PhysicianInformation_CountryOfHospital’. The rule is saved under the name rule1. The piechart on the right

informs us about how many tuples are dirty and how many are clean after running the rule. Here, 48 clean

and 246 dirty tuples were found.

3.2.b Conditional functional dependencies

Conditional functional dependencies (CFDs) extend FDs with conditions. The syntax of a CFD rule is as

following:

ColumnA, ColumnB = (or !=) constant (or attribute) | ColumnC,

ColumnD = (or !=) constant (or attribute)...

An example of that type of rule is country=UK, area-code=131, street | postcode, city=Edinburgh, which

indicates that in the UK, if the area-code is 131, then the city has to be Edinburgh and street uniquely

determines postcode. In a one-to-one relationship, it also means that if the city is Edinburgh, then the

country has to be the UK and the area-code has to be 131, while postcode uniquely determines street.

The user can form a CFD rule by selecting the option CFD from the “Category of rule” drop-down list. CFD

also consists of two fields, the left-hand and right-hand. Furthermore, each field may have a condition.

Figure 9 shows the interface for composing a CFD rule. For example, let’s say there is a user who wants to

run the rule ‘Phys.Inf._CountryOfHospital’, ‘Phys.Inf._CountryOfHospital’=UK | ‘Phys.Inf._CityOfHospital’,

Page 14: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

14

which indicates that when the ‘Phys.Inf._CountryOfHospital’ is the UK, then this country determines the

city. The user selects the column ‘PhysicianInformation_CountryOfHospital’ in the left field and the column

‘PhysicianInformation_CityOfHospital’ in the right field. Subsequently, he fills the left condition

(‘Phys.Inf._CountryOfHospital’ = UK) in the fields under the label “Left Conditions”. Figure 10 shows the

results after running this rule with the “Rule statistics” tab selected in the right part of the screen, whereas

Figure 11 shows the same but with the “Tuples frequency” tab selected instead.

Figure 9: Composing a conditional functional dependency rule.

Figure 10: The results after running the rule ‘Phys.Inf._CountryOfHospital’,

‘Phys.Inf._CountryOfHospital’=UK | ‘Phys.Inf._CityOfHospital’. The rule is saved under the name rule2. The

piechart on the right informs us that 217 clean and 77 dirty tuples were found.

Page 15: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

15

Figure 11: Showing the results of rule2 as in Figure 10, but with the “Tuples frequency” tab selected,

showing the frequency of the different values of the group which violate the rule. The value with the

greatest frequency (here “LONDON”) could possibly indicate the correct one, while the others the dirty ones.

3.2.c Denial constraints

Denial constraints (DCs) define rules between one or more tuples in a relation. The syntax requires a

conjunction of predicates, where each predicate consists of an operator (=, !=, <, <=, >, >=) and a

tuple.attribute over tuple.attribute or constant. The syntax is as follows:

not(t1.ColumnA > t1.ColumnB & t1.ColumnC = constant)

meaning that there should be no tuple with value on ColumnA greater than value on ColumnB and value on

ColumnC equal to some constant. For instance, the rule not(t1.DiseaseDuration > t1.AgeAtOnset) indicates

that there should be no tuple with value on DiseaseDuration greater than value on AgeAtOnset, given that

units of measurement are the same (e.g. years).

We can form a denial constraint by selecting the corresponding option from the “Category of rule” drop-

down list. Figure 12 shows the interface that DCV offers for the composition of such a rule. The user can

add as many rows (predicates) as he wants by pressing the button “+” at the right or deleting the existing

ones by pressing the button “-” at the left. The conjunction of predicates has logical “AND” operation,

meaning that tuples must violate all the predicates for violating the rule.

An example of a denial constraint is the following. There should be no tuple for patients with value on

YearOfBirth greater than value on YearDiseaseOnset. This rule can be written as not(t1.YearOfBirth >

t1.YearDiseaseOnset). Figure 13 shows the results after running this rule. Two violations have occurred.

Page 16: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

16

Values that participate in the schema of the rule are coloured in red. It is not known which one is the dirty

one. The expert user has to discover it.

Figure 12: Composing a denial constraint rule.

Figure 13: The results after running the rule not(t1.YearOfBirth > t1.YearDiseaseOnset). The rule is saved

under the name rule3. The piechart on the right informs us that 292 clean and 2 dirty tuples were found.

Page 17: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

17

4. New derived columns DCV constitutes a powerful tool for both the clinicians and researchers who want to compute medical

scores derived from one or more columns or even discretise their data according to their own criteria.

These two powerful options are described in detail in the next two subsections.

4.1 Discretisation

Discretisation is available through the menu option “Discretise values”. It is applied on the values of a

column. When the user clicks at this option, a new div is created at the left under the “Results” tab. It asks

the user to fill the fields with his preferred ranges and their respective discretised values. If the first cell

stays blank, it means that there is no lower limit (minus infinity is assumed as the actual cell's true value).

Respectively, if the last cell stays blank, it means that there is not an upper limit (infinity is assumed here).

The user can add new ranges by pressing the button “+” at the right or deleting the existing ones by

pressing the red button “Del.”, which is located next to each range. The tool prevents the user

automatically from making errors by not allowing him to continue with the filling of fields immediately after

an error value is being completed (e.g. when the left value of a range is greater than the right value). If the

completed fields have no errors, the user can press the button “Compute” to execute his request for

discretisation. A new column with the name discretised_XXX (where XXX is the name of column whose

values have been discretised) is produced next to the XXX column. (Figure 14)

Figure 14: Discretisation of columns AgeAtOnset and DiseaseDuration.

4.2 Computation of medical scores

DCV offers its users a powerful environment for the execution of any arithmetic operation. MadIS is the

powerful component which executes all these operations in minimal time, from the simplest ones (+, -, *, /)

to the most complicated. A pre-defined list of functions from the math lib of Python is provided for more

Page 18: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

18

functionality, which is able to cope with the needs of any demanding user. All these operations run as “row-

function” queries in madIS producing one value for each row. pyfunerrtonul madIS's row function is used

for the execution of the python ones. The provided categories of functions are the following: hyperbolic

functions, numerical functions, trigonometric functions, power and logarithmic functions, angular

functions, aggregate functions (zscores, average, etc.), while two constants are also provided for absolute

precision. These constants are pi = 3.141592... and e = 2.718281...

All this functionality is available through the “Compute medical scores” button of the toolbar. Figure 15

shows the interface that DCV offers for the computation of these scores. There are two tabs at the right,

the “Columns” tab, which contains the names of the columns of the project, and the “Functions” tab, which

contains the aforementioned functions by category. The user can write the formula for the computation of

his medical score in the textarea field. Double-clicking on a column's name, the column's name appears in

the textarea at the current cursor position. Respectively, double-clicking on a function's name, the

function's name along with the appropriate number of arguments also appear in the textarea. With just

one click, useful information about the function is provided under the “Compute score” and “Delete

function” buttons.

Let's assume that we want to compute the score (ESR + log(CReactiveProtein)) / 2. This score uses the

function log(). The user writes the formula in the textarea as it is, (ESR + log(CReactiveProtein)) / 2. As

mentioned before, he can select the columns and the functions from the right tabs, respectively. Once he

writes the formula he wants to compute, he must provide a name for the new derived column before

clicking on the button “Compute score”. If the user does not provide a name, the system automatically asks

for one. There is also the optional field “Save function as”. If the user provides a name in this field, his

formula is saved in the database under the name provided. This formula constitutes a user-defined

function, and from now on it appears along with the other functions in the “Functions” tab under the

category “User defined functions”. In this way, any user can save his own formulas and use them again as

functions. Clicking on the button “Compute score”, the new derived column is computed and added at the

end of the datatable. Users have also the option of deleting a selected function by pressing the button

“Delete function”. If no function is selected, a relevant warning message appears. (See Figures 15 & 16).

Page 19: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

19

Figure 15: Computing a medical score

Figure 16: The new derived column ESR_CRP_score is the result of (ESR + log(CReactiveProtein)) / 2

computation.

Page 20: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

20

5. Data visualisation Data visualisation is a helpful and often supplementary tool in the cleaning process. DCV offers a wide

range of plots and charts that users can use for visualising the data and their derived values. Powerful

visualisation components from the D3 javascript library11 were used for this purpose. The server sends the

appropriate data for visualisation to the client in JSON format after the client's request. One more powerful

asset of the new DCV is the interaction among these plots, the datatable and the filters.

5.1. Visualisation for discrete variables

DCV currently provides barcharts and piecharts for the visualisation of discrete variables.

5.1.a Barcharts

Vertical barcharts can be formed for any column. This operation is available through the “Barchart” option

of the “Data visualisation” toolbar's menu. The user can select from the context menu, which opens in front

of him, his preferred column and see immediately a barchart of its values. A new div with the graphical

chart is appended at the left under the “Results” tab. Figure 17 shows the barchart of column “Sex”.

Figure 17: Barchart of column Sex.

The user has several options, from detecting the different values of the column, to seeing if any of the

column’s values is an error. The user can also have a maximised view of the barchart by clicking the button

“Maximise”, or close it by clicking the button “Close”. Figure 18 shows the maximised view of the previous

barchart (barchart of Figure 17).

11 http://d3js.org/. Retrieved 17/02/2015.

Page 21: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

21

Figure 18: Maximised view of the previous barchart in Figure 17.

However, the most important feature of barcharts, and any other of DCV's plots, is the interaction with the

datatable. Clicking on a bar, e.g. on the bar ‘Femaled’ (typographical error detection), the datatable shows

only the records with value “Femaled” on column Sex (Figure 19). The user can edit the cell and correct the

error value.

Figure 19: The datatable displays only the record with value “Femaled” on column Sex.

Page 22: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

22

5.1.b Piecharts

Piecharts are available through the “Piechart” option of the “Data visualisation” toolbar's menu. These

plots, like the barcharts, are also interactive (i.e. clicking on a ‘piece’ of the pie displays only the relevant

records in datatable). (Figure 20)

Figure 20: Piechart of column Sex.

5.2. Visualisation for continuous variables

DCV currently provides scatterplots and linecharts for the visualisation of continuous variables.

5.2.a Scatterplots

Scatterplots are available through the “Scatterplot” option of the “Data visualisation” toolbar's menu. The

user is asked to select two columns of continuous variables for plotting their scatterplot. Figure 21 shows

the scatterplot of CReactiveProtein vs ESR. A “reverse” button is also provided so that the scatterplot

displays the same data but with the x and y axes interchanged. Furthermore, the user can click on a

particular point on the scatterplot and go to the actual cell (in datatable) whose value is displayed.

Page 23: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

23

Figure 21: Scatterplot: CReactiveProtein vs ESR.

5.2.b Linecharts

Linecharts are available through the “Linechart” option of the “Data visualisation” toolbar's menu. Figure

22 shows the linechart of CReactiveProtein vs ESR. Similar functionality to scatterplots is provided with

linecharts (but the actual line is not clickable, since particular data points are not displayed).

Figure 22: Linechart: CReactiveProtein vs ESR.

Page 24: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

24

6. History and workflows In this section, we will describe the “History” and “Workflows” functionalities of DCV.

6.1 History

DCV keeps the history of any action that affects the values of data during each workflow. The user can see

history records under the “History” tab. History records are expressed in natural language, so any user can

read them. However, DCV also offers the extremely powerful functionality of Undo/Redo History. The user

can return to any previous step of the history record and bring back his old data or return to a more recent

step again. Checkboxes at the left of each history record serve exactly this purpose. When the user unclicks

one, the table automatically comes to the condition before this history record and the most recent ones

applied. On the other hand, when the user clicks an unclicked checkbox, the table automatically comes to

the condition just after this specific history record applied. Figure 23 shows the project's table at its current

condition, while Figure 24 shows it at the condition before history record 2 and the most recent ones

applied.

Furthermore, the tracking of Do/Undo operations provide a solid basis for data auditing and data

stewardship.

Figure 23: The project's table at its current condition, with “History” steps appearing on the left div.

Page 25: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

25

Figure 24: The project's table at the condition before history record 2 and the most recent ones applied.

6.2 Workflows

Last but not least, workflows constitute an extremely powerful asset of DCV. While history records have

predetermined order following the specific order of their execution, workflows give the opportunity to the

user to select which actions he wants to save; he can then re-run the same sequence of steps (the

workflow) in other projects or with other data.

Saving and execution of workflows are available through the “Workflow” button of the toolbar. The user

can select the actions which he wants to include in the workflow, clicking on the “+” button next to each

history record under the “History” tab. Clicking on this button of a history record, a JSON object, which

contains all the necessary information about the history record's action, is being appended in the textarea.

The user can subtract an action from the workflow by clicking the “-” icon, which replaces the “+” icon after

the history record has been added to the workflow. Once the user completes the composition of his

workflow, he can save it providing a name in the field “Workflow”. Clicking on the button “Save workflow”,

saves the workflow in the database. A workflow can also be run in the same project without saving it.

Let's assume that the user wants to run a saved workflow. Clicking on the field “Workflow”, a context menu

with the names of all the workflows that have been saved in the database opens. The user can select his

preferred one from the menu. Textarea automatically fills with the JSON objects of the workflow’s actions.

The user, after being satisfied that this is the workflow he wants to run, can click on the button “Run

workflow” and the workflow is executed. Workflows can be deleted if the user does no longer need them.

This can be done by selecting a workflow and clicking on the button “Delete workflow”.

Page 26: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

26

Figure 25: Composing a workflow.

Figure 26: The selected workflow has been executed in the same project changing its history.

Page 27: D15.2 DCV Curation Tools and Services to Automatically and … · 2015. 5. 8. · D15.2 – DCV curation tools and services to automatically and manually acquire high-quality curated

D15.2 – DCV curation tools and services to

automatically and manually acquire high-quality

curated data

MD-PAEDIGREE - FP7-ICT-2011-9 (600932)

27

7. Future work The Data Curation and Validation (DCV) tool offers a semi-automatic cleaning process, as described in this

deliverable. However, we are planning to continue improving and evolving the tool with additional

functionality, taking also into consideration the input we receive from the users (clinicians, medical

researchers, etc.) during the various training and testing sessions that will take place in MD-Paedigree.

We are planning in designing cleaning workflows and processes that will be more automated. These

enhancements, for example, may include automatic discovery of functional dependencies, conditional

functional dependencies and denial constraints. Schema matching is also to be considered, because it is

necessary for the data integration process from different hospitals, but also necessary for a more advanced

cleaning process. Machine learning algorithms will also be adopted for curation tasks. In addition, we are

planning on using DCV’s GUI for interacting with the data mining, analysis and knowledge discovery tools

that will be developed in Task T16.1. In other words, we will incorporate the work produced under T16.1

within DCV, so that it includes some well-established supervised and unsupervised techniques addressing

well-defined research tasks, capturing specific user requirements, such as: high-dimensionality reduction

and feature selection, similarity analysis, clustering, classification, etc. The final aim is to develop a fully

integrated platform with one WEB-based GUI for data analysis and KDD. In addition, by utilizing EXAREME

(previously named ADP) from T14.3, we could support Big Data analytics based on scalable data analysis

techniques and distributed execution. Scientific Workflow engine encapsulation that will support scalable

and reproducible scientific research, as well as interdisciplinary collaboration across different institutions

and scientists, is our ultimate goal.

8. References

[1] Ebraid, Arm, et al., “NADEEF: A Generalized Data Cleaning System”. Proceedings of the VLDB

Endowment, Vol. 6, No. 12, pp. 1218-1221

[2] Dallachiesa, Michele, et al. “NADEEF: A Commodity Data Cleaning System”. SIGMOD’13, June 22–27,

2013, New York, New York, USA.

[3] Stonebraker, Michael, et al. “Data Curation at Scale: The Data Tamer System”. 6th Biennial Conference

on Innovative Data Systems Research (CIDR ’13), January 6-9, 2013, Asilomar, California, USA.

[4] http://openrefine.org. Retrieved 17/02/2015.