A semantic based methodology to classify and protect sensitive data in medical records Flora Amato,...

16
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola , Antonino Mazzeo, Sara Romano Dipartimento di Informatica e Sistemistica Universita’ degli Studi di Napoli, Federico II Naples, Italy 1

Transcript of A semantic based methodology to classify and protect sensitive data in medical records Flora Amato,...

Page 1: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

A semantic based methodology to classify and protect sensitive

data in medical records

Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano

Dipartimento di Informatica e Sistemistica Universita’ degli Studi di Napoli, Federico II

Naples, Italy1

Page 2: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

Rationale

• Introduction to challenges in e-healt;• Motivation and Open challenges;• Proposal of access control policies;• Methodology to extract relevant

information to protect and apply the proper security policy;

• A Case study;• Conclusion and future works

2

Page 3: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

The Electronic Health

• E-Health challenges:

– To provide value-added services to the healthcare actors (patients, doctors, etc...);

– To enhance the efficiency and reducing the costs of complex informative systems.

• E-Health term encloses many meanings; we are focused on those aspects of telemedicine that involve not only technological aspects but, also, procedural ones;

• In particular, we are assisting to a gradual adoption of innovative IT solutions for e-health but, at the state, the major open issue is the cohesistence of two different domains:

3

Page 4: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

The cohesistence of old and new systems from a security point of view…..

1) Modern eHealth systems are designed to enforce fine-grain access control policies and the medical records are a-priori well structured to properly manage the different fields, but…..

2) eHealth is also applied in those contexts where new information systems have not been developed yet but “documental systems” are, in some way, introduced. This means that today documental systems give users the possibility to access a digitalized version of a medical record without having previously classified the critical parts.

4

Page 5: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

Unstructured Medical record data and actors

Actors are not aware that structuring data is important for data elaboration and protection.

• Security Problem • private data (critical part) can be accessed by not authorized actors.•It is not possible to enforce a fine-grained acess control on digitalized unstructured documents

• Solution • extract relevant informaton from the records,• enforce access control policies

5

Page 6: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

Motivation and our proposal

• The problem: “Documental systems” allow access to medical record digitalized version (unstructured data) without having previously classified the critical parts.

• We propose a semantic-based method to locate the resource being accessed and associate the proper security rule to apply.

• The Access control models is still based on fine-grain data classification.

6

Page 7: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

Semantic method for resource classification

• Knowledge extraction by means of several text analysis methodologies.

ST

EP

S

1

2

4

3

• Running example:

7

Page 8: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

• Goal:– extraction of relevant units of lexical elements

• Text tokenization: – segmentation of a sentence into minimal units of analysis (token).

- disambiguation of punctuation marks, aiming at token separation;; separation of continuous strings (i.e. strings that are not separated by blank spaces) to be considered as independent tokens: for example, in the Italian string “c’era” there are two independent tokens (c’ + era).

This segmentation can be performed by means of special tools, defined tokenizers, including glossaries with wellknown expressions to be regarded as medical domain tokens and mini-grammars containing heuristic rules regulating token combinations.

• Text normalization:– variations of the same lexical expression should be reported in a

unique way:• (i) words that assume different meaning if are written in small or capital letter

• (ii) acronyms and abbreviations (“USA” or “U.S.A.”)

Step 1 - Text Preprocessing: Tokenization and Normalization

8

Page 9: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

Step 2 - Morpho-syntactic analysis: POS tagging and Lemmatization

• Goal:– extraction of word categories.

• Part-of-speech (POS) tagging:– assignment of a grammatical

category (noun, verb, etc.) to each lexical unit.

– word-category disambiguation: the vocabulary of the documents of interest is compared with an external lexical resource

• Key-Word In Context (KWIC) Analysis.

• Lemmatization:– Reducing the inflected forms to

the respective lemma

9

Page 10: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

Step 3 - Relevant Terms Recognition

• Goal:– identification of terms useful to characterize the sections of

interest.

• TF-IDF (Term Frequency - Inverse Document Frequency): relevant lexical items are frequent and concentrated on few documents.

Wt,d = ft,d * log(N/Dt)

• term frequency (tf ), corresponds to the number of times a given term occurs in the resource;

• inverse document frequency (idf), concerning the term distribution within all the sections of the medical records: it relies on the principle that term importance is inversely proportional to the number of documents from the corpus where the given term occurs.

10

Page 11: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

Step 4 - Identification of Concepts of Interest

• Goal: – Clusterize relevant terms

in synset (semantically equivalent terms) in order to associate the semantic concept

11

Page 12: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

Security Policies

• At the end of the semantic analysis process, a medical record can be seen as composed by several sections (resources) that can be properly protected;

• A Security policy is set of rules structured as ACL:

sj ; ai; rk

where:– sj S = s1 … sm the set of actors;

– ai A = a1 … ah the set of actions;

– rk R = r1 … rh the set of resources;12

Page 13: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

Medical Record Policy (Use Case)

actors

resources

actions

13

Page 14: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

Action-actors identification

Lr* = sj, ai, r* r*R, ai A*A, sj S*S

Giving the policy and given a resource r* R, it is easy to locate the set of all allowed rules:

14

Page 15: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

System behavior: an example

15

Page 16: A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

Conclusions and Future works

• We have proposed a semantic approach for document parts (resource) classification from a security point of view;

• It is useful to associate a set of security rules on the resources;

• It is a promising method that can strongly help in facing security issues that arise once data are made available for new potential applications.

• Future works:– To prove the methodology in other e-government fields,– To implement a system to on-line extract/classify and

enforce fine-grained policies with acceptable performances.

16