L’Institut National des Sciences Appliquées de Lyon

169
N° d’ordre 2010-ISAL-0120 Année 2010 THESE Présentée devant L’Institut National des Sciences Appliquées de Lyon Pour obtenir LE GRADE DE DOCTEUR Spécialité : INFORMATIQUE Ecole doctorale : INFORMATIQUE ET MATHEMATIQUES Par Hossam Aldin JUMAA Titre AUTOMATISATION DE LA MEDIATION ENTRE XML ET DES BASES DE DONNEES RELATIONNELLES “XML AND RELATIONAL DATABASES MEDIATION AUTOMATIONSoutenue le 16 décembre 2010 devant le Jury composé de : Jury MM. Bruno DEFUDE Professeur, Institut Telecom, Telecom SudParis Examinateur Jocelyne FAYN Ingénieur de recherche HDR, MTIC, INSERM, Bron Co-directrice de thèse Frank MORVAN Professeur, IRIT, Univ. Paul Sabatier, Toulouse Rapporteur, examinateur Paul RUBEL Professeur Emérite, MTIC, INSA de Lyon Directeur de thèse Christine VERDIER Professeur, LIG, Univ. Joseph Fourier, Grenoble Rapporteur, examinateur Laboratoire MTIC: Méthodologies de Traitement de l' Information en Cardiologie Université Claude Bernard Lyon 1, INSA de Lyon, Bron, France

Transcript of L’Institut National des Sciences Appliquées de Lyon

Page 1: L’Institut National des Sciences Appliquées de Lyon

N° d’ordre 2010-ISAL-0120 Année 2010

TTHHEESSEE

Présentée devant

L’Institut National des Sciences Appliquées de Lyon

Pour obtenir LE GRADE DE DOCTEUR

Spécialité : INFORMATIQUE

Ecole doctorale : INFORMATIQUE ET MATHEMATIQUES

Par HHoossssaamm AAllddiinn JJUUMMAAAA

Titre

AAUUTTOOMMAATTIISSAATTIIOONN DDEE LLAA MMEEDDIIAATTIIOONN EENNTTRREE XXMMLL EETT DDEESS BBAASSEESS DDEE

DDOONNNNEEEESS RREELLAATTIIOONNNNEELLLLEESS

““XXMMLL AANNDD RREELLAATTIIOONNAALL DDAATTAABBAASSEESS MMEEDDIIAATTIIOONN AAUUTTOOMMAATTIIOONN””

Soutenue le 16 décembre 2010 devant le Jury composé de : Jury MM.

Bruno DEFUDE Professeur, Institut Telecom, Telecom SudParis Examinateur

Jocelyne FAYN Ingénieur de recherche HDR, MTIC, INSERM, Bron Co-directrice de thèse

Frank MORVAN Professeur, IRIT, Univ. Paul Sabatier, Toulouse Rapporteur, examinateur

Paul RUBEL Professeur Emérite, MTIC, INSA de Lyon Directeur de thèse

Christine VERDIER Professeur, LIG, Univ. Joseph Fourier, Grenoble Rapporteur, examinateur

Laboratoire MTIC: Méthodologies de Traitement de l'Information en Cardiologie Université Claude Bernard Lyon 1, INSA de Lyon, Bron, France

Page 2: L’Institut National des Sciences Appliquées de Lyon
Page 3: L’Institut National des Sciences Appliquées de Lyon

INSA Direction de la Recherche - Ecoles Doctorales – Quadriennal 2007-2010

SIGLE ECOLE DOCTORALE NOM ET COORDONNEES DU RESPONSABLE

CHIMIE

CHIMIE DE LYON http://sakura.cpe.fr/ED206 M. Jean Marc LANCELIN

Insa : R. GOURDON

M. Jean Marc LANCELIN Université Claude Bernard Lyon 1 Bât CPE 43 bd du 11 novembre 1918 69622 VILLEURBANNE Cedex Tél : 04.72.43 13 95 Fax : [email protected]

E.E.A.

ELECTRONIQUE, ELECTROTECHNIQUE, AUTOMATIQUE http://www.insa-lyon.fr/eea M. Alain NICOLAS Insa : C. PLOSSU [email protected] Secrétariat : M. LABOUNE AM. 64.43 – Fax : 64.54

M. Alain NICOLAS Ecole Centrale de Lyon Bâtiment H9 36 avenue Guy de Collongue 69134 ECULLY Tél : 04.72.18 60 97 Fax : 04 78 43 37 17 [email protected] Secrétariat : M.C. HAVGOUDOUKIAN

E2M2

EVOLUTION, ECOSYSTEME, MICROBIOLOGIE, MODELISATION http://biomserv.univ-lyon1.fr/E2M2 M. Jean-Pierre FLANDROIS Insa : H. CHARLES

M. Jean-Pierre FLANDROIS CNRS UMR 5558 Université Claude Bernard Lyon 1 Bât G. Mendel 43 bd du 11 novembre 1918 69622 VILLEURBANNE Cédex Tél : 04.26 23 59 50 Fax 04 26 23 59 49 06 07 53 89 13 [email protected]

EDISS

INTERDISCIPLINAIRE SCIENCES-SANTE Sec : Safia Boudjema M. Didier REVEL Insa : M. LAGARDE

M. Didier REVEL Hôpital Cardiologique de Lyon Bâtiment Central 28 Avenue Doyen Lépine 69500 BRON Tél : 04.72.68 49 09 Fax :04 72 35 49 16 [email protected]

INFOMATHS

INFORMATIQUE ET MATHEMATIQUES http://infomaths.univ-lyon1.fr M. Alain MILLE

M. Alain MILLE Université Claude Bernard Lyon 1 LIRIS - INFOMATHS Bâtiment Nautibus 43 bd du 11 novembre 1918 69622 VILLEURBANNE Cedex Tél : 04.72. 44 82 94 Fax 04 72 43 13 10 [email protected] - [email protected]

Matériaux

MATERIAUX DE LYON M. Jean Marc PELLETIER Secrétariat : C. BERNAVON 83.85

M. Jean Marc PELLETIER INSA de Lyon MATEIS Bâtiment Blaise Pascal 7 avenue Jean Capelle 69621 VILLEURBANNE Cédex Tél : 04.72.43 83 18 Fax 04 72 43 85 28 [email protected]

MEGA

MECANIQUE, ENERGETIQUE, GENIE CIVIL, ACOUSTIQUE M. Jean Louis GUYADER Secrétariat : M. LABOUNE PM : 71.70 –Fax : 87.12

M. Jean Louis GUYADER INSA de Lyon Laboratoire de Vibrations et Acoustique Bâtiment Antoine de Saint Exupéry 25 bis avenue Jean Capelle 69621 VILLEURBANNE Cedex Tél :04.72.18.71.70 Fax : 04 72 43 72 37 [email protected]

ScSo

ScSo* M. OBADIA Lionel Insa : J.Y. TOUSSAINT

M. OBADIA Lionel Université Lyon 2 86 rue Pasteur 69365 LYON Cedex 07 Tél : 04.78.77.23.88 Fax : 04.37.28.04.48 [email protected]

*ScSo : Histoire, Geographie, Aménagement, Urbanisme, Archéologie, Science politique, Sociologie, Anthropologie

Page 4: L’Institut National des Sciences Appliquées de Lyon
Page 5: L’Institut National des Sciences Appliquées de Lyon

Automatisation de la Médiation entre XML et des Bases de Données Relationnelles

Résumé

XML offre des moyens simples et flexibles pour l'échange de données entre applications et s'est rapidement imposé comme standard de fait pour l'échange de données entre les systèmes d'information. Par ailleurs, les bases de données relationnelles constituent aujourd’hui encore la technologie la plus utilisée pour stocker les données, du fait notamment de leur capacité de mise à l’échelle, de leur fiabilité et de leur performance. Combiner la souplesse du modèle XML pour l'échange de données et la performance du modèle relationnel pour l’archivage et la recherche de données constitue de ce fait une problématique majeure. Cependant, l'automatisation des échanges de données entre les deux reste une tâche difficile.

Dans cette thèse, nous présentons une nouvelle approche de médiation dans le but d’automatiser l'échange de données entre des documents XML et des bases de données relationnelles de manière indépendante des schémas de représentation des données sources et cibles. Nous proposons tout d’abord un modèle d’architecture de médiation générique des échanges. Pour faciliter la configuration d’interfaces spécifiques, notre architecture est basée sur le développement de composants génériques, adaptés à n'importe quelle source XML et n'importe quelle base de données relationnelle cible. Ces composants sont indépendants de tout domaine d'application, et ne seront personnalisés qu’une seule fois pour chaque couple de formats de données sources et de stockage cible. Ainsi notre médiateur permettra la mise à jour automatique et cohérente de toute base de données relationnelle à partir de données XML. Il permettra aussi de récupérer automatiquement et efficacement les données d'une base de données relationnelle et de les publier dans des documents XML (ou messages) structurés selon le format d'échange demandé. La transformation en XML Schema d'un modèle relationnel constitue l’un des éléments clé de notre médiateur. Nous proposons une méthodologie basée sur deux algorithmes successifs : l’un de stratification des relations en différents niveaux en fonction des dépendances fonctionnelles existant entre les relations et les clés des relations, le deuxième de transformation automatique du modèle relationnel en XML Schema à partir de la définition d’un ensemble de fragments types d’encodage XML des relations, des attributs, des clés et des contraintes référentielles. La méthodologie proposée préserve les contraintes d'intégrité référentielles du schéma relationnel et élimine toute redondance des données. Elle a

Page 6: L’Institut National des Sciences Appliquées de Lyon

été conçue pour conserver la représentation hiérarchique des relations, ce qui est particulièrement important pour la génération de requêtes SQL correctes et la mise à jour cohérente des données. Notre approche a été appliquée et testée avec succès dans le domaine médical pour automatiser l’échange de données entre une représentation XML du protocole de communication standard SCP-ECG, une norme ISO décrivant un format ouvert de représentation de bio-signaux et métadonnées associées, et un modèle relationnel européen de référence qui inclut notamment l’archivage de ces données. L'automatisation de la médiation est particulièrement pertinente dans ce domaine où les électrocardiogrammes (ECG) constituent le principal moyen d’investigation pour la détection des maladies cardio-vasculaires et doivent être échangés rapidement et de manière transparente entre les différents systèmes de santé, en particulier en cas d'urgence, sachant que le protocole SCP-ECG a de nombreuses implémentations puisque la plupart des sections et des champs de données sont optionnels. Mots-Clés: XML, Bases de données relationnelles, Médiation, Echange de données automatisé.

Page 7: L’Institut National des Sciences Appliquées de Lyon

XML and Relational Databases Mediation Automation

Abstract

XML has rapidly emerged as a de facto standard for data exchange among modern information systems. It offers simple and flexible means to exchange data among applications. In the meanwhile, relational databases are still the most used data storage technology in almost all information systems because of their unique features of scalability, reliability and performance. Thus, a crucial issue in the data management is to bring together the flexibility of the XML model for data exchange and the performance of the relational model for data storage and retrieval. However, the automation of bi-directional data exchange between the two remains a challenging task.

In this dissertation, we present a novel mediation approach to automate data exchange between XML and relational data sources independently of the adopted data structures in the two data models. We first propose a generic mediation framework for the data exchange between any XML document and any existing relational database. The architecture of our proposed framework is based on the development of generic components, which will ease the setup of specific interfaces adapted to any XML source and any target relational database. The mediator components are independent of any application domain, and need to be customized only once for each couple of source and target data storage formats. Hence, our mediator provides automatic and coherent updates of any relational database from any data embedded in XML documents. It also allows to retrieve data from any relational database and to publish them into XML documents (or messages) structured according to a requested interchange format. The transformation from a Relational Model to XML represents a main key component of the proposed mediator. Thus, we proposed a methodology and devised two algorithms to efficiently and automatically transform the relational schema of a relational database management system into an XML schema. Our transformation methodology preserves the integrity constraints of the relational schema and avoids any data redundancy. It has been designed in order to preserve the hierarchical representation of the relational schema, which is particularly important for the generation of correct SQL updates and queries in the proposed mediation framework. Another key component is the automation of the SQL generation. Therefore, we devised a generic methodology and algorithms to automate the SQL queries generation that are required to update or retrieve data to/from the relational databases.

Our proposed framework has been successfully applied and tested in the healthcare domain between an XML representation of SCP-ECG, an

Page 8: L’Institut National des Sciences Appliquées de Lyon

open format ISO standard communications protocol embedding bio-signals and related metadata, and an European relational reference model including these data. The mediation automation is especially relevant in this field where electrocardiograms (ECG) are the main investigation for the detection of cardiovascular diseases, and need to be quickly and transparently exchanged between health systems, in particular emergency, whereas the SCP-ECG protocol has numerous legacy implementations since most of the sections and of the data fields are not mandatory. Keywords: XML, Relational databases, Mediation, Data exchange automation.

Page 9: L’Institut National des Sciences Appliquées de Lyon

Table of Contents

Chapter 1 - ................................................................................................ Introduction ......................................................................... 1

1.1 Introduction .............................................................................................................................. 1 1.2 Context & Motivation ................................................................................................................ 2 1.3 Objective of Research ................................................................................................................ 4 1.4 Scientific Contributions .............................................................................................................. 4 1.5 Dissertation Organization .......................................................................................................... 5

Chapter 2 - .................................... State of the Art & Related Work ......................................................................... 7

2.1 Data Integration towards XML & Relational Database Integration ............................................... 7 2.1.1 Introduction .......................................................................................................................... 7 2.1.2 Data integration: the basic concepts ...................................................................................... 9

2.1.2.1 Data integration definition ................................................................................................. 9 2.1.2.2 Data integration scenarios ................................................................................................. 9

2.1.3 Data integration architectures ............................................................................................. 10 2.1.3.1 Mediator based data integration architecture .................................................................. 10 2.1.3.2 Data exchange architecture ............................................................................................. 11 2.1.3.3 Peer-to-peer data integration architecture ....................................................................... 12

2.1.4 Mapping composition in data integration ............................................................................. 14 2.1.5 Data exchange vs. data integration....................................................................................... 16 2.1.6 Main open problems in data integration ............................................................................... 17 2.1.7 XML .................................................................................................................................... 18

2.1.7.1 Introduction to XML......................................................................................................... 18 2.1.7.2 XML schema languages (DTD vs. XSD) ............................................................................... 21

2.1.7.2.1 XML schemas advantages ..................................................................................... 21 2.1.7.2.2 Document Type Definition language ..................................................................... 22 2.1.7.2.3 XML Schema Definition language.......................................................................... 22 2.1.7.2.4 XSD vs. DTD ......................................................................................................... 24

2.1.7.3 XML query languages ....................................................................................................... 25 2.1.7.3.1 XPath .................................................................................................................. 25 2.1.7.3.2 XSLT .................................................................................................................... 26 2.1.7.3.3 XQuery ................................................................................................................ 27 2.1.7.3.4 XQuery update facility ......................................................................................... 28

2.1.7.4 XML parsers .................................................................................................................... 29

Page 10: L’Institut National des Sciences Appliquées de Lyon

Contents

HOSSAM JUMAA ii Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

2.1.7.4.1 DOM ................................................................................................................... 29 2.1.7.4.2 SAX ..................................................................................................................... 30

2.1.8 XML data management and relational databases .................................................................. 30 2.1.8.1 Introduction to XML data management ............................................................................ 31 2.1.8.2 XML vs. relational database ............................................................................................. 32 2.1.8.3 XML & relational databases integration ............................................................................ 33 2.1.8.4 Relational publishing in XML ............................................................................................ 34 2.1.8.5 XML storage into relational databases .............................................................................. 37 2.1.8.6 Relational databases update from XML............................................................................. 38 2.1.8.7 Bidirectional data exchange between XML and relational databases .................................. 40

2.1.9 Discussion ........................................................................................................................... 41 2.1.10 Conclusion .......................................................................................................................... 42

2.2 eHealth Challenges & Applications ........................................................................................... 43 2.2.1 Introduction ........................................................................................................................ 43 2.2.2 eHealth ............................................................................................................................... 45 2.2.3 European research challenges in eHealth ............................................................................. 45

2.2.3.1 Standard Communications Protocol for Computer-Assisted Electrocardiography (SCP-ECG) project ............................................................................................................................ 47

2.2.3.2 Open European Data Interchange and Processing for Electrocardiography (OEDIPE) Project ....................................................................................................................................... 48

2.2.4 Information Retrieval in Healthcare...................................................................................... 50 2.2.5 Interoperability in Healthcare .............................................................................................. 51

2.2.5.1 Interoperability definition ................................................................................................ 51 2.2.5.2 Interoperability in healthcare definition ........................................................................... 52

2.2.6 Electronic Health Record (EHR) ............................................................................................ 54 2.2.6.1 EHR definition ................................................................................................................. 54 2.2.6.2 EHR benefits .................................................................................................................... 54 2.2.6.3 Implementation of the Electronic Health Record in France, called DMP ............................. 55

2.2.7 EHR interoperability ............................................................................................................ 58 2.2.7.1 Standard Architecture for Healthcare Information Systems: HISA ...................................... 60 2.2.7.2 EHRcom standard ............................................................................................................ 62 2.2.7.3 HL7 standard ................................................................................................................... 65 2.2.7.4 Integrating the Healthcare Enterprise (IHE) ...................................................................... 68 2.2.7.5 Digital Imaging and Communications in Medicine (DICOM) ............................................... 68 2.2.7.6 Synthesis of the EHR standards ........................................................................................ 69

2.2.8 Conclusion .......................................................................................................................... 69

Chapter 3 - ....... XML and Relational Databases Mediation Automation ..................................................... 71

3.1 A Generic Mediation Architecture ............................................................................................ 72 3.1.1 Introduction ........................................................................................................................ 72 3.1.2 General data exchange scenarios ......................................................................................... 72 3.1.3 Mediation framework design ............................................................................................... 74 3.1.4 Mediator’s components ....................................................................................................... 75

3.1.4.1 XSD Generator from XML ................................................................................................. 76 3.1.4.2 XSD Converter from R Schema.......................................................................................... 76 3.1.4.3 Schemata Manager .......................................................................................................... 76 3.1.4.4 SQL Generator ................................................................................................................. 77

3.1.5 Mediator functions .............................................................................................................. 77

Page 11: L’Institut National des Sciences Appliquées de Lyon

Contents

HOSSAM JUMAA iii Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

3.1.5.1 XML to DB function .......................................................................................................... 78 3.1.5.2 DB to XML function .......................................................................................................... 78

3.1.6 Discussion ........................................................................................................................... 79 3.1.7 Conclusion .......................................................................................................................... 80

3.2 Automation of the Transformation from Relational Models to XML Schemas ............................. 81 3.2.1 Introduction ........................................................................................................................ 81 3.2.2 Transformation methodology ............................................................................................... 81

3.2.2.1 Relations stratification − definitions ................................................................................. 82 3.2.2.2 Stratification algorithm .................................................................................................... 83

3.2.2.2.1 Algorithm description .......................................................................................... 83 3.2.2.3 XML Schema fragments templates .................................................................................... 85

3.2.2.3.1 Root fragment template....................................................................................... 86 3.2.2.3.2 Relation fragment template ................................................................................. 86 3.2.2.3.3 Data type and domain fragment template ............................................................ 87 3.2.2.3.4 Attribute fragment template ................................................................................ 87 3.2.2.3.5 Primary key constraint fragment template ............................................................ 88 3.2.2.3.6 Foreign key constraint fragment template ............................................................ 89

Nesting fragment template ................................................................................................... 89 Key reference fragment template ......................................................................................... 90

3.2.2.3.7 Uniqueness constraint fragment template ............................................................ 91 3.2.2.3.8 Non-nullable constraint fragment template .......................................................... 92

3.2.2.4 Target XML Schema transformation algorithm .................................................................. 92 3.2.3 Discussion ........................................................................................................................... 94 3.2.4 Conclusion .......................................................................................................................... 94

3.3 Automatic SQL Generation ....................................................................................................... 95 3.3.1 Introduction ........................................................................................................................ 95 3.3.2 SQL auto-generation methodology ....................................................................................... 95

3.3.2.1 SQL auto-generation to update relational DBs from any XML format ................................. 96 3.3.2.1.1 Step1: SQL select generation ................................................................................ 97 3.3.2.1.2 Step2: SQL update/insert generation .................................................................... 99

3.3.3 SQL auto-generation to query and retrieve data from any RDB and to embed them into any XML format ....................................................................................................................... 100

3.3.4 Conclusion ........................................................................................................................ 101

Chapter 4 - ......................... Validation & Application in eHealth ...................................................................... 103

4.1 Introduction .......................................................................................................................... 103 4.2 SCP-ECG protocol .................................................................................................................. 104 4.3 OEDIPE reference relational model for the storage of ECG data ............................................... 107

4.3.1 Data acquisition sub-model ................................................................................................ 108 4.4 XML based mediation for data exchange between SCP-ECG files and relational databases ....... 109 4.5 Meta-model manager ............................................................................................................ 111

4.5.1 Protocol meta-model ......................................................................................................... 111 4.5.2 Database meta-model ........................................................................................................ 113 4.5.3 Mapping meta-model ........................................................................................................ 115 4.5.4 Rule-base .......................................................................................................................... 116

4.6 Mediator functions ................................................................................................................ 116 4.6.1 Protocol to database function ............................................................................................ 116 4.6.2 Database to protocol function ............................................................................................ 117

Page 12: L’Institut National des Sciences Appliquées de Lyon

Contents

HOSSAM JUMAA iv Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

4.7 Discussion ............................................................................................................................. 117 4.8 Conclusion ............................................................................................................................ 120

Chapter 5 - ..................................................................................................... Conclusion ...................................................................... 121

5.1 Summary ............................................................................................................................... 121 5.2 Directions for future research ................................................................................................ 122

Page 13: L’Institut National des Sciences Appliquées de Lyon

Contents

HOSSAM JUMAA v Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

Table of figures Figure 2-1 Mediator based data integration architecture .............................................. 10 Figure 2-2 Data exchange architecture ....................................................................... 12 Figure 2-3 Peer to peer data integration architecture ................................................... 13 Figure 2-4 XML data management scenarios .............................................................. 32 Figure 2-5 The interoperability framework of SIS architecture ................................... 58 Figure 2-6 The HISA Standard Architecture [CEN/TC251 1999] ................................ 61 Figure 2-7 Extract from the standard reference model EHRcom around the root class

“EHR_EXTRACT” [Kalra 2004] ..................................................................... 64 Figure 2-8 Defining the structure of an HL7 v3 message ............................................ 66 Figure 2-9 Preview of Clinical Document Architecture in the HL7 XML format ......... 67 Figure 3-1 The exchange scenarios between XML files and relational databases ......... 73 Figure 3-2 Overall architecture of the proposed data interchange XML-based mediator

between XML documents and relational database management systems............ 74 Figure 3-3 Global functional architecture of the mediation solution between XML

documents and relational databases .................................................................. 78 Figure 3-4 XML to DB function ................................................................................. 78 Figure 3-5 DB to XML function ................................................................................. 79 Figure 3-6 Relational database model to XML Schema transformation ........................ 82 Figure 3-7 A principle example of stratification results into three levels (figure 3-7-a)

from a physical relational model (figure 3-7-b) ................................................ 85 Figure 3-8 Automatic SQL generation schema (XML to RDB function) ...................... 97 Figure 3-9 Automatic SQL generation schema (DB to XML function) ...................... 101 Figure 4-1 SCP-ECG data structure [CEN/TC251 2005] ........................................... 105 Figure 4-2 SCP-ECG section layout overview .......................................................... 106 Figure 4-3 OEDIPE data acquisition sub-model (physical schema) [Fayn et al. 1994a],

[Fayn et al. 1994b] ........................................................................................ 109 Figure 4-4 XML based mediator architecture for automating the storage of SCP-ECG

data into relational databases ......................................................................... 110 Figure 4-5 Overall view of the XML schema of the SCP-ECG protocol meta-model . 112 Figure 4-6 XMLSpy© Editor snapshot of a part of section 1 (Header Information-

patient data/ECG acquisition data) of the SCP-ECG protocol XML schema .... 112 Figure 4-7 Sample result of the OEDIPE “ECG acquisition” relational schema

stratification according to algorithm 1 described in section 3.2.2.2. We obtained four levels ..................................................................................................... 113

Figure 4-8 XMLSpy© Editor snapshot of an excerpt of the target XML schema generated by the transformation algorithm 2 described in section 3.2.2.4 ........ 114

Figure 4-9 Snapshot of the XSLT template that maps date of birth from the SCP-ECG protocol to the database meta-model instance ................................................. 115

Figure 4-10 The two main functions of the mediator ................................................. 116

Page 14: L’Institut National des Sciences Appliquées de Lyon

Contents

HOSSAM JUMAA vi Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

Page 15: L’Institut National des Sciences Appliquées de Lyon

CChhaapptteerr 11 -- IINNTTRROODDUUCCTTIIOONN

1.1 Introduction Thanks to the rapid development of Information and Communication Technologies (ICTs), we observe that the world changes from an industrial society to an information society where information is considered as one of the most important assets for any organization. The methods of communication have also changed considerably. Indeed, the mobile phone, electronic mail and videoconferencing offer new options to share information. This increases the possibility for individuals and organizations to share their experiences and knowledge. Therefore, in most domains, enterprises and organizations seek for sharing their data and open their systems to collaborate with their partners. That makes the design of more intelligent, pervasive and cost-effective information systems one of the most challenging goals.

In fact, we have recently seen an extraordinary evolution in terms of design of new digital systems and intelligent communication, from mobile phone to computing and data grids, via the smart PDA and intelligent clothing incorporating a multitude of sensors. However, modern enterprises and organizations are still largely relying on their old legacy applications and systems to carry out their daily business. Therefore, the design of new information systems today becomes more challenging especially when they should integrate numerous heterogeneous and distributed systems equipped with a mix of old and new technologies.

For instance, in the eHealth field, healthcare has become a cooperative process that involves several individuals with different but

Page 16: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 1 - Introduction

HOSSAM JUMAA 2 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

complementary tasks and skills (e.g. general practitioners, specialists, nurses, pharmacists, social assistants, etc.). Moreover, people mobility, long duration treatment, stakeholders multiplicity, huge volume of data generated and exchanged are factors that require new, more intelligent and pervasive medical information systems that are able to retrieve the patient related information from any Electronic Health Record (EHR) or any medical information system in order to improve the quality of care and of the patients follow-up [Verdier 2006]. Indeed, there are many problems to be faced in the medical informatics and eHealth domain. The main ICT challenge in the healthcare field is to enable “more informed decision-making and more cost-effective use of resources” [Dzenowagis and Kernen 2005] by developing new medical information systems capable to give the right information, when and where needed. To fulfil these objectives, ICTs must track and provide patient data and ensure that they are correct and complete. In addition, they must ensure an effective and timely communication, between patients and professionals as well as between healthcare professionals, to enable efficient sharing of medical knowledge and increase the quality of care. Finally, the ICTs must be able to deliver high quality services despite distance and time.

To achieve the goal of “designing more intelligent, pervasive and cost-effective information systems” by adopting and using the new ICTs, we aim to design new generic models and new methodologies which shall facilitate the access to the patient’s related information, when and where needed, from any electronic health record or medical system anywhere and anytime, and to assist the healthcare providers in taking the right decision about the health of the patient. Therefore, we face different challenging issues: information retrieval, interoperability, data integration from different sources and data exchange.

1.2 Context & Motivation Data integration and data exchange are major research challenges in the domain of information systems. Nowadays, XML is a key technology in this field. Enterprises increasingly collaborate through XML based data exchange technologies and Web services. However, enterprise information systems mainly store data in relational databases (RDB) which are a well-experimented, dominant technology for structured data storage and retrieval. RDBs allow to efficiently store, query and retrieve a huge volume of data. They are still the most important type of data sources.

This situation has triggered several attempts to bring together the flexibility of the XML model as a data exchange standard and the performance of the relational model as a data storage and retrieval

Page 17: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 1 - Introduction

HOSSAM JUMAA 3 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

technology. Therefore, many systems and techniques have been developed for publishing relational databases in XML format and for storing XML data into new, compliant relational databases. However, the automation of bi-directional data exchange between the two remains a challenging task. In fact, there is not only a need to publish relational data in XML format, but also to update the existing relational databases from XML data. Hence, the problem of updating existing relational databases from XML data remains a challenge and has not received much attention. This operation should be automatic and transparent to the user, and should dynamically support open structures of incoming data. Indeed, it is all the more important to preserve business oriented database models because of their efficiency for data retrieval via legacy human-machine interfaces. Motivating example In this context, we were motivated by a real scenario from the cardiology field where the Electrocardiogram (ECG) remains one of the most important non-invasive diagnostic tools for early recognition of heart diseases.

In fact, on the first hand, the patient’s ECG data are often recorded by different acquisition devices and stored according to either proprietary or to the SCP-ECG format. SCP-ECG is an international standard communication protocol to exchange ECG data between health systems [ISO/DIS 11073-91064 2007]. It is an open protocol and may have many implementations since most of the sections and of the data fields are not mandatory. On the other hand, pre-existing relational databases store the ECG data according to heterogeneous schemas. A reference core database model is the OEDIPE model [Fayn et al. 1994a] which has been proposed within the framework of a European project for the storage of serial ECG data and related metadata. The model, composed of 50 tables and 200 attributes, has been designed to support a modular implementation in various sub-models according to different scenarios of use of the ECG data.

Therefore, we need to find an efficient solution for storing any ECG data file created by any device using a proprietary format or compliant with the SCP-ECG protocol into any already existing relational database such as a sub-model of the OEDIPE relational model. A generic interface able to extract data from any SCP-ECG file coming from any manufactured device, and to automatically store these data, transparently for the user, into any relational database of health records, is thus needed.

The outcomes of our research should be generic enough to be applicable to a wide range of other applications and domains. The e-commerce, for example, is another key domain where our research should

Page 18: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 1 - Introduction

HOSSAM JUMAA 4 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

have an important impact. Enterprises in that domain are engaged in business workflows and exchange XML-based data (e.g. Web service based business workflows). However, most of the exchanged data are stored in relational databases inside the enterprise information system. For instance, the enterprise information system may export some information about orders in XML format to its partners. The partners may then change that information (e.g. the quote, billing information, delivery address, etc) in the XML documents and return it back to the enterprise. Then the changed information in the returned XML documents (e.g. quote) must be imported correctly and transparently into the underlying relational databases.

1.3 Objective of Research We aim to design and develop an intelligent mediator, which allows to connect and map different representations (e.g. XML and RDB) to automate the data exchange. Thus we will tackle the problem of the interoperability in the data exchange between RDBs and XML and we will particularly focus on the problem of data exchange between collaborating and loosely coupled information systems.

Hence, in this dissertation, we will present our research result about designing and developing new generic information models, IT architectures and methodologies to enable automatic and transparent data exchange between XML data format and relational databases. We are particularly interested in taking advantage of the new ICTs in order to access the right information when and where needed, by ensuring an efficient and transparent data exchange between different representations of data from different heterogeneous systems.

Therefore, we propose a generic XML-based data mediation framework to exchange data between any open structure of XML documents and any relational database of a relational database management system. The proposed framework will then be applied and tested on an application from the cardiology domain to demonstrate the validity of our proposal. However, our solution may be used in other application domains.

1.4 Scientific Contributions In this dissertation, we aim to present a novel mediation approach to automate data exchange between XML and relational data sources independently of the adopted data structures in the two data models. The contributions of this thesis are as follows: · We first propose a generic mediation framework for the data exchange

between any XML document and any existing relational database. The architecture of our proposed framework is based on the development of

Page 19: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 1 - Introduction

HOSSAM JUMAA 5 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

generic components, which will ease the setup of specific interfaces adapted to any XML source and any target relational database. The mediator components are independent of any application domain, and need to be customized only once for each couple of generic source and target data storage formats. Hence, our mediator will provide automatic and coherent updates of any relational database from any data embedded in XML documents. It will also allow retrieving data from any relational database and publishing them into XML documents (or messages) structured according to a requested interchange format.

· The transformation from a relational model to XML represents a main key component of our proposed mediator. Thus, we will propose a methodology and devise two algorithms to efficiently and automatically transform the relational schema of a relational database management system into an XML schema. Our transformation methodology preserves the integrity constraints of the relational schema and avoids any data redundancy. It will be designed in order to preserve the hierarchical representation of the relational schema, which is particularly important for the generation of correct SQL updates and queries in the proposed mediation framework.

· Another key component is the automation of the SQL generation. Therefore, we will devise a generic methodology and algorithms to automate the SQL queries generation that are required to update or retrieve data to/from the relational databases.

· Finally, our proposed framework will be applied and tested in the healthcare domain between XML representations of SCP-ECG, an open format ISO standard communications protocol embedding bio-signals and related metadata, and a European relational reference model including these data. The mediation automation is especially relevant in this field where electrocardiograms (ECG) are the main investigation for the detection of cardiovascular diseases, and need to be quickly and transparently exchanged between health systems, in particular in emergency, whereas the SCP-ECG protocol has numerous legacy implementations since most of the sections and of the data fields are not mandatory.

1.5 Dissertation Organization The remainder of this dissertation is organized as follows:

In chapter 2, we review the different works related to the interoperability challenges in both data integration and the eHealth domains. Therefore, this chapter is divided into two main parts: In the first part, we introduce the basic concepts from the data integration and the data

Page 20: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 1 - Introduction

HOSSAM JUMAA 6 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

exchange fields and introduce XML as a key technology in these fields. Then, we give a detailed literature review of previous works on the integration between XML and relational databases. In the second part, we review the particularity of the interoperability problem in the eHealth field and analyse the previous related work done in this field.

In chapter 3, the core of our contributions, we present our new mediation approach for the automatic bi-directional data exchange between any XML source and any relational database. This chapter is divided into three main parts: In the first part (section 3.1), we describe the architecture of the proposed generic mediation framework to automate the data exchange between any XML file and any relational database. This framework supports the automatic updating of any relational database from any XML file as well as the retrieval of data from any relational database and their storage into any predefined or required XML format. In the second part (section 3.2), we describe a methodology for the automatic transformation of the relational model into XML schema. This transformation methodology has been adopted and used in our proposed mediator to automatically generate an XSD schema from any relational database model. In the third part (section 3.3), we present the automatic SQL generation methodology that has been used through our mediation framework.

In chapter 4, we give a case study from the health domain where we have successfully applied and tested our proposed framework between an XML representation of the SCP-ECG standard communications protocol and the OEDIPE relational reference model including these data, in addition to other data and related meta-data.

In Chapter 5, we provide some concluding remarks and discuss directions for future research.

Page 21: L’Institut National des Sciences Appliquées de Lyon

CChhaapptteerr 22 -- SSTTAATTEE OOFF TTHHEE AARRTT && RREELLAATTEEDD WWOORRKK

This chapter presents the state of the art of the recent works related to the interoperability challenges in the data integration domain.

In the first part, we introduce the basic concepts from the data integration and the data exchange fields and introduce XML as a key technology in these fields. Then, we give a detailed literature review of previous works about the integration between XML and relational databases.

In the second part, we review the particularity of the interoperability problem in the eHealth domain and present some previous related work done in this field.

2.1 Data Integration towards XML & Relational Database Integration

2.1.1 Introduction

In today’s modern information systems, data integration becomes a pervasive challenge, where querying data across multiple autonomous and heterogeneous collaborative information systems and data sources is crucial. Enterprises seek for more business opportunities in highly competitive markets. Thus, to meet the business requirements, the paradigm of new dynamic and cost-effective data integration for existing information systems becomes more and more important. Indeed, data integration facilitates information access and reuse, and allows to satisfy the business and customers’ needs by providing more complete information.

Page 22: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 8 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

Data integration, as one of the oldest database research fields, has been well-studied and yields a large number of theoretical works in the literature [Lenzerini 2002], [Ziegler and Dittrich 2004], [Lenzerini 2005], [Halevy et al. 2006], [Ziegler and Dittrich 2007]. It has emerged shortly after the database systems were introduced into the business world. Many sub-fields and directions have been developed from data integration research. Authors in [Halevy et al. 2006] divide these directions into: generating schemas mapping, query answer processing, the role of XML, model management, peer to peer data management and the artificial intelligence role. In spite of the enormous research works that have been realized in each of these fields, there remain a lot of open problems and challenges to be addressed and solved.

Extensible Markup Language (XML) [W3C 1998] has had an important role in the data integration domain over the last decade. There are several integration systems and solutions that have been developed using XML as the underlying data model and XML query as the query language. Indeed, XML plays increasingly an important role in the exchange of a wide variety of data on the Web and elsewhere. It offers a mechanism for structuring data for large-scale electronic publishing thanks to a common syntactic format. The XML technology and its related languages and derivatives (XML Schema, XSLT, XQuery…) now provide powerful tools for sharing, converting and exchanging information via networked computers, taking into account both the syntax and the semantic of the data.

The remaining of this section is organized as follows: First, we review the main theoretical aspects and works in the data integration field. We introduce the problem of the data integration by its definition and its main scenarios. Next, we present the main data integration architectures: the classical mediation architecture, the data exchange architecture, and the peer to peer data integration architecture. Then, we give a formal comparison between the data exchange and data integration systems. After that, we review the different approaches and languages used in the mapping composition of the data integration systems: Global as View (GaV), Local as View (LaV), Global Local as View (GLaV) and BYU-Global-Local-as-View (BGLaV). Next, we summarize some of the remaining open problems in the data integration field.

Secondly, we introduce the XML technology and its related languages and standards: XML schema languages (DTD and XSD), XML query languages (XPath, XSLT, XQuery and XQuery Updates Facilities languages) and the XML parsers (DOM and SAX).

Page 23: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 9 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

Finally, we review in detail the XML data management role in the data integration domain and focus on the problem of data integration when relational databases are used as the main data storage technology and XML is used as the main data exchange technology.

2.1.2 Data integration: the basic concepts

Data integration is one of the oldest research fields in databases and it has been the focus of extensive theoretical work. However, numerous open problems of data integration remain to be solved [Ziegler and Dittrich 2004], [Halevy et al. 2006], [Ziegler and Dittrich 2007]. Hereafter, we will summarize some of the main recent papers in this research area.

2.1.2.1 Data integration definition Data integration is the process of combining data residing at different sources and providing the user with a unified and transparent access to these data [Ullman 2000], [Halevy 2001], [Lenzerini 2002]. Data integration is an important issue in business and scientific scenarios. In the business domain, it is often required to integrate the customer data that may be stored in different databases located at different companies. In scientific scenarios, it is often needed to combine the research results (i.e. experiments, findings, etc) that may be scattered across different repositories. Data integration aims to facilitate information access and reuse through a single virtual information access point and to have a more comprehensive view to satisfy the need. This problem becomes more important when the volume of data and the need of sharing are increasing.

2.1.2.2 Data integration scenarios We distinguish two main scenarios of data integration: (1) The tightly coupled approach, commonly known as data

warehousing. This scenario of data integration consists in the replication of the data of interest from the external sources on a local server for locally working on these data.

(2) The loosely coupled approach, known as data mediation. This scenario relies on querying a virtual global representation of data and providing query rewriting mechanisms to fetch just the necessary data from the sources at the query time.

Indeed, in the first approach, data sources are largely dependent on each others. It is often used when latency is an important factor in the integrated system. However, most of the time, this approach is very expensive in terms of money and the efforts required to set up the integration system. Its architecture is not very flexible and it does not

Page 24: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 10 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

promote the reusability of code. Also, it makes changes much more difficult to do. In contrast, in the second scenario, data sources operate independently of each other. Communications between them is often done via a mediating queuing system. Therefore, the later approach is more used especially for its cost-effectiveness and flexibility.

2.1.3 Data integration architectures

From the literature review we distinguish three types of data integration architectures: ● The mediator based, called classical data integration architecture. ● The data exchange architecture. ● The peer-to-peer data integration architecture.

In the following sub-sections we will present each of these three different architectures.

2.1.3.1 Mediator based data integration architecture The mediator based architecture in [Wiederhold 1992] is one of the several architectures that have been proposed to deal with the problem of integrating heterogeneous information systems. This architecture, known also as classical data integration architecture, has been schematically depicted in Figure 22-1. In this architecture the goal is to query heterogeneous data in different sources via a virtual global schema, such that all queries are expressed over this virtual global schema (called mediated schema) [Lenzerini 2005]. The data are stored in a set of heterogeneous and autonomous sources which are wrapped by wrappers or translators to provide a uniform data model view of the data stored in the sources. The mediators combine the answers coming from wrappers and/or other mediators.

Figure 2-1 Mediator based data integration architecture

Page 25: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 11 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

Indeed, a mediator is a smart software needed to improve access to the type and to the quality of the information needed, for decision making for instance. Mediators are common in software design but their architectures and their implementations are various. The main advantage of mediation is to manage complexity and ensure the maintainability of the data integration system.

According to [Lenzerini 2002], a data integration system I is formally a triple <G, S, M> where:

· G is the global schema, i.e. a set of global relational symbols, each one with an associated arity, plus a set of integrity constraints expressed over such relational symbols.

· S is the source schema, i.e. a set of relational symbols (disjoint from G), that constitutes a relational representation of the data stored at the sources.

· M is the mapping between G and S, constituted by a set of assertions of the form [qS4 qG], where: qS is a conjunctive query over the source schema and qG is a conjunctive query over the global schema (for a review of the different approaches in the mapping compositions see section 2.1.4).

Starting from a source database instance D, the semantics of a data integration system I is defined by Lenzerini as the set of all the global databases B that respect the following conditions:

· B is a legal database instance for G (it respects the relational structure of G, and satisfies the integrity constraints expressed on it).

· B satisfies the mapping assertions with respect to the source instance D.

With respect to the mapping assertions, different assumptions can be made about the quality of the mapping. In particular, if we assume that the mapping is sound, then the data provided by the sources are a subset of the global data (the extension of qS is contained into the extension of qG). Conversely, if the mapping is considered to be complete, the data provided by the sources are a superset of the global data (the extension of qS contains the extension of qG). Finally, we assume that a mapping is exact, when it is both sound and complete. We point out that, due to the general characteristics of the sources, which are distributed, autonomous and independent, the sound mapping assumption is more reasonable in a data integration environment.

2.1.3.2 Data exchange architecture The previous work on data mediation formed the foundations for new researches on data exchange [Fagin et al. 2005], [Kolaitis 2005], [Kolaitis et al. 2006], [Giacomo et al. 2007]. Indeed, data exchange is an old, but

Page 26: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 12 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

recurrent, database problem. Figure 2-2 shows the general architecture of a typical data exchange system. The goal is to transform data structured according to a source schema into data structured according to a different target schema. In other words, the problem here is how to transform or to move data from a source schema to a target schema, according to a given specification.

Figure 2-2 Data exchange architecture

According to [Fagin et al. 2005], given a source schema S, a target schema T with a set of target dependencies ∑t and a set of source-to-target dependencies ∑st, then the target dependencies are Tuple-Generating Dependencies (TGDs) and Equality Generating Dependencies (EGDs), whereas source-to-target dependencies are TGDs mapping the source schema to the target schema. Thus the data exchange problem for a given source instance can be formalized as the following: is there a target instance that satisfies the constraints of the schema mapping (expressed in terms of source-to-target dependencies) and the target dependencies constraints?

Formally, given a source instance I, is there a target instance J such that: <I, J> =| ∑st and J =| ∑t? Then we call such J a solution for I.

Furthermore, for a schema mapping specified by embedded implicational dependencies, this problem is solvable in polynomial time by assuming that the schema mapping is kept fixed and the constraints of the schema mapping satisfy a certain structural condition, called “weak acyclicity”.

2.1.3.3 Peer-to-peer data integration architecture The emergence of peer-to-peer sharing systems inspired the data management research community to consider P2P architectures for data

Page 27: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 13 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

sharing. Many systems and works have been proposed for the data integration and exchange in the peer to peer environment. These systems are called Peer Data Management Systems PDMSs [Bernstein et al. 2002], [Nejdl et al. 2002], [Halevy et al. 2003], [Halevy et al. 2004], [Huebsch et al. 2005], [Fuxman et al. 2006], [Giacomo et al. 2007], [Al King et al. 2009].

The architecture for data integration and exchange in Peer to Peer networks and systems can be schematically represented as shown in Figure 2-3. This architecture of data integration is decentralized, dynamic and presents a data-centric coordination between autonomous organizations.

In P2P data integration systems, each peer has a schema that is mapped into its own local source or/and into other external peer schemas. Queries are posed over one peer. In such a P2P system, each peer models an autonomous information site. Each peer: ● exports its information content in terms of a peer schema, by combining

its own data from its local source or/and data coming from other peers, ● relates its local source to its schema by means of local mappings, ● relates to other peers by means of a set of P2P mappings.

For a survey of the related works on the data management issues and challenges in the peer to peer systems we refer the reader to [Blanco et al. 2006].

Figure 2-3 Peer to peer data integration architecture

Page 28: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 14 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

In fact, the peer to peer networks offer two additional benefits in the context of data integration. First, organizations often want to share data, but none of them wants to take the responsibility of creating a mediated schema, maintaining it and mapping sources to it. Thus the P2P architecture offers a truly distributed mechanism for sharing data. Every data source needs to only provide semantic mappings to a set of neighbours it selects, and more complex integrations emerge as the system follows semantic paths in the network. Second, it is not always clear that a single mediated schema can be developed for a data integration scenario. Let us for example consider data sharing in a scientific context, where data may involve scientific findings from multiple disciplines, bibliographic data, drug related data and clinical trials. The variety of the data and the needs of the parties interested in sharing are too diverse to be represented by a single mediated schema. With a P2P architecture there is never a single global mediated schema, since data sharing occurs in local neighbourhoods of the network.

2.1.4 Mapping composition in data integration

A crucial issue in the design of a data integration system is to specify the method of mapping between the global schema and the sources. Two basic approaches have been proposed in the literature: the LaV (local-as-view) approach [Ullman 2000], [Levy et al. 1996] and the GaV (global-as-view) approach [Garcia-Molina et al. 1997], [Halevy 2001]. Recently two other approaches have been proposed: the GLaV1 approach [Friedman et al. 1999], which is a generalization of the LaV approach, and the BGLaV2 combined approach [Xu and Embley 2004]. In the following we will try to expose the advantages and disadvantages of these different approaches.

1 Global Local as View 2 “BGLaV” is an acronym for “BYU-Global-Local-as-View.”

Page 29: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 15 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

The GaV approach for data mediation was first used in the TISMMIS1 project [Garcia-Molina et al. 1997] at Stanford & IBM Labs, around 1995. In this approach, the global schema is defined from the sources (i.e. the relations in the mediated schema are defined as queries over the source schemas). The main advantage is that the query answering process is easy and can be done by means of simple unfolding techniques. However, its structure is not well suited for adding or removing sources (i.e. every change in the source schema may lead to redesign the mapping). Moreover, the building of a global schema for a set of heterogeneous sources is tedious and a hard task.

The LaV approach was introduced on the initiative of the Information Manifold Project [Levy et al. 1996], at AT&T Bell Laboratories, around 1996. In this approach, the sources are defined from the global schema (i.e. the relations in the source schemas are defined as queries over the mediated schema). The main advantage is that the integration of new sources becomes easier. No knowledge is needed about other sources and relationships between them. As well, specifying constraints over the sources is easier than over the mediated schema, so the description of sources could be more precise. On the other hand, query answering is more complex than only simple unfolding queries and includes the problem of the reformulation of each user query into sub-queries over the sources, in addition to the hard task of building the mediated schema.

The GLaV approach [Friedman et al. 1999] is a generalization of the LaV approach. The sources are described in terms of queries over a mediated schema, but it takes into account more general cases where designers have to associate a general query over the source relations "qS" to a general query over the global relations "qG". Thus, GLaV mappings are more expressive and well suited to represent complex "data webs"

1 Stands for The Stanford-IBM Manager of Multiple Information Sources

Page 30: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 16 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

relationships in distributed data integration environments. However, in this approach we face the same disadvantages as in the LaV approach.

Finally, the BGLaV approach [Xu and Embley 2004] uses source-to-target mappings based on a predefined conceptual target schema, which is specified ontologically and independently of any of the sources. First, every relation on the global "target" schema is specified independently and the sources are wrapped independently as well. Then they specify a set of mapping elements to match the source schema with the target schema (direct and indirect elements). Thus, as the query reformulation is reduced to unfolding rules, this approach combines the scalability of LaV and the query performance of GaV.

2.1.5 Data exchange vs. data integration

As mentioned earlier, a data integration system is expressed by a triple <G, S, M> where: G is the global schema, S is the source schema and M is a set of mapping assertions relating elements of the global schema with elements of the source schema. In addition, a data exchange system may be defined by <S, T, ∑t, ∑st> where S is the source schema, T is the target schema, ∑t is a set of target dependencies and ∑st is a set of source-to-target dependencies. Therefore, a data exchange system can be seen as a data integration system in which: S is the source schema, T and ∑t form the global schema and ∑st are the mapping assertions M of the data integration system.

In data integration, the studied systems are either the LaV systems or the GaV systems. In the LaV approach, the mapping relates each element of the source schema to a query (view) over the global schema so there are no target constraints (∑t =0). As the global schema is commonly assumed to be a reconciled, virtual view of a heterogeneous collection of sources, it has no constraints. In GaV, the mapping relates each element of the global schema to a query (view) over the source schema. In data exchange: ∑st relates a query over the source schema S to a query over the target schema T. It is neither a LaV nor a GaV system. Instead, it can be seen as a GLaV (global-and-local-as-view) system. Hence the target schema is often independently created and comes with its own constraints. Moreover, in data exchange we have to materialize a finite target instance that best reflects the given source instance.

Finally, in both data integration and data exchange the “certain answers” mechanism is used as the semantics of queries over the target (global) schema. In data integration, the source instances are used to compute the certain answers of queries over the global schema but in data exchange, queries over the target schema may have to be answered using

Page 31: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 17 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

the materialized target instance alone, without reference to the original source instance. The previous comparison is summarized in Table 2-1.

Data Integration system

<G, S, M> Data Exchange system

<S, T, ∑t, ∑st>

Source the source schema S the source schema S Target G virtual global schema

with no constraint “∑t =0” T target schema with a set

of constraints ∑t Mapping M a set of mapping

assertions relating elements of G with elements of S

∑st a set of source-to-target dependencies

Mapping composition approach

LaV or GaV

GLaV

Query answering

“certain answers” are computed using the source

instances

“certain answers” are computed using the

materialized target instance Query result not materialized materialized

Table 2-1 Data integration vs. data exchange systems

2.1.6 Main open problems in data integration

Through our review of literature in the data integration field, we found that a lot of work has been done. However, a lot is still to be done to resolve many open problems such as the following: ● The first main step to realize any integration system is the design of the

virtual global schema. Therefore, it is crucial to develop new methods and tools that allow a cost-effective design and construction of this schema. Especially, because there is not yet any silver approach that fits all scenarios.

● The next step after the mediated schema construction is the design of “wrappers” or adapters for each data source. Therefore, automating the sources wrapping is another open challenge in data integration due to the heterogeneity of the integrated source systems.

● The mapping between the sources and a global schema is a main part of any integration system. Therefore, the discovering of this mapping and the automation of these mappings generation is still an active research area.

Page 32: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 18 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

● One of the new challenging problems that appear is data cleaning and reconciliation to ensure the quality of the extracted data.

● The processing of the updates expressed on the global schema and/or the sources (“read/write” vs. “read-only” data integration) is one of the most recent challenges in data integration.

● The modelling of the global schema, the sources, and the mappings between the two are still also an open challenge problem in data integration due to the lack of a winning approach.

● The problem of efficient query answering and optimization over the global schema presents a real challenge area of research in the data integration field.

2.1.7 XML

In this section, we review the XML technology and its related languages and standards: XML schema languages (DTD and XSD), XML query languages (XPath, XSLT language, XQuery and XQuery updates facilities languages) and XML parsers (DOM and SAX).

2.1.7.1 Introduction to XML EXtensible Markup Language (XML) [W3C 2008a] is a simple, very flexible tagged text format derived from the Standard Generalized Markup Language (SGML) [ISO8879 1986]. It has been proposed, developed, and maintained since 1996 by the World Wide Web Consortium (W3C), as a tool to standardize the format of all the documents used on the Internet and meets the challenges of large-scale electronic publishing. The first edition was published as a W3C recommendation in February 1998 [W3C 1998].

Indeed, XML is a well-defined set of rules for specifying semantic tags which divide a document into parts and identify the different parts of the document. Thus, it is more a meta-language than a language because it defines the syntax in which other domain-specific markup languages can be written. The tags of a domain-specific language can be documented in a schema written in any of several languages, including the Document Type Definition (DTD) [W3C 2008b] and the XML Schema Definition (XSD) [W3C 2001] languages.

XML differs from Hyper Text Markup Language (HTML) [W3C 1999c], another widely used markup language on the web and also derived from SGML, that has no predefined tags (i.e. one must define his own set of tags). XML and HTML were designed for different purposes. XML was designed for describing data objects called XML documents while HTML was designed to display data and focuses on how to display this data.

Page 33: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 19 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

Moreover, XML has a much stricter syntax than HTML which simplifies its processing.

In XML documents, data can be stored in either elements or attributes - both of these can be named to give meaning for the data contained. Elements are constructed using start and end tags and possibly some content between these tags. This content can be either text data or other elements. While attributes have no start and end tags, they are defined in the beginning tag of an element, and their content is limited to only text data. The elements order is significant in the XML document while the attributes order is not.

Data vs. Documents: It is important to distinguish between the two approaches of the XML structures: the Data centric and the Document centric structures [Bourret 2005]. In the first one, the XML structure is highly regular, composed of finely grained data, the order of elements is not very important and there is no mixed content. This approach is mainly used to transport data (and often used for representing legacy data) and designed to produces XML files that are readable by machines. In the second approach, the XML structure is less regular, with largely grained data, the order of elements is important and the elements may have a mixed content. This approach is used for the design of XML documents that are mainly handled manually. In the data management context and the integration between XML and relational databases the first approach is always adopted and used.

Here are the main basic rules for well-formed XML documents: · There must be a single root element: The root element can appear only

once and all other elements are nested inside it. · Elements must be properly terminated: Each element has a start-tag

<tag-name> and must be matching the end-tag </tag-name>. The only exceptions are the empty (have no content) tags, that look like <tag-name/>.

· Elements should be properly nested and shall not overlap and there is no limit to the nesting level.

· Elements and attributes names are case sensitive. · An attribute, which is extra information that can be added to an

element start tag, must be quoted (i.e. all attribute values must be enclosed in quotes).

· An XML document should start with a declaration using a special tag that identifies the version of the XML specification.

· It is possible to impose a specific grammar by using an XML schema language (DTD or XSD).

Page 34: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 20 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

A simple sample XML document may look like this: <? xml version="1.0" ?> {declaration} <! DOCTYPE notes SYSTEM "Notes.dtd"> {type of document} <!--My notes file--> {comment} <notes> {root element start tag} <note id=" 001"> {attribute}

<type priority="high"/> {empty element (no data)} <to>Hossam</to> {not empty element} <from>Joel</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body>

</note> {end element tag} </notes>

When an XML document is received by a server, various functions can be applied on it. The document can be processed to verify if it is well-formed (i.e. satisfies the previous XML rules), validated against an XML schema, transformed, stored or forwarded depending on the application. The well-formed check is a process where the XML document’s syntax is verified to be correct according to the XML specification and the Validation is the process where the document’s structure is checked against a possible DTD or XML schema in XSD. The transformation process can transform an XML document from one structure to another structure or from one set of tags to another set of tags. This process makes it possible to reorder elements between the source and destination representations – also arbitrary elements and structures can be added and existing ones removed. The eXtensible Stylesheets Language (XSL) [W3C 2006] and its XSL Transformation (XSLT) part [W3C 1999a] is the common technology for carrying out XML document transformations.

In fact, XML is not one simple technology or recommendation but there are many XML related technologies which contribute to the power of XML. Indeed, well-formed XML documents can be created by using the core XML recommendation in different applications. However, to make use of such XML documents as a format to store information and publish it, on the World Wide Web for instance, we would need to use some other technologies that are XML-based or related. For example, to define the structure of an XML Document, the Document Type Definition (DTD) or the XML Schema Definition (XSD) languages may be used. To display XML in the browsers, the Cascade Style Sheet (CSS) may be used. To display or transform XML into another format such as pdf or HTML, the new Extensible Stylesheets Language (XSL) may be used. To navigate and organize the data in the XML documents, technologies such XPath, XPointer and XLink may be used. To use XML in an application, one of the XML interfacing technologies such as Document Object Model (DOM) or

Page 35: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 21 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

Simple API for XML (SAX) may be used. To query or update XML documents, we may use the XQuery language and XML Updated Facilities. Therefore, XML with its related languages and derivatives now provides powerful tools for sharing, converting and exchanging information via networked computers. Furthermore, there are many proposals to standardize XML document structures for domains as diverse as stock trading, graphic design, Healthcare (“HL7”), Web Services (“SOAP, ebXML”…)…

In the following sections we present in more details some of these XML related technologies that we have used in our proposed solution, starting from the technologies to define an XML document schema (the DTD and the XSD languages), to the XML Querying languages (XPath, XSLT, XQuery and XQuery update facilities), and finally the technologies for parsing XML data (DOM and SAX).

2.1.7.2 XML schema languages (DTD vs. XSD)

XML schema languages present a way to define the structure of the XML documents and provide an additional level of syntax checking. The constraints provided by the well-formedness rules of XML are very simple. Thus the validity constraints introduced by a schema allow specifying the tree structure of XML documents. Indeed, there exist two standard recommendations from the W3C to define the structure of the XML documents: The Document Type Definition (DTD) and XML Schema Definition (XSD). Hereafter, we present both of these technologies and give a comparison between them. There are other schema languages including RELAX NG [OASIS 2001], [ISO/IEC 19757-2 2003] and Schematron [ISO/IEC 19757-3 2006], but they are not W3C recommendations.

2.1.7.2.1 XML schemas advantages

XML schemas specify contracts between data producers and consumers. Thus, they represent a standardized vocabulary and structure for application domains which requires a deep understanding of the application and the domain of interest. Schemas may be particularly useful for:

· Validation of XML documents against a schema (for safety). · Automation of XML exchange and storage into relational databases. · Query optimization. · XML binding (mapping to programming languages).

Page 36: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 22 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

2.1.7.2.2 Document Type Definition language

The Document Type Definition language (DTD) [W3C 2008b] is inherited from the SGML world with an almost completely intact syntax. Thus a DTD itself is not defined using the XML syntax. In fact, a DTD is the blue print of a document’s structure that contains a series of declarations. It is used to describe the valid syntax of a class of XML documents by assigning names and types for different element and attributes. Therefore, a DTD enables that each XML document can carry a description of its own format. It may characterize an agreed standard for interchanging data between independent groups of people. In addition, applications can use a standard DTD to verify that the received data from the outside world is valid.

The DTD can be a separate file or it can also be embedded in the XML file. In fact, the DTD contents can be split across an external file and the XML file. Here is a sample of a separated DTD for validating the previous XML document sample:

<? xml version="1.0" encoding="UTF-8"?> <!ELEMENT notes (note*)> {element declaration} <!ELEMENT note (type, to, from, heading, body)> <!ELEMENT type EMPTY> <!ELEMENT to (#PCDATA)> <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> <!ATTLIST type priority (low|normal|high) “normal”> <!ATTLIST note id CDATA #REQUIRED> {attribute declaration}

2.1.7.2.3 XML Schema Definition language

The XML Schema Definition language (XSD) [W3C 2001], published as a W3C recommendation in May 2001, is a newer, more flexible and more elaborated schema language than DTD. XSD, as an alternative to DTD, is used for specifying the type of each element and the data types that are associated with the elements. It provides a way of defining strong typing relationships and supports data types and namespaces that DTD do not. In addition, XSD schemas are themselves XML documents, thus they may be managed by XML authoring tools.

Indeed, XSD introduces new levels of flexibility and security assurance against unauthorized changes that may accelerate the adoption of

Page 37: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 23 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

XML for significant industrial use. For example, a schema author can build a schema that reuses a previous schema, and even overrides it when new unique features are needed. XSD allows the author to determine which parts of a document may be validated, or identify parts of a document where a schema may apply. Moreover, XSD also provides a way for users to choose which XML Schema they use to validate elements in a given namespace. It can define data types, ranges, enumerators, dates, and more complex data types to strictly specify what constitutes a valid XML document.

XML Schema specifications are divided into three parts. Part 0 [W3C 2004] explains what schemas are, how they differ from DTDs, and how to build a schema. Part 1 [W3C 2009a] proposes methods for describing the structure and constraints of XML documents contents, and defines the rules for documents schema-validation. Part 2 [W3C 2009b] defines a set of simple data types, to be associated with XML element types and attributes, which allows XML applications to better manage dates, numbers, and other types of information.

Here is a sample of a XSD schema for validating the previous XML document sample:

<?xml version="1.0"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="notes"> <xs:complexType> <xs:sequence> <xs:element name="note"> <xs:complexType> <xs:sequence> <xs:element name="type"> <xs:complexType> <xs:attribute name="priority" type="xs:string" use="required"/> </xs:complexType> </xs:element> <xs:element name="to" type="xs:string"/> <xs:element name="from" type="xs:string"/> <xs:element name="heading" type="xs:string"/> <xs:element name="body" type="xs:string"/> </xs:sequence> <xs:attribute name="id" type="xs:int" use="required"/> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>

Page 38: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 24 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

2.1.7.2.4 XSD vs. DTD

XSD has many advantages over the DTD mainly: (1) XSD supports powerful and rich sets of built-in data types (for

elements as well as for attributes) with the possibility of inheritance and derived data types, compared to DTDs which only support character strings. For example, XSD can specify that a particular attribute must be a valid date, or a number, or a list of URIs, or a string that is exactly 8 characters long.

(2) XSD can define all the constraints that a DTD can define, and many more. XSD supports the identity constraints (key, keyref, unique), which are more powerful than the IDs and IDREFs supported by DTDs. Identity constraints can be specified on any element or attribute, regardless of its type. They can be locally defined for a combination of elements and attributes. In addition, XSD supports fine grained cardinalities constraints while DTD is mainly based on Kleine closure (*,+,?).

(3) XSD has the same syntax as XML, and it may be managed by XML editing tools.

(4) Namespaces are supported in XSD, but not in DTD. A summary of the comparison between XSD and DTD is depicted

in Table 2-2

XSD DTD Data types rich sets of built-in

data types & customized new

data types

only two types of data PCDATA & CDATA

Constraints Cardinalities fine grained minOccurs= 0 ..* maxOccurs= 0 ..*

*, +, ?

Primary keys

id, key ID

Foreign keys idref, keyref IDREF/IDREFS uniqueness unique not supported

Namespace well-supported not supported Syntax XML not XML (SGML)

Table 2-2 XSD vs. DTD to specify XML schemas

Page 39: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 25 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

2.1.7.3 XML query languages

2.1.7.3.1 XPath

The XML Path language [W3C 1999b], known as XPath, is a W3C recommendation from November 1999. Its name is derived from its feature “path expression” which provides a way of hierarchic addressing of the nodes in an XML tree. Thus, XPath provides a common method for locating and extracting information from XML documents based on specified criteria. XPath was defined during the development of XSLT and XPointer. It was designed to provide unambiguous navigation of XML documents. XPath's functionality is used by both XPointer and XSLT. XSLT uses only a subset of XPath; XPointer uses additional syntax mechanisms to extend its functionality (XPointer allows forward and backward addressing to specific XML locations internal to a document and to locations in external XML documents). XPath is also used by XQuery, which is an emerging technology that will provide standardized access to data in the RDBMS using XML.

XPath 2.0 [W3C 2007a] is a superset of the old XPath 1.0 version, with new capabilities to support a richer set of data types and to take advantage of the type information that becomes available when documents are validated using XML Schema (XSD) [W3C 2001]. XPath 2.0 allows the processing of values conforming to the XQuery/XPath Data Model (XDM) defined in [W3C 2007b]. This data model provides, in addition to a tree representation of XML documents, the atomic values such as integers, strings and booleans, and sequences that may contain both references to nodes in an XML document and atomic values. A backwards compatibility mode is provided to ensure that nearly all XPath 1.0 expressions continue to deliver the same result with XPath 2.0.

XPath works on the XML document model which can be represented as a hierarchical tree structure. There are seven node types in XML (root, namespace, processing instruction, element, text, attribute, and comment nodes). In fact, XPath uses path expressions to select nodes or node-sets in an XML document. Path expressions are very similar to the expressions used in a traditional computer file system. An XPath expression is the series of steps required to identify the desired section of XML. Each of these steps represents a layer (or possibly several layers if a wildcard is used) within the XML tree. During each step of the expression, tests may be optionally performed to narrow the search based on criteria specified by a Predicate. Then an XPath expression result may be: a

Page 40: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 26 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

selection of nodes from the input documents, or an atomic value, or more generally, any sequence allowed by the data model.

2.1.7.3.2 XSLT

The eXtensible Stylesheet Language Transformation (XSLT) [W3C 1999a], a W3C recommendation from November 1999, is a common technology for carrying out XML document transformations. It is one of the two main parts of the eXtensible Stylesheet Language (XSL) version 1.1 specifications as a W3C recommendation from December 2006. The XSL language presents the new, cutting edge language for expressing stylesheets. It enables to do any transformation we can imagine on an XML document. This is what leverages XML as the ultimate file format for data. The second part of the XSL version 1.1 specifications is the XSL Format Objects (XSL-FO), that is an XML vocabulary for specifying formatting semantics allowing a large possibilities for print, display or oral presentations.

XSLT version 2.0 [W3C 2007c], a W3C recommendation from January 2007, is a revised version of the older version published in November 1999. It is compatible with Namespaces and XPath 2.0. XSLT 2.0 shares the same data model [W3C 2007b] as XPath 2.0, and it uses the library of functions and operators of XPath 2.0. Indeed, one of XSLT's best purposes is to translate information from one XML vocabulary to another. XSLT operates on an abstract model that views an XML document as a tree and it is not required that a tree be created. It provides means to access the document tree in order to: access nodes by name or content, search for a specific content or nodes and manipulate content or nodes. In XSLT, the XML syntax was chosen for many reasons, among the most important were: · Reuse of the XML parser minimizes footprint. · Familiarity and ease of understanding. · Reuse of the lexical apparatus of XML for handling whitespaces,

Unicode, namespaces, and so forth.

XSLT has many features such as: ● XSLT Stylesheets are XML documents and they follow the XML rules. ● Multiple input sources. ● Ability to select document fragments using XPath expressions. ● Named and/or pattern-based templates. ● Parameterized templates. ● Intermediate transformation states may be managed using variables. ● Stylesheets may be combined using include or import. ● Built-in support for output sorting and numbering.

Page 41: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 27 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

● Both XML and non-XML output is supported. ● The functions in XSLT have no side effects and can be processed in any

order. ● How we code our stylesheet will impact what parts can be processed

independently. ● The parts of the stylesheet can be processed in any order and that does

not impact the order of the output. The order of the output depends on the order in the XML file.

● XSLT supports recursion. This feature makes it a very powerful tool.

An XSLT stylesheet accepts XML from the abstract tree model of the source document, known as the source tree, and processes it to produce a result tree. The XSLT stylesheet defines the rules for transformation, based on the XML elements and attributes in the source tree. It also may contain formatting information called format objects (or FOs) and applies those objects against the transformation. A single stylesheet can apply to multiple XML documents, provided the elements and structure are consistent with those specified by the stylesheet. The source XML document can also invoke multiple XSLT stylesheets. For example, the XML source could be processed by XSLT to render HTML, voice markup, and rendered for printing... These could occur as separate parallel processes (each invocation running an XSL processor in a separate memory space) or sequentially (each invocation running after the previous one completes). The advantage of parallel processing is that an XSLT error in one stylesheet will not prevent the others from running, whereas in sequential processing, any downstream process will be terminated as well. The disadvantage of parallel processing is system memory usage.

2.1.7.3.3 XQuery

XQuery [W3C 2007d], a W3C recommendation since January 2007, is the language for querying XML data (i.e. not only XML files, but anything that can appear as XML, including databases). It uses the structure of XML intelligently to allow expressing queries across all these kinds of data, whether physically stored in XML or viewed as XML via middleware.

XQuery 1.0 is a superset of XPath 2.0 and shares the same data model and supports the same functions and operators [W3C 2007e]. It is compatible with several W3C standards, such as XML, Namespaces, XSLT and XML Schema. It is supported by all major databases. XQuery allows finding and extracting elements and attributes from XML documents. Here is an example of a query that XQuery could solve:

Page 42: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 28 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

“Select all books titles with a price higher than $30 from the bookstore collection stored in the XML file called books.xml”.

This query may be expressed by the following FLWOR1 expression:

for $x in doc(“books.xml”)/bookstore/book where $x/price>30 return $x/title

The same previous FLWOR expression will select exactly the same as the following path expression:

doc(“books.xml”)/bookstore/book[price>30]/title XQuery was designed with the goal of providing flexible query

facilities to extract data from real and virtual documents on the Web and give the needed interaction between the Web and the database worlds, so it has been invented to be for XML like SQL for databases. Thus collections of XML files may be accessed like databases. However, XQuery has no means to make persistent changes or updates on the XML documents. Therefore, the XQuery working group published a new candidate recommendation called “XQuery update facility” that extends the XQuery language. We present it in the following section.

2.1.7.3.4 XQuery update facility

The XQuery update facility [W3C 2008c] is an update facility that extends the XQuery language. It provides expressions that can be used to make persistent changes to instances of the XQuery 1.0 and XPath 2.0 Data Model (XDM). XQuery update facility has been published as a W3C candidate recommendation in August 2008 and its specification is expected to receive soon the status of recommendation.

Indeed, an XQuery 1.0 expression takes zero or more XDM instances as input and returns an XDM instance as a result. In XQuery 1.0

1 FLWOR stands for: FOR, LET, WHERE, ORDER BY, RETURN.

Page 43: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 29 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

there is no expression that can modify the state of an existing node; however, constructor expressions may create new nodes. Therefore, the XQuery update facility 1.0 introduces a new category of expression called an “updating expression” that may modify the state of an existing node. It provides means to perform the following operations on the XDM instance: node insertion, node deletion, node modification by changing some of its properties while preserving its node identity and node creation of a modified copy with a new node identity. XQuery update facility has five new kinds of expressions: insert, delete, replace, rename, and transform. Hence, it is expected to facilitate the updates of XML documents.

2.1.7.4 XML parsers XML parsers are software components that provide means to access and manipulate XML documents. They are used for performing document validation and processing within the syntax of the programming language in use. There are two major alternatives for parsing XML documents: the W3C Document Object Model (DOM) and the Simple API for XML (SAX). They are both programming interfaces used for accessing and manipulating XML documents. We present them both hereafter.

2.1.7.4.1 DOM

The W3C standard Document Object Model (DOM) is “a platform and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure, and style of a document”.

Indeed, XML has a hierarchical data format and may be represented as a tree. An XML document consists of data elements that may have other nested elements and attributes attached to them. Using the DOM, a hierarchical model can be created that describes the XML document as a recursive list of lists. Therefore, the DOM views an XML document as a tree-structure. Then all elements can be accessed through the DOM tree. The elements content (data and attributes) can be modified or deleted, and new elements can be created. Elements, attributes, and their content are all considered as nodes.

DOM however has two important shortcomings that make developers looking for other alternative models. First, because DOM is language-neutral, processing DOM with a strongly typed programming language (e.g. Java) can result in problems representing data-type structures by language specific types. Secondly, the DOM parsing way requires that the whole document is processed at the same time to produce the object model hierarchy. As a result, a large system memory and

Page 44: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 30 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

powerful resources are necessary when large documents are handled because the whole object model is stored in memory.

2.1.7.4.2 SAX

The Simple API for XML (SAX) is a sequential access parser for XML. It is the most popular alternative to DOM. SAX has a different approach in the document parsing that addresses the two previous shortcomings. SAX has been designed to provide a simpler event-based programming interface for XML processing. Programmers define the event handlers that are then called when different elements are encountered in the XML document. Despite the fact that SAX is simpler and has smaller memory requirements than DOM, it has its own weaknesses. Indeed, SAX requires the programmer to maintain state information. This can make applications using SAX more complex and difficult to maintain.

Therefore, it is important to understand the differences between DOM and SAX to apply the appropriate choice depending on the application requirements. For instance, real-time applications such as various messaging protocols would be suitable for using the event-based SAX while an application that uses an XML document as a simple database would benefit from the object model of DOM.

2.1.8 XML data management and relational databases

In this section, we review in detail the XML data management role in the data integration domain and focus on the problem of data integration when relational databases are used as the main data storage technology and XML is used as the main data exchange technology. We first introduce the different XML data management scenarios and problems. Next, we give a comparison between XML and relational databases for better understanding of the integration challenges between both technologies. Then, we will review the main literature contributions that have been done in the field. We will classify them into three classes of problems or scenarios. First we give a review of the research works done in the field of XML publishing from relational databases. Then, we survey the research works done in the field of XML storage into relational databases for efficient storage, query and retrieval of XML data. Finally, we review the recent research works that focus on the databases update problem from XML data.

Page 45: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 31 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

2.1.8.1 Introduction to XML data management XML plays increasingly an important role in the exchange of a wide variety of data on the Web and elsewhere. During the last years, it has become the de-facto standard data exchange format for most Internet-based business applications. Its nested, self-describing structure provides simple and flexible means for applications to model and exchange data. Indeed, XML enables flexible processing and simple information exchange between heterogeneous applications and platforms. In the last decade, XML has extremely wide spread. It no longer needs to be justified as a good idea to use it. In fact, the question has changed from “Why XML?” to “Why not XML?”

XML offers a common syntax for sharing data among enterprise and organization systems. It is more than just a syntax because it is standardised and has an extensible syntax and many powerful related technologies and standards. As a consequence, XML has had a main role in the development of data integration over the last decade. However, XML does not address the semantic integration issues. Thus, sources may share XML files whose tags are completely meaningless outside the application. Since it appears as if data could actually be shared, the need for integration becomes much more significant.

Hence, during the last decade several integration systems have been developed using XML as the underlying data model and XML query languages as the query language. In these data integration systems, every aspect needs to be extended to handle XML. One important issue is to deal with the relational databases which are the main storage technology of the data sources that shall be integrated. Thus, one of the main goals in any XML-based integration system is to make significant use of XML with the database technology and especially the relational databases.

We can distinguish three main related issues or scenarios that appear for the use of XML with the relational database. These scenarios, as depicted in Figure 2-4, are:

· Publishing data from a legacy database to XML. · Storing and querying XML in persistent relational databases. · Update the databases from XML data.

In the following sections we have classified the previous works from the literature into these three scenarios and we review the contributions that have been done in each of them. Before that, in the next section, we discuss the characteristics of XML and relational databases, and the relations between both technologies.

Page 46: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 32 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

Figure 2-4 XML data management scenarios

2.1.8.2 XML vs. relational database XML has some advantages: it is self-describing, i.e. the tags

describe the structure and types of the data (and some semantics in some cases). It is designed to be interpreted by machines. It is portable because it is Unicode based and independent of any platform which makes it the best candidate medium for data exchange. However, it has some disadvantages: the access to data is slow since it involves parsing the XML files and doing some text conversion; the performance is especially poor when XML files contain large volume of data. Despite that XML is surrounded with a lot of tools and related technologies, schema languages (DTD and XSD Language, etc.), query languages (XPath, XQuery, XQuery updates facilities, etc.), programming interfaces (DOM, SAX, etc.) and so on, but it still lacks facilities existing in the relational databases such as efficient storage, indexes, data integrity and security, transactions, triggers, etc.

The relational model, as proposed by E.F. Codd [Codd 1970], is a database model based on first-order predicate logic. A relational database is a collection of tables, a table consists of a set of records or rows and a record consists of a set of fields or columns. The relational databases are a well-experimented, dominant and efficient technology for structured data storage and retrieval. The relational database technology provides a lot of facilities in terms of data management, such as indexes, data integrity and security, transactions, triggers and multi-access management. However, the bidirectional exchange of data from relational databases between heterogeneous schemata is cumbersome and often hand coded which makes it expensive, ineffective and error-prone.

Page 47: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 33 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

In the table 2-3 we compare the relational related technologies with their analogous in XML.

XML documents Relational databases Data model XQuery 1.0 and XPath 2.0

data model relational data model

API DOM, SAX JDBC/ODBC Query language XQuery, XSLT, XPath SQL

Table 2-4 XML documents vs. relational databases

To summarize, XML is the best standard technology for data exchange whereas the relational database is, and it seems that this will not change in the near future, the best technology to store, query and retrieve data efficiently. Consequently, there is a need to integrate the advantage of both technologies. In the next section we present the problem of integration between XML and relational databases and present the main three scenarios and issues related to this integration.

2.1.8.3 XML & relational databases integration The data exchange and integration between XML and relational databases is a challenge due to the heterogeneity of both models. Therefore, this field has been an active research topic in the last decade with the goal of making a significant use of both technologies. In the literature, we may distinguish two main problems that were considered at the beginning [Krishnamurthy et al. 2003]:

1. Relational publishing in XML (called also XML publishing by Krishnamurthy): The goal is to define an XML view of the relational data set and then XML queries are posed over this view.

2. XML storage into a relational database: The aim is to use the relational database management system RDBMS to efficiently store the XML data. In this scenario, there are two main sub problems: (1) a new relational schema shall be designed for storing the XML data, (2) XML queries have thus to be translated into SQL ones for evaluation. Recently, a few research works considered a new challenge problem which is:

3. Relational databases update from XML: the goal is to update already existing relational databases from XML data. In fact, in the relational publishing into XML scenario a new challenging problem to update the relational databases from XML views is emerged, where the XML views, that were representing a read-only functionality to export data from the relational databases (i.e. from the legacy systems) into the XML format,

Page 48: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 34 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

should then be extended to satisfy the need for the update functionality to transport any update over the XML view into the relational database. Also in the XML storage into relational database context, the need to handle these updates queries has also received some attention. Therefore, these queries should be supported and translated correctly into the relational database. However, the general case of updating any database from any XML format has not been considered yet and particularly the updating of existing legacy relational databases from any newly produced relevant XML data.

Therefore, we may classify the problems related to the interaction between XML and relational databases into the previously cited three main problems: Relational publishing in XML, XML storage into a relational database and Relational databases update from XML. In the following section we review the different works that have been performed to face each of them.

2.1.8.4 Relational publishing in XML The goal of relational publishing in XML is to transform the existing data in a relational database of a legacy system into an XML format. These data are usually stored in a pre-existing relational database (i.e. in the legacy system), and are updated through their interfaces. However, we need to provide XML wrappers to export the relational data into XML and make it accessible for Web publishing and/or data integration.

Many initiatives have been launched to fulfil this need. The DB2XML proposed in [Turau 1999] was one of the first tools to transform the result of a query or the complete content of a relational database into XML documents. It generates a description of the data in terms of a DTD and allows a transformation step of the XML documents using the XSLT stylesheets. The used mapping approach is similar to a straightforward relational to XML translation algorithm, called Flat Translation where the flat relational model maps to the flat XML model in a one to one mode. However, DB2XML does not allow importing the XML documents, even those that conform to the same specification of the exported documents.

The relational publishing in XML is mainly achieved by defining an XML view of the relational data and then XML queries are performed over this view. Thus, the XML view is a technology used to extract data from a relational database into a specific XML format, that means an agreed formatting of element and attributes based on the design of a new XML schema expressed in DTD or XSD languages. The XML views concept (or "XViews") was first introduced in [Baru 1999]. In this paper, the author presents various approaches to select a set of candidate XViews.

Page 49: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 35 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

The final XView is then derived by a manual refinement over the candidate XViews. Indeed, an XView over a relational schema may be defined as an aggregation of some or all of the relations in this relational schema. The used approach to derive the candidate XViews is based on representing the relational schema as a directed graph. The graph processing techniques are used to enumerate the candidate XViews. Then, the nodes with the maximum degree (or alternatively with zero degree) are the possible roots in the candidate XViews. When a schema graph has cycles these cycles should be broken. Therefore user guidance is always required to complete the task of enumerating the XViews.

In [Carey et al. 2000a], [Carey et al. 2000b], [Shanmugasundaram et al. 2001a], XPERANTO is proposed as a middleware that allows existing relational data to be viewed and queried in XML. It provides XML views over the relational database to allow users querying and structuring XML data using an XML query language without dealing with the SQL language. Indeed, this middleware uses XML Query Graph Model (XQGM) as an intermediate representation which is general enough to capture the semantics of a powerful language such as XQuery and flexible enough for easy translation to SQL. Thus, XQuery first generates the intermediate XQGM representation. The XQGM representation helps in the query rewriting step to perform a complex XML view composition, and XQGM is then translated into the SQL statements. Therefore, XPERANTO allows the user to publish relational data as XML using a high-level XML query language to eliminate the need for application code. However, the flat relational representation is only considered through the XML view composition over the relational database, and updates queries are not supported.

In [Shanmugasundaram et al. 2001b], authors propose a SQL-based language specification that extends SQL with new scalar and aggregate functions for constructing XML documents inside the relational engine which may improve significantly the performance.

SlikRoute [Fernández et al. 2000], [Fernández et al. 2002] is another tool for relational data publishing in XML. The publishing process is accomplished in three steps: (1) relational tables are represented in a canonical XML view; (2) a public, virtual XML view is specified in the XQuery language by the database administrator; (3) an application allows formulating Queries in XQuery language over the public view. Then, the proposed algorithm translates the XQuery expression into SQL and decomposes the XML view over the relational database into an optimal set of SQL queries. All the previous approaches only consider the flat representation of the relational tables and they do not consider the integrity

Page 50: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 36 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

constraints in the XML view, which may cause a problem in writing the XML queries properly.

One important issue in the relational publishing in XML is the transformation from the relational schema into XML schema which is used then to publish the relational data in XML. This issue has been well-studied in the literature and several transformation approaches of a relational schema into an XML schema have been proposed. In [Lee et al. 2001] a straight way, called Flat Transformation method, is presented. This method converts relations and attributes of a relational schema respectively into elements and attributes of a DTD. Lee et al. in [Lee et al. 2002], [Lee et al. 2003] give two algorithms called NeT (Nested-based Transformation) and CoT (Constraints-based Transformation). The NeT algorithm derives nested structures of the flat relations by repeatedly applying the nest operator based on the tuples values of each relation. Therefore, the resulting nested structures are not based on the relational semantics of the schema. The CoT transformation considers the dependencies inclusion constraints to create a more intuitive DTD. An algorithm to transform a relational schema into a DTD by taking into account the functional dependencies has been proposed in [Lv and Yan 2007]. But all these previous works fail to capture all the semantic and integrity constraints of the relational schema and they introduce some redundancy, due to the use of DTD as a target schema language.

There are also several works taking XML Schema as the target schema language. In [Fong and Cheung 2005], the authors propose to translate a relational schema into an Extended Entity Relationship (EER) model which is then transformed into a conceptual XSD graph. The XSD graph is finally mapped into an XML schema. Thus, two steps are introduced which may result in some level of inefficiency. The method proposed in [Yang and Sun 2008] to automate the transformation of the relational schema into an XML Schema, only considers the semantic of inclusion dependencies. Two other recent works about the transformation from a relational schema into XML Schema are presented in [Liu et al. 2006], [Zhou R. et al. 2008]. However, they cannot capture the hierarchical view of the relational model, which is especially useful to provide a complete and consistent mapping between XML and relational schemata needed for performing a coherent series of SQL update queries in cascade. In addition, they did not consider the automation process because their transformation algorithm needs the intervention of an expert to decide which relation is the “dominant relation”.

In [Nayak et al. 2010], a relational data publishing approach is proposed according to the customer request and his conditions. In this

Page 51: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 37 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

approach the schema of the database is not exposed to the customer, which may be useful for security. The proposed middleware consists in four main components: (1) the XML to SQL schema mapping that maps the XML requested format to the database schema, (2) the XML to SQL translation which is designed for a set of statements for a particular business process, (3) the SQL to XML translation to convert the SQL query result into an XML file, and (4) the XML to XML schema translation to map the schema of the result XML file into the customer schema.

2.1.8.5 XML storage into relational databases The goal of XML storage into relational databases is to use the Relational Database Management System RDBMS to store and/or query XML data. Indeed, with the large amount of data represented as XML documents, it becomes necessary to store and query efficiently these XML documents. To address this problem, some works have been done on building native XML database systems [Bourret 2007]. These database systems are built from scratch for the specific purpose of storing and querying XML documents. However, this approach suffers from the potential disadvantage that native XML database systems do not exploit the sophisticated storage and query capabilities already supported in existing relational database systems. To face this situation, the database research literature has seen an explosion of publications about storing and querying XML data in the relational databases.

In the XML storage problem, we may distinguish two main sub problems: (1) a relational schema has to be designed and created for storing the XML data, (2) XML queries have to be translated to SQL ones.

Several solutions have been proposed to store XML data into relational databases [Deutsch et al. 1999], [Florescu and Kossmann 1999], [Schmidt et al. 2001], [Tatarinov et al. 2002]. They are all focused on mapping XML documents into a specifically designed relational database for efficient storage and query of XML data. Many techniques have been proposed for solving this problem. However, most of them concentrate on storing and querying XML documents regardless of any knowledge of the schema associated with the XML data [Florescu and Kossmann 1999], [Deutsch et al. 1999]. In some techniques, the mapping is inferred from Document Type Definitions (DTDs) for storing and querying XML documents [Shanmugasundaram et al. 1999], [Kappel et al. 2004]. Others techniques are using XML Schema [Penna et al. 2006].

These techniques generally follow a set of steps. The first step consists in generating the relational schema, where relational tables are created for the purpose of storing XML documents. Then, XML documents

Page 52: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 38 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

are "stored" by shredding them into tuples of the tables that were created. Finally, the XML queries over the "stored" XML documents are converted into SQL queries over the created tables. The SQL query results are then tagged to produce the desired XML result.

2.1.8.6 Relational databases update from XML An important issue in updating relational databases from XML is to make sure that only valid XML updates will be allowed and correctly achieved into the relational database. In the literature, the update problem has recently received more attention.

In the context of updating relational databases from XML views, the authors in [Wang L. and Rundensteiner 2004], [Wang L. et al. 2006a], [Wang L. et al. 2006b] deal with the problem of the existence of a correct relational translation for a given update over an XML view. Therefore, they propose a theoretical foundation for the XML View update. In addition, a schema centric algorithm, called Schema-driven Translatability Reasoning algorithm (STAR), is proposed to classify a given update view into one of three types: "un-translatable", "conditionally", and "unconditionally translatable". The classification is based on several features of the XML view and the update statements, including: (a) granularity of the update at the view side, (b) properties of the view construction, and (c) types of duplication (i.e. content and/or structural duplication) appearing in the view. Indeed, they extend the concept of “clean source” for relational databases into a new theory called "clean extended-source theory" that defines the criteria to determine if a mapping translation is correct.

Considering the same problem of updating relational databases through XML views, the authors in [Braganholo et al. 2004], [Braganholo et al. 2006] propose a framework, called PATAXO, for supporting the updates of the relational database from the XML views. In PATAXO, XML views are specified as query trees and mapped into relational views. XML view updates are translated to updates of relations only if XML views are well-nested, and if the query tree has no duplication. In fact, techniques from the existing works on updating relational views are used to decide if this relational view is updatable with respect to relational updates or not. If it is, updates are then translated to the relational database.

Other works have studied the XML view update problem [Fan et al. 2007], [Choi et al. 2008], [Cong 2007]. They consider XML views and updates when XML views are defined using recursive DTDs and the XML updates are specified in terms of recursive XPath expressions with complex filters.

Page 53: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 39 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

In the context of the storage of XML documents into relational databases, authors in [Barbosa et al. 2004] propose to handle the information preservation in the XML to relational database mapping by basing their approach on the design of mapping schemes classes. These mapping schemes are "lossless", i.e. preserve the structure and content of the documents, and "validating", i.e. can be mapped into legal databases. They are especially important to ensure the correctness of the translation of XML queries (including updates) into the SQL transactions (i.e. only updates resulting in valid documents should be allowed).

In the context of so called the XML Enabled Database [Pardede et al. 2006], the authors focus on the XML update management. Therefore they propose an update methodology based on the preservation of the conceptual semantic constraint in XML data during the updates. Thus different XML conceptual constraints are transformed into a logical model. The XML semantic constraints are classified and represented in an XML Schema. Then, the SQL/XML annotations can be added to XML Schema to transform the XML Schema components into SQL components. Finally, different generic update methodologies are proposed based on the following operations type: (i) insertion, (ii) deletion and (iii) replacement. For each type three different maintenance strategies are considered: (i) restrict, (ii) nullify and (iii) cascade. Different updated target functionalities are considered, either as a key, a keyref or just a simple content. These generic methods are the basic functions that need to be followed for any XML storage regardless of the underlying data model.

In fact, the problem of updating relational databases from XML data remains a challenge. It has not been very well investigated and there is still no widely accepted solution. Furthermore, the XML views are increasingly used for exporting and querying relational data and XML data. However there are still some open problems that are worth to investigate in exporting and querying XML views [Cong 2007]. For instance, the different XML storage techniques are still underway and are subject to open debates in the database community. The different techniques will introduce new problems related to building the XML view in terms of the update syntax and updating through XML views (especially when XML data are distributed on different sites or are stored in different relations). Another important issue is the document-centric concern of XML data which receives less attention from the database community. Nevertheless, many XML data are document-centric data such as bioinformatics data. Thus, XML view generation for document-centric data is worth to investigate. Finally, the concurrency control should be considered in the XML updates which would open up new interesting research avenues.

Page 54: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 40 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

2.1.8.7 Bidirectional data exchange between XML and relational databases Some related works have addressed the problem of data integration between XML and relational databases with a focus on the bi-direction data exchange between both technologies. We summarize them hereafter.

In [Bourret et al. 2000], the authors propose a utility for data exchange between XML documents and relational databases. This proposed utility is based on viewing the XML documents as a tree of objects. Then, the object-relational mapping techniques could be used. The utility has four functionalities: (1) extracting data from an XML document into relational tables of a known schema; (2) creating an XML document from extracted relational data according to a known DTD; (3) generating relational schemas form DTDs; (4) generating DTDs from relational schemas. For defining the mapping between XML and relational databases, an XML based language that expands the principles of object-relational mappings to handle XML is proposed. In this work, they only take into account the transfer of XML documents into a known relational schema but do not consider the updating of an existing relational database where the new XML data should be reconciled with previously existing data to have coherent and consistent database update.

In [Tzvetkov et al. 2005], the authors propose a tool to transfer data between XML and relational databases. In the system design they propose three classes: (1) the FormDBXML class represents the interface and control the two other classes; (2) the second class is the DBReader, it converts database tables to XML documents by reading the relational model and produce the corresponding XSD schema; (3) Finally the XMLParser class converts XML documents to database tables. However, they do not propose methodologies neither for the XSD generation from the relational model nor for the generation of the relational tables from the XSD schema. Furthermore, the proposed tool does not handle the data exchange when both the XSD and the database schemas already exist.

The RelaXML approach [Knudsen et al. 2005] has been proposed for the automatic bidirectional XML-based data exchange of relational data. It allows exchanging data between both relational and XML data. RelaXML supports the export of relational data and the re-import from XML data. This approach is based on the separation of the structure definition from the data definition (i.e. the selection of data is separated from the definition of the XML structure). Therefore, the data to export are defined by a concept, which can be compared to a view and results in one SQL query. RelaXML detects if the re-import is possible at the time of export and if not the user is warned. Three kinds of import functions are supported: insert, update and merge.

Page 55: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 41 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

As a result, we observe that the problem of updating existing relational databases in legacy systems from XML data, where the new XML data should be reconciled with the previously existing data (to have a consistent database update) has not been considered in any of the previous works.

2.1.9 Discussion

The proposed solutions for relational publishing in XML consist in the design of the XML view as a virtual view for the relational data. Then, the mapping from the relational model into the XML views is defined. However, most solutions deal with the problem of data exchange from one side. They only give a read-only solution (i.e. only take into consideration exporting the relational data into an XML format using XML views). Some publishing solutions support XML queries translation into SQL queries to retrieve the needed data from the relational database. However, they do not consider the update queries, as all XML query languages only allow querying XML data and do not yet support the INSERT or UPDATE clauses. As we see, only few works take into consideration the problem of updating the relational database from the published XML views.

The proposed solutions for XML storage into relational databases are based on designing or on transforming the XML data model into a corresponding relational model by designing a new relational schema transformed from the XML representation model. However, relational models based on an XML tree structure are abstract representations and do not provide optimal data retrieval by business processes. Furthermore, these solutions are not taking into account the case where we need to store new XML data into existing operational legacy information systems with their well-optimized relational model.

In both the publishing relational databases in XML and the XML storage in relational databases scenarios, the problem of the update of relational databases from XML has not received much attention. Maybe the fact that XQuery is only allowing to query XML data and does not support yet the INSERT or UPDATE query type, is behind the lack of treating the update question. However, recently the World Wide Web consortium (W3C) has proposed the XQuery update facility as a candidate recommendation. It extends the XQuery language with five new kinds of expressions (insert, delete, replace, rename, and transform). As a consequence, the emergence of the XQuery update facility should facilitate the update of XML documents. However, a coherent and correct translation of such update queries into relational databases queries is still needed.

Page 56: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 42 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

In spite of the large number of works that have been done in the domain of the integration between XML and relational databases, we observe, throughout our literature review, that these works only focus either on mapping relational to XML data with the goal of publishing relational data into XML, or on mapping XML to relational data for storing and querying XML data efficiently using newly designed relational databases. But none of these works took into account the problem of storing XML data into existing relational databases (in other words, the problem of updating the existing relational databases from XML documents). However, existing relational databases nowadays represent critical infrastructures in most enterprises and organisations. Thus, saving XML data meaningfully into relational tables in a legacy system is challenging and far away from being simple or straightforward because the XML and the relational approaches are built on different principles. Therefore, tools and solutions are needed to facilitate the update of legacy relational databases from XML documents, while preserving integrity constraints over the target databases where we should reconcile the new XML data with pre-existing data to have a consistent database update.

2.1.10 Conclusion

Data integration has been extensively studied during the last three decades and a huge theoretical foundation in this field can be found in the literature. In this first part of our state of the art, we studied and reviewed the main theoretical aspects and works in the data integration field. The problem of the data integration and its main scenarios are presented. Three main data integration architectures from the literature are distinguished and reviewed: the classical mediation architecture, the data exchange architecture, and the peer to peer data integration architecture. A formal comparison between the data exchange and data integration systems is presented. The different approaches and languages used in the mapping composition of the data integration systems: GaV, LaV, GLaV and BGLaV are presented. Some of the remaining open problems in the data integration field are discussed. Next, the XML technology and its related languages and standards are presented and reviewed. Finally, the XML data management role in the data integration is reviewed. Works in the literature, on the problem of data integration when the relational database is used as the main data storage technology and XML is used as the main data exchange technology, are classified into three main categories and the works in each of them are exposed and reviewed.

Data integration is expected to remain an active area due to its social aspects, as enterprises and organisations need to share their data and

Page 57: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 43 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

integrate their system when they decide to collaborate and cooperate in all fields. In this context, from one side XML emerged as a very widely accepted standard for data exchange on the Web and among applications (i.e. Web Services and Semantic Web). On the other side, relational databases are the most mature and well-studied technology for efficiently storing, querying, and retrieving data in almost all information systems. Therefore, a huge work has been made to combine both technologies in the data integration, data exchange and data management systems. However, a lot of open problems need to be more investigated and solved such as the bi-directional data exchange between these two models, the update of relational databases from XML data, the update through XML views, the concurrency control when updating the XML data, and the data redundancy management in the XML view especially during the update operations, etc.

Through this thorough literature review of the existing works on data integration between XML and relational databases, we noticed that the issue of updating existing relational databases in the legacy systems from XML data has not yet received much attention. However, existing relational databases are critical assets in most enterprises and organisations and XML is the main standard for data interchange among them. Therefore, new tools and solutions are needed to facilitate the update process of a legacy relation database from XML data or documents. This process should be automatic and transparent. Our goal through this thesis is to tackle this issue. We therefore propose a generic mediation solution to automate the exchange between XML and relational databases and particularly to ensure automatic and coherent update of any existing legacy database from any XML data or document.

2.2 eHealth Challenges & Applications

2.2.1 Introduction

In the present era, thanks to the rapid development of the so-called Information and Communication Technologies (ICT), the methods of communication are changing considerably in a way that increases the possibility for organizations and individuals to share their experiences and knowledge in a society where information becomes a fundamental asset.

Healthcare is one of the vital domains that should be affected by these changes. According to WHO [Dzenowagis and Kernen 2005], “ICT is changing healthcare delivery and is at the core of responsive health systems”. Indeed, ICTs should improve considerably the delivery of high quality healthcare services and expand its production. Therefore, the main

Page 58: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 44 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

ICT challenge is to enable “more informed decision-making and more cost-effective use of resources” by building up new medical information systems capable of giving the right information, when and everywhere it is needed. Facing this challenge, ICT must provide an access to the complete and correct patient information. It must also allow an effective and timely communication between the different stakeholders in the health domain including the patients. It must ensure the delivery of the healthcare whatever the distance is.

Healthcare becomes a cooperative process that involves several individuals and organizations with different but complementary tasks and skills. In this context, new challenges have emerged that require new medical information systems to be more intelligent and pervasive [Verdier 2006]. In this section, we will thus review some of the challenges in the areas of eHealth and medical informatics and inspect the current situation, taking advantage of the large experience of MTIC in this field, gained through their work with national and European projects in telemedicine and eHealth (such as projects SCP-ECG, OEDIPE, EPI-MEDICS, etc..), in which they were deeply involved for several years [Fayn et al. 1994a], [Fayn et al. 2003], [Télisson et al. 2004], [Rubel et al. 2005], [Fayn and Rubel 2010].

We will particularly focus on the content and structure of various medical records [Eichelberg et al. 2005] which have already been or are being implemented in European countries, the medical coordination policies of information exchange that have been advocated by international standard committees [Loukil 1997], the medical services which are under development, the methods to access rich, personalized multimedia data [Ghedira 2002]. Another issue relates to the exploration of different use-case scenarios on a collection of examples about the need to exchange medical information in the areas of pHealth and telemedicine [Atoui 2006], [Telisson 2006].

We will first define the meaning of eHealth and review the eHealth challenges addressed by the European research over the past 20 years in that field. We then present two of the European projects related to our work: the SCP-ECG and the OEDIPE projects. Next, we review some of the recent works related to the information retrieval challenge in the Healthcare field. Afterwards, we expose the interoperability problem in the Healthcare field and its categories. Then, we focus on the electronic health record EHR, its definition and advantages. Finally, we review EHR interoperability problems, and review some of the most advanced standards that have already been or are being implemented in European countries and all over the world.

Page 59: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 45 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

2.2.2 eHealth

One of the most frequently cited definition of eHealth has been published by eHealth researcher Gunther Eysenbach in [Eysenbach 2001]:

« e-health is an emerging field in the intersection of medical informatics, public health and business, referring to health services and information delivered or enhanced through the Internet and related technologies. In a broader sense, the term characterizes not only a technical development, but also a state-of-mind, a way of thinking, an attitude, and a commitment for networked, global thinking, to improve health care locally, regionally, and worldwide by using information and communication technology. » In fact, eHealth, also known as ICT for health, designates all the

applications of the information and communications technologies through the whole range of functions that affect the health sector, from the doctor to the hospital manager, via nurses, data processing specialists, social security administrators and patients. It offers to European citizens important opportunities for improved access to a better quality of healthcare delivery. It can empower both patients and healthcare professionals and offers to governments means to handle the increasing demand on healthcare services. It can also help to reshape the future of healthcare delivery, making it more citizen-centred. eHealth promises both today substantial productivity gains and tomorrow restructured, citizen-centred health systems, while respecting the diversity of Europe’s multi-cultural, multi-lingual healthcare traditions.

Examples of applications include health information networks, electronic health records, telemedicine services, personal wearable and portable communicating systems, health portals, and many other ICT-based tools assisting disease prevention, diagnosis, treatment, health monitoring and lifestyle management.

2.2.3 European research challenges in eHealth

The European Commission (EC) has been supporting eHealth Research and Development (R&D) through framework programs for over the past 20 years and contributed to the emergence of new generations of technologies in several fields of healthcare. Through these years, EC has supported

Page 60: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 46 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

about 450 collaborative R&D projects in eHealth with a Community contribution of over €1 billion.

Historically, we can identify two major periods in terms of European research in the health domain. Starting from 1985, with the goal of developing new tools for health professionals to make health delivery systems more efficient, a planning exercise called “Bio-Informatics Collaborative European Program and Strategy”' was launched by the EC which was the first that identified what was then referred to as the “health telematics” industry. An exploratory action was adopted in 1988. This exploratory action was followed by the “Advanced Informatics in Medicine (AIM)” program with a Community contribution worth €111 million.

By this program (AIM), the first major period of European research in the health domain has started. The goal was to stimulate the development of applications by enabling infrastructures that could allow fast access to vital information and sharing of information among health professionals to improve access, quality, and efficiency of care. Thus, the EC R&D projects during the 90s focused mainly on development of electronic health records and connectivity among all the points of care of a health delivery system at regional and national levels. Currently many countries cooperate on making their systems interoperable to support patient mobility and to create a single European eHealth market.

Following the initial focus on infrastructures for health professionals, in 1999 started a second major period that is still active today and is expected to continue in the next ten or twenty years. It focuses on developing tools for patients to improve prevention and personalization of Healthcare. Therefore, activities over the last decade have aimed at setting-up an intelligent patient-centered environment supporting personalized healthcare (ambient environment for patients). Wearable and portable personal health systems have been developed for providing patients and health professionals with better information a person's health status. These systems support health monitoring (homecare), chronic disease management and disease prevention.

The next step, as stated by the European Commission in the eHealth week this year [ECIS 2010], will be the promotion of ICT for predictive medicine under the slogan “Towards the Virtual Physiological Human (VPH)”. This VPH concept with the R&D roadmap has been published by the EC. Therefore, patient specific models will be developed to assist and provide safer medical operations, personalized treatments and safer drugs. The disease and knowledge of physiology will be integrated from the level of molecules and cells to the levels of organs and systems.

Page 61: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 47 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

The VPH is intended to constitute a global “scaffold” of medical knowledge and a “toolbox” for researchers.

Hereafter we will present with some details two projects that were part of the European research in the health field and in which our laboratory was deeply involved for several years: the SCP-ECG and the OEDIPE projects.

2.2.3.1 Standard Communications Protocol for Computer-Assisted Electrocardiography (SCP-ECG) project

In fact, the ECG has been the subject of every advanced research not only for information processing but also for information exchanging in a standard and reusable way through heterogeneous hardware and software environment. This standardizing initiative (launched in 1989) was supported by the first framework program of the European Union in telemedicine, “Advanced Informatics in Medicine (AIM)”, which paved the way for the emergence of the SCP-ECG project “Standard Communications Protocol for Computer-Assisted Electrocardiography” (SCP-ECG AIM #A1015 1991). This project brought together the main academic European teams in the quantitative Electrocardiology field and 90% of the worldwide electrocardiographs manufacturers. The goal was to develop a “protocol” for a universal syntactic and semantic representation of the ECG data in order to overcome the interoperability problem of the existing proprietary formats and to allow storing digital ECG as files, readable everywhere and anytime. This goal has a great interest in public health for patients as well as for professionals and health institutions. It shall result in a strong economic impact in the industrial world.

The resulting SCP-ECG protocol is composed of 11 sections of data. The first section contains all metadata for documenting the ECG in the form of tags. Four sections are devoted to the representation of the encoded signals, possibly compressed, completed with the parameters to allow signals restitution. The next three sections correspond to the representation of the results of the three preliminary stages of the ECG classical analysis: the delimitation of the waves, the calculation of global parameters intrinsic to the cardiac electrical activity as the heart rate and the calculation of amplitude and duration measurements for each ECG lead. The last three sections are devoted to the representation of human and automatic diagnostic interpretations of the ECG in text form and are encoded according to a standardized data dictionary [Willems et al. 1992]. This protocol has the advantage of being designed to be fully customizable. Indeed, all data sections are optional except the first section, which has

Page 62: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 48 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

only identifying data such as the name and date of birth of the patient for whom the ECG was recorded, and the date and time of the ECG recording.

Therefore, the SCP-ECG protocol is open to various modes of representation of the ECG, whatever the system of capture and recording used, and whatever its ECG analysis and automatic interpretation capabilities. It contains all metadata needed for a future exploitation of the transmitted information. It includes descriptions of conditions of the ECG recording and information that every doctor must have to interpret this ECG such as the patient's current medication during the examination that may have significant effects on the ECG signals. Indeed, all the included information is inherent to ECG representation that makes this protocol a reference model for data representation in favor of transportation.

The SCP-ECG protocol after the AIM project was submitted in 1993 for approval as a pre-Norm to the European Standardization Bodies (CEN: Comité Européen de Normalisation) within the Technical Committee TC251 under the project named Team 007. The SCP-ECG protocol has then been approved by CEN as a European Norm in 2005 (EN 1064:2005), and has been recognized in 2007 as an ISO standard (ISO/DIS 11073-91064).

2.2.3.2 Open European Data Interchange and Processing for Electrocardiography (OEDIPE) Project

One of the three research directions in the SCP-ECG project that were conducted jointly with the development of the communication protocol was the development of European guidelines and recommendations for storing ECG data and the development of a draft storage reference model compatible with the SCP-ECG protocol and every serial ECG analysis service [Rubel et al. 1991]. This research direction was then further developed within the European project OEDIPE (Open European Data Interchange and Processing for Electrocardiography) in the second phase of the AIM framework (OEDIPE AIM #2026 1991).

The goal of this project was to design a reference model that supports the storage of the recorded ECG data independently of the used capturing system and the analysis or interpretation methods used for its processing. The adopted approach was therefore to develop a generic representation of the cardiac electric information as it is used, manipulated or processed, and of the metadata associated with the conditions and the methods of acquisition, analysis and interpretation.

The main difficulty was the semantic heterogeneity of the calculated data representation. The main concern was then to develop a complete model supporting the vast diversity of recording and analysis approaches of cardiac electrical information, compatible with multiple

Page 63: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 49 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

usage scenarios of this information in various contexts of clinical practice and also medical research. The requirements of providing the ECG information to an emergency department, to general practitioners or to a specialist care unit in cardiology are actually very different. The history management for obtaining various analysis results can also be performed following different procedures depending on whether the followed research goal is for the assessment of clinical activity, for diagnostic or therapeutic purpose or for conducting scientific research.

The modeling approach was to develop a representation of cardiac electrical information divided into sub-models in a modular fashion to easily select only a portion of the overall model without affecting its consistency. The final reference model is a personalized service-oriented representation for intelligent information management. Indeed, the model is enriched by additional metadata to complete the information quality enabling an automatic piloting of the serial analysis processes. Thus one or more ECGs of a patient can be qualified as reference ECGs in order to be compared or to compare only the most recent one automatically with a new recorded ECG. However, a protocol of serial ECG analysis can be quite complex in some situations. Thus, a meta-model has been designed for an intelligent control of the process of analyzing ECG data modeled for their storage. This meta-model represents the following view of a patient whose ECG data are stored: the patient’s life may be punctuated by events, scheduled events according to a theoretical protocol or intercurrent events, some of these events may be qualified as reference events, and patients can be grouped into categories. Thus the reference concept was introduced at three levels: the level of all individuals, the level of the history of an individual and the level of its electrocardiographic examinations. Then a three-dimensional reference representation is initiated: Patient, Time, Descriptors.

This reference model for storing ECG data, designed using the relational approach, is composed of about 50 tables and 200 attributes, divided into six sub-models which refer to six major phases: data acquisition, data preprocessing, unary analysis, serial analysis, the interpretation and the management of the ECG analyses [Fayn et al. 1994a], [Fayn et al. 1994b]. This model not only supports the storage of resting ECG data but also sequences of particular recordings of a long duration electrocardiographic examination. A stress ECG, for example, involves collecting a set of sequences of almost 10 seconds of ECG, each being performed after a particular level of stress that increases progressively as the patient should achieve this level of stress by doing physical exercise such as walking on a treadmill or pedaling on a stationary bicycle. This

Page 64: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 50 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

examination is intended to assess the fitness of individuals and to test their cardiovascular tolerance in a stress situation. This model also allows the representation of such sequences of recorded data and their associated metadata such as the power of the efforts exercised at each stress level and the clinical intercurrent signs. The stored ECG data may also correspond to representative sequence types of ambulatory, continuous monitoring examination or telemetry.

The OEDIPE project was ranked one of the top two European projects of the AIM program by the departments of the European Community. It was thus selected for presentation at the Inter-Ministerial Conference of the G-7 Information Society, held in Brussels in February 1995.

2.2.4 Information Retrieval in Healthcare

The access to the right information, when and where needed, is especially crucial in the medical domain. Therefore, we review hereafter the challenge of Information Retrieval in the field of health.

The medical information systems can be classified into the following four categories: 1. Information systems for general practitioners or specialists or even for

dentists. 2. Information Systems for different units in the hospital as imaging,

electrocardiography, or biology laboratories. 3. Hospital Information Systems. 4. Knowledge based Information Systems.

In [Horsch and Balbach 1999], the authors distinguish three different techniques used for information retrieval in the medical field: 1. Indexing text files by keywords: information can be found in the form

of text documents or hypertext (guidelines, digital libraries). 2. Structured Query Language to a database: the interface to search

information is an essential function of databases. 3. Knowledge based Information systems: here the knowledge base is not

directly accessible by the user queries. The physician working with such a system expects that the system offers a diagnostic or therapeutic specific information. Nevertheless, a “good” system should be able to confirm the decisions indicating the knowledge on which the decisions are based.

Many works have focused on the information retrieval, with different viewpoints, but with the same objective “access to the right information when and where needed” to help users taking their decisions. The work of [Ghedira et al. 2002] focuses on the personalization of

Page 65: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 51 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

information retrieval and display during the consultation of hypermedia data, based on the preferences of users and their practice. She proposed a method called HANS to design models that serve as a basis for the development of hypermedia consulting systems, particularly to capitalize the knowledge of experts and to manage navigation user profiles.

[Chen and Dahanayake 2004] confirm that the centralized nature of medical information systems does not meet the requirements of a distributed system of regional or national health and lacks flexibility. Thus they propose to adopt Service Oriented Architecture (SOA) and a Components Based Development (CBD) approach. Therefore, the design of a system becomes the selection, configuration, assemblage and deployment of components which are reusable. They propose a three-layer architecture: the first layer defines the basic functions; the second allows customization based on a specific area and the third allows customization based on a specific user. This architecture is supposed to be able to adapt and use the technologies already available in the Health domain.

[Rodriguez and Preciado 2004] propose a pervasive hospital information system using multi-agents, to enable health professionals at the hospital to access previous clinical cases for supporting rapid and better decision-making. In their system, the essential components are contextual agents that achieve the tasks of information management such as gathering the links to previous patient records or the retrieval of links to information on medical digital libraries on the web.

2.2.5 Interoperability in Healthcare

Hereafter, we first introduce the interoperability definition in general. Then, we give the definition of the interoperability problem in the healthcare field and present the different issues and challenges related to that problem.

2.2.5.1 Interoperability definition Since the word interoperability is used with different meanings in the literature, we will begin by recalling its explicit definition. According to [ISO/IEC 2382-01 2003] Interoperability is:

« The capability to communicate, execute programs, or transfer data among various functional units in a manner that requires the user to have little or no knowledge of the unique characteristics of those units. »

Page 66: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 52 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

In addition, according to [ISO/IEC TR 10000-1 1998] it is: « The ability of two or more IT systems to exchange information and to make mutual use of the information that has been exchanged. »

Hence, interoperability means foremost the ability of heterogeneous systems to exchange data, i.e. data transmitted by one system can be recognized, interpreted, used and processed as many times as necessary by other systems in an automatic way.

2.2.5.2 Interoperability in healthcare definition According to HIMSS1:

« Interoperability means the ability of health information systems to work together within and across organizational boundaries in order to advance the effective delivery of healthcare for individuals and communities. » Throughout this definition we can at least have the answer about

the purpose of medical interoperability. At present, patient data are stored in different systems and are

structured in different ways, which makes the construction of a management system for the medical records complex and difficult. This reality was expressed by McDonald in [McDonald 1997]. This situation of data encoded in proprietary formats must evolve in the future to standard medical databases and integrated medical applications, such as clinical support decision systems. But exhaustive efforts of standardization are

1 Healthcare Information and Management Systems Society

Page 67: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 53 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

required before the medical databases will become the expected shared product by researchers and health professionals.

The interoperability problem is thus divided into four categories: 1. Interoperability of exchanged messages among the various applications

(examples of standardization in this category are HL7 V2 and V3 messages versions ...).

2. Interoperability of the electronic medical record also called Electronic Health Record (EHR) (examples of standards are HL7 CDA, CEN EHRcom, openEHR, etc…).

3. Interoperability of encoded technical and medical terms (ICD10, SNOMED…)

4. Interoperability of Identifications of patients, healthcare processes and clinical recommendations.

As an example, the Artemis project [Dogac et al. 2006] was launched to address the problem of interoperability in the healthcare field. It aims to establish an interoperability framework among eHealth systems using the Web services. In its architecture the different medical systems are presented as Web services in a peer-to-peer environment. The main objective is to exchange all or part of electronic medical records in an interoperable way. In Artemis there are two types of ontologies to annotate the Web services: the Service Functionality Ontology (which provides the semantics about the service function) and the Service Message Ontology (which defines the structure and semantics of service parameters messages in WSDL).

One remarkable point of Artemis is that it does not require the use of a global ontology, but that the different health systems are free to set their own ontologies. However, it supposes that the different ontologies are built using standards such as HL7, EHRcom, openEHR [Eichelberg et al. 2005], [Garde et al. 2007], because these standards are good ways to represent the knowledge of the medical field. Briefly, in this project, various medical information systems use modules offered by the standards to develop their own ontologies of clinical concepts, then these local ontologies are exported to a mediator. The only condition is that the systems should provide the specification for the mapping between their local ontologies and the reference ontologies in the mediator. Then the mediator is responsible to reconcile the structural and semantic differences of the exchanged data.

Page 68: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 54 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

2.2.6 Electronic Health Record (EHR)

Faced with increased spending on health, medical authorities are seeking to reduce costs of knowledge management and to expand the efficiency to improve the quality of care. For that purpose, the computerization of the medical records of the patients is the ultimate solution. Nevertheless this computerization is a real challenge because the EHR, partner of Health Professionals in their mission of care, is also used by many other actors (hospital staff, researchers, pharmacists, patients, etc…) for various tasks. Since the electronic health record is the cornerstone of any information system in the medical field, we have devoted a part of our research to review the different initiatives in the standardising of the computerized medical records.

2.2.6.1 EHR definition The Electronic Healthcare Record (EHR), also called Electronic Medical Record, has been defined by [Iakovidis 1998] as:

« Digitally stored health care information about an individual’s lifetime with the purpose of supporting continuity of care, education and research, and ensuring confidentiality at all times » Indeed, the EHR (in some countries also called Personal Medical

Record) is an electronic medical record that shall enable patients to access their health data. It aims to enlarge the role of the patient as an actor of his health and to facilitate the communication of health data between health professionals. It is aimed at improving both the coordination and the quality of care. Moreover, it shall facilitate the access to the patient's history. Thus, it shall result in a large reduction of the health costs, improve knowledge management by avoiding unnecessary duplication of tests and help to assess large statistical trends.

The EHR may include information such as observations, laboratory tests, diagnostic imaging reports, treatments, therapies, administered drugs, patient identifying information, legal permissions, and allergies, etc...

2.2.6.2 EHR benefits The EHR shall allow health professionals to share medical information about a patient. With this shared information, the EHR will promote the effectiveness of the management of knowledge, the quality and the continuity of care by facilitating a coordinated care of patients, all this with

Page 69: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 55 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

the respect of medical confidentiality and privacy of the patient. Moreover, the availability and the completeness of information in the EHR, coming from the different practitioners who treat the same patient, will allow a better knowledge and a better monitoring of the patient. Furthermore, the EHR will reduce the costs by avoiding, for example, unnecessary duplication of examinations. The EHR will act, for all healthcare professionals, as a support of all information related to a given patient’s health.

2.2.6.3 Implementation of the Electronic Health Record in France, called DMP1 In France, over the last years, health authorities tried to promote the implementation of the so-called DMP or electronic Personal Medical Record to help improve the quality of care, and at the same time to reduce its costs. Thus, a project for establishing the DMP was set up in August 2004 by a law about the health insurance reform. Thus, each person who has a health insurance may apply for his DMP, which shall bring together all his medical history. This project marked a new stage on July 27, 2007, when the public interest group, which is responsible for coordinating the national trials on the subject, launched its first call for applications. It was searching for providers hosting computers, which were requested to fulfil the criteria laid down by law, particularly in terms of security and project infrastructure. Several projects have been put in place to test, at the regional level, the feasibility and security of these EHRs. In 2009 a new plan has been proposed to restart the DMP project. In May 2009, the Ministry of Health published the “Gagneux” report with 12 proposals with the aim of reforming the policy of the health systems computerization and created the Agency of Shared Health Information Systems called “ASIP Santé”, to coordinate the DMP’s implementation [Gagneux 2008], [Gagneux 2009]. Three major actions were taken: the definition of a National Health Identifier (LSI), the development of interoperability

1 Dossier Medical Personnel

Page 70: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 56 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

frameworks and the establishment of measures of deployment and accompanying users (patients, health professionals, and industry). The objective is to progressively deploy the DMP over the coming years. Therefore, several versions which will be enriched with new services will succeed. The first version shall include data on the patient’s health history (allergies, prescribed medication, results of tests in biology and radiology, reports of consultations and hospitalizations...). It will include also a set of services such as automatic reception of results of analysis, vaccines reminders, and therapeutic follow-up programs. Health professionals should have access to the DMP through their own professional applications. The DMP will also be deployed in health facilities. This deployment will be realized within a national framework for interoperability, security and confidentiality of personal data.

The patient will be empowered to manage the access rights of the health professionals to his own data. Thus, each citizen in France should select a provider that will host its DMP and the corresponding health data. All steps needed to access to a record of his DMP will be performed by accessing a unique, secure portal of the DMP, which will then grant access to the selected provider. All steps will rely on a set of rules and principles strictly supervised by the law, and based on ethical laws in France. The DMP shall only be accessed and filled by health professionals who have been authorized by the patient, except for the emergency physicians who will have special access rights on some parts of the DMP.

Every citizen should have a national health identifier randomly generated and assigned by a central system, the LSI-A, to secure the contents of data exchange. While waiting for the establishment of the LSI-A, it is proposed to use a calculated identifier, called INS-C. The INS-C generation algorithm is provided by the “ASIP Santé”. The patient authentication will be based on a secret code of his choice and a temporary password that is automatically generated at each connection will be sent by SMS or e-mail, according to the patient’s preferences. The health professional can connect to the DMP Internet portal by using business software compatible with the DMP. The identification is then done by his health professional card (CPS) and the national health identification of his patient. The access is only granted if the patient has allowed the health professional to access his DMP during the opening phase of the DMP. The patient may nevertheless give his approval to the physician during a consultation by means of a simple declaration. If the professional works in a health institution, the access is realized via the information system of the institution. In addition, data will be encrypted while they are transported and during storage. Each data reported in the DMP is dated, signed and its

Page 71: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 57 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

author identified. A health professional will be granted to access the data for which he is the author and to the traces relating to its own actions. The reading and writing access rights to the other data will be defined according to the wishes of the patient and depending on the specialty and the nature of the documents. The host however will not have access to the contents of the DMP. On the other hand, proposals for granting the patients the possibility of masking some of this medical data according to the wishes of the patient were also submitted. The DMP host providers can only transmit the patient’s data to the health professionals or to the health facilities designated in the contract. The history of all actions performed on the information contained in the DMP will be maintained by the host providers.

A preliminary proposal content of the DMP was also subject to discussion with various health bodies and organizations. Hereafter, we give an extract. The DMP should include:

1 - Data identifying the holder of the personal medical record: his/her name, first name, date of birth, identifier for the establishment and maintenance of personal medical records, and information identifying his/her physician.

2 - Elements contributing to the coordination, the quality, the continuity of care and the prevention. The DMP should especially contain:

· General medical information: this includes the medical history and the surgical events, the history of specialist consultations, the allergies and recognized intolerances and the vaccinations.

· Healthcare Data: it includes the results of laboratory tests, the summary of the diagnostic acts, the summary of the therapeutic procedures, the balance sheet of the loss of autonomy evaluation, the diagnostic of the physiotherapy, the conclusions of telemedicine, the hospitalizations reports, the current pathologies, the ongoing treatments, the drug dispensation and the nurses monitoring data.

· Prevention data: includes the individual risk factors, the reports of diagnostic procedures for the prevention, the reports of therapeutic acts for the prevention, and the monitoring elements and acts of care set out in the health book.

· Imaging data · A space for the owner

Page 72: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 58 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

The interoperability framework for information systems that has been proposed by ASIP is based on a modular approach composed of three layers: content, service and transport (Figure 2-5). The ASIP agency has selected the solution proposed by the IHE consortium "Integrating the Healthcare Enterprise" (www.ihe.net), created by an American industry group.

2.2.7 EHR interoperability

The electronic medical record has been the subject of active research for many years. Indeed, large hospital systems have been developed from the 70s to lower costs and improve quality of care. Many autonomous systems for a specific need of different health actors have been also developed. This has progressively led to a patchwork of information systems from different vendors. Meanwhile, management constraints required a centralization of certain information. It therefore became necessary to connect these previously independent systems. However, today medical data are stored according to proprietary formats in a multitude of legacy medical information systems available on the market, and there is no one universal standard for digital representation of the data [Eichelberg et al. 2005]. This

Figure 2-5 The interoperability framework of SIS architecture

Page 73: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 59 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

results in a serious problem of interoperability. In fact, the interoperability of medical records is essential to achieve the exchange and the sharing of medical data [McDonald 1997], [Catley and Frize 2002], [Garde et al. 2007].

The most basic way for information exchange is the exchange of ASCII (only text) using physical media. A slightly more elaborate way is the direct exchange of data by applications using a so-called Application Programming Interface or API. The disadvantage of these solutions is that they require specific developments for each specific case. Another way is to use email. The disadvantage is that the information transmission has no formal structure, which necessitates a manual re-typing of the information in the right places for reuse.

This need to exchange data in a structured form allows the automation of the exchange processes and leads to the development of message standards. Cohabitation of a few information systems in a single structure can be feasible, but when the number of systems grows the situation becomes critical. Indeed, a hospital information system (HIS) in a comprehensive health care facility requires about 30 to 40 different applications, involving from 870 to 1560 interfaces [Chabriais 2001]. Thus we can perceive the difficulties and problems to integrate systems that have no structured data exchange standard.

Therefore, it is necessary to model these exchanges in order to achieve a consistent standardization. In the eHealth area, as in many others, specialists have been working to achieve standards specific to their needs. Standardization in the medical data exchange field is extensive. During our review of the literature on this issue, we explored the different EHR standardization initiatives that are being developed. Hereafter, we list the most known standards for the electronic Health Record:

· EHRcom: CEN EN 13606 Electronic Health Record Communication · CDA: The Health Level 7 (HL7) Clinical Document Architecture · GEHR /OpenEHR.

Page 74: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 60 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

· DICOM1 SR: DICOM Structured Reporting · WADO: Web Access to DICOM persistent Objects · IHE2 RID: IHE Retrieve Information for Display · IHE XDS: IHE Cross-enterprise Clinical Document Sharing · MML: The Medical Mark-up Language.

Hereafter, we will present only the most promising standards among the previous ones. The first one, EHRcom, has been developed under the aegis of CEN. But CEN has also developed a Standard Architecture for Healthcare Information Systems: HISA. These two standards have been worked-out by the so-called technical committee for medical informatics (TC251). CEN TC251 has been established in March 1990 with the objective to organize and coordinate the development of standards for architecture development in medical informatics [Moor et al. 1993]. We first present HISA.

2.2.7.1 Standard Architecture for Healthcare Information Systems: HISA To facilitate the communication of all or part of the electronic health record in heterogeneous technical environments, CEN TC251 has proposed a reference architecture as a European pre-standard: HISA (Healthcare Information System Architecture). This architecture allows the recognition and understanding of information in electronic medical records. It is constructed in a way that during the exchange of clinical data, the data contained in the electronic health record maintain their integrity, the context is preserved, and the needs in terms of legal, ethical and safety issues are respected. HISA is an architecture of reusable components that are organized to facilitate the evolution of information system (Figure 2-6).

1 Digital Imaging and Communications in Medicine 2 Integrating the Healthcare Enterprise

Page 75: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 61 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

HISA also aims to identify a set of common services used by information systems in the field of health by providing through this reference architecture, an open, modular and scalable integration of the existing health information systems that is appropriate to all health professionals, independently of their levels of responsibility, for the whole life cycle of an information system in medicine. We distinguish three cooperating levels in HISA: · Healthcare Application Layer to model the data flow and the

functionality required by the implementation process, · Healthcare Middleware Layer for modeling shared services to support

the application layer, · Healthcare Bitways Layer to represent the physical infrastructure,

integration and interoperability of technology environments. HISA provides two classes of services: the Healthcare Common

Services (HCS), specific to the health domain, and the Generic Common Services, independent of the application domain. For that, this architecture is divided into six sub-data models. The six sub-systems are: subjects of care, healthcare Characteristics, activities, resources, authorization and concepts. HISA may be seen as an interface between the different medical work environments and the specialized applications of the health domain. It is compatible with the OSI model and the communication architectures such as COM or CORBA.

Figure 2-6 The HISA Standard Architecture [CEN/TC251 1999]

Page 76: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 62 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

2.2.7.2 EHRcom standard The European pre-standard ENV 13606: 2000 [CEN/TC251 2000] is a standard based on the exchange of messages for the setting up of electronic health records. This standard defines an information model of the health record, a list of interpretable terms to structure the contents of the health record and a method to specify the distribution rules of extracts of the health record. This standard does not propose models to specify a complete system of electronic health records management. Some parts of the standard have already been implemented in projects in northern Europe (Sweden, Norway, Denmark and Netherlands). These experiments showed a number of weaknesses that limit the development on a larger scale. The information model, in particular, seems complex to implement with many options that require designers to have a very high level of abstraction to implement it. [Whittaker 2002] also mentions that “the method used to structure the clinical data is not necessarily appropriate because it focuses on the direct care of patients regardless of all secondary care”.

In 2001, CEN/TC251 decided to advance the status of this pre-standard to a complete European standard, taking into account the ongoing experiences and by adopting the architectural methodology of openEHR. This standard is then called EHRcom and divided into five parts including: the reference model, the interchange specifications, the architectures and the lists of reference terms, the security devices and the exchange models. CEN/TC251 also plans to submit EHRcom to ISO/TC 215 as the basis for an international standard for the representation of electronic health records.

However, only the reference model is stable, the other four parties are currently in draft form and require much more discussions. This reference model [CEN/TC251 2004] is composed of four sub-models (extracts of the record, demographics, access control and message) that describe the most appropriate aspects for the communication of extracts of health records between different information systems.

The sub-model extract defines the root class of the reference model and the data structures for the record content. The demography sub-model provides the minimum tools to define the persons, the software agents, the devices and the organisms that may appear in an extract from the health record. The access control sub-model, under development, defines a representation for policies to access to the health record. Finally, the message sub-model, also under development, defines the attributes required to communicate the extract of the health record following a demand. Figure 2-7 [Kalra 2004] illustrates the extract sub-model of the reference model of the EHRcom standard.

Page 77: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 63 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

A harmonization effort between EHRcom and HL7 v3 has been recently done. Thus, the reference model of EHRcom is now compatible with the reference model of HL7 v3. The data types used in the EHRcom reference model have the same representation as the data types defined in HL7. This harmonization should facilitate eventual exchange of data between applications implementing one of these two standards. Architectures that respect the EHRcom standard should be compatible with the information models of messages and the message element types (CMET) developed by HL7.

The other interesting point in the EHRcom standard is the methodology based on archetypes. An archetype is divided mainly into three parts: descriptive data, constraint rules and ontological definitions.

· The data descriptions contain a unique identifier for the archetype, a

code comprehensible for an automatic interpretation and various metadata such as the author, the version, the objective, etc. These data also indicate whether an archetype is a specialization of another archetype.

· The constraint rules are the core of the archetype. They define restrictions on the structure in terms of cardinality between the components and verify the validity of records in accordance with the archetype.

· The ontology part defines the vocabulary that can be used for specific instances of an archetype. It may contain language translations, codes meanings and links between elements of locally used code.

It is difficult to predict the development of this standard. It is now

accepted by the healthcare market, in spite of the fact that only the reference model is stable and that much of its concepts are still under development. However, harmonization with HL7 and improving the concepts from experiments conducted in recent years should encourage the deployment of electronic health record exchange systems that respect this standard.

Page 78: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 64 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

Figure 2-7 Extract from the standard reference model EHRcom around the root class “EHR_EXTRACT” [Kalra 2004]

Page 79: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 65 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

2.2.7.3 HL7 standard Health Level 7 (HL7), a not-for-profit organization, was founded in 1987 with the goal of developing a standard for the electronic exchange of clinical, financial and administrative information, between independent health information systems. In June 1994, HL7 has been designated as a standards development organization (SDO) accredited by ANSI (American National Standardization Institution). Originally American, the HL7 format tends to gain more ground and to be an international standard for this type of information. There are several versions of the HL7 standard.

Version 1.0 was a prototype and was not widely implemented. HL7 began to exist and succeed with Versions 2.x, which are still the most widely EHR standard deployed over the world. These versions define lists of messages without an underlying model. For instance, HL7 v2.4 offers messages for the administrative management of patients, prescriptions, summaries of observations, recording results of physiological signals, information on drug interactions, etc... With the development of each new version, the number of messages increases which makes the lack of modeling a challenging problem. HL7 has also reached its limits in terms of development. Since 1997, the consortium thus launched the development of HL7 version 3.

Version 3 is based on a formal model that includes the RIM (Reference Information Model) and an architecture of clinical documents called CDA (Clinical Document Architecture).

The RIM is organized into Subjects, Classes, Relationships, Attributes and Data Types. To define the specifications of a message a Message Information Model (MIM) is extracted. The Development Committee of HL7 has also defined the concept of hierarchical message definitions (HMD). These definitions are one of the normative elements of the standard. These HMD define one or more types of messages. A message type constitutes a "recipe" to create an instance of message. They specify the data that can appear as part of the message and the order in which they should appear. A critical element was also defined: the Common Message Element Type (CMET). A CMET is an element defined by a committee and may be reused in all other messages (for example, the appropriate way to define a new patient). These CMET allow other committees to save time by ensuring that these elements are defined by the appropriate committee, and increase thus the consistency and the relevance of the standard (Figure 2-8) [Chabriais 2001].

Page 80: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 66 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

The other important element of HL7 v3 is the Clinical Document Architecture (CDA). The CDA for HL7 is as the EHCR (Electronic Health Care Record) for CEN/TC251 (see next §). HL7 defines the CDA as a tagged document standard that specifies the structure and the semantics of clinical documents for the purpose of exchange. The CDA is a markup standard of medical documents whose goal is to standardize the exchange of medical documents between humans. The concept of document must be considered here as an assemblage of information legally and indivisibly authenticated and is apprehended as a whole and in its context. The information such as insurance records, date of birth, etc… are derived from original documents by extracting, copying and combining. For now, only the part concerning the representation of documents is validated. An excerpt of CDA is shown in Figure 2-9 [Chabriais 2001].

The number 7 indicates that the HL7 standard imposes methods in the Layer 7 (i.e. the application layer) of the OSI model. Thus, HL7 does not take into account the security of the data exchange and of the transport

Figure 2-8 Defining the structure of an HL7 v3 message

Page 81: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 67 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

of messages. However, this issue is handled by the lower level layers such as SSL for security or TCP for data transport.

Technically, HL7 v3 is a multi-level XML architecture that respects the (ISO/IEC 10744) architectural forms and whose levels depend on the granularity of the tagging, and not on the information involved. Medical information is organized into three levels with different granularity as needed:

· Header: includes information such as name, ID, etc. without constraints on the text content.

· Context: organizes the textual content or the documents according to criteria of reasoning or administrative organization.

· Content: allows building the necessary medical information according to the Reference Information Model (RIM)

Figure 2-9 Preview of Clinical Document Architecture in the HL7 XML format

Page 82: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 68 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

The organization functions, registration, management, aggregation of documents are required at all levels. Thus, each institution/actor must develop “schemas/DTD” corresponding to its own use in accordance with the architectural forms of the relevant levels.

2.2.7.4 Integrating the Healthcare Enterprise (IHE) The initiative “Integrating the Healthcare Enterprise (IHE)” was founded in 1998 by the Radiological Society of North America and the society HIMSS. The objective of this initiative is to encourage the integration of computing resources in the health field (it is a global initiative with the aim of improving the interoperability of IT in the health field).

IHE does not develop standards but selects and recommends appropriate standards for specific use cases and develops application profiles for these standards that allow integration of specific systems. The IHE specifications define many building blocks called “integration profiles”. Each integration profile addresses one problem and specifies a solution based on existing standards such as DICOM and HL7. For each profile, actors and transactions are defined.

An example of an integration profile is IHE Cross-Enterprise Document Sharing (IHE XDS). It aims to provide a document archive for the EHR. The basic idea of this profile is to store documents in an ebXML registry/repository to facilitate sharing of these documents. Thus, this profile is concerned by the content of the EHR, but it only specifies their metadata to help locate documents. Through this profile, a group of enterprises that agree to work together and to share their clinical documents, are called Clinical Affinity Domain. They agree thus to a set of policies (e.g. how patients are identified, access conditions, etc.)

The characteristics of this profile are: · Distributed: each organization “publishes” clinical information for

others. · Cross-Enterprise: a Registry provides an index for published

information to authorized organizations. · Document Centric: the published clinical data is organized into

“clinical documents” using agreed standard document types. · Document Content Neutral: the document content is processed only by

source and consumer IT systems.

2.2.7.5 Digital Imaging and Communications in Medicine (DICOM) The Digital Imaging and Communications in Medicine (DICOM) is the de facto standard for communication of medical images. This standard defines the data structures and services for the exchange of medical images and

Page 83: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 69 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

information. It is available in its current modern form since 1993. It has thus been defined before the development of web technologies and XML. Therefore DICOM, unlike other standards, uses a binary encoded, complex DICOM specific application-level network protocol with its own tags. Currently, there are two DICOM based EHR standards:

· Web Access to DICOM Persistent Objects (WADO): An extension of the DICOM standard that enables the retrieval of DICOM objects over HTTP or HTTPS. Thus it allows to bring together the DICOM imaging world and Web infrastructures.

· DICOM Structured Reporting (DICOM SR): An extension of the DICOM standard for the exchange of structured data such as medical reports, measurements, results, procedure logs, etc. It can use the existing DICOM infrastructure, i.e. transmission over networks, query and retrieval, encryption and digital signatures

2.2.7.6 Synthesis of the EHR standards The assessment of these standards of the medical record shows no winner. The contents formats are similar in the concept and in the possibilities, and are based on a two-level modelling approach with a reference model and a set of constraints rules (archetypes, templates) for mapping the clinical data into the reference model. The most advanced standard today is DICOM SR, but it is rather centralized and only accepted in the field of medical imaging. The other standards vary in the progress made in the standardization process. In principle, all seem pertinent to be implemented in the electronic medical record. However, the most likely candidates to offer a comprehensive solution are either the complete architecture of EHRcom with its still under development security features and patterns of exchange, or IHE XDS with a content structured format of CDA and access to the images on WADO. The last solution has been selected by several electronic medical records projects in the United States, Canada and France.

2.2.8 Conclusion

This last decade has seen the emergence and the rapid development of ICT and the explosion of the eHealth and medical informatics fields with works with the goal of providing a better quality of healthcare and increasing the productivity of the health services and reducing its costs. In fact, health professionals recognize the need of more informed decision-making and more cost-effective health information systems for improving the healthcare delivery and reducing the costs. Therefore, medical informatics

Page 84: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 2 - State of the Art & Related Work

HOSSAM JUMAA 70 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

has established a presence within many academic and industrial research works and new challenges have emerged.

As we have presented in the state of the art, many initiatives have been taken for promoting the eHealth solutions over the world. However, the interoperability issue is still one of the major challenges in the eHealth domain. Indeed, a complete interoperability solution between health information systems is still far away from being operational due to the diversity of the actors and of their interactions, the multiplicity of disciplines in medicine and the vast variety of medical practices even for the same specialty within a given country. Medicine is not yet an exact science but it is quite an art, where the “truth” cannot be known. In addition, the increase of the mobility of the patients throughout the world and the language differences difficulty are other factors that make the interoperability issue for digital medical data exchange more challenging. More recently, the global Information Society has started to experience the widespread use of pervasive computing and the development of smart Personal Health Systems (PHS) aimed at improving patient’s health support anywhere and anytime, by means of ICT. For example, the personal ECG monitor (PEM), an intelligent smart device for ubiquitous pHealth (personalized health) decision-support, allows everybody to self-record an ECG anywhere, at anytime, on demand when needed. The PEM has been elaborated within the EPI-MEDICS (Enhanced Personal, Intelligent and Mobile system for Early Detection and Interpretation of Cardiological Syndromes) European project within the framework of the Information Society Technologies Program, and was mainly centered on the design and development of an original, novel ECG-based PHS of professional quality [Rubel et al. 2005], [Fayn and Rubel 2010]. The PEM embeds advanced methods of signal processing and decision-making, and also a personalized medical record of the user of the device which is thus transmitted, via wireless technology, in XML format with the last recorded ECGs to the appropriate call center or referral physician in case of ECG abnormalities detection and alarm generation. Such a medical record including bio-signals and personal health data such as risk factors or biology data should then be automatically and immediately decoded and integrated in the information system of the recipient for enhancing its reuse.

Hence, the need to manage the interoperability and data mediation within health information systems is nowadays a key issue that this thesis is dealing with. eHealth is thus one of the main application domain that should especially benefit from the original research works that we performed in this thesis and that we will detail in the next two chapters.

Page 85: L’Institut National des Sciences Appliquées de Lyon

CChhaapptteerr 33 -- XXMMLL AANNDD RREELLAATTIIOONNAALL DDAATTAABBAASSEESS MMEEDDIIAATTIIOONN

AAUUTTOOMMAATTIIOONN

Enterprises increasingly collaborate via web services by means of XML based data exchange technologies, while their information systems are usually storing data in relational databases which are still the most important type of data storage. Indeed, XML is considered as a key technology for data exchange that enables flexible processing and simple information exchange between heterogeneous applications. Meanwhile, the relational databases are a well-experimented, dominant technology to efficiently store, query and retrieve a huge volume of data. Consequently, a crucial issue in the data management is to bring together the flexibility of the XML model for communication and the performance of the relational model for data storage and retrieval. The main challenges of this issue are both: first, to be able to store the new data correctly into any existing relational database from any XML document and second, to be able to retrieve the required data from any relational database and to transform them into any XML format. This process should be automatic and transparent to the user, and should dynamically support any data structure.

In this chapter, we propose a novel mediation framework for automatic bi-directional data exchange between any XML source or sink and any Relational Database. This chapter presents the core contribution of our dissertation. It is organised into three main parts:

In the first part (section 3.1), we describe the architecture of the proposed generic mediation framework to automate data exchange between any XML file and any relational database.

Page 86: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 72 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

In the second part (section 3.2), we describe a methodology for the automatic transformation of the relational model into an XML schema. This transformation methodology has been adopted and used in our proposed approach to automatically generate XSD schemas from any relational database model.

In the third part (section 3.3), we describe the automatic SQL generation methodology that has been used in our mediation framework.

3.1 A Generic Mediation Architecture

3.1.1 Introduction

Hereafter, we describe the proposed generic mediation framework which consists in the design of a generic XML-based data mediator for data exchange between any XML source or sink and any relational database. The proposed mediator shall support an automatic, coherent and transparent updating of any relational database from any XML document [Jumaa et al. 2010b]. In the next section, we first illustrate the three general scenarios for data exchange between both models. Next, we present the architecture of the mediator, its components and its functions. Then, we discuss our proposed architecture.

3.1.2 General data exchange scenarios

For exchanging data between an XML model and an relational model, we can identify three main generic scenarios that are depicted in Figure 3-1.

In the first scenario (Figure 3-1 (a)), the relational data in the legacy systems should be exported or published into an XML format to facilitate the data exchange. In this scenario, the user (or system) receives or retrieves the needed or required data and is able to modify the received data in the XML format. Thus, the changes over the retrieved data should be transported by the mediator to coherently update the original data in the relational database in the legacy system.

In the second scenario (Figure 3-1 (b)), the user in a system issues a query to retrieve some data stored in a relational database (possibly in one or more databases). This query is defined in terms of an XML infoset with some conditions. Thus, the user defines the data he needs as well as the XML format he desires. Then, the mediator should generate a coherent and correct series of SQL queries to retrieve the required data and transform it into the required XML format defined by the user. Furthermore, if the user updates or makes changes over the retrieved data,

Page 87: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 73 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

as in the previous scenario, the mediator should support correct update of the relational database from the modified data.

In the third scenario (Figure 3-1 (c)), the data that have been produced by different source systems and/or devices are then exported in an XML format to be exchanged over the network. Thus the relational database in the legacy systems should be able to import the XML data whatever their structure or representation. Therefore, the mediator should provide means to automatically store into and update the relational database from any new XML incoming data, while data in the relational database will continue to be stored and queried efficiently and thus help the legacy system to survive and to be up-to-date. Indeed, we notice that the update operation in the first scenario is a sub case of the import operation where both the insert of new data and the update of old data should be considered.

Figure 3-1 The exchange scenarios between XML files and relational databases

Page 88: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 74 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

3.1.3 Mediation framework design

To handle the problem of data exchange as described in the previous three generic scenarios we have proposed an XML based mediation framework. The goal of this framework is to give a generic and automatic solution for the data exchange between any XML document and any existing relational database.

Figure 3-2 shows the overall architecture of the proposed mediation framework. This framework is based on the development of generic components, which will ease the setup of specific interfaces adapted to any source XML and any target relational database. The mediator components shall be independent of any application domain, and thus shall need to be customized only once for each couple of source and target data storage formats. Thus, the role of our XML-based mediator is to automatically store the data embedded in XML documents into a target relational database as well as to retrieve the results of any query over the database and to embed them into XML documents (or messages) structured according to a requested interchange format.

Figure 3-2 Overall architecture of the proposed data interchange XML-based mediator between XML documents and relational database

management systems

The incoming data, of course, should be compliant with a global data model that is known in advance, which is usually the case between cooperative organizations. But the proposed solution shall be also open to any implementation of the global model, even if some data are missing or if the data structure may be dynamically changed. On the other hand, the items in the incoming data must correspond to attributes of the target database, even if the database model can store additional data.

Page 89: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 75 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

To design this generic XML-based mediation framework, we propose a solution that consists in the interconnection of meta-models of the source and target data sources using the XML language. Meta-models will be specified using the XML Schema Definition (XSD) language. Indeed, to transfer data between XML documents and a relational database, it is important to match the XML document schema and the database schema. Since the structure of the XML documents may not exactly match the structure of the existing database, we will generate an XML schema of the relational database before carrying out the mapping process. Specifically, in our framework we will first generate the physical model of the relational database by means of some standard reverse engineering technique. Then, the relational schema will be automatically converted into an equivalent XML schema using an “XSD converter from R schema” component. Then a “XSD schemata matcher” will perform the correspondence between the XML source schema and the XML schema representing the target database relational schema. Finally, the “XSD schemata matcher” will produce the interconnection between the source and the target XML schemas for each pair of exchanged data sources. This interconnection will consist in building a set of XSLT mapping instances. Furthermore, our framework will include a “Rule base” for supporting coherent and secure databases updates.

In the next section we detail the components of the proposed mediator architecture.

3.1.4 Mediator’s components

Our proposed architecture consists of four main components as depicted in Figure 3-2. First, the “XSD Generator from XML” generates the meta-model of each XML source as an XML schema that is described by means of the XSD language. Then the second component, the “XSD Converter from R Schema”, generates the meta-model of a relational database as an XML schema that is described in XSD. Third, the “Schemata Manager” manages the meta-models of the XML sources and of the relational databases and produces the mapping for each couple of source and target data in the XSLT language. Each instance of XSLT allows to obtain an XML infoset compliant with a source schema that will be transformed into an XML infoset compliant with the target schema. Fourth, the “SQL generator” generates the correct SQL script to store the XML data into the relational database or to retrieve the needed infoset from the relational database. Hereafter, we detail more each of these components.

Page 90: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 76 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

3.1.4.1 XSD Generator from XML The role of the “XSD Generator from XML” component is to automatically create an XML schema in XSD language for each XML source, when such a schema is not already provided by the source. It uses the instance of the XML document to be exchanged with the target relational database in the generation process.

3.1.4.2 XSD Converter from R Schema The role of the “XSD Converter from R Schema” component is to generate an XML schema from the database relational schema. This component is especially important because it maps the database schema into an equivalent XML representation. This XML representation must be properly created. Therefore we design a new methodology for automating the transformation of the XML schema from the relational model. It is described in details in section 3.2.

Briefly, our methodology to generate the target XML schema from a relational database consists of two main steps (see section 3.2). In the first step, a stratification algorithm is applied to classify the different relations of the relational schema. The latter may be obtained by applying a reverse engineering technique on the relational database. The result of the algorithm is a hierarchical representation of the relational schema into different relation levels according to the existing functional dependencies and to the foreign keys constraints. In the second step, a transformation algorithm is applied to generate the target XML schema according to the classification performed in the first step and based on the use of a set of generic XML schema fragments templates. These templates aim to map each component of the relational schema into its corresponding XML schema fragment. Then a highly nested target XML schema is incrementally built while accurately reproducing the database structure as well as preserving the semantics defined by the integrity constraints.

The resulting XML schema offers many major benefits: (1) it can be used to publish an XML view of relational data; (2) it allows to check the XML document for anomalies or incoherencies before updating the relational database from the XML document; (3) it can take into consideration the hierarchical view of the database tables, which will be very useful for generating the SQL scripts that shall be executed on the database.

3.1.4.3 Schemata Manager The schema mapping between the exchanged source and target data is a prerequisite for a successful data exchange. This mapping is achieved by

Page 91: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 77 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

the main component of the mediator, that we call the “Schemata Manager”. This component includes four sub-components:

● The first sub-component is the “Sources Schemata” that includes the XML sources meta-model, which is specified using the XSD language to describe the layout of the data to be exchanged with each source.

● The second one is the “Target Schemata” that includes the relational database meta-model, which is specified using also XSD to describe the physical model of the relational database.

● The third sub-component is the “XSD schemata matcher”, which is specified using XSLT and describes the interconnections between the fields of the XML document meta-model and the database columns of the database meta-model.

● The last sub-component is a “Rule base”, which contains a set of pre-processing rules that will be used to verify the possible pre-existence of the data to be inserted in each table of the database, and to define the tables’ keys to be reused or created when generating the SQL scripts of the database updates with new data. The rules are mainly associated with the primary keys of the tables. As an example, when the value of a primary key is not known, a set of attributes can be defined to verify the existence of a row in a table. For instance, in the medical domain, the existence of a patient in the Patient table can be verified using the attributes: Last name, First name, Sex and Date of birth.

3.1.4.4 SQL Generator The “SQL Generator” has two roles:

● The first is to generate the relevant SQL queries for each table’s row to be updated in the database, using the XML-infoset as input data. The SQL queries are created according to the database meta-model (i.e. the XSD schema converted from the relational database schema), and use the rules specified in the rule base.

● The second role of the SQL generator is to act as an XML publisher of the result of each database query and to generate an XML-infoset that is compliant with the XML schema of the database meta-model.

3.1.5 Mediator functions

Here we present the general functional architecture of our proposed mediation solution between any XML document and any Relational Database. As depicted in Figure 3-3 we identify two main tasks for our mediator: the XML to DB Function which allows inserting or updating the

Page 92: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 78 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

database from the XML file, and the DB to XML function which allows the extraction of data from the database into a required XML format.

Figure 3-3 Global functional architecture of the mediation solution between XML documents and relational databases

3.1.5.1 XML to DB function The goal of this functionality is to allow the update of the target database from any XML data whatever its structure and to generate automatically and coherently the corresponding SQL Insert or/and update statements. The Figure 3-4 describes the data flow and management inside the mediator to realize this functionality.

Figure 3-4 XML to DB function

3.1.5.2 DB to XML function The goal of this functionality is to allow the retrieval of data from the target database according to any required XML format. Therefore, the mediator shall assure the automatic generation of the corresponding SQL Select statements. Figure 3-5 describes the data flow and management inside the mediator to realize this functionality.

Page 93: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 79 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

Figure 3-5 DB to XML function

3.1.6 Discussion

As already explained, several attempts have been performed to bring together the flexibility of the XML model for data exchange and the performance of the relational model for data storage and retrieval. Many solutions for data integration have thus been proposed over the last decade. Some solutions have addressed the problem of publishing relational databases as XML documents, while others have focused on storing and querying XML data into specifically designed relational databases. The latter solutions however are not always cost-effective for operational legacy information systems, and relational models designed on the basis of an XML tree structure are abstract representations which do not provide optimal data retrieval by business processes. Thus, the problem of updating existing relational databases from XML documents remains a challenge which has not received much attention. This process should be automatic and transparent to the user, and should dynamically support any incoming, open data structure.

Our proposed framework was motivated by two main factors. The first one is to give a generic solution to transform both source and target schemas into XML. Thus the composition of the schemata mapping will be easier because there is no problem of models heterogeneity. Semantic heterogeneity, if any, can be classically solved by means of an ontology designed by the experts of the application domain, if needed. Secondly, the choice of XML as the common representation language in the mediation is motivated by the richness of the existing tools and standards built around the XML family. In particular, the design and composition of the mapping may be carried out by using some commercial tools which allow, through

Page 94: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 80 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

graphical interfaces, to specify the relationships between two XML schemas, and to automatically produce the mappings in XSLT. Nevertheless, in case of a complete agreement between the source and target data sources, the mapping may also be automatically generated without the need to apply any transformations.

There are two possible languages recommended by the W3C to define XML schemas: the Document Type Definition (DTD) language and the XML Schema Definition (XSD) language. Our choice for the XSD language for representing the XML documents and relational databases meta-models in the “schemata manager” component is motivated by the fact that it is a much more powerful and sophisticated schema definition language and allows for a finer level of control than that of DTD. In fact, it enables XML to reach its full potential by separating the schema language from XML.

Although XSD Schemas are quite complicated and difficult to create and study, many graphical XML editors have been fortunately released (e.g. XMLSpy, Liquid XML studio...) to make the XSD schema related task (e.g. creation, modification, etc.) easier.

In our approach there is no need to introduce additional control steps such as coherence and integrity verification when updating the original relational database from XML data, as opposed to the previously proposed solutions from the literature to update relational databases from a published XML view, which require to check the correctness of the updates.

3.1.7 Conclusion

In this section, we presented a generic mediation framework. Our proposed framework consists in the design of a generic XML-based data mediator for data exchange between any XML source and any relational database. We introduced three general scenarios for data exchange between XML and Relational databases models. We presented also the architecture of our mediator and described the functions of its components. Finally, we discussed the different choices and the ideas behind the design of our proposed architecture.

In the next section, we will describe the methodology we propose for automatically transforming any relational model into an XML schema. This transformation methodology will be used in our proposed mediator.

Page 95: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 81 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

3.2 Automation of the Transformation from Relational Models to XML Schemas

3.2.1 Introduction

As already stated, XML has rapidly emerged as a standard for data exchange and data representation on the Web. Thanks to its nested self-describing structure, XML offers simple and flexible means to exchange data between applications. However, most data nowadays are stored and maintained in relational databases and it is very unlikely that this situation will change in the near future, mainly because of some unique advantages offered by relational databases such as scalability, reliability and performance. This situation has triggered many works to bring together the flexibility of the XML model as a data exchange standard and the performance of the relational model for data storage and retrieval. However, the problem of automating the update of existing relational databases from XML data remains a challenge. This operation should be as transparent as possible to the user, and should dynamically support open structures of incoming data.

Indeed, one of the important steps that should be taken into account to face this problem is the design of an XML schema which can be used as a wrapper of the relational database representation, and which allows for preserving the integrity constraints. Such an XML schema offers many major benefits: (1) it can be used to publish an XML view of relational data; (2) it allows to check on the XML document for anomalies or incoherencies before updating the relational database from the XML document; (3) it can take into consideration the hierarchical view of the database tables, which will be very useful for generating the SQL scripts that shall be performed on this database.

Hereafter, we present our second main contribution of this thesis which is a methodology to automate the creation of an XML schema from a source relational database, while preserving the integrity constraints and avoiding any data redundancy [Jumaa et al. 2010a]. Our methodology is composed of two main steps that are described hereafter.

3.2.2 Transformation methodology

The methodology we propose to generate the target XML schema from a relational database while taking into account the integrity constraints consists of two main steps as depicted in Figure 3-6.

Page 96: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 82 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

The first step consists in a stratification algorithm that classifies all the relations of the data model. The later may be obtained by applying a reverse engineering technique from the relational database. The result is a hierarchical representation of the relational model into different levels of relations according to the existing functional dependencies and based on the foreign keys constraints.

The second step consists in a transformation algorithm that generates the target XML schema according to the classification performed in the first step and based on the application of a set of generic XML schema fragments templates. These templates aim to map each component of the relational schema into its corresponding XML schema fragment. Thus, the target XML schema is incrementally built by preserving the database structure as well as the semantics defined by the integrity constraints.

Figure 3-6 Relational database model to XML Schema transformation

3.2.2.1 Relations stratification − definitions The idea behind relations stratification is to classify the different relations of a relational schema into different levels with respect to a hierarchy based on the referential constraints represented by their foreign keys. This hierarchy is aimed at easily representing the relational data as a tree that is then transformed into a highly nested XML schema. This tree view of the

Page 97: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 83 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

relations is based on the referential integrity constraints, i.e. on the foreign keys existing in the relations.

Let us first introduce two new concepts: the relation level and the foreign key degree, which will be used in our stratification algorithm we present in the next section.

A zero level relation is a relation that has no foreign key which refers to other relations in the schema.

A zero degree foreign key is a foreign key which refers to a zero level relation. Foreign keys in a self-referenced relation have no degree.

A first level relation is a relation that has foreign keys and all of its foreign keys are zero degree foreign keys.

A first degree foreign key is a foreign key which refers to a first level relation.

More generally: An N level relation is a relation for which all its foreign keys are

M degree foreign keys, where M£N-1. An M degree foreign key is a foreign key that refers to a relation

of level M.

3.2.2.2 Stratification algorithm We suppose that the source relational database model is at least in the third normal form so that all transitive dependencies have been removed, and that there are no circularly referenced relations. We think that this is a reasonable assumption since relational database models should be normalized in practice.

We present hereafter the algorithm that incrementally stratifies the relations of a relational schema into the appropriate levels according to the definitions described in the previous section.

First, we define R as the set of all the relations in the data model. Formally: R = {r1 , .., rm}.

Then, we define a function Fk: R à R where Fk(r) = {r1, .., rk}, and rk is a parent relation of r and rk ¹ r. The stratification algorithm, called Algorithm 1, is then based on this notation.

3.2.2.2.1 Algorithm description

The input of Algorithm 1 is the set of relations R of the relational schema. It results in output in disjoint subsets of all the relations of R: {L0, L1, ... , Ln}, where each of these subsets corresponds to a distinct level of relations. The first step of Algorithm 1 consists in searching, over all the relations,

Page 98: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 84 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

the relations which have no foreign key that refers to other relations. These relations are assigned to the zero level set L0. Then, we take away this zero level set from the set of all the relations, and continue to search among all the remaining relations of DR=R -L0, the relations that have no parent table in DR. That means that all their parent relations are already stratified into a previous level. Then, these relations are assigned to the level set L1, and so on until all the relations have been stratified.

Algorithm 1: Relations Stratification Algorithm

Input: R, the set of all relations of a relational schema

Output: The relations subsets L0, L1, ... , Ln

1: L0 = Æ; {initialize Zero level set}

2: DR = Æ; {initialize an empty set}

3: for all ri Î R do

4: if Fk(ri) = Æ then {if ri has no foreign key}

5: L0 = L0 È {ri}; {add ri to Zero level set }

6: DR = R - L0 ; {subtract Zero level set from R}

7: k = 1 ; {initialize level index}

8: while DR ¹ Æ do

9: Lk = Æ ; {initialize the k level set}

10: for all ri Î DR do

11: if Fk(ri) Ç DR = Æ then {if ri has no parent relation in DR }

12: Lk = Lk È {ri} {add ri to k level set }

13: DR = DR - Lk ; {subtract k level set from R}

14: k = k +1;

As an example for the illustration of our stratification method, let

us consider three tables A, B, C linked with a one-to-many relationship between A and B and a many-to-many relationship between B and C, according to the physical model displayed in Figure 3-7-a. Applying our stratification method to this model will result in three levels as depicted in Figure 3-7-b.

Page 99: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 85 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

Figure 3-7 A principle example of stratification results into three levels (figure 3-7-a) from a physical relational model (figure 3-7-b)

Table_A and Table_C, as they have no foreign keys, are in level 0, Table_B is in level 1 as it has only one foreign key (a) and it is a zero degree foreign key because it refers to a table, Table_A, which is in level 0. Finally, Table_D is in level 2, as its composite primary key (b,c) refers to Table_B which is in level 1 and to Table_C which is in level 0.

3.2.2.3 XML Schema fragments templates Before presenting our algorithm to auto-generate the target XML schema, based on the previous relations stratification, we first introduce a set of principles to map each component of a relational schema into a corresponding XML schema fragment template.

A schema of a relational database is composed of: ● A set of relations or tables. ● A set of attributes or columns in each relation. ● A specified data type “domain” for each attribute. ● A primary key in each relation. ● A set of foreign keys. ● A set of constraints (uniqueness, not null).

For all of these components of a relational schema, we thus specify a set of 9 templates to generate their corresponding XML fragments in the XML Schema language. The fragment templates 1 to 9 correspond respectively to the XML schema parts which are surrounded by rectangles.

Page 100: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 86 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

3.2.2.3.1 Root fragment template

To generate our target XML schema from a relation schema R, the first step is to create a root element where all the other relations of the R schema shall be placed or nested. This step is required for any XML schema generation from the relational schema. While an XML schema represents one single structure of XML documents, the name of the corresponding relational schema R shall be used as a name for the schema root element, as follows: <?xml version="1.0" encoding="utf-8" ?> <xs:schema elementFormDefault="qualified"

xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="R_TargetXMLSchema"> <xs:complexType> <xs:sequence> <!-- Here are the generated elements of the target

Schema --> </xs:sequence> </xs:complexType> </xs:element> <!-- Here are the generated new data types of the target

Schema --> </xs:schema>

Fragment 1 : Root transformation template

3.2.2.3.2 Relation fragment template

Each relation shall be mapped into an “element” with a “complexType” in the target XML schema. This element shall be nested to the correct parent based on the stratification of the relations we presented previously. As each relation in the relation schema is identified by a unique name, hence it is suitable to name the XML element by the relation name. For instance, for a relation r, its XML schema fragment will be generated as follows:

<xs:element name="r" minOccurs="0" maxOccurs="unbounded"> <xs:complexType> <!-- Here are the generated attributes of relation r --> </xs:complexType> <!-- Here are the generated identity constraints of

relation r --> </xs:element>

Fragment 2 : Relation transformation template

Page 101: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 87 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

3.2.2.3.3 Data type and domain fragment template

For all data types in the relational schema, if the data type of an attribute exists as a primitive data type in the XML Schema Definition Language, then it will be used in the “Type” attribute of the “attribute” or “element” declaration, and if it does not exist, then a “simpleType” element is generated and declared in a global scope (i.e. it is a sub-element of the “xs:schema” tag) to be reused for any attribute of the same type.

In fact, XSD has a very expressive power to define new data types over its primitive data types. Using different facets, data types may be restricted to domain values. For instance, for a string attribute a of a maximum length of 40 characters, a new data type can be defined by restriction of the xs:string primitive data type by using the “maxLength” facet, and by using a “simpleType” element named with the data type name “string40” as follows:

<xs:simpleType name="string40"> <xs:restriction base="xs:string"> <xs:maxLength value="40" /> </xs:restriction> </xs:simpleType>

Fragment 3 : Data type or domain transformation template

3.2.2.3.4 Attribute fragment template

There are two possibilities to transform an attribute a in a relation r into its corresponding XML element.

The first possibility is to get what we call attribute oriented XML documents, where each attribute a in a relation r which has a data type DT (where DT is either a primitive data type in XSD or a new generated data type) shall be mapped into an “attribute” element nested into its relation element. Thus, an “attribute” element named with the attribute name from the relational schema and which has the data type DT, shall be created under the corresponding element of the relation r as follows:

Page 102: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 88 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

<xs:element name="r" minOccurs="0" maxOccurs="unbounded"> <xs:complexType> <xs:attribute name="a" type="DT" /> <!-- Here are specified other mapped attributes

of relation r --> </xs:complexType> </xs:element>

Fragment 4 : Attribute transformation template

The second possibility is to get element oriented documents, where each attribute a in a relation r shall be mapped into an “element” nested into its relation element. Thus, an “element” named with the attribute name from the relational schema will be created under the corresponding element of the relation r as follows:

<xs:element name="r" minOccurs="0" maxOccurs="unbounded"> <xs:complexType> <xs:sequence>

<xs:element name="a" type="DT" /> <!-- Here are specified other mapped attributes of

relation r --> <xs:sequence> </xs:complexType> </xs:element>

A main difference between the two possibilities is that the second

one adds an ordering over the attributes of a relation which does not exist in the relational schema. Therefore, we decided to use the first possibility.

3.2.2.3.5 Primary key constraint fragment template

In each relation, a primary key constraint is specified over one or more attributes that shall be mapped into a “key” element in XSD. The “key” element consists of one and only one ‘selector’ element (an XPath expression to specify a set of elements) and one or more “field” elements. Each “field” element contains an XPath expression that specifies the values that must exist, be unique and be not-null for the set of elements specified by the selector element. Thus, the “key” element is an identity-constraint mechanism that not only asserts uniqueness to the content identified by a “selector” element of the tuples resulting from the evaluation of the fields XPath expression, but also emphasizes that all selected contents exist and are not null.

Let be pk a primary key constraint defined over a set of attributes a1 , .., an of a relation r. Then the XML schema fragment to represent this pk constraint shall be generated by using the “key” element as follows:

Page 103: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 89 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

<xs:element name="r" minOccurs="0" maxOccurs="unbounded"> <xs:complexType> <xs:attribute name="a1" type="DT" /> ... <xs:attribute name="an" type="DT" /> <!-- Here are specified other mapped attributes of rela-

tion r --> </xs:complexType> <xs:key name="r_pk"> <xs:selector xpath=".//r" /> <xs:field xpath="@a1" /> ... <xs:field xpath="@an" /> </xs:key> <!-- Here are specified other mapped identity constraints of

relation r --> </xs:element>

Fragment 5 : Primary key transformation template

3.2.2.3.6 Foreign key constraint fragment template

In each relation one or more foreign key constraints could have been specified over one or more attributes. To generate an XML schema fragment corresponding to the foreign key, we have two possibilities we present hereafter.

Nesting fragment template Let be fk a foreign key constraint defined over a set of attributes a1, .., an of a relation r referring to the set ap1, .., apn in the parent relation rp. This foreign key constraint may be represented implicitly by a direct nesting of the relation r as a child or sub-element in the element representing the parent relation rp, and the set of attributes a1, .., an shall be omitted from the mapped attributes of the relation r as follows:

Page 104: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 90 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

<xs:element name="rp"> <xs:complexType> <xs:sequence> <xs:element name="r" minOccurs="0"

maxOccurs="unbounded"> <xs:complexType> <!-- The mapped attributes of r except

the set of attributes of fk --> </xs:complexType> <!-- Here are the generated identity

constraints of relation r --> </xs:element> <xs:sequence> <xs:attribute name="ap1” type="DT"/> ... <xs:attribute name="apn” type="DT"/> <!—- The other mapped attributes of rp --> </xs:complexType> <xs:key name="rp_pk"> <xs:selector xpath=".//r" /> <xs:field xpath="@ap1" /> ... <xs:field xpath="@apn" /> </xs:key> <!—- The other mapped identity constraints of rp --> </xs:element>

Fragment 6 : Relation nesting transformation template

Key reference fragment template More than one foreign key constraint can be defined in a relation. Therefore, only one of them can be mapped using the previous fragment and other foreign key constraints shall be mapped using a “keyref” element for each foreign key.

The “keyref” element specifies one or more attribute values corresponding to the primary key of the parent relation which has already been created in primary key fragment. The “keyref” consists of one ‘selector’ element (an XPath expression that specifies a set of elements) and one or more “field” elements. The definition of a “keyref” element emphasizes the correspondence, between the content identified by the “selector” element of the tuples resulting from the evaluation of the “field” XPath expressions, and the referenced key in the parent relation.

Let be fk1 a foreign key constraint defined over a set of attributes a1, .., an of a relation r that refers to the primary key in the parent relation r1. This primary key is mapped into a “key” element with the name r1_pk by using the primary key fragment generation, and then fk1 is created as a “keyref” element, as follows:

Page 105: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 91 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

<xs:element name="r" minOccurs="0" maxOccurs="unbounded"> <xs:complexType> <xs:attribute name="a1” type="DT"/> ... <xs:attribute name="an” type="DT"/> <!—- The other mapped attributes of r --> </xs:complexType> <xs:keyref name="fk1" refer=" r1_pk"> <xs:selector xpath=".//r" /> <xs:field xpath="@a1" /> ... <xs:field xpath="@an" /> </xs:keyref> <!—- The other mapped identity constraints

of rp --> </xs:element>

Fragment 7 : Foreign key transformation Template

3.2.2.3.7 Uniqueness constraint fragment template

In each relation, one or more uniqueness constraints can be specified over one or more attributes. To create the XML schema fragment corresponding to each uniqueness constraint, a “unique” element is created. The “unique” element consists of one “selector” element and one or more “field” elements. Each “field” element contains an XPath expression that specifies the values that must be unique for the set of elements specified by the selector element.

Let be u a uniqueness constraint defined over a set of attributes a1, .., an of a relation r. To represent this constraint, a “unique” element is created as follows: <xs:element name="r" minOccurs="0"

maxOccurs="unbounded"> <xs:complexType> <xs:attribute name="a1” type="DT"/> ... <xs:attribute name="an” type="DT"/> <!—- The other mapped attributes of r --> </xs:complexType> <xs:unique name="u"> <xs:selector xpath=".//r"/> <xs:field xpath="@a1" /> ... <xs:field xpath="@an" /> </xs:unique> <!—- The other mapped identity constraints

of r --> </xs:element>

Fragment 8 : Uniqueness constraint transformation template

Page 106: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 92 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

3.2.2.3.8 Non-nullable constraint fragment template

In the relation schema non-nullable constraints could have been defined over some attributes. It is therefore needed to map such constraints with respect to the possibility which is adopted to generate the attributes. If the “attribute oriented methodology” is chosen (i.e. XML attributes are used to represent the attributes of the relation schema), then this constraint shall be mapped by adding the attribute “use” with the value “required” in the attribute declaration as follows: <xs:attribute name="a" type="DT" use="required"/>

Fragment 9 : Non-nullable constraint transformation template

If the “element oriented methodology” is chosen (i.e. XML elements are used to represent the attributes of the relational schema) then the attribute “nillable” with the value “false” shall be used as follows:

<xs:element name="a" type="DT" nillable="false"/>

3.2.2.4 Target XML Schema transformation algorithm In the previous section we presented the generation of the different XML schema fragments templates for all components of a relational schema. We now describe our algorithm to auto-generate the target XML schema by taking into account the level of each relation given by the stratification algorithm and using the previous sets of fragments generations. The target XML schema is highly nested, based on the hierarchy of the relations, and preserves all the integrity constraints of the original relational schema which is therefore redundancy-free.

Let be R a Relational Schema where the relations are stratified into n levels {L0, L1, ... , Ln} which are disjoint subsets of all the relations in R. Each relation ri Î R is identified by its name, a subset of attributes Ai = {a1, ..,aj} Í A where A is the set of all the attributes in R, and a subset of constraints Ci = {c1, .., cm} Í C where C is the set of all constraints (Primary Keys, Foreign Keys, Uniqueness, and Non-nullable). Therefore, on the basis of this notation, our algorithm, named Algorithm 2, which shall be used to auto-generate the target XML schema from the Relational schema R while all integrity constraints are preserved, is presented hereafter:

Page 107: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 93 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

Algorithm 2: Target XML schema generation algorithm

Input: A relation schema R where the relations are stratified into n levels {L0, L1, ... , Ln}. Each relation is identified by the tuple (ri Î R , Ai Í A, Ci Í C).

Output: The target XML schema

1: Initialize a new XSD file;

2: Create the schema root element; {Using fragment 1}

3: Create a new data type for each different domain constraint;

{Using for example fragment 3}

4: for all ri Î L0 do

5: Create ri element as sub-element of the root element schema; {Using fragment 2}

6: for all aj Î Ai do

7: Create attribute aj for element ri ; {Using fragment 4}

8: for all cm Î Ci do

9: Create the corresponding constraint fragment;

{Using fragment 5 for primary key constraints, using fragment 7 for foreign key constraints, using fragment 8 for uniqueness constraints, using fragment 9 for non-nullable constraints}

10: for k= 1 to n do {n is the level index}

11: for all ri Î Lk do

12: Create ri element as sub-element of rp ; where rp Î FK(ri) Ç Lk-1 {Note: " ri , ri Î Lk Þ $ rp Î Lk-1 \ rp Î FK(ri)}

{Using fragment 6}

Let us define Ap the attributes set of rp

13: for all aj Î Ai and aj Ï Ap do

14: Create attribute aj for the element ri ; {Using fragment 4}

Let us define cp the foreign key constraint which refers to rp

15: for all cm Î Ci and cm ¹ cp do

16: Create the corresponding constraint fragment;

{Using fragment 5 for primary key constraints, using fragment 7 for foreign key constraints, using fragment 8 for uniqueness constraints, using fragment 9 for non-nullable constraints}

Page 108: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 94 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

3.2.3 Discussion

There are many methodologies to transform relational models into XML in the literature. However, none of them takes into account the hierarchy of the relations in the data model. In contrast, our solution has been designed to obtain a schema that captures the hierarchical view of the relational model, that is taken into account for generating the cascade of SQL queries to be performed for updating the database coherently, whatever the complexity of the relational model. The only restriction we assume is that all relations are at least in the third normal form. In addition, generic encoding templates are defined for the support of an automatic mapping of the relations of the database schema into corresponding fragments of the target XML schema.

To prove the feasibly of our approach, we applied it to an example from the healthcare domain (which will be presented in the next Chapter). The resulting XML schema contains neatly nested fragments without any redundancy, which allowed us to use it to automatically generate the SQL queries that will carry out the updates over the underlying relational databases.

3.2.4 Conclusion

In this section, a methodology to efficiently automate the generation of a target XML schema from a relational schema is proposed. This methodology has the advantage of producing redundancy free schemas by performing high levels of nesting; it also preserves the integrity constraints in the relational model. The proposed methodology consists of two steps. The first step is the stratification step where the relations of the relational model are stratified into levels based on functional dependencies and referential constraints. The second step is the generation step where the target XML schema is automatically created by applying to each relation, on the basis of its stratification level and foreign keys degrees, the corresponding generation fragment templates.

In the next section, we present the methodology and the algorithms to automate the generation of SQL queries that are required to update and/or retrieve data to/from the relational database in our mediation framework.

Page 109: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 95 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

3.3 Automatic SQL Generation

3.3.1 Introduction

In this section, we describe a generic methodology to auto-generate the SQL script in our proposed XML based mediation framework for effective and automatic data exchange between any XML format and any relational databases. We first present an algorithm to automatically generate a coherent series of SQL statements to store the XML data into an existing relational database and to update the relational database from XML data. Secondly, we give a methodology to generate coherent SQL query scripts to retrieve from a relational database the data required by a virtual XML view and to return the result in the required XML format.

3.3.2 SQL auto-generation methodology

We focus here on the problem of the SQL auto-generation for the purpose of interchanging data between XML documents and relational databases. In the following, we present our proposed methodology to solve the problem.

We distinguish two cases: ● The first case is the storage and/or the update of the relational database

from an XML document, where we need to automatically generate a coherent series of SQL statements taking into account the tables’ hierarchy and the integrity constraints of the database and especially the foreign key constraints.

● The second case is the retrieval of data from a relational database and their publishing into a specific, required XML format. In this case the user only defines the data he/she needs and the XML format.

Hereafter, we describe the proposed methodology for the SQL auto-generation for the previous two cases in our framework of data exchange between XML and Relational databases. The first objective is to automate the generation of a correct series of INSERT and /or UPDATE for the storage of any XML document into a relational database, which is very helpful in the context of legacy information systems. The second objective is to automate the generation of correct SELECT query scripts to retrieve data required from a relational database and then to encode these data into the specific required XML format.

Page 110: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 96 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

3.3.2.1 SQL auto-generation to update relational DBs from any XML format In this case, we aim to automate the storage and the update of the relational database from an XML document. Therefore, we need to automatically generate a coherent and correct series of SQL statements. Hence, we should take into account that, in the creation of the INSERT/UPDATE statements, the order of insertion in the relational database is very important as tables are related through foreign key constraints. Consequently, it is crucial to first insert records into the referenced tables that hold the candidate primary keys, and then into the related tables where these candidate keys are acting as foreign keys. Otherwise an error of foreign key constraint violation will be returned, the record will not be inserted and the SQL transaction must roll-back.

The schema for the adopted solution for this case is depicted in Figure 3-8.

Our algorithm to automatically generate the SQL script for updating the relational database from any XML data is based on the XSD schema which is automatically generated from the database, and on the rule base. This algorithm has two main steps. In the first one, the algorithm generates the SELECT statements over the database and retrieves the ID keys for the data if the data already exist in the database. Then, on the basis of the result of each SELECT statement, the algorithm specifies the SQL operation for the corresponding tuple (INSERT or UPDATE). This step is achieved by the Select generator as depicted in Figure 3-8. In the second step, the algorithm generates the final SQL transaction according to the hierarchy of the tables which is represented in the XML DB entry infoset that is compliant to the XSD schema of the database. This step is achieved by the SQL generator as depicted in Figure 3-8.

Page 111: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 97 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

3.3.2.1.1 Step1: SQL select generation

In this step of our algorithm, the input of the algorithm is an XML DB entry infoset compliant with the database XSD schema, which is produced from the original XML input infoset by an XSLT transformation step of the XML schemata matcher, and the output shall be the modified XML DB entry infoset. Actually, the output infoset is the DB entry infoset updated with the real Keys from the Database and with the correct operation to be performed in the next step of SQL generation.

Figure 3-8 Automatic SQL generation schema (XML to RDB function)

Page 112: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 98 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

The algorithm to generate the SQL Select script is presented hereafter.

Algorithm 3: SQL select generation

Input: An XML DB entry infoset compliant with the XSD database Schema. An XML file representing the set of rules.

Output: A changed XML DB entry infoset with all the real IDs and/or primary keys retrieved from the database for the existing data.

1: xml=loadXMLDoc("DBentry.xml"); Load and initiate the XML input file

2: xml=loadXMLDoc("rulebase.xml"); Load the XML rule-base file

3: $Lpath="/*"; Select all the elements in the level zero from the input XML file

4: n:= count ($Lpath); Count the element in the level

5: While n<>0 do

6: Begin (while)

7: For i = 1 to n do

8: Begin (for)

9: $Rname:=node-name($Lpath[i]); Read the relation name from XML input file

10: [PK]:="//$Rname/pk"; Read the PK attributes of the current relation from the rule-base file

11: [A] :="//$Rname/a" Read attributes specified for the verification of existence of the current relation from the rule-base file

12: For each Ak in [A] do

13: ak:= /$Rname @Ak Read the value ak of the attribute Ak

14: If {SELECT [PK] from $Rname Where [A] = [a] } not null then

15: Replace the primary keys values in the input XML with values from the retrieved value by the SELECT statement.

16: Else

17: Set the value of the attribute “operation” to "INSERT" for the current element and all it is sub elements

18: End (For)

19: $Lpath:="$Lpath"+"/*[@operation="UPDATE"]"; Select the elements in the next level from the input XML file when the attribute “operation” is still "UPDATE"

Page 113: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 99 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

20: n:= count ($Lpath);

21: End (While)

Algorithm description First, we load the XML input file, which is compliant with the XSD schema of the database, and initiate it by adding to each element the attribute “operation” with the default value "UPDATE". This value may be automatically changed during the algorithm execution according to the content of the database. We load also the XML file representing the rule-base. In this XML file, we define for each relation of the relational model of the database, the set of the primary key attributes and the set of attributes that are used in verifying the existence of tuples in the database table. The algorithm starts then by selecting all relation elements in the level zero. For each of these elements, we read their names from the XML input file and use them as table names. We read also the primary keys names and the specified set of attributes for data existence verification from the rule-base XML file and we read their values from the XML input file to be used in the conditional clause of the SELECT statement. Thus, all parameters needed to generate the SELECT statement are provided. We check then the result of the SELECT statement: if it is not null, i.e. the tuple already exists, we replace the temporary IDs in the primary key attributes in the input XML file by the real primary key value retrieved by the SELECT statement. If the result of SELECT is null i.e. the tuple does not exist in the database table, then the “operation” attribute value for this relation element and its sub elements are set to "INSERT". Next, when all elements in this level have been processed, we select the relation elements from the next level whose “operation” attribute value remained "UPDATE" and we repeat the same steps to generate the SELECT statements for each relation and replace the IDs by real ones from the database or set the operation attribute value to ‘INSERT’ until there are no more relations in the next level.

3.3.2.1.2 Step2: SQL update/insert generation

In this step, the input is the modified XML DB entry infoset and the output will be a set of SQL INSERT or/and UPDATE or simply the SQL transaction. The algorithm to generate the SQL Update/Insert script is presented hereafter. It consists in generating the SQL script for each element level by level. The operation and data values for each element

Page 114: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 100 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

correspond to a tuple in a relation and they are obtained from the attributes values of the element.

Algorithm 4: SQL Update/Insert generation

Input: The modified XML DB entry infoset with all the real IDs and/or primary keys retrieved from the database by the previous algo-rithm.

Output: The SQL transaction.

1: xml=loadXMLDoc("DBentry.xml"); Load the XML input file

2: $Lpath="/*"; Select all the elements in the level zero from the input XML file

3: n:= count ($Lpath); Count the elements in the level

4: While n<>0 do

5: Begin (while)

6: For i = 1 to n do

7: Begin (for)

8: Read the operation attribute value and generate the corresponding UPDATE/INSERT statement for the element with the values of their attributes.

9: End (For)

10: $Lpath:="$Lpath"+"/*"; Select the elements in the next level from the input XML file

11: n:= count ($Lpath);

12: End (While)

3.3.3 SQL auto-generation to query and retrieve data from any RDB and to embed them into any XML format

In this case, we aim to automate the SQL generation to retrieve data from any relational database and embed them into a specified XML format. Therefore, the user shall define the data he/she needs and the XML format to be used. To do so he/she shall be able to specify the set of the query conditions as constants and the required data as variables in the required XML format. Then a first transformation step is applied to produce a valid and correct XML file in a format that is compliant with the database schema (with the constants and variables). Next, a correct selection over the database is generated. The result is then embedded into the XML file which is compliant with the database XSD. Finally, a second

Page 115: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 101 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

transformation step is applied to produce the result in the required format. The schema for the adopted solution for this case is depicted in Figure 3-9.

Figure 3-9 Automatic SQL generation schema (DB to XML function)

3.3.4 Conclusion

In this section, we described our methodology to auto-generate the SQL scripts in our XML-based mediation framework. This methodology gives a generic solution for an effective and automatic data exchange between any XML format and any relational database. Specifically, we proposed an algorithm to automatically generate a coherent series of SQL statements to store the XML data into existing relational databases (thereby updating them using the XML data). Then we proposed a methodology to generate coherent SQL query scripts to retrieve data from a relational database as

Page 116: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 3 - XML and Relational Databases Mediation Automation

HOSSAM JUMAA 102 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

required by a virtual XML view and to return the result in the required XML format.

Page 117: L’Institut National des Sciences Appliquées de Lyon

CChhaapptteerr 44 -- VVAALLIIDDAATTIIOONN && AAPPPPLLIICCAATTIIOONN IINN EEHHEEAALLTTHH

To demonstrate the feasibility of our approach we have applied and tested it in the healthcare application domain. Specifically, we applied it to exchange data between an XML representation of SCP-ECG, an open format ISO standard communications protocol embedding bio-signals and related metadata, and a European relational reference model including these data [Jumaa et al. 2008]. The mediation automation is especially relevant in this field where electrocardiograms (ECG) are considered as the main means to early detect cardiovascular diseases, and need to be quickly and transparently exchanged between healthcare systems, especially in emergency situations.

Our application example from the cardiology domain to automate the storage of SCP-ECG data into relational databases is presented hereafter.

4.1 Introduction Electrocardiography (ECG) is one of the most important non-invasive diagnostic tools for the early detection of coronary heart diseases. Indeed, an important issue for improving the quality of decision-making in the field of quantitative electrocardiography is to have a rapid access to the patient’s medical history and previous ECGs [Kadish et al. 2001], [Fayn et al. 2007]. Hence, thousands of ECG databases have been set up in the healthcare domain to manage ECG data for patient care as well as for conducting medical research to discover new medical knowledge using data mining

Page 118: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 104 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

techniques [Zhou S. et al. 2003]. Most of these databases and the corresponding ECG management systems rely on relational databases.

A reference core database model is the OEDIPE model [Fayn et al. 1994a]. OEDIPE has been proposed within the framework of a European project for the storage of serial ECG data and related metadata. On the other hand, ECG patient data may be recorded by different acquisition devices and stored in the SCP-ECG format [CEN/TC251 2005], [ISO/DIS11073-91064 2007] or in proprietary formats. The SCP-ECG European Standard Communication Protocol is the best way for the exchange of ECG data, especially when wireless transmission is required. However, as SCP-ECG is an open protocol, it has numerous implementations since most of the sections and of the data fields are not mandatory. As a consequence, the extraction of the protocol fields from an SCP-ECG file and their insertion into a database require a costly development of specific interfaces for each pair of source and target representations. Therefore, appropriate means should be developed for facilitating the process of data extraction and storage from any SCP-ECG or proprietary file coming from any manufactured device into any relational database of health records.

We thus applied our proposed mediation architecture to support the development of specific interfaces adapted to any source or target SCP-ECG data. The challenge is to automate as much as possible the storage and retrieval of any SCP-ECG protocol message into/from a relational database transparently for the user.

In the following, we will first describe the SCP-ECG protocol and the OEDIPE database reference model. Then, we will present our proposed XML-based mediation architecture for automating the storage of any SCP-ECG file into a relational database implementing the OEDIPE reference model. Afterwards, we will describe in detail the mediator components and functionalities. Finally, we will discuss our solution before to conclude.

4.2 SCP-ECG protocol The SCP-ECG is a standard communications protocol to transport the digital ECG records, annotations and related metadata. It is based on different international recommendations including the OSI model. This protocol is defined in the CEN standard EN 1064:2005, and has been recently approved as an ISO standard (ISO/DIS 11073-91064). This protocol is composed of 11 sections. It is an open protocol where almost all the data sections are optional with very few mandatory data. SCP-ECG specifies the syntax and the semantics of the ECG data to be transmitted.

Page 119: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 105 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

Therefore, this standard specifies a data interchange format for transferring ECG reports and data from any vendor's computerized ECG recorder to any other's vendor or academic central ECG management system. It allows standardized transfer of digitized ECG data and results between various computer systems [Willems et al. 1993]. Four basic functional layers were considered:

a) Data layer, structured in 11 sections (Figure 4-1) including: ● Demographic data of patients as well as the administrative data

related to the ECGs recording, i.e. the metadata related to the examination it-self (section 1),

● ECG raw signal data and the recorded metadata (sections 2 to 6), ● Automatic quantitative analysis results (sections 7 and 10), ● Automatic interpretation results and the physicians diagnoses

(sections 8, 9 and 11).

The protocol is self-documented by a version number. Each section of the protocol is composed of a set of elementary or composite data fields, and has a unique identifier (the section number). Each data field

Figure 4-1 SCP-ECG data structure [CEN/TC251 2005]

Page 120: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 106 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

shall be present exactly once or may be optional. Some data fields may be also repeated, such as drugs or diagnoses which may have more than one instance. Each data field is characterized by a unique identifier or tag. The protocol also specifies the allowed data types (integer, varchar, etc.) and lengths. The data representation and transmission follow a tag-based format of the form (Tag, Length, Value).

Each section consists of a 16 bytes Section Identification Header (Section ID Header) and a section data part (Figure 4-2).

Each section has a Section Protocol Version Number (bytes 9 and 10) which may be used to specify different levels of compatibility with the standard when it is updated in the future. For data sections 12 to 1023, the Section Version shall refer to the manufacturer’s version for that section, independently of the Protocol version.

b) Message treatment layer: This level is for specifying the dataflow possibly exchanged

between the different electrocardiographic devices. Table 4-1 shows the main request/response messages that are specified in the SCP-ECG protocol. Query Type Requested objects Data to be selected Patient List Request

- that have ECG records since “date/time” - that have name starting by letters “XX”

Demographic data (Last name, First name, Date of birth, Patient_ID, and sex)

ECG List Request

- of the patient “X”

List of the ECGs of patient “X” - acquisition date/time - acquisition system identifier

ECG Report Request

- of selected ECG - of the last ECG of patient “X”

- Demographic data - Signal data - Global measurements

Table 4-1 The main request/response messages specified in the SCP-ECG protocol

Figure 4-2 SCP-ECG section layout overview

Page 121: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 107 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

c) Errors treatment layer: This level specifies the error and advice messages in case of

failures during the data transmission between the acquisition, analyses or treatment systems.

d) The physical layer: Many solutions for data transmission in synchronous or

asynchronous modes may be used. Finally this protocol has been designed to allow the transfer of

additional data. In practice, systems often use only a sub-set of the previous sections. Also the section 1 “patient demographic data” structure has been designed to be very flexible. Only the patient identifier number, the identifier of the ECG device on which the ECG acquisition has been performed and the date and time of the recording are mandatory. The last name, the first name, the sex and the date of birth of the patient and the processing system identifier (where the analysis and the interpretation of the ECG have been performed) are not mandatory but highly recommended. Indeed, the precise identification of the patient is necessary to retrieve later its previous ECGs.

4.3 OEDIPE reference relational model for the storage of ECG data The OEDIPE (Open European Data Interchange and Processing for Computerized Electrocardiography) relational model [Fayn et al. 1994a], [Fayn et al. 1994b], [Fayn and Rubel 1994] is a reference example in the eHealth domain where it is especially relevant and necessary that health professionals have an immediate access to the medical records of their patients from the information system they are using in practice. When medical doctors receive new data electronically, these data must be transparently and automatically stored in their usual database, so that they will be able to reuse their usual software to exploit these data and automatically merge the previously stored data with newly received ones.

This reference relational model that we used in our example, has been proposed by a European Consortium as a complete reference model for the storage and retrieval of medical data centred on the ECG. It has been designed to be at least compliant with the standard communications protocol SCP-ECG and to store all the metadata related to the ECG examinations for serial ECG analysis support. Various scenarios of use of an electrocardiographic database, from the practitioner office or an emergency department, up to a cardiology hospital and research purposes, have been taken into account. The global reference model is composed of ~ 50 tables and 200 attributes, and it has been designed according to a very modular approach for being easily integrated in various implementation

Page 122: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 108 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

sites. In addition, it allows the setting-up of generic, dynamic and patient-specific management and control strategies of serial analysis processes. It includes the storage of any clinical event which may be relevant for the patients, and consequently it allows an easy access to other distributed databases storing non ECG data. The OEDIPE core model is composed of six sub-models corresponding to the main processing phases that should be supported by an ECG management system. These six sub-models are: ECG acquisition, ECG Preprocessing, ECG measurement and data reviewing, ECG Interpretation, Serial ECG Analysis, and Serial ECG analysis process control.

Hereafter, we will present only the physical schema of the first sub-model (the data acquisition sub-model) as an example of a relational database implementing the OEDIPE model.

4.3.1 Data acquisition sub-model

The data acquisition sub-model, depicted in Figure 4-3, allows to store the demographic and the administrative data of the patient as well as the contextual raw data related to the acquisition of an ECG. The data correspond to the content of sections 1, 2 and 3 of the SCP-ECG protocol.

When a patient has at least one recorded ECG, his/her demographical data are stored in table PATIENT, whereas the data that are contextual to the ECG examination such as the age, the weight and the size of the patient are stored in table HAS_REC.

An electrocardiogram may be a 5-30 seconds rest ECG, an ambulatory ECG recording of the Holter type for a duration of usually 24 hours, or a stress ECG, which may be composed of one or more sequences of recording of durations from 8 to 10 seconds. An electrocardiogram may also correspond to a pharmacology test that could extend over several days.

Each of the ECG recording sequences will be stored in table ECG_REC_SEQ. Each sequence, which corresponds to a recording phase that is likely to be analyzed, is characterized by its duration, date, compression parameters, and raw data. The recording is usually performed by a technician (REC_TECHN), with a specific device (PROCESS_DEVICE) according to a particular acquisition protocol which will be documented in table ACQUIRED_BY with, in particular, the data related to the ECG leads that have been recorded.

Page 123: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 109 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

4.4 XML based mediation for data exchange between SCP-ECG files and relational databases

We applied our framework proposed in the previous chapter as a solution to automate data exchange between any SCP-ECG file and any ECG relational database. The adopted methodology consists in the design and in the interconnection of meta-models of the source and target data representations using the XML language. The XML representation of the

Figure 4-3 OEDIPE data acquisition sub-model (physical schema) [Fayn et al. 1994a], [Fayn et al. 1994b]

Page 124: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 110 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

SCP-ECG format is easily designed and specified using the XML Schema language. The database meta-model can be first created by using reverse engineering techniques, and then appropriately refined by our stratification method (Algorithm 1) according to a tree structure related to the hierarchy of the tables being updated or queried to ensure the database integrity. Our method includes then a rule base for the support of coherent and secure ECG databases updates. Finally, the design of the interconnection between the source and target data representations is specified using XSLT. All the concepts of the exchanged data are specified and represented in XML.

Our proposed mediator architecture and its data flow are depicted in Figure 4-4 . The protocol and the database models are wrapped by a converter into their XML representation. The converter α transforms ECG data from an SCP-ECG format into an XML representation and vice versa, by using the protocol meta-model. The converter β has two roles. The first is a SQL generator for each table’s row to be updated in the database, from an XML-infoset. SQL queries are created according to the database meta-model, and by applying the rules (specified in the Rule-base) that relate to each updated table of the ECG Relational Database. The second role of converter β is to be an XML publisher of the result of each database query into an XML-infoset that is compliant with the XML schema of the database meta-model.

Figure 4-4 XML based mediator architecture for automating the storage of SCP-ECG data into relational databases

The schemas’ mapping between the source and target formats is an essential building block for the interoperability management of data exchange. This mapping is achieved by the main component of the mediator, that we call the Meta-Model Manager. This component includes

Page 125: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 111 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

four sub-components. The first sub-component is the protocol meta-model, which is specified using XML Schema to describe the layout of the SCP-ECG protocol. The second one is the database meta-model, which is specified using also XML Schema to describe the structure of the relational database and its physical model. The third sub-component is the mapping meta-model, which is specified using XSLT to describe the interconnection between the protocol fields in the protocol meta-model and the database columns in the database meta-model. The last sub-component is the rule-base, which contains a set of pre-processing rules that will be used to verify the possible pre-existence of the data to be inserted in each table of the database, when generating the SQL script for the database update with new SCP-ECG data. We describe in detail these four components in the next section.

4.5 Meta-model manager The Meta-Model Manager consists of four sub-components:

4.5.1 Protocol meta-model

This sub-component is the XML representation of the SCP-ECG protocol. The protocol meta-model has two roles: (1) first, it is used in the specification of the mapping meta-model with the database meta-model; (2) secondly, it is used by converter α to ensure the correct transformation of ECG data from a SCP-ECG file into the XML protocol message, compliant with its schema.

The protocol meta-model, specified using the XML Schema language (XSD), describes the physical structure of the protocol fields and their related metadata.

Figure 4-5 shows an overview of the SCP-ECG protocol meta-model schema described in XSD Language. The complete SCP-ECG XML schema is presented in Appendix 1.

Page 126: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 112 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

Figure 4-5 Overall view of the XML schema of the SCP-ECG protocol meta-model

Figure 4-6 shows a snapshot of a part of section 1 (Header Information-patient data/ECG acquisition data) of the SCP-ECG protocol XML schema. Here, a graphical XML schema Editor has been used to give a clearer view of the structure and to facilitate the task of designing the XML schema of the protocol meta-model.

Figure 4-6 XMLSpy© Editor snapshot of a part of section 1 (Header Information-patient data/ECG acquisition data) of the SCP-ECG protocol

XML schema

Page 127: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 113 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

4.5.2 Database meta-model

The database meta-model is the XML representation of the relational database. It has two roles: (1) first, it is used for specifying the mapping meta-model with the protocol meta-model; (2) secondly, it ensures the validation of any instance of database entry with respect to its schema. The database meta-model, specified using the XSD language, describes the database structure and the information related to the database tables and columns.

To automatically generate the XML schema of the database meta-model, we applied the transformation process of our methodology, that we proposed and presented in section 3.2 of the previous chapter, on the ECG acquisition model (Figure 4-3).

At the end of the first step of our transformation methodology, after the stratification algorithm has been applied, all the relations are stratified into four levels as shown in Figure 4-7.

Figure 4-7 Sample result of the OEDIPE “ECG acquisition” relational schema stratification according to algorithm 1 described in section 3.2.2.2. We obtained four levels

Page 128: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 114 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

For illustration purpose, Figure 4-7 displays only a sub-model of the ECG data acquisition relational model. Through the selection of this example, all the possible constraints in different contexts are represented, including a self-referenced relationship that we have intentionally added (in relation “Has_Record”) to the original model to demonstrate the global capabilities of our approach. A given medical test may indeed refer to another previous one for analysis or comparison for instance.

In the second step, the XML schema generation algorithm is applied to create the target XML schema. A sample part of the later is displayed in Figure 4-8, using a typical XML editing tool: XMLSpy editor©. For clarity purpose, we preferred this type of graphical presentation of the XML schema vs. more conventional, textual presentation formats. Figure 4-8 shows the successful transformation of the relational model in an XML Schema. The complete XML schema of the previous database meta-model is presented in Appendix 2.

Figure 4-8 XMLSpy© Editor snapshot of an excerpt of the target XML schema generated by the transformation algorithm 2 described in section 3.2.2.4

Page 129: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 115 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

4.5.3 Mapping meta-model

The role of this sub-component is to perform the interconnection between the protocol and the database meta-models. It is specified using the XSLT language.

In this sub-component two instances of stylesheets are specified: the protocol to database stylesheet and the database to protocol stylesheet, respectively corresponding to the mapping of one or more elements from the protocol meta-model into one or more elements in the DB meta-model and vice-versa.

Figure 4-9 shows a sample of the XSLT representation of the protocol to DB stylesheet instance which maps the date of birth, specified in three elements of the protocol meta-model, into one element of the database meta-model instance. To produce the mapping between the protocol and database schemas, we used the commercial tool Altova MapForce© that allows by graphical interface to compose the mapping and then auto-generate the mapping in XSLT.

Figure 4-9 Snapshot of the XSLT template that maps date of birth from the SCP-ECG protocol to the database meta-model instance

Page 130: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 116 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

4.5.4 Rule-base

To ensure the coherence of the database while inserting new ECG data, we introduce this sub-component that contains a set of pre-processing rules that will be used to verify the possible pre-existence of the data to be inserted in each table of the database and to find the tables’ keys to be reused or created when generating the SQL scripts for the updates of the databases with new data.

The rules are mainly associated with the primary keys of the tables. As an example, when the value of a primary key is not known, a set of attributes can be defined to verify the existence of a row in a table. For instance, the existence of a patient in the Patient table can be verified using the attributes: Last name, First name, Sex and Date of birth.

4.6 Mediator functions Two main tasks are distinguished in the mediator, the protocol to database function and the reverse function, which are depicted in Figure 4-10.

Figure 4-10 The two main functions of the mediator

4.6.1 Protocol to database function

Through this task, as shown in Figure 4-10-a, first a new SPC-ECG file is received from an ECG device or a Host. Then, the converter reads the file,

Page 131: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 117 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

extracts the protocol fields and stores them into an XML file that is compliant with the protocol meta-model in the Meta-Model Manager. Then, the protocol to database mapping stylesheet is applied to the XML file by the XSLT processor. This operation results in a new XML file with the corresponding database fields. Finally, the SQL statements for the update of the database are created by the SQL generator module, taking into account the rules stored in the rule-base.

4.6.2 Database to protocol function

During this task, as shown in Figure 4-10-b, an SQL query result is converted into an XML file by the XML publisher module, which gives as output the XML fields from the database fields. Then, the database to protocol mapping stylesheet is applied to this XML file. This transformation is performed by the XSLT processor which results in a new XML file with the corresponding protocol fields. Finally, the converter creates an SCP-ECG file with the protocol fields from the last XML file.

4.7 Discussion We have proposed an XML based mediator as a generic solution for the exchange of ECG data between any SCP-ECG compliant device and any relational database storing the ECG data. Our mediator is open to the management of the evolution of data representation and to the dynamicity of their schemas.

Converting ECG data into XML has already been investigated in the literature. Many efforts have been made to define a standard XML representation of ECG data such as XML-ECG [Lu et al. 2007] and ecgML [Wang H. et al. 2003]. These standards vary in complexity and completeness. These approaches have the same purpose of providing ECG data files in a generic standard format for their exchange over the network. However, we preferred to design an XML representation which reflects the general structure of the SCP-ECG protocol, and is thus compliant with all the SCP-ECG instances. Indeed, our SCP-ECG XML representation using the XML Schema language reflects better the flexibility and the openness of the SCP-ECG protocol. This is very important since SCP-ECG has now become an ISO standard.

Many transformation proposals were introduced in the literature to map the relational models into XML schemas. However, none of them can be reused for our purpose of database update automation. Indeed, almost all of these propositions use DTD as the target schema language in the transformation process. However the DTD specifications fail in

Page 132: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 118 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

representing most of the integrity constraints that may be defined in a relational schema. For instance, DTDs do not support a consistent description of “many-to-many” relations without introducing redundancy. In addition, the data type definitions provided by DTDs are too limited. For instance, there is no way to specify that an element's text content must be a valid representation of an integer, or even that the content shall not exceed a given number of characters.

The other proposals that consider XML Schema as a target schema language in their transformation methods cannot also be used for our purpose of database update automation. For instance, in [Fong and Cheung 2005], the authors propose a transformation method from a relational schema into an Extended Entity Relationship (EER) model which is then transformed into a conceptual XSD graph, which is itself finally mapped into an XML schema. Thus, this methodology seems to be inefficient due to the two unnecessary steps introduced. Other authors, in [Liu et al. 2006], [Zhou R. et al. 2008], propose a direct transformation methodology from a relational schema to the XML Schema language. However, their transformation algorithm does not take into account the hierarchy of the relations in the data models as their goals do not include the update. Whereas, in our proposed transformation methodology, we aim to obtain a schema that captures the hierarchical view of the relational model, which is especially helpful in generating a cascade of SQL queries to perform a coherent update of the relational database, whatever the complexity of the relational model is.

The data exchange and integration between XML data and relational databases has been an active topic of research in the last decade [Krishnamurthy et al. 2003]. Two main objectives were considered: publishing relational databases into XML documents and storing XML documents into relational databases. However, to our knowledge, the problem of updating the existing relational databases from any new XML related data was never considered before and there is no solution that has yet been proposed for storing new XML data into existing, legacy relational databases. This update problem is far from being a trivial task because it needs to take into account the relations hierarchy in the relation model (i.e. the order of the INSERT/UPDATE clauses is crucial for a successful updating transaction). Our transformation methodology successfully captures that hierarchy and reproduces it automatically in the XML representation of the relational database. The methodology is based on an important step in which the relations are stratified to reflect their hierarchy based on the dependencies between foreign and primary keys; the target schema is then generated based on a set of templates. We have defined

Page 133: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 119 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

these templates to map components in the relational model into their corresponding fragments in the XML Schema language. In addition, to ensure the data integrity, we have introduced a rule-base for a coherent management of the well-ordering of the successive updates to be performed in the database, for a correct and consistent update of the previously existing data with the new data.

The use of the XML Schema language (XSD) for describing meta-models of both protocol and database guarantees the accuracy, the completeness, and the consistency of the data exchanged. This shall result in a better data quality that will support and enhance the business process, which is especially important in the healthcare domain. In addition, XSD offers powerful and rich data modelling capabilities which are especially relevant within the framework of our application field with respect to the complexity and to the open structure of the SCP-ECG protocol. Thanks to the concepts of XML Schema, an infinite number of combinations of the XML messages may be built and validated by the same schema. Furthermore, the possibility of the potential reuse of the schemas may reduce the work and the time for the meta-model design. For instance, adding a manufacturer specific section only requires the insertion of this new section by means of include or import commands. Furthermore, XSD offers all the capabilities for capturing and modelling the physical structure of a relational database with respect to the hierarchy of the tables according to the relations and functional dependencies between the tables.

XML has proven through this application example to provide a high level of interoperability for heterogeneous IT systems thanks to the separation of data content from their definition or presentation. Therefore, the choice of XML as the common representation language in the mediation does not need to be motivated any more, especially with all the emerging and existing tools and standards of the XML family. In particular, there are many graphical XML editors that make the schema design much easier with more and more powerful features. For instance, the mapping design and composition may be performed by using some commercial tool which allows, through a graphical interface, to specify the relationships between two XML schemas, and finally to produce automatically the mapping output in XSLT. However, this manual composition of the mapping between the source and target schemas may be automated using some ontology based approaches for automatic schema matching.

Our methodology for both the auto-transformation of the relational model into an XML schema and for the SQL generation is obviously very efficient thanks to the simplicity of our proposed algorithms, which all have a small logarithmic complexity.

Page 134: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 4 - Validation & Application in eHealth

HOSSAM JUMAA 120 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

As a result, we demonstrated that we can achieve a complete structural interoperability between relational models and XML data. Overcoming possible semantic heterogeneity of data can then be easily solved by means of specific information domain mapping models. The complete mediation solution we propose should thus greatly facilitate XML and SQL data integration within information systems.

4.8 Conclusion In this chapter we demonstrated the feasibility of our approach through an example from the healthcare application domain. Here, we applied our solution to exchange data between an XML representation of SCP-ECG, an open format ISO standard communications protocol embedding bio-signals and related metadata, and a European relational reference model including these data.

The SCP-ECG protocol and the OEDIPE database reference model are first described. The proposed XML-based mediation architecture for automating the storage of any SCP-ECG file into the OEDIPE relational database is then presented, and the mediator components and their functionalities are described in detail. Finally, we discuss the ins and the outs of the proposed solution and its application results in this especially rich and complex application field of quantitative Electrocardiology.

The successful application of our method between an XML representation of the SCP-ECG ISO norm and the OEDIPE data model thus demonstrates the relevance of our mediation approach based on an automated XML schema generation.

Page 135: L’Institut National des Sciences Appliquées de Lyon

CChhaapptteerr 55 -- CCOONNCCLLUUSSIIOONN

In this chapter, we summarize the results of our dissertation and discuss some possible future research directions

5.1 Summary In this dissertation, we presented a novel mediation approach to automate data exchange between XML and relational data sources independently of the adopted data structures in the two data models.

We first proposed a generic mediation framework for the data exchange between any XML document and any existing relational database. The architecture of our proposed framework is based on the development of generic components, which will ease the setup of specific interfaces adapted to any XML source and any target relational database. The mediator components are independent of any application domain, and need to be customized only once for each couple of source and target data storage formats. Hence, our mediator provides automatic and coherent updates of any relational database from any data embedded in XML documents. It also allows to retrieve data from any relational database and to publish them into XML documents (or messages) structured according to a requested interchange format.

The transformation from a Relational Model to XML represents a main key component of the proposed mediator. Thus, we proposed a methodology and devised two algorithms to efficiently and automatically transform the relational schema of a relational database management system into an XML schema. Our transformation methodology preserves the integrity constraints of the relational schema and avoids any data redundancy. It has been designed in order to preserve the hierarchical

Page 136: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 5 - CONCLUSION

HOSSAM JUMAA 122 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

representation of the relational schema, which is particularly important for the generation of correct SQL updates and queries in the proposed mediation framework.

Another key component is the automation of the SQL generation. Therefore, we devised a generic methodology and algorithms to automate the SQL queries generation that are required to update or retrieve data to/from the relational databases.

Our proposed framework has been successfully applied and tested in the healthcare domain between an XML representation of SCP-ECG, an open format ISO standard communications protocol embedding bio-signals and related metadata, and an European relational reference model including these data. The mediation automation is especially important in this field where electrocardiograms (ECG) are the main investigation for the early detection of cardiovascular diseases, and need to be quickly and transparently exchanged between health systems, particularly in the emergency cases, whereas the SCP-ECG protocol has numerous legacy implementations since most of the sections and of the data fields are not mandatory.

5.2 Directions for future research Through our work, several promising directions for future research we present hereafter have emerged. Automation of XML schemas-matching: As described in our application example, we used a commercial tool called Altova MapForce© to produce the mappings between the protocol and the database schemas. This tool allows (through a graphical interface) to compose the mappings and to generate automatically the mappings in XSLT sheets. More research efforts are needed here to automate the schema matching between source and target XML schemas and to produce the mappings in XSLT. The schema matching problem has been extensively studied in the data integration field. However, most of the previous schema matching solutions have focused only on the relational database schema. Moreover, there is no automatic schema matching solution up to date; i.e. the matching process requires a human intervention to provide exact and correct transformations. Wrappers generation for relational data sources: the source wrappers are key components in any data integration system. They should allow for an automatic and transparent data exchange between the source and the mediated schema. More research efforts are needed to investigate the usage of our proposed methodology i.e. transforming a relational schema into an XML schema, as a solution for generating wrappers on the fly for relational sources in the data integration context when more and more XML based

Page 137: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 5 - CONCLUSION

HOSSAM JUMAA 123 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

mediators are used. This will facilitate the task of data integration by solving the models heterogeneity problem. Database query composition: The query composition to any database is also a challenging issue. Thus, more research efforts should be done to develop new tools and interfaces to make the database user-centered. Such tools and interfaces should be intuitive, user-friendly and helpful to allow any user to specify which data exactly he/she looks for and needs. Indeed, this issue becomes even more challenging in the healthcare field when health information systems may be queried by users with various profiles and for multiple purposes. Our stratification method could also be reused for correct query composition.

Moreover, in the XML sphere, user queries are usually specified using the XPath and the XQuery languages. However, these two languages are complex and require a great learning effort from users and a considerable level of knowledge. Therefore, there is a need to provide an easier and more intuitive way for the formulation of XML user queries. A promising approach is the Query-By-Example (QBE) approach, proposed in [Ferreira et al. 2009]. The approach is based on providing a sample document describing the required data and on an intuitive user interface allowing users to define their selections and add restrictions on the content value. Such a solution could be adopted to complement our proposed mediator by a friendly user interface allowing users to formulate their queries in an intuitive and easy way. In addition, query optimization methods could also be considered to take account of the environment and of its constraints [Hameurlain and Morvan 2009], especially in the eHealth domain.

And there are certainly plenty of other research works that shall be performed in order to solve the steadily increasing number of information interoperability problems, especially, but not only, in the eHealth domain where every day new, isolated information systems are created to locally support and enhance health care and where plenty of pervasive self-care devices will create huge amounts of data that will be processed by intelligent, increasingly self-improving systems which will evolve with the progress in medicine and science.

Page 138: L’Institut National des Sciences Appliquées de Lyon

CHAPTER 5 - CONCLUSION

HOSSAM JUMAA 124 Thesis in Computer Sciences / 2010 Institut National des Sciences Appliquées de Lyon

Page 139: L’Institut National des Sciences Appliquées de Lyon

Bibliography

[Al King et al. 2009] RADDAD AL KING, ABDELKADER HAMEURLAIN, AND FRANCK MORVAN, “Ontology-Based Method for Schema Matching in a Peer-to-Peer Database System”. Dataspace: The Final Frontier, p. 208–212, 2009.

[Atoui 2006] HUSSEIN ATOUI, “Conception de Systèmes Intelligents pour la Télémédecine Citoyenne”, Phd thesis, Lyon: INSA de Lyon, p. 158, 2006.

[Barbosa et al. 2004] DENILSON BARBOSA, JULIANA FREIRE, AND ALBERTO O. MENDELZON, “Information Preservation in XML-to-Relational Mappings”, In Database and XML Technologies, p. 66-81, 2004.

[Baru 1999] CHAITANYA BARU, “XViews: XML Views of Relational Schemas”, In Proceedings of the 10th International Workshop on Database & Expert Systems Applications, p. 700-705, 1999. IEEE Computer Society.

[Bernstein et al. 2002] PHILIP A. BERNSTEIN, FAUSTO GIUNCHIGLIA, ANASTASIOS KEMENTSIETSIDIS, JOHN MYLOPOULOS, LUCIANO SERAFINI, AND ILYA ZAIHRAYEU, “Data management for peer--to--peer computing: A vision”, In Proceedings of the 5th International Workshop on the Web and Databases (WebDB), 2002.

[Blanco et al. 2006] ROLANDO BLANCO, NABEEL AHMED, DAVID HADALLER, L. G. ALEX SUNG, HERMAN LI, AND MOHAMED ALI SOLIMAN, “A Survey of Data Management in Peer-to-Peer Systems”, Technical Report CS-2006-18. University of Waterloo, 20 June 2006.

[Bourret 2005] RONALD BOURRET, “XML and Databases”, 2005. Available at: http://www.rpbourret.com/xml/XMLAndDatabases.htm.

[Bourret 2007] RONALD BOURRET, “Going native: Use cases for native XML databases”, 2007. Available at: http://www.rpbourret.com/xml/UseCases.htm.

[Bourret et al. 2000] RONALD BOURRET, C. BORNHOVD, AND A. BUCHMANN, “A generic load/extract utility for data transfer between XML documents and relational databases”, Second International Workshop on Advanced Issues of E-Commerce and Web-Based Information System, WECWIS 2000, p. 134-143, 2000.

[Braganholo et al. 2004] VANESSA P. BRAGANHOLO, SUSAN B. DAVIDSON, AND CARLOS A. HEUSER, “From XML view updates to relational view updates: old solutions to a new problem”, In Proceedings of the Thirtieth international conference on Very large data bases, volume 30, p. 276-287, 2004. Toronto, Canada: VLDB Endowment.

[Braganholo et al. 2006] VANESSA P. BRAGANHOLO, SUSAN B. DAVIDSON, AND CARLOS A. HEUSER, “PATAXO: A framework to allow updates through XML views”, ACM Trans. Database Syst., volume 31, issue 3: p. 839-886, 2006.

Page 140: L’Institut National des Sciences Appliquées de Lyon

126

[Carey et al. 2000a] MICHAEL CAREY, DANIELA FLORESCU, ZACHARY IVES, YING LU, JAYAVEL SHANMUGASUNDARAM, EUGENE SHEKITA, AND SUBBU SUBRAMANIAN, “XPERANTO: Publishing object-relational data as XML”, In WebDB (Informal Proceedings), p. 105–110, 2000.

[Carey et al. 2000b] MICHAEL CAREY, JERRY KIERNAN, JAYAVEL SHANMUGASUNDARAM, EUGENE J. SHEKITA, AND SUBBU N. SUBRAMANIAN, “XPERANTO: Middleware for Publishing Object-Relational Data as XML Documents”, In Proceedings of the 26th International Conference on Very Large Data Bases, p. 646-648, 2000. Morgan Kaufmann Publishers Inc.

[Catley and Frize 2002] CHRISTINA CATLEY AND MONIQUE FRIZE, “Design of a health care architecture for medical data interoperability and application integration”, In the second joint EMBS/BMES Conference, volume 3: p. 1952-1953, 2002. Houston, TX, USA.

[CEN/TC 251 1999] CEN/TC251 HEALTH INFORMATICS, Short Strategic Study, Health Information Infrastructure. CEN/TC251/SSS-HII NR2, p. 50, 1999.

[CEN/TC251 2000] CEN/TC251 HEALTH INFORMATICS, “ENV 13606:2000, Electronic healthcare record communication, European Prestandard”, European Committee for Standardization, Bruxelles, Belgique, 2000.

[CEN/TC251 2004] CEN/TC251 HEALTH INFORMATICS, “prEN 13606-1, Electronic health record communication, Part 1: Reference model”, Draft for CEN Enquiry, European Committee for Standardization, Bruxelles, Belgique, 2004.

[CEN/TC251 2005] CEN/TC251 HEALTH INFORMATICS, “Standard communication protocol-Computer-assisted electrocardiography (SCP-ECG)”, ICS: 35.240.80 IT applications in health care technology; reference number EN 1064:2005+A1:2007, 2005.

[Chabriais 2001] JOËL CHABRIAIS, “Les standards pour les systèmes d’information de santé (Série documents d'initiation) HL7”, 2001. http://www.gmsih.fr/fre.

[Chen and Dahanayake 2004] NONG CHEN AND AJANTHA DAHANAYAKE, “Rethinking of medical information retrieval and access”, In Proceedings of the IDEAS Workshop on Medical Information Systems: The Digital Hospital (IDEAS-DH), Beijing, CHINA, 2004.

[Choi et al. 2008] BYRON CHOI, GAO CONG, WENFEI FAN, AND STRATIS D. VIGLAS, “Updating Recursive XML Views of Relations”, Journal of Computer Science and Technology, volume 23, issue 4 (8): p. 516-537, 2008.

[Codd 1970] EDGAR F. CODD, “A relational model of data for large shared data banks”, Communication ACM, volume 13, issue 6: p. 377-387, 1970.

[Cong 2007] GAO CONG, “Query and Update through XML Views”, In Databases in Networked Information Systems, volume 4777: p. 81-95, 2007. Lecture Notes in Computer Science, Springer Berlin/Heidelberg.

[Deutsch et al. 1999] ALIN DEUTSCH, MARY FERNANDEZ, AND DAN SUCIU, “Storing semi-structured data with STORED”, In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p. 431-442, 1999. Philadelphia, Pennsylvania, United States: ACM.

[Dogac et al. 2006] ASUMAN DOGAC, GOKCE B. LALECI, SERKAN KIRBAS, YILDIRAY KABAK, SIYAMED S. SINIR, ALI YILDIZ, AND YAVUZ GURCAN, “Artemis: deploying semantically enriched web services in the healthcare domain”, Information Systems Journal (special issue on Semantic Web and Web Services), volume 31, issue 4: p. 321-339, 2006.

Page 141: L’Institut National des Sciences Appliquées de Lyon

127

[Dzenowagis and Kernen 2005] JOAN DZENOWAGIS, AND GAEL KERNEN, “Connecting for health. Global vision, local insight”, In World Health Organization Press, Geneva, Report for the World Summit on the Information Society, p. 40, 2005.

[Eichelberg et al. 2005] MARCO EICHELBERG, THOMAS ADEN, JORG RIESMEIER, ASUMAN DOGAC, AND GOKCE B LALECI, “A survey and analysis of Electronic Healthcare Record standards”, ACM Computing Surveys, volume 37, issue 4: p. 277-315, 2005.

[ECIS 2010] EUROPEAN COMMISSION INFORMATION SOCIETY, “20 Years of European Commission's support to the development of eHealth”, eHealth week 2010, March, 2010.

[Eysenbach 2001] Gunther EYSENBACH, “What is e-health?”, Journal of medical Internet research, volume 3, issue 2, 2001.

[Fagin et al. 2005] RONALD FAGIN, PHOKION G. KOLAITIS, RENÉE J. MILLER, AND LUCIAN POPA, “Data exchange: semantics and query answering”, In Theoretical Computer Science (Special issue for selected papers from the 2003 International Conference on Database Theory) volume 336, issue 1: p. 89-124, 2005.

[Fan et al. 2007] WENFEI FAN, GAO CONG, AND PHILIP BOHANNON, “Querying XML with update syntax”, In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, p. 293-304, 2007. Beijing, China: ACM.

[Fayn and Rubel 1994] JOCELYNE FAYN AND PAUL RUBEL, “OEDIPE Project-Demonstrator for the implemented conceptual reference model for ECG storage”, Public report. Brussels: Commission of the European Communities, 1994.

[Fayn et al. 1994a] JOCELYNE FAYN, PAUL RUBEL, J.L. WILLEMS, L. CONTI, AND R. RENIERS. “Design and implementation strategies of a core database model for the storage and retrieval of serial ECG data”, In IEEE-Computers in Cardiology, p. 189-192, Bethesda, MD, USA, 1994.

[Fayn et al. 1994b] JOCELYNE FAYN, PAUL RUBEL, AND J.L. WILLEMS, “Management of serial ECGs and control strategies for the comparison process”, Methods of information in medicine, volume 33, issue 1 (March): p. 148-152, 1994.

[Fayn et al. 2003] JOCELYNE FAYN, CHIRINE GHEDIRA, DAVID TELISSON, HUSSEIN ATOUI, JOËL PLACIDE, L. SIMON-CHAUTEMPS, PH. CHEVALIER, AND PAUL RUBEL, “Towards new integrated information and communication infrastructures in e-health. Examples from cardiology”. IEEE Computers in Cardiology 2003, volume 30: p. 113-116, 2003.

[Fayn et al. 2007] JOCELYNE FAYN, PAUL RUBEL, OLLE PAHLM, AND GALEN S. WAGNER, “Improvement of the detection of myocardial ischemia thanks to information technologies”, International Journal of Cardiology, volume 120, issue 2, p. 172-180, 2007.

[Fayn and Rubel 2010] JOCELYNE FAYN AND PAUL RUBEL, “Towards a personal health society in cardiology”, IEEE Transactions on Information Technology in Biomedicine, volume 14, issue 2. IEEE Transactions on Information Technology in Biomedicine: p. 401-409, 2010.

[Fernández et al. 2000] MARY FERNÁNDEZ, WANG-CHIEW TAN, AND DAN SUCIU, “SilkRoute: trading between relations and XML”, Computer Networks, volume 33, issue 1-6 (June): p. 723-745, 2000.

[Fernández et al. 2002] MARY FERNÁNDEZ, YANA KADIYSKA, DAN SUCIU, ATSUYUKI MORISHIMA, AND WANG-CHIEW TAN, “SilkRoute: A framework for publishing relational data in XML”, ACM Trans. Database Syst., Volume 27, issue 4: p. 438-493, 2002.

[Ferreira et al. 2009] FLAVIO X. FERREIRA, DANIELA DA CRUZ, PEDRO R. HENRIQUES, ALDA L GANÇARSKI, AND BRUNO DEFUDE, “A Query By Example Approach For Xml Querying”,

Page 142: L’Institut National des Sciences Appliquées de Lyon

128

In Proceeding of the 4th IBERIAN conference on Information Systems and Technology, p. 611-614, 2009.

[Florescu and Kossmann 1999] D. FLORESCU AND D. KOSSMANN, “Storing and Querying XML Data using an RDMBS”, IEEE Data Engineering Bulletin, volume 22: p. 27-34, 1999.

[Fong and Cheung 2005] JOSEPH FONG AND SAN KUEN CHEUNG, “Translating relational schema into XML schema definition with data semantic preservation and XSD graph”, Information and Software Technology, volume 47, issue 7 (May 15): p. 437-462, 2005.

[Friedman et al. 1999] MARC FRIEDMAN, ALON LEVY, AND TODD MILLSTEIN, “Navigational plans for data integration”, In Proceedings of the sixteenth national conference on Artificial intelligence, p. 67-73, 1999. Orlando, Florida, United States: American Association for Artificial Intelligence.

[Fuxman et al. 2006] ARIEL FUXMAN, PHOKION G. KOLAITIS, RENEE J. MILLER, AND WANG-CHIEW TAN, “Peer data exchange”, ACM Transactions on Database Systems (TOS), volume 31, issue 4: p. 1454-1498, 2006.

[Gagneux 2008] MICHEL GAGNEUX, “Mission de relance du projet de Dossier Médical Personnel, Recommandations à la ministre de la Santé, de la Jeunesse, des Sports et de la Vie associative”, Rapport de la mission de relance du DMP, p.120, 2008. Available at: http://www.sante-sports.gouv.fr/IMG/pdf/Rapport_DMP_mission_Gagneux.pdf

[Gagneux 2009] MICHEL GAGNEUX, “Refonder la gouvernance de la politique d'informatisation du système de santé, Douze propositions pour renforcer la cohérence et l'efficacité de l'action publique dans le Domaine des Systèmes d'Information de santé”, Public Rapport, 2009. Available at : http://www.ladocumentationfrancaise.fr/rapports-publics/094000345/index.shtml

[Garcia-Molina et al. 1997] HECTOR GARCIA-MOLINA, YANNIS PAPAKONSTANTINOU, DALLAN QUASS, ANAND RAJARAMAN, YEHOSHUA SAGIV, JEFFREY D. ULLMAN, VASILIS VASSALOS, AND JENNIFER WIDOM, “The TSIMMIS Approach to Mediation: Data Models and Languages”, Journal of Intelligent Information Systems, volume 8, issue 2: p. 117-132, 1997.

[Ghedira 2002] CHIRINE GHEDIRA, “Modélisation et Méthode de conception de systèmes de navigation adaptative dans les hypermédias”, Phd thesis, Lyon: INSA de Lyon, p. 141, 2002.

[Ghedira et al. 2002] CHIRINE GHEDIRA, PIERRE MARET, JOCELYNE FAYN, AND PAUL RUBEL, “Adaptive user interface customization through browsing knowledge capitalization”, International Journal of Medical Informatics, volume 68, issue 1-3: p. 219-228, 2002.

[Giacomo et al. 2007] GIUSEPPE DE GIACOMO, DOMENICO LEMBO, MAURIZIO LENZERINI, AND RICCARDO ROSATI, “On reconciling data exchange, data integration, and peer data management”, In Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p. 133-142, 2007. Principles of Database Systems, Beijing, China: ACM.

[Grade et al. 2007] SEBASTIAN GARDE, PETRA KNAUP, EVELYN J. S HOVENGA, AND SAM HEARD, “Towards Semantic Interoperability for Electronic Health Records: Domain Knowledge Governance for openEHR Archetypes”, Methods of Information in Medicine, volume 46, issue 3: p. 332–343, 2007.

[Halevy 2001] ALON HALEVY, “Answering queries using views: A survey”. The VLDB Journal: the International Journal on Very Large Data Bases, volume 10, issue 4: p. 270-294, 2001.

[Halevy et al. 2003] ALON HALEVY, Z.G. IVES, D. SUCIU, AND I. TATARINOV, “Schema

Page 143: L’Institut National des Sciences Appliquées de Lyon

129

mediation in peer data management systems”, In Proceedings of the 19th International Conference on Data Engineering (ICDE'03), p. 505- 516, Seattle, WA, USA, March 5, 2003.

[Halevy et al. 2004] ALON HALEVY, Z.G. IVES, JAYANT MADHAVAN, P. MORK, D. SUCIU, AND I. TATARINOV, “The Piazza peer data management system”. Transactions on Knowledge and Data Engineering, volume 16, issue 7: p. 787-798, 2004.

[Halevy et al. 2006] ALON HALEVY, ANAND RAJARAMAN, AND JOANN ORDILLE, “Data integration: the teenage years”, In Proceedings of the 32nd international conference on Very large data bases, p. 9-16. Seoul, Korea: VLDB Endowment, 2006.

[Hameurlain and Morvan 2009] ABDELKADER HAMEURLAIN AND FRANK MORVAN, “Evolution of query optimization methods”, Transactions on Large-Scale Data-and Knowledge-Centered Systems I, p. 211–242, 2009.

[Horsch and Balbach 1999] ALEXANDER HORSCH AND THOMAS BALBACH, “Telemedical information systems”, IEEE Transactions on Information Technology in Biomedicine, volume 3, issue 3: p. 166 -175, 1999.

[Huebsch et al. 2005] RYAN HUEBSCH, BRENT N. CHUN, JOSEPH M. HELLERSTEIN, BOON THAU LOO, PETROS MANIATIS, TIMOTHY ROSCOE, SCOTT SHENKER, ION STOICA, AND AYDAN R. YUMEREFENDI, “The Architecture of PIER: an Internet-Scale Query Processor”. In Conference on Innovative Data Systems Research CIDR, p. 28-43, 2005.

[Iakovidis 1998] I. IAKOVIDIS, “Towards personal health records: Current situation, obstacles and trends in implementation of Electronic Healthcare Records in Europe”, Int. J. Medical Informatics, volume 52, issue 1: p. 105–117, 1998.

[ISO8879 1986] ISO8879, “Information Processing - Text and Office Systems - Standard Generalized Markup Language (SGML)”, 1986. http://www.iso.ch/cate/d16387.html.

[ISO/IEC TR 10000-1 1998] ISO/IEC TR 10000-1, “Information technology -- Framework and taxonomy of International Standardized Profiles -- Part 1: General principles and documentation framework”, 1998.

[ISO/IEC 2382-01 2003] ISO/IEC 2382-01, “Information Technology Vocabulary, Fundamental Terms”, 2003.

[ISO/IEC 19757-2 2003] ISO/IEC 19757-2, “Information technology - Document Schema Definition Language (DSDL) - Part 2: Regular-grammar-based validation - RELAX NG”, 2003. Available at: http://standards.iso.org/ittf/PubliclyAvailableStandards/c037605_ISO_IEC_19757-2_2003(E).zip.

[ISO/IEC 19757-3 2006] ISO/IEC 19757-3, “Information technology - Document Schema Definition Language (DSDL) - Part 3: Rule-based validation - Schematron”, 2006.

[ISO/DIS 11073-91064 2007] ISO/DIS 11073-91064 and EN 1064:2005, “Medical Informatics-Standard Communication Protocol-Computer-assisted Electrocardiography (SCP-ECG)”, 2007.

[Jumaa et al. 2008] HOSSAM JUMAA, JOCELYNE FAYN, AND PAUL RUBEL, “XML Based Mediation for Automating the Storage of SCP-ECG Data into Relational Databases”, In Proceeding of the 35th annual IEEE international conference of Computers in Cardiology CinC2008, p. 445-448, 14 September, 2008. Bologna, Italy: IEEE Computer Society Press

[Jumaa et al. 2010a] HOSSAM JUMAA, JOCELYNE FAYN, AND PAUL RUBEL, “An automatic approach to generate XML schemas from relational models”, In Proceeding of the 12th International Conference on Computer Modeling and Simulation, UKSim2010, p. 509-514, 2010. Cambridge, UK: IEEE Computer Society Press.

Page 144: L’Institut National des Sciences Appliquées de Lyon

130

[Jumaa et al. 2010b] HOSSAM JUMAA, PAUL RUBEL, AND JOCELYNE FAYN, “An XML-based framework for automating data exchange in healthcare”, In Proceeding of the 12th IEEE International Conference on e-Health Networking, Applications and Services, Healthcom2010, p. 264-269, 2010. Lyon, France: IEEE Computer Society Press.

[Kadish et al. 2001] ALAN H. KADISH, ALFRED E. BUXTON, HAROLD L. KENNEDY, BRADLEY P. KNIGHT, JAY W. MASON, CLAUDIO D. SCHUGER, CYNTHIA M. TRACY, WILLIAM L. WINTERS, ALAN W. BOONE, JOHN W. HIRSHFELD, BEVERLY H. LORELL, GEORGE P. RODGERS, AND HOWARD H. WEITZ, “ACC/AHA clinical competence statement on electrocardiography and ambulatory electrocardiography: A report of the ACC/AHA/ACP–ASIM Task Force on Clinical Competence”, Circulation, volume. 104, p. 3169-3178, 2001.

[Kalra 2004] D. KALRA, “CEN prEN 13606, draft standard for Electronic Health Record Communication and its introduction to ISO TC/215, CEN/TC 251 WG I Document N04-52”, 2004.

[Kappel et al. 2004] GERTI KAPPEL, ELISABETH KAPSAMMER, AND WERNER RETSCHITZEGGER, “Integrating XML and Relational Database Systems”, World Wide Web, volume 7, issue 4: p. 343-384, 2004.

[Knudsen et al. 2005] STEFFEN ULSO KNUDSEN, TORBEN BACH PEDERSEN, CHRISTIAN THOMSEN, AND KRISTIAN TORP, “RelaXML: Bidirectional Transfer Between Relational and XML Data”, In Proceedings of the 9th International Database Engineering & Application Symposium, p. 151-162, 2005. Montreal, QC, Canada: IEEE Computer Society.

[Kolaitis 2005] PHOKION G. KOLAITIS, “Schema mappings, data exchange, and metadata management”, In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p. 61-75, 2005. Principles of Database Systems, Baltimore, Maryland: ACM.

[Kolaitis et al. 2006] PHOKION G. KOLAITIS, JONATHAN PANTTAJA, AND WANG-CHIEW TAN, “The complexity of data exchange”, In Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p. 30-39, 2006. Principles of Database Systems, Chicago, IL, USA: ACM.

[Krishnamurthy et al. 2003] RAJASEKAR KRISHNAMURTHY, RAGHAV KAUSHIK, AND JEFFREY F. NAUGHTON, “XML-to-SQL Query Translation Literature: The State of the Art and Open Problems”, In Database and XML Technologies, volume 2824: p. 1-18, 2003. Lecture Notes in Computer Science.

[Lee et al. 2001] LEE DONGWON, MURALI MANI, FRANK CHIU, AND WESLEY W CHU, “Nesting-based Relational-to-XML Schema Translation”, Int. Workshop on the Web and Databases (WebDB), p. 61-66, 2001.

[Lee et al. 2002] LEE DONGWON, MURALI MANI, FRANK CHIU, AND WESLEY W CHU, “NeT & CoT: Translating relational schemas to XML schemas using semantic constraints”, In ACM International Conference on Information and Knowledge Management, p. 282-291, 2002.

[Lee et al. 2003] LEE DONGWON, MURALI MANI, AND WESLEY W CHU, “Schema conversion methods between XML and relational models”, In Information and Software Technology, volume 48 p. 245–252 (2006) Knowledge Transformation for the Semantic Web, IOS, p. 1-17, 2003.

[Lenzerini 2002] MAURIZIO LENZERINI, “Data integration: a theoretical perspective”, In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p. 233-246, Madison, Wisconsin: ACM, 2002.

Page 145: L’Institut National des Sciences Appliquées de Lyon

131

[Lenzerini 2005] MAURIZIO LENZERINI, “Logical Foundations for Data Integration”, In Lecture Notes in Computer Science LNCS, SOFSEM 2005: Theory and Practice of Computer Science, volume 3381: p. 38-40, Invited Talks, Springer Berlin / Heidelberg, 2005.

[Levy et al. 1996] ALON Y. LEVY, ANAND RAJARAMAN, AND JOANN J. ORDILLE, “Querying Heterogeneous Information Sources Using Source Descriptions”, In Proceedings of the 22th International Conference on Very Large Data Bases, p. 251-262, 1996. Bombay, India: Morgan Kaufmann Publishers Inc. San Francisco, CA, USA.

[Liu et al. 2006] CHENGFEI LIU, MILLIST W. VINCENT, AND JIXUE LIU, “Constraint Preserving Transformation from Relational Schema to XML Schema”, World Wide Web, volume 9, issue 1: p. 93-110, 2006.

[Loukil 1997] ADLEN LOUKIL, “Méthodologies, modèles et architectures de référence pour la gestion et l'échange de données médicales multimedia”, Phd thesis, Lyon: INSA de Lyon, p. 224, 1997.

[Lu et al. 2007] XUDONG LU, HUILONG DUAN, AND HUIYING ZHENG, “XML-ECG: An XML-Based ECG Presentation for Data Exchanging”, In Proceeding of the 1st International Conference on Bioinformatics and Biomedical Engineering (ICBBE 2007), pp 1141-1144, 2007. Wuhan, China, 6 July.

[Lv and Yan 2007] TENG LV AND PING YAN, “Schema Conversion from Relation to XML with Semantic Constraints”, In Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007), volume 4: p. 619-623, 2007. IEEE Computer Society.

[McDonald 1997] CLEMENT J. MCDONALD, “The Barriers to Electronic Medical Record Systems and How to Overcome Them”, Journal of the American Medical Informatics Association, volume 4, issue 3: p. 213-221, 1997.

[Moor et al. 1993] GEORGES J. E. DE MOOR, CLEMENT MCDONALD, AND JAAP NOOTHOVEN VAN GOOR, “Progress in standardization in health care informatics”, IOS Press, 1993.

[Nayak et al. 2010] GOPAL K. NAYAK, MANAW MODI, G.K. SANTHALIA, AND MILAN DAS, “Middleware Design - A Nobel Approach to Publish Relational Database to XML without Exposing the Schema of RDBMS”, In the fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation (AMS 2010), p. 164-167, 2010.

[Nejdl et al. 2002] WOLFGANG NEJDL, BORIS WOLF, CHANGTAO QU, STEFAN DECKER, MICHAEL SINTEK, AMBJÖN NAEVE, MIKAEL NILSSON, MATTHIAS PALMÉR, AND TORE RISCH, “EDUTELLA: a P2P networking infrastructure based on RDF”, p. 604-615, 2002. Honolulu, Hawaii, USA: ACM.

[OASIS 2001] OASIS COMMITTEE SPECIFICATION, “RELAX NG Specification”, 3 December 2001. http://www.relaxng.org/spec-20011203.html.

[Pardede et al. 2006] ERIC PARDEDE, J. WENNY RAHAYU, AND DAVID TANIAR, “XML-Enabled Relational Database for XML Document Update”, In Proceedings of the 20th International Conference on Advanced Information Networking and Applications (AINA'06), volume 2: p. 205-212, 2006. IEEE Computer Society.

[Penna et. al 2006] GIUSEPPE DELLA PENNA, ANTINISCA DI MARCO, BENEDETTO INTRIGILA, IGOR MELATTI, AND ALFONSO PIERANTONIO, “Interoperability mapping from XML schemas to ER diagrams”, Data Knowledge Engineering, volume 59, issue 1: p. 166-188, 2006.

[Rodriguez and Preciado 2004] MARCELA RODRIGUEZ, AND ALFREDO PRECIADO, “An Agent Based System for the Contextual Retrieval of Medical Information”, AWIC 2004, LNAI, volume

Page 146: L’Institut National des Sciences Appliquées de Lyon

132

3034: p. 64-73, 2004. [Rubel et al. 1991] PAUL RUBEL, JOCELYNE FAYN, P.W. MACFARLANE, AND J.L. WILLEMS,

“Development of a conceptual reference model for digital ECG data storage”, In Computers in Cardiology 1991 Proceedings, p. 109-112, 1991.

[Rubel et al. 2005] PAUL RUBEL, JOCELYNE FAYN, GIANDOMENICO NOLLO, DEODATO ASSANELLI, BO LI, LIOARA RESTIER, STEFANO ADAMI, ET AL., “Toward personal eHealth in cardiology. Results from the EPI-MEDICS telemedicine project”, Journal of Electrocardiology, volume 38, issue 4: p. 100-106, 2005.

[Schmidt et al. 2001] ALBRECHT SCHMIDT, MARTIN L. KERSTEN, MENZO WINDHOUWER, AND FLORIAN WAAS, “Efficient Relational Storage and Retrieval of XML Documents”, In Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases, p. 137-150, 2001. Springer-Verlag.

[Shanmugasundaram et al. 1999] JAYAVEL SHANMUGASUNDARAM, KRISTIN TUFTE, CHUN ZHANG, GANG HE, DAVID J. DEWITT, AND JEFFREY F. NAUGHTON, “Relational Databases for Querying XML Documents: Limitations and Opportunities”, In Proceedings of the 25th International Conference on Very Large Data Bases, p. 302-314, 1999. Morgan Kaufmann Publishers Inc.

[Shanmugasundaram et al. 2001a] JAYAVEL SHANMUGASUNDARAM, JERRY KIERNAN, EUGENE J. SHEKITA, CATALINA FAN, AND JOHN FUNDERBURK, “Querying XML Views of Relational Data”, In Proceedings of the 27th International Conference on Very Large Data Bases, p. 261-270, 2001. Morgan Kaufmann Publishers Inc.

[Shanmugasundaram et al. 2001b] JAYAVEL SHANMUGASUNDARAM, EUGENE SHEKITA, RIMON BARR, MICHAEL CAREY, BRUCE LINDSAY, HAMID PIRAHESH, AND BERTHOLD REINWALD, “Efficiently publishing relational data as XML documents”, the VLDB Journal, volume 10, issue 2: p. 133-154.

[Tatarinov et al. 2002] IGOR TATARINOV, STRATIS D. VIGLAS, KEVIN BEYER, JAYAVEL SHANMUGASUNDARAM, EUGENE SHEKITA, AND CHUN ZHANG, “Storing and querying ordered XML using a relational database system”, In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, p. 204-215, 2002. Madison, Wisconsin: ACM.

[Télisson 2004] DAVID TÉLISSON, JOCELYNE FAYN, AND PAUL RUBEL, “Design of a tele-expertise architecture adapted to pervasive multi-actor environments-application to ecardiology”, In Computers in Cardiology 2004, Volume 31: p. 749-752, 19 September, 2004.

[Télisson 2006] DAVID TELISSON, “Méthodologie pour la conception des systèmes d'information pervasifs. Application à la pSanté”, Phd thesis, Lyon: INSA de Lyon, p. 128, 2006.

[Turau 1999] VOLKER TURAU, “Making Legacy Data Accessible for XML Applications”, 1999. [Tzvetkov et al. 2005] VENTZISLAV TZVETKOV AND XIONG WANG, “DBXML - Connecting

XML with Relational Databases”, In Proceedings of the Fifth International Conference on Computer and Information Technology, p. 130-135, 2005. IEEE Computer Society.

[Ullman 2000] JEFFREY D. ULLMAN, “Information integration using logical views”, In the 6th International Conference on Database Theory (ICDT '97) Theoretical Computer Science, Special issue, volume 239, issue 2: p. 189-210, 2000.

[Verdier 2006] CHRISTINE VERDIER, “Health information systems: from local to pervasive medical data”, Santé et Systémique, 2006. Hermès ed,

[Wang H. et al. 2003] HAIYING WANG, BENJAMIN JUNG, FRANCISCO AZUAJE, AND NORMAN

Page 147: L’Institut National des Sciences Appliquées de Lyon

133

BLACK, “ecgML: Tools and Technologies for Multimedia ECG Presentation”, In Proceedings of XML Europe Conference, London, United Kingdom, May 2003.

[Wang L. and Rundensteiner 2004] LING WANG AND ELKE A. RUNDENSTEINER, “On the updatability of XML views published over relational data”, Conceptual Modeling–ER 2004, p. 795–809, 2004.

[Wang L. et al. 2006a] LING WANG, ELKE A. RUNDENSTEINER, AND MURALI MANI, “Updating XML views published over relational databases: towards the existence of a correct update mapping”, Data & Knowledge Engineering, volume 58, issue 3 (September): p. 263-298, 2006.

[Wang L. et al. 2006b] LING WANG, ELKE A. RUNDENSTEINER, MURALI MANI, AND MING JIANG, “HUX: handling updates in XML”, In Proceedings of the 32nd international conference on Very large data bases, p. 1235-1238, 2006. Seoul, Korea: VLDB Endowment.

[Whittaker 2002] Martin Whittaker, “EHR wars: the winners and loser in electronic record standards”, Br J Healthcare Comput Info Manage, 2002.

[Wiederhold 1992] GIO WIEDERHOLD, “Mediators in the architecture of future information systems”, IEEE Computer Magazine, volume 25, issue 3: 38-49, 1992.

[Willems et al. 1992] J.L. WILLEMS, C. ZYWIETZ, AND PAUL RUBEL, “SCP-ECG: A Standard Communications Protocol for Computerized Electrocardiography”. In Noothoven Van Goor J and Christensen JP, Eds, Advances in Medical Informatics, p. 325-332, 1992. Amsterdam: IOS Press.

[Willems et al. 1993] J.L. WILLEMS, PAUL RUBEL, AND C. ZYWIETZ, “Standard interchange for computerized electrocardiography”, Studies in Health Technology and Informatics, volume 6: 185-194, 1993.

[W3C 1998] WORLD WIDE WEB CONSORTIUM (W3C), Extensible Markup Language (XML) 1.0, W3C Recommendation, 10 February 1998. http://www.w3.org/TR/1998/REC-xml-19980210.

[W3C 1999a] WORLD WIDE WEB CONSORTIUM (W3C), Extensible Stylesheet Language Transformations (XSLT), W3C Recommendation, 16 November 1999. http://www.w3.org/TR/xslt.

[W3C 1999b] WORLD WIDE WEB CONSORTIUM (W3C), XML Path Language (XPath), W3C Recommendation, 16 November 1999. http://www.w3.org/TR/xpath/.

[W3C 1999c] WORLD WIDE WEB CONSORTIUM (W3C), HTML 4.01 Specification, W3C Recommendation, December 1999. http://www.w3.org/TR/html4/cover.html.

[W3C 2001] WORLD WIDE WEB CONSORTIUM (W3C), XML Schema Definition Language (XSD) 1.1, 2 May 2001. http://www.w3.org/XML/Schema.

[W3C 2004] WORLD WIDE WEB CONSORTIUM (W3C), XML Schema Part 0: Primer Second Edition, W3C Recommendation, 28 October 2004. http://www.w3.org/TR/xmlschema-0/.

[W3C 2006] WORLD WIDE WEB CONSORTIUM (W3C), Extensible Stylesheet Language (XSL) Version 1.1, W3C Recommendation, 5 December 2006. http://www.w3.org/TR/xsl11/.

[W3C 2007a] WORLD WIDE WEB CONSORTIUM (W3C), XML Path Language (XPath) 2.0, W3C Recommendation, 23 January 2007. http://www.w3.org/TR/xpath20/.

[W3C 2007b] WORLD WIDE WEB CONSORTIUM (W3C), XQuery 1.0 and XPath 2.0 Data Model (XDM), W3C Recommendation, 23 January 2007. http://www.w3.org/TR/xpath-datamodel/.

[W3C 2007c] WORLD WIDE WEB CONSORTIUM (W3C), XSL Transformations (XSLT) Version 2.0, W3C Recommendation, 23 January 2007. http://www.w3.org/TR/xslt20/.

Page 148: L’Institut National des Sciences Appliquées de Lyon

134

[W3C 2007d] WORLD WIDE WEB CONSORTIUM (W3C), XQuery 1.0: An XML Query Language, W3C Recommendation, 23 January 2007. http://www.w3.org/TR/xquery/.

[W3C 2007e] WORLD WIDE WEB CONSORTIUM (W3C), XQuery 1.0 and XPath 2.0 Functions and Operators, W3C Recommendation, 23 January 2007. http://www.w3.org/TR/xpath-functions/.

[W3C 2008a] WORLD WIDE WEB CONSORTIUM (W3C), Extensible Markup Language (XML) 1.0 (Fifth Edition), W3C Recommendation, 26 November 2008. http://www.w3.org/TR/xml/.

[W3C 2008b] WORLD WIDE WEB CONSORTIUM (W3C), Definition of the XML document type declaration from Extensible Markup Language (XML) 1.0 (Fourth Edition), W3C Recommendation, 26 November 2008. http://www.w3.org/TR/REC-xml/#dt-doctype.

[W3C 2008c] WORLD WIDE WEB CONSORTIUM (W3C), XQuery Update Facility 1.0, W3C Candidate Recommendation, 1 August 2008. http://www.w3.org/TR/2008/CR-xquery-update-10-20080801/.

[W3C 2009a] WORLD WIDE WEB CONSORTIUM (W3C), W3C XML Schema Definition Language (XSD) 1.1 Part 1: Structures, W3C Working Draft, 3 December 2009. http://www.w3.org/TR/xmlschema11-1/.

[W3C 2009b] WORLD WIDE WEB CONSORTIUM (W3C), W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes, W3C Working Draft, 3 December 2009. http://www.w3.org/TR/xmlschema11-2/.

[Xu and Embley 2004] LI XU AND DAVID W. EMBLEY, “Combining the Best of Global-as-View and Local-as-View for Data Integration”, In Proceedings of the 3rd International Conference on Information Systems Technology and its Applications (ISTA'2004), volume 48: p. 123-136, 2004. Salt Lake City, Utah, USA: GI, 15 June.

[Yang and Sun 2008] CHENGSEN YANG AND JINGUANG SUN, “The exchange from relational schemas to XML schemas based on semantic constraints”, Wuhan University Journal of Natural Sciences, volume 13, issue 4: p. 485-489, 2008.

[Zhou R. et al. 2008] RUI ZHOU, CHENGFEI LIU, AND JIANXIN LI, “Holistic Constraint-Preserving Transformation from Relational Schema into XML Schema”, In Database Systems for Advanced Applications, volume 4947/2008: p. 4-18, 2008. Lecture Notes in Computer Science, Springer Berlin, Heidelberg.

[Zhou S. et al. 2003] S. ZHOU, G. GUILLEMETTE, R. ANTINORO, AND F. FULTON, “New approaches in Philips ECG database management system design”, In Computers in Cardiology 2003, p. 267-270, 2003.

[Ziegler and Dittrich 2004] PATRICK ZIEGLER AND KLAUS R DITTRICH, “Three Decades of Data Integration - All Problems Solved?”, In the 18th International Federation for Information Processing IFIP World Computer Congress (WCC 2004), ed. René Jacquart, volume 156: p. 3-12, August 22-27, 2004. Toulouse, France: Kluwer.

[Ziegler and Dittrich 2007] PATRICK ZIEGLER AND KLAUS R DITTRICH, “Data Integration --- Problems, Approaches, and Perspectives”. In Conceptual Modelling in Information Systems Engineering, ed. John Krogstie, Andreas Lothe Opdahl, and Sjaak Brinkkemper, p. 39-58, 2007. Berlin: Springer.

Page 149: L’Institut National des Sciences Appliquées de Lyon

Appendix 1

XML schema of the SCP-ECG protocol meta-model

Page 150: L’Institut National des Sciences Appliquées de Lyon

1 <?xml version="1.0" encoding="utf-8"?>2 <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">3 <xs:element name="ecgRecord">4 <xs:complexType>5 <xs:sequence>6 <xs:element minOccurs="0" ref="sectionPointers" />7 <xs:element ref="headerInfo" />8 <xs:element minOccurs="0" ref="huffmanTables" />9 <xs:element ref="ecgLeadDefinition" />10 <xs:element minOccurs="0" ref="qrsLocation" />11 <xs:element minOccurs="0" ref="refBeatData" />12 <xs:element ref="rhythmData" />13 <xs:element minOccurs="0" ref="globalMeasurements" />14 <xs:element minOccurs="0" ref="textDiagnosis" />15 <xs:element minOccurs="0" ref="manufSpecDiagnostic" />16 <xs:element minOccurs="0" ref="leadMeasResults" />17 <xs:element minOccurs="0" ref="UnivStateCodes" />18 </xs:sequence>19 </xs:complexType>20 </xs:element>21 <xs:element name="sectionPointers">22 <xs:complexType>23 <xs:sequence minOccurs="3" maxOccurs="unbounded">24 <xs:element name="sectionId" type="xs:unsignedShort" />25 <xs:element name="sectionLength" type="xs:unsignedInt" />26 <xs:element name="sectionIndex" type="xs:unsignedInt" />27 </xs:sequence>28 </xs:complexType>29 </xs:element>30 <xs:element name="headerInfo">31 <xs:complexType>32 <xs:sequence>33 <xs:element minOccurs="0" name="sectionHeader" type="sectionHeaderType" />34 <xs:element minOccurs="0" name="lastName">35 <xs:simpleType>36 <xs:restriction base="xs:string">37 <xs:maxLength value="40" />38 </xs:restriction>39 </xs:simpleType>40 </xs:element>41 <xs:element minOccurs="0" name="firstName">42 <xs:simpleType>43 <xs:restriction base="xs:string">44 <xs:maxLength value="40" />45 </xs:restriction>46 </xs:simpleType>47 </xs:element>48 <xs:element name="patientID">49 <xs:simpleType>50 <xs:restriction base="xs:NMTOKEN">51 <xs:maxLength value="40" />52 </xs:restriction>53 </xs:simpleType>54 </xs:element>55 <xs:element minOccurs="0" name="middleName">56 <xs:simpleType>57 <xs:restriction base="xs:string">58 <xs:maxLength value="40" />59 </xs:restriction>60 </xs:simpleType>61 </xs:element>62 <xs:element minOccurs="0" name="age">63 <xs:complexType>64 <xs:sequence>65 <xs:element name="ageValue" type="xs:positiveInteger">66 </xs:element>67 <xs:element name="ageUnits">68 <xs:simpleType>69 <xs:restriction base="xs:string">70 <xs:enumeration value="Unspecified" />71 <xs:enumeration value="Years" />72 <xs:enumeration value="Months" />73 <xs:enumeration value="Weeks" />74 <xs:enumeration value="Days" />75 <xs:enumeration value="Hours" />76 </xs:restriction>77 </xs:simpleType>78 </xs:element>79 </xs:sequence>80 </xs:complexType>

-1-

Page 151: L’Institut National des Sciences Appliquées de Lyon

81 </xs:element>82 <xs:element minOccurs="0" name="dateofbirth" type="dateType" />83 <xs:element minOccurs="0" name="height">84 <xs:complexType>85 <xs:sequence>86 <xs:element name="hValue" type="xs:unsignedShort" />87 <xs:element name="hUnit">88 <xs:simpleType>89 <xs:restriction base="xs:string">90 <xs:enumeration value="Unspecified" />91 <xs:enumeration value="Centimeters" />92 <xs:enumeration value="Inches" />93 <xs:enumeration value="Milimeters" />94 </xs:restriction>95 </xs:simpleType>96 </xs:element>97 </xs:sequence>98 </xs:complexType>99 </xs:element>100 <xs:element minOccurs="0" name="weight">101 <xs:complexType>102 <xs:sequence>103 <xs:element name="wValue" />104 <xs:element name="wUnit">105 <xs:simpleType>106 <xs:restriction base="xs:string">107 <xs:enumeration value="Unspecified" />108 <xs:enumeration value="Kilogram" />109 <xs:enumeration value="Gram" />110 <xs:enumeration value="Pound" />111 <xs:enumeration value="Ounce" />112 </xs:restriction>113 </xs:simpleType>114 </xs:element>115 </xs:sequence>116 </xs:complexType>117 </xs:element>118 <xs:element minOccurs="0" name="sex">119 <xs:simpleType>120 <xs:restriction base="xs:string">121 <xs:enumeration value="Male" />122 <xs:enumeration value="Female" />123 <xs:enumeration value="Unknown" />124 <xs:enumeration value="Unspecified" />125 </xs:restriction>126 </xs:simpleType>127 </xs:element>128 <xs:element minOccurs="0" name="race">129 <xs:simpleType>130 <xs:restriction base="xs:string">131 <xs:enumeration value="Unspecified" />132 <xs:enumeration value="Caucasian" />133 <xs:enumeration value="Black" />134 <xs:enumeration value="Oriental" />135 <xs:enumeration value="Reserved" />136 <xs:enumeration value="Other" />137 </xs:restriction>138 </xs:simpleType>139 </xs:element>140 <xs:element minOccurs="0" name="drugs">141 <xs:complexType>142 <xs:sequence maxOccurs="unbounded">143 <xs:element name="codeTableIndicator">144 <xs:simpleType>145 <xs:restriction base="xs:unsignedByte">146 <xs:enumeration value="0" />147 <xs:enumeration value="1" />148 </xs:restriction>149 </xs:simpleType>150 </xs:element>151 <xs:element minOccurs="0" name="classDrugCode">152 <xs:complexType>153 <xs:sequence>154 <xs:element name="codeId" type="xs:unsignedByte" />155 <xs:element name="classDrugDes">156 <xs:simpleType>157 <xs:restriction base="xs:string">158 <xs:enumeration value="Unspecified" />159 <xs:enumeration value="Digitalis" />160 <xs:enumeration value="Antiarrhythmic" />

-2-

Page 152: L’Institut National des Sciences Appliquées de Lyon

161 <xs:enumeration value="Diuretics" />162 <xs:enumeration value="Antihypotensive" />163 <xs:enumeration value="Antianginal" />164 <xs:enumeration value="Antithrombotic" />165 <xs:enumeration value="Beta Blockers" />166 <xs:enumeration value="Psychotropic" />167 <xs:enumeration value="Calcium Blockers" />168 <xs:enumeration value="Anticholesterol" />169 <xs:enumeration value="ACE-Inhibitors" />170 <xs:enumeration value="Not taking drugs" />171 <xs:enumeration value="Unknwon type" />172 <xs:enumeration value="Other medication" />173 <xs:enumeration value="Reserved code" />174 <xs:enumeration value="Manufacturer code" />175 </xs:restriction>176 </xs:simpleType>177 </xs:element>178 </xs:sequence>179 </xs:complexType>180 </xs:element>181 <xs:element minOccurs="0" name="specificDrugCode">182 <xs:complexType>183 <xs:sequence>184 <xs:element name="codeId">185 <xs:simpleType>186 <xs:restriction base="xs:unsignedByte">187 <xs:minInclusive value="0" />188 <xs:maxInclusive value="13" />189 </xs:restriction>190 </xs:simpleType>191 </xs:element>192 <xs:element name="specificDrugeDes">193 <xs:simpleType>194 <xs:restriction base="xs:string">195 <xs:enumeration value="Unspecified" />196 <xs:enumeration value="Digoxin-Lanoxin" />197 <xs:enumeration value="Digitoxin-Digitalis" />198 <xs:enumeration value="Dysopyramide" />199 <xs:enumeration value="Quinidine" />200 <xs:enumeration value="Procainamide" />201 <xs:enumeration value="..." />202 <xs:enumeration value="see page 25" />203 </xs:restriction>204 </xs:simpleType>205 </xs:element>206 </xs:sequence>207 </xs:complexType>208 </xs:element>209 <xs:element minOccurs="0" name="description">210 <xs:simpleType>211 <xs:restriction base="xs:string">212 <xs:maxLength value="40" />213 </xs:restriction>214 </xs:simpleType>215 </xs:element>216 </xs:sequence>217 </xs:complexType>218 </xs:element>219 <xs:element minOccurs="0" name="systolic_mmHg">220 <xs:complexType>221 <xs:simpleContent>222 <xs:extension base="xs:unsignedShort" />223 </xs:simpleContent>224 </xs:complexType>225 </xs:element>226 <xs:element minOccurs="0" name="diastolic_mmHg" type="xs:unsignedShort" />227 <xs:element minOccurs="0" maxOccurs="unbounded" name="diagRef">228 <xs:simpleType>229 <xs:restriction base="xs:token">230 <xs:maxLength value="80" />231 </xs:restriction>232 </xs:simpleType>233 </xs:element>234 <xs:element name="acqDevID" type="machineType" />235 <xs:element minOccurs="0" name="anaDevID" type="machineType" />236 <xs:element minOccurs="0" name="acqInstDes">237 <xs:simpleType>238 <xs:restriction base="xs:token">239 <xs:maxLength value="40" />240 </xs:restriction>

-3-

Page 153: L’Institut National des Sciences Appliquées de Lyon

241 </xs:simpleType>242 </xs:element>243 <xs:element minOccurs="0" name="anaInstDes">244 <xs:simpleType>245 <xs:restriction base="xs:token">246 <xs:maxLength value="40" />247 </xs:restriction>248 </xs:simpleType>249 </xs:element>250 <xs:element minOccurs="0" name="acqDeptDes">251 <xs:simpleType>252 <xs:restriction base="xs:token">253 <xs:maxLength value="40" />254 </xs:restriction>255 </xs:simpleType>256 </xs:element>257 <xs:element minOccurs="0" name="anaDeptDes">258 <xs:simpleType>259 <xs:restriction base="xs:token">260 <xs:maxLength value="40" />261 </xs:restriction>262 </xs:simpleType>263 </xs:element>264 <xs:element minOccurs="0" name="referPhys">265 <xs:simpleType>266 <xs:restriction base="xs:token">267 <xs:maxLength value="60" />268 </xs:restriction>269 </xs:simpleType>270 </xs:element>271 <xs:element minOccurs="0" name="latestPhys">272 <xs:simpleType>273 <xs:restriction base="xs:token">274 <xs:maxLength value="60" />275 </xs:restriction>276 </xs:simpleType>277 </xs:element>278 <xs:element minOccurs="0" name="techDes">279 <xs:simpleType>280 <xs:restriction base="xs:token">281 <xs:maxLength value="40" />282 </xs:restriction>283 </xs:simpleType>284 </xs:element>285 <xs:element minOccurs="0" name="roomDes">286 <xs:simpleType>287 <xs:restriction base="xs:token">288 <xs:maxLength value="40" />289 </xs:restriction>290 </xs:simpleType>291 </xs:element>292 <xs:element minOccurs="0" name="statCode" type="xs:unsignedByte"></xs:element>293 <xs:element name="acqDate" type="dateType"></xs:element>294 <xs:element name="acqTime" type="timeType"></xs:element>295 <xs:element minOccurs="0" name="baselineFilter" type="xs:unsignedShort"></xs:element>296 <xs:element minOccurs="0" name="LPassFilter"></xs:element>297 <xs:element minOccurs="0" name="filterBitMap">298 <xs:simpleType>299 <xs:restriction base="xs:unsignedByte">300 <xs:minInclusive value="0" />301 <xs:maxInclusive value="7" />302 </xs:restriction>303 </xs:simpleType>304 </xs:element>305 <xs:element minOccurs="0" maxOccurs="unbounded" name="freeText">306 <xs:simpleType>307 <xs:restriction base="xs:string">308 <xs:maxLength value="79" />309 </xs:restriction>310 </xs:simpleType>311 </xs:element>312 <xs:element minOccurs="0" name="seqNum">313 <xs:simpleType>314 <xs:restriction base="xs:string">315 <xs:maxLength value="11" />316 </xs:restriction>317 </xs:simpleType>318 </xs:element>319 <xs:element minOccurs="0" name="historyDiagCodes">320 <xs:complexType>

-4-

Page 154: L’Institut National des Sciences Appliquées de Lyon

321 <xs:sequence>322 <xs:element name="codeTable" type="xs:boolean" />323 <xs:element minOccurs="0" maxOccurs="unbounded" name="histDiagCode">324 <xs:simpleType>325 <xs:restriction base="xs:string">326 <xs:enumeration value="unknowCode=0" />327 <xs:enumeration value="apparentlyHealthy=1" />328 <xs:enumeration value="acuteMyocardialInfarction=10" />329 <xs:enumeration value="myocardialInfarction=11" />330 <xs:enumeration value="PreviousMyocardialInfarction=12" />331 <xs:enumeration value="ischemicHeartDisease=15" />332 <xs:enumeration value="perpheralVascularDisease=18" />333 <xs:enumeration value="cyanoticCongenitalHeartDisease=20" />334 <xs:enumeration value="acyanoticCongenitalHeartDisease=21" />335 <xs:enumeration value="valvulHeartDisease=22" />336 <xs:enumeration value="hypertension=25" />337 <xs:enumeration value="cerebovascularAccident=27" />338 <xs:enumeration value="cardiomyopathy=30" />339 <xs:enumeration value="pericarditis=35" />340 <xs:enumeration value="myocarditis=36" />341 <xs:enumeration value="postOperativeCardiacSurgery=40" />342 <xs:enumeration value="implementedCardiacPacemaker=42" />343 <xs:enumeration value="pulmonaryEmbolism=45" />344 <xs:enumeration value="respiratoryDisease=50" />345 <xs:enumeration value="endocrineDisease=55" />346 <xs:enumeration value="neurologicalDisease=60" />347 <xs:enumeration value="alimentaryDisease=65" />348 <xs:enumeration value="renalDisease=70" />349 <xs:enumeration value="preOperativeGeneralSurgery=80" />350 <xs:enumeration value="postOperativeGeneralSurgery=81" />351 <xs:enumeration value="generalMedical=90" />352 <xs:enumeration value="otherCode=100" />353 </xs:restriction>354 </xs:simpleType>355 </xs:element>356 </xs:sequence>357 </xs:complexType>358 </xs:element>359 <xs:element minOccurs="0" name="electrodeConfCode">360 <xs:complexType>361 <xs:sequence>362 <xs:element name="Placement12LeadsSys">363 <xs:simpleType>364 <xs:restriction base="xs:string">365 <xs:enumeration value="0= Unspecified (carts that do not record the electrode placement

information should use 0)." />366 <xs:enumeration value="1= Standard 12-lead positions: RA, RL, LA, and LL are placed at limb

extremities. V1 to V6 at standard positions on the chest. All electrodes are placed individually. " />367 <xs:enumeration value="2= RA, RL, LA, and LL are placed on the torso (Mason-Likar positions). V1

to V6 are placed at standard positions on the chest. All electrodes are placed individually." />368 <xs:enumeration value="3= RA, RL, LA, and LL are placed on the torso (Mason-Likar positions).

These limb electrodes are individually placed. V1 to V6 on the chest as part of a single electrode pad (V1 to V6 are NOT placed individually)." />

369 <xs:enumeration value="4= RA, RL, LA, LL, and V1 to V6 (all electrodes) are on the chest in a single electrode pad (such as Omnitrode). (None of the electrodes are placed individually)." />

370 <xs:enumeration value="5= 12-lead ECG is derived from Frank XYZ leads." />371 <xs:enumeration value="6= 12-lead ECG is derived from non-standard leads." />372 <xs:enumeration value="7 to 255 Undefined now. Reserved for later use." />373 </xs:restriction>374 </xs:simpleType>375 </xs:element>376 <xs:element name="PlacementXYZSys">377 <xs:simpleType>378 <xs:restriction base="xs:string">379 <xs:enumeration value="0= Unspecified (carts that do not record the electrode placement

information should use 0." />380 <xs:enumeration value="1= Frank lead system (Frank, 1956; 13:737)." />381 <xs:enumeration value="2= McFee-Parungao lead system (see Benchimol, Vectorcardiography,

Williams &amp; Wilkins, Baltimore, 1973, fig. 1.6 on page 6)." />382 <xs:enumeration value="3= Cube lead system (Grishman et al, Amer Heart J 1951;41:483)." />383 <xs:enumeration value="4= Bipolar uncorrected XYZ lead system." />384 <xs:enumeration value="5= Pseudo-orthogonal XYZ lead system (as used in Holter recording)." />385 <xs:enumeration value="6= XYZ leads derived from standard 12-lead ECG." />386 <xs:enumeration value="7 to 255 Undefined now. Reserved for later use." />387 </xs:restriction>388 </xs:simpleType>389 </xs:element>390 </xs:sequence>391 </xs:complexType>392 </xs:element>

-5-

Page 155: L’Institut National des Sciences Appliquées de Lyon

393 <xs:element minOccurs="0" name="dateTimeZone">394 <xs:complexType>395 <xs:sequence>396 <xs:element name="offset" type="xs:short">397 </xs:element>398 <xs:element name="index" type="xs:unsignedShort">399 </xs:element>400 <xs:element name="description" type="xs:string">401 </xs:element>402 </xs:sequence>403 </xs:complexType>404 </xs:element>405 <xs:element minOccurs="0" maxOccurs="unbounded" name="freeTextMedHistory" type="xs:string">406 </xs:element>407 </xs:sequence>408 </xs:complexType>409 </xs:element>410 <xs:element name="huffmanTables">411 <xs:complexType>412 <xs:sequence>413 <xs:element name="sectionHeader" type="sectionHeaderType" />414 <xs:element name="tableNum" type="xs:unsignedShort">415 </xs:element>416 <xs:element minOccurs="0" maxOccurs="unbounded" name="huffmanCodeTable">417 <xs:complexType>418 <xs:sequence>419 <xs:element name="codeStructureNum" type="xs:unsignedShort" />420 <xs:element minOccurs="0" maxOccurs="unbounded" name="structures">421 <xs:complexType>422 <xs:sequence>423 <xs:element name="bitPrefixNum" type="xs:unsignedByte">424 </xs:element>425 <xs:element name="bitCodeNum" type="xs:unsignedByte">426 </xs:element>427 <xs:element name="tableMode">428 <xs:simpleType>429 <xs:restriction base="xs:unsignedByte">430 <xs:enumeration value="0" />431 <xs:enumeration value="1" />432 </xs:restriction>433 </xs:simpleType>434 </xs:element>435 <xs:element name="baseValue" type="xs:unsignedShort" />436 <xs:element name="baseCode" type="xs:unsignedLong" />437 </xs:sequence>438 </xs:complexType>439 </xs:element>440 </xs:sequence>441 </xs:complexType>442 </xs:element>443 </xs:sequence>444 </xs:complexType>445 </xs:element>446 <xs:element name="ecgLeadDefinition">447 <xs:complexType>448 <xs:sequence>449 <xs:element name="sectionHeader" type="sectionHeaderType" />450 <xs:element name="numLeads" type="xs:unsignedByte" />451 <xs:element name="flagsByte">452 <xs:complexType>453 <xs:sequence>454 <xs:element name="refBeatUsed">455 <xs:simpleType>456 <xs:restriction base="xs:boolean">457 <xs:pattern value="1" />458 <xs:pattern value="0" />459 </xs:restriction>460 </xs:simpleType>461 </xs:element>462 <xs:element name="reserved" type="xs:boolean" />463 <xs:element name="leadSimRec">464 <xs:simpleType>465 <xs:restriction base="xs:boolean">466 <xs:pattern value="1" />467 <xs:pattern value="0" />468 </xs:restriction>469 </xs:simpleType>470 </xs:element>471 <xs:element name="numSimRecLead" type="xs:unsignedByte" />472 </xs:sequence>

-6-

Page 156: L’Institut National des Sciences Appliquées de Lyon

473 </xs:complexType>474 </xs:element>475 <xs:element maxOccurs="unbounded" name="leadInfo">476 <xs:complexType>477 <xs:sequence>478 <xs:element name="startSampleNum" type="xs:unsignedInt" />479 <xs:element name="endSampleNum" type="xs:unsignedInt" />480 <xs:element name="leadId">481 <xs:complexType>482 <xs:sequence>483 <xs:element name="code" type="xs:unsignedByte" />484 <xs:element name="leadDes">485 <xs:simpleType>486 <xs:restriction base="xs:string">487 <xs:enumeration value="Unspecified Lead" />488 <xs:enumeration value="I" />489 <xs:enumeration value="II" />490 <xs:enumeration value="V1" />491 <xs:enumeration value="V2" />492 <xs:enumeration value="V3" />493 <xs:enumeration value="V4" />494 <xs:enumeration value="V5" />495 <xs:enumeration value="V6" />496 <xs:enumeration value="V7" />497 <xs:enumeration value="V2R" />498 <xs:enumeration value="V3R" />499 <xs:enumeration value="V4R" />500 <xs:enumeration value="V5R" />501 <xs:enumeration value="V6R" />502 <xs:enumeration value="V7R" />503 <xs:enumeration value="X" />504 <xs:enumeration value="H" />505 <xs:enumeration value="I-cal" />506 <xs:enumeration value="II-cal" />507 <xs:enumeration value="..." />508 <xs:enumeration value="see page 38" />509 </xs:restriction>510 </xs:simpleType>511 </xs:element>512 </xs:sequence>513 </xs:complexType>514 </xs:element>515 </xs:sequence>516 </xs:complexType>517 </xs:element>518 </xs:sequence>519 </xs:complexType>520 </xs:element>521 <xs:element name="qrsLocation">522 <xs:complexType>523 <xs:sequence>524 <xs:element name="sectionHeader" type="sectionHeaderType" />525 <xs:element name="header">526 <xs:complexType>527 <xs:sequence>528 <xs:element name="refBeatLength" type="xs:unsignedShort" />529 <xs:element name="fcM" type="xs:unsignedShort" />530 <xs:element name="num_QRS" type="xs:unsignedShort" />531 </xs:sequence>532 </xs:complexType>533 </xs:element>534 <xs:element minOccurs="0" maxOccurs="unbounded" name="subtractionBeats">535 <xs:complexType>536 <xs:sequence>537 <xs:element name="beatType" type="xs:unsignedShort" />538 <xs:element name="startSampleNum" type="xs:unsignedInt" />539 <xs:element name="fcSapmpleNum" type="xs:unsignedInt" />540 <xs:element name="endSampleNum" type="xs:unsignedInt" />541 </xs:sequence>542 </xs:complexType>543 </xs:element>544 <xs:element minOccurs="0" maxOccurs="unbounded" name="protoctedArea">545 <xs:complexType>546 <xs:sequence>547 <xs:element name="startSampleNum" type="xs:unsignedInt" />548 <xs:element name="endSampleNum" type="xs:unsignedInt" />549 </xs:sequence>550 </xs:complexType>551 </xs:element>552 </xs:sequence>

-7-

Page 157: L’Institut National des Sciences Appliquées de Lyon

553 </xs:complexType>554 </xs:element>555 <xs:element name="refBeatData">556 <xs:complexType>557 <xs:sequence>558 <xs:element name="sectionHeader" type="sectionHeaderType" />559 <xs:element name="header">560 <xs:complexType>561 <xs:sequence>562 <xs:element name="avm" type="xs:unsignedShort">563 </xs:element>564 <xs:element name="sampleInterval" type="xs:unsignedShort">565 </xs:element>566 <xs:element name="diffUsed">567 <xs:simpleType>568 <xs:restriction base="xs:unsignedByte">569 <xs:enumeration value="0" />570 <xs:enumeration value="1" />571 <xs:enumeration value="2" />572 </xs:restriction>573 </xs:simpleType>574 </xs:element>575 <xs:element name="reserved" type="xs:unsignedByte" />576 </xs:sequence>577 </xs:complexType>578 </xs:element>579 <xs:element maxOccurs="unbounded" name="leadLength" type="xs:unsignedShort" />580 <xs:element maxOccurs="unbounded" name="dataLead" type="xs:hexBinary" />581 </xs:sequence>582 </xs:complexType>583 </xs:element>584 <xs:element name="rhythmData">585 <xs:complexType>586 <xs:sequence>587 <xs:element name="sectionHeader" type="sectionHeaderType" />588 <xs:element name="header">589 <xs:complexType>590 <xs:sequence>591 <xs:element name="avm" type="xs:unsignedShort" />592 <xs:element name="sampleInterval" type="xs:unsignedShort" />593 <xs:element name="diffUsed">594 <xs:simpleType>595 <xs:restriction base="xs:unsignedByte">596 <xs:enumeration value="0" />597 <xs:enumeration value="1" />598 <xs:enumeration value="2" />599 </xs:restriction>600 </xs:simpleType>601 </xs:element>602 <xs:element name="biomodal" type="xs:unsignedByte" />603 </xs:sequence>604 </xs:complexType>605 </xs:element>606 <xs:element maxOccurs="unbounded" name="leadLength" type="xs:unsignedShort" />607 <xs:element maxOccurs="unbounded" name="dataLead" type="xs:hexBinary" />608 </xs:sequence>609 </xs:complexType>610 </xs:element>611 <xs:element name="globalMeasurements">612 <xs:complexType>613 <xs:sequence>614 <xs:element name="sectionHeader" type="sectionHeaderType" />615 <xs:element name="header">616 <xs:complexType>617 <xs:sequence>618 <xs:element name="numQRSTypes" type="xs:unsignedByte" />619 <xs:element name="numPacemakerSpikes" type="xs:unsignedByte" />620 <xs:element name="rrInterval" type="xs:unsignedShort" />621 <xs:element name="ppInterval" type="xs:unsignedShort" />622 </xs:sequence>623 </xs:complexType>624 </xs:element>625 <xs:element maxOccurs="unbounded" name="qrsMeas">626 <xs:complexType>627 <xs:sequence>628 <xs:element name="pOnset" type="xs:unsignedShort" />629 <xs:element name="pOffset" type="xs:unsignedShort" />630 <xs:element name="qrsOnset" type="xs:unsignedShort" />631 <xs:element name="qrsOffset" type="xs:unsignedShort" />632 <xs:element name="tOffset" type="xs:unsignedShort" />

-8-

Page 158: L’Institut National des Sciences Appliquées de Lyon

633 <xs:element name="pAxis" type="xs:unsignedShort" />634 <xs:element name="qrsAxis" type="xs:unsignedShort" />635 <xs:element name="tAxis" type="xs:unsignedShort" />636 </xs:sequence>637 </xs:complexType>638 </xs:element>639 <xs:element minOccurs="0" maxOccurs="unbounded" name="PacemakerSpikeMeas">640 <xs:complexType>641 <xs:sequence>642 <xs:element name="spikeLocation" type="xs:unsignedShort" />643 <xs:element name="spikeAmplitude" type="xs:short" />644 </xs:sequence>645 </xs:complexType>646 </xs:element>647 <xs:element minOccurs="0" maxOccurs="unbounded" name="PacemakerSpikeInfo">648 <xs:complexType>649 <xs:sequence>650 <xs:element name="skipeType">651 <xs:simpleType>652 <xs:restriction base="xs:string">653 <xs:enumeration value="0 Unknown" />654 <xs:enumeration value="1 Spice triggers neither P-wave nor QRS" />655 <xs:enumeration value="2 Spike triggers a QRS" />656 <xs:enumeration value="3 Spike triggers a P-wave" />657 <xs:enumeration value="4 to 127 Reserved" />658 <xs:enumeration value="128 to 254 Manufcturer-specific" />659 <xs:enumeration value="255 No spike type analysis performed" />660 </xs:restriction>661 </xs:simpleType>662 </xs:element>663 <xs:element name="spikeSource">664 <xs:simpleType>665 <xs:restriction base="xs:string">666 <xs:enumeration value="0 Unknown" />667 <xs:enumeration value="1 Internal" />668 <xs:enumeration value="2 External" />669 <xs:enumeration value="3 to 255 Reserved" />670 </xs:restriction>671 </xs:simpleType>672 </xs:element>673 <xs:element name="triggerIndex" type="xs:unsignedShort" />674 <xs:element name="pulseWidth" type="xs:unsignedShort" />675 </xs:sequence>676 </xs:complexType>677 </xs:element>678 <xs:element name="qrsTypeInfo">679 <xs:complexType>680 <xs:sequence>681 <xs:element name="qrsNum" type="xs:unsignedShort" />682 <xs:element maxOccurs="unbounded" name="qrsType" />683 </xs:sequence>684 </xs:complexType>685 </xs:element>686 <xs:element name="extraMeas">687 <xs:complexType>688 <xs:sequence>689 <xs:element name="ventRate" type="xs:unsignedShort" />690 <xs:element name="atrialRate" type="xs:unsignedShort" />691 <xs:element name="qtc" type="xs:unsignedShort" />692 <xs:element name="formulaType">693 <xs:simpleType>694 <xs:restriction base="xs:string">695 <xs:enumeration value="0 Unknown or unspecified" />696 <xs:enumeration value="1 Bazett" />697 <xs:enumeration value="2 Hodges" />698 <xs:enumeration value="3 to 127 Reserved" />699 <xs:enumeration value="128 to 254 Manufacturer specific" />700 <xs:enumeration value="255 Measurement not available" />701 </xs:restriction>702 </xs:simpleType>703 </xs:element>704 <xs:element name="taggedFieldsSize" type="xs:unsignedShort" />705 <xs:element minOccurs="0" maxOccurs="unbounded" name="taggedField">706 <xs:complexType>707 <xs:sequence>708 <xs:element name="tag">709 <xs:simpleType>710 <xs:restriction base="xs:string">711 <xs:enumeration value="0 QTend All-lead Dispersion" />712 <xs:enumeration value="1 QTend All-lead Dispersion" />

-9-

Page 159: L’Institut National des Sciences Appliquées de Lyon

713 <xs:enumeration value="2 QTend Precordial Dispersion" />714 <xs:enumeration value="3 QTend Precordial Dispersion" />715 <xs:enumeration value="4 to 254 Reserved" />716 <xs:enumeration value="255 None (section terminator)" />717 </xs:restriction>718 </xs:simpleType>719 </xs:element>720 <xs:element name="len">721 <xs:simpleType>722 <xs:restriction base="xs:string">723 <xs:enumeration value="5" />724 <xs:enumeration value="none" />725 <xs:enumeration value="0" />726 </xs:restriction>727 </xs:simpleType>728 </xs:element>729 <xs:element minOccurs="0" name="group">730 <xs:complexType>731 <xs:sequence>732 <xs:element name="maxMinDisper" type="xs:unsignedByte" />733 <xs:element name="hrCorrMaxMinDisper" type="xs:unsignedByte" />734 <xs:element name="qtSdevDisper" type="xs:unsignedByte" />735 <xs:element name="hrCorrQTSdevDisper" type="xs:unsignedByte" />736 <xs:element name="formulaType">737 <xs:simpleType>738 <xs:restriction base="xs:string">739 <xs:enumeration value="0 Unknown or unspecified" />740 <xs:enumeration value="1 Bazett" />741 <xs:enumeration value="2 Hodges" />742 <xs:enumeration value="3 to 127 Reserved" />743 <xs:enumeration value="128 to 254 Manufacturer specific" />744 <xs:enumeration value="255 Measurement not available" />745 </xs:restriction>746 </xs:simpleType>747 </xs:element>748 </xs:sequence>749 </xs:complexType>750 </xs:element>751 </xs:sequence>752 </xs:complexType>753 </xs:element>754 </xs:sequence>755 </xs:complexType>756 </xs:element>757 <xs:element name="ManSpecBlock" type="xs:string" />758 </xs:sequence>759 </xs:complexType>760 </xs:element>761 <xs:element name="textDiagnosis">762 <xs:complexType>763 <xs:sequence>764 <xs:element name="sectionHeader" type="sectionHeaderType" />765 <xs:element name="header">766 <xs:complexType>767 <xs:sequence>768 <xs:element name="confirmed">769 <xs:simpleType>770 <xs:restriction base="xs:string">771 <xs:enumeration value="0 Original report" />772 <xs:enumeration value="1 Confirmed report" />773 <xs:enumeration value="2 Overread report, but not confirmed" />774 </xs:restriction>775 </xs:simpleType>776 </xs:element>777 <xs:element name="date" type="dateType" />778 <xs:element name="time" type="timeType" />779 <xs:element name="statementsNum" type="xs:unsignedByte" />780 </xs:sequence>781 </xs:complexType>782 </xs:element>783 <xs:element maxOccurs="1" name="statementsData">784 <xs:complexType>785 <xs:sequence maxOccurs="unbounded">786 <xs:element name="sequenceNum" type="xs:unsignedByte" />787 <xs:element name="leangth" type="xs:unsignedShort" />788 <xs:element name="statementField" type="xs:string" />789 </xs:sequence>790 </xs:complexType>791 </xs:element>792 </xs:sequence>

-10-

Page 160: L’Institut National des Sciences Appliquées de Lyon

793 </xs:complexType>794 </xs:element>795 <xs:element name="manufSpecDiagnostic">796 <xs:complexType>797 <xs:sequence>798 <xs:element name="sectionHeader" type="sectionHeaderType" />799 </xs:sequence>800 </xs:complexType>801 </xs:element>802 <xs:element name="leadMeasResults">803 <xs:complexType>804 <xs:sequence>805 <xs:element name="sectionHeader" type="sectionHeaderType" />806 <xs:element name="header">807 <xs:complexType>808 <xs:sequence>809 <xs:element name="leadsNum" type="xs:unsignedShort" />810 <xs:element name="ManufSpecific" type="xs:unsignedShort" />811 </xs:sequence>812 </xs:complexType>813 </xs:element>814 <xs:element maxOccurs="1" name="LeadsMeas">815 <xs:complexType>816 <xs:sequence maxOccurs="unbounded">817 <xs:element name="leadID" type="xs:unsignedShort">818 </xs:element>819 <xs:element name="leadLeangth" type="xs:unsignedShort">820 </xs:element>821 <xs:element name="mandatoryMeas">822 <xs:complexType>823 <xs:sequence>824 <xs:element name="pDuartion" type="xs:unsignedShort" />825 <xs:element name="prInterval" type="xs:unsignedShort" />826 <xs:element name="qrsDuration" type="xs:unsignedShort" />827 <xs:element name="qtInterval" type="xs:unsignedShort" />828 <xs:element name="qDuration" type="xs:unsignedShort" />829 <xs:element name="qDuration" type="xs:unsignedShort" />830 <xs:element name="rDuration" type="xs:unsignedShort" />831 <xs:element name="sDuration" type="xs:unsignedShort" />832 <xs:element name="rQuoteDurat" type="xs:unsignedShort" />833 <xs:element name="sQuotDurat" type="xs:unsignedShort" />834 <xs:element name="qAmplitude" type="xs:short" />835 <xs:element name="rAmplitude" type="xs:short" />836 <xs:element name="sAmplitude" type="xs:short" />837 <xs:element name="rQuoteAmpl" type="xs:short" />838 <xs:element name="sQuoteAmpl" type="xs:short" />839 <xs:element name="jAmplitude" type="xs:short" />840 <xs:element name="pPlusAmplit" type="xs:short" />841 <xs:element name="pMinusAmplit" type="xs:short" />842 <xs:element name="tPlusAmplit" type="xs:short" />843 <xs:element name="tMinusAmplit" type="xs:short" />844 <xs:element name="stSlope" type="xs:short" />845 <xs:element name="pMorphe">846 <xs:simpleType>847 <xs:restriction base="xs:string">848 <xs:enumeration value="0 unknown" />849 <xs:enumeration value="1 positive" />850 <xs:enumeration value="2 negative" />851 <xs:enumeration value="3 positive/negative" />852 <xs:enumeration value="4 negative/positive" />853 <xs:enumeration value="5 positive/negative/positive" />854 <xs:enumeration value="6 negative/positive/negative" />855 <xs:enumeration value="7 notched M-shaped" />856 <xs:enumeration value="8 notched W-shaped" />857 </xs:restriction>858 </xs:simpleType>859 </xs:element>860 <xs:element name="tMorphe">861 <xs:simpleType>862 <xs:restriction base="xs:string">863 <xs:enumeration value="0 unknown" />864 <xs:enumeration value="1 positive" />865 <xs:enumeration value="2 negative" />866 <xs:enumeration value="3 positive/negative" />867 <xs:enumeration value="4 negative/positive" />868 <xs:enumeration value="5 positive/negative/positive" />869 <xs:enumeration value="6 negative/positive/negative" />870 <xs:enumeration value="7 notched M-shaped" />871 <xs:enumeration value="8 notched W-shaped" />872 </xs:restriction>

-11-

Page 161: L’Institut National des Sciences Appliquées de Lyon

873 </xs:simpleType>874 </xs:element>875 <xs:element name="isoSegOnsetQrs" type="xs:short" />876 <xs:element name="isoSegEndQrs" type="xs:short" />877 <xs:element name="intrinsiciod" type="xs:short" />878 <xs:element name="qualCode" type="xs:short" />879 <xs:element name="stAmplit20" type="xs:short" />880 <xs:element name="stAmplit60" type="xs:short" />881 <xs:element name="stAmplit80" type="xs:short" />882 </xs:sequence>883 </xs:complexType>884 </xs:element>885 <xs:element name="manufSpecMeas" type="xs:string">886 </xs:element>887 </xs:sequence>888 </xs:complexType>889 </xs:element>890 </xs:sequence>891 </xs:complexType>892 </xs:element>893 <xs:element name="UnivStateCodes">894 <xs:complexType>895 <xs:sequence>896 <xs:element name="sectionHeader" type="sectionHeaderType" />897 <xs:element name="header">898 <xs:complexType>899 <xs:sequence>900 <xs:element name="confirmed">901 <xs:simpleType>902 <xs:restriction base="xs:string">903 <xs:enumeration value="0 Original report" />904 <xs:enumeration value="1 Confirmed report" />905 <xs:enumeration value="2 Overread report, but not confirmed" />906 </xs:restriction>907 </xs:simpleType>908 </xs:element>909 <xs:element name="date" type="dateType" />910 <xs:element name="time" type="timeType" />911 <xs:element name="statementsNum" type="xs:unsignedByte" />912 </xs:sequence>913 </xs:complexType>914 </xs:element>915 <xs:element maxOccurs="1" name="statementsData">916 <xs:complexType>917 <xs:sequence maxOccurs="unbounded">918 <xs:element name="sequenceNum" type="xs:unsignedByte" />919 <xs:element name="leangth" type="xs:unsignedShort" />920 <xs:element name="statementField">921 <xs:complexType>922 <xs:sequence>923 <xs:element name="statementFieldType">924 <xs:simpleType>925 <xs:restriction base="xs:string">926 <xs:enumeration value="1 Coded statement type, using the Universal Statement Codes." />927 <xs:enumeration value="2 Full text type, as used in Section 8." />928 <xs:enumeration value="3 Statement logic type, as described below." />929 </xs:restriction>930 </xs:simpleType>931 </xs:element>932 <xs:element name="statementData" type="xs:string" />933 </xs:sequence>934 </xs:complexType>935 </xs:element>936 </xs:sequence>937 </xs:complexType>938 </xs:element>939 </xs:sequence>940 </xs:complexType>941 </xs:element>942 <xs:complexType name="sectionHeaderType">943 <xs:sequence>944 <xs:element name="sectionIdNo" type="xs:unsignedShort" />945 <xs:element name="leangth" type="xs:unsignedInt" />946 <xs:element name="sectionVerNo" type="xs:unsignedByte" />947 <xs:element name="protocolVerNo" type="xs:unsignedByte" />948 </xs:sequence>949 </xs:complexType>950 <xs:complexType name="machineType">951 <xs:sequence>952 <xs:element name="instituationNum" type="xs:unsignedShort" />

-12-

Page 162: L’Institut National des Sciences Appliquées de Lyon

953 <xs:element name="departementNum" type="xs:unsignedShort" />954 <xs:element name="devId" type="xs:unsignedShort" />955 <xs:element name="devType">956 <xs:simpleType>957 <xs:restriction base="xs:token">958 <xs:enumeration value="Cart" />959 <xs:enumeration value="System (or Host)" />960 </xs:restriction>961 </xs:simpleType>962 </xs:element>963 <xs:element name="manufCode">964 <xs:simpleType>965 <xs:restriction base="xs:string">966 <xs:enumeration value="0:Unknown" />967 <xs:enumeration value="1:Burdick" />968 <xs:enumeration value="2:Cambridge" />969 <xs:enumeration value="3:Compumed" />970 <xs:enumeration value="4:Datamed " />971 <xs:enumeration value="5:Fukuda" />972 <xs:enumeration value="6:Hewlett-Packard" />973 <xs:enumeration value="7:Marquette Electronics " />974 <xs:enumeration value="8:Mortara Instruments" />975 <xs:enumeration value="9:Nihon Kohden" />976 <xs:enumeration value="10:Okin " />977 <xs:enumeration value="11:Quinton" />978 <xs:enumeration value="12:Siemens" />979 <xs:enumeration value="13:Spacelabs" />980 <xs:enumeration value="14:Telemed" />981 <xs:enumeration value="15:Hellige" />982 <xs:enumeration value="16:ESA-OTE" />983 <xs:enumeration value="17:Schiller" />984 <xs:enumeration value="18:Picker-Schwarzer" />985 <xs:enumeration value="19:Medical Devices (ex. Elettronica-Trentina)" />986 <xs:enumeration value="20:Zwönitz" />987 <xs:enumeration value="21 to 99:Reserved" />988 <xs:enumeration value="100:Other" />989 </xs:restriction>990 </xs:simpleType>991 </xs:element>992 <xs:element name="modelDescription">993 <xs:simpleType>994 <xs:restriction base="xs:token">995 <xs:maxLength value="5" />996 </xs:restriction>997 </xs:simpleType>998 </xs:element>999 <xs:element name="protRevNum" type="xs:unsignedByte" />1000 <xs:element name="protCompLevel" type="xs:unsignedByte" />1001 <xs:element name="langSupport" type="xs:unsignedByte" />1002 <xs:element name="devCapabilities">1003 <xs:complexType>1004 <xs:simpleContent>1005 <xs:extension base="xs:unsignedByte">1006 <xs:attribute name="printEcgReports" type="xs:boolean" />1007 <xs:attribute name="interpretEcg" type="xs:boolean" />1008 <xs:attribute name="storeEcgRecords" type="xs:boolean" />1009 <xs:attribute name="acquireEcgData" type="xs:boolean" />1010 </xs:extension>1011 </xs:simpleContent>1012 </xs:complexType>1013 </xs:element>1014 <xs:element name="acFrequency">1015 <xs:simpleType>1016 <xs:restriction base="xs:string">1017 <xs:enumeration value="Unspecified" />1018 <xs:enumeration value="50 Hz" />1019 <xs:enumeration value="60 Hz" />1020 </xs:restriction>1021 </xs:simpleType>1022 </xs:element>1023 <xs:element name="anaProgRevNum" type="xs:token" />1024 <xs:element name="devSerialNum" type="xs:token" />1025 <xs:element name="devSysSoftwareId" type="xs:token" />1026 <xs:element name="devScpImpSoftwareId">1027 <xs:simpleType>1028 <xs:restriction base="xs:token">1029 <xs:maxLength value="24" />1030 </xs:restriction>1031 </xs:simpleType>1032 </xs:element>

-13-

Page 163: L’Institut National des Sciences Appliquées de Lyon

1033 <xs:element name="devManuf" type="xs:token" />1034 </xs:sequence>1035 </xs:complexType>1036 <xs:complexType name="dateType">1037 <xs:sequence>1038 <xs:element name="year">1039 <xs:simpleType>1040 <xs:restriction base="xs:unsignedShort">1041 <xs:minInclusive value="1900" />1042 </xs:restriction>1043 </xs:simpleType>1044 </xs:element>1045 <xs:element name="month">1046 <xs:simpleType>1047 <xs:restriction base="xs:unsignedByte">1048 <xs:minInclusive value="01" />1049 <xs:maxInclusive value="12" />1050 </xs:restriction>1051 </xs:simpleType>1052 </xs:element>1053 <xs:element name="day">1054 <xs:simpleType>1055 <xs:restriction base="xs:unsignedByte">1056 <xs:minInclusive value="01" />1057 <xs:maxInclusive value="31" />1058 </xs:restriction>1059 </xs:simpleType>1060 </xs:element>1061 </xs:sequence>1062 </xs:complexType>1063 <xs:complexType name="timeType">1064 <xs:sequence>1065 <xs:element minOccurs="0" name="hours">1066 <xs:simpleType>1067 <xs:restriction base="xs:unsignedByte">1068 <xs:minInclusive value="0" />1069 <xs:maxInclusive value="23" />1070 </xs:restriction>1071 </xs:simpleType>1072 </xs:element>1073 <xs:element minOccurs="0" name="minutes">1074 <xs:simpleType>1075 <xs:restriction base="xs:unsignedByte">1076 <xs:minInclusive value="0" />1077 <xs:maxInclusive value="59" />1078 </xs:restriction>1079 </xs:simpleType>1080 </xs:element>1081 <xs:element minOccurs="0" name="Seconds">1082 <xs:simpleType>1083 <xs:restriction base="xs:unsignedByte">1084 <xs:minInclusive value="0" />1085 <xs:maxInclusive value="59" />1086 </xs:restriction>1087 </xs:simpleType>1088 </xs:element>1089 </xs:sequence>1090 </xs:complexType>1091 </xs:schema>

-14-

Page 164: L’Institut National des Sciences Appliquées de Lyon

Appendix 2

XML schema of the OEDIPE data acquisition meta-model

Page 165: L’Institut National des Sciences Appliquées de Lyon

1 <?xml version="1.0" encoding="utf-8"?>2 <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">3 <xs:element name="OedipeSubModel">4 <xs:complexType>5 <xs:sequence>6 <xs:element name="Institution" minOccurs="0" maxOccurs="unbounded">7 <xs:complexType>8 <xs:sequence>9 <xs:element name="Department" minOccurs="0" maxOccurs="unbounded">10 <xs:complexType>11 <xs:sequence>12 <xs:element name="Department_Techn" minOccurs="0" maxOccurs="unbounded">13 <xs:complexType>14 <xs:attribute name="techn_id" type="xs:NMTOKEN"/>15 </xs:complexType>16 <xs:key name="department_techn_pk">17 <xs:selector xpath=".//Department_Techn"/>18 <xs:field xpath="./@dpt_id"/>19 <xs:field xpath="@techn_id"/>20 </xs:key>21 <xs:keyref name="department_techn_fk1" refer="technician_pk">22 <xs:selector xpath=".//Department_Techn"/>23 <xs:field xpath="@techn_id"/>24 </xs:keyref>25 </xs:element>26 <xs:element name="Phy_Department" minOccurs="0" maxOccurs="unbounded">27 <xs:complexType>28 <xs:attribute name="phy_id" type="xs:NMTOKEN"/>29 </xs:complexType>30 <xs:key name="phy_department_pk">31 <xs:selector xpath=".//Phy_Department"/>32 <xs:field xpath="./@dpt_id"/>33 <xs:field xpath="@phy_id"/>34 </xs:key>35 <xs:keyref name="phy_departement_fk1" refer="physician_pk">36 <xs:selector xpath=".//Phy_Department"/>37 <xs:field xpath="@phy_id"/>38 </xs:keyref>39 </xs:element>40 </xs:sequence>41 <xs:attribute name="dpt_id" type="xs:NMTOKEN" use="required"/>42 <xs:attribute name="dpt_nbr" type="xs:integer"/>43 <xs:attribute name="dpt_des" type="xs:string"/>44 </xs:complexType>45 <xs:key name="department_pk">46 <xs:selector xpath=".//Department"/>47 <xs:field xpath="@dpt_id"/>48 </xs:key>49 </xs:element>50 </xs:sequence>51 <xs:attribute name="instit_id" type="xs:NMTOKEN" use="required"/>52 <xs:attribute name="instit_nbr" type="xs:integer"/>53 <xs:attribute name="instit_des" type="xs:string"/>54 </xs:complexType>55 <xs:key name="institution_pk">56 <xs:selector xpath=".//Institution"/>57 <xs:field xpath="@instit_id"/>58 </xs:key>59 </xs:element>60 <xs:element name="Physician" minOccurs="0" maxOccurs="unbounded">61 <xs:complexType>62 <xs:attribute name="phy_id" type="xs:NMTOKEN" use="required"/>63 <xs:attribute name="phy_des" type="xs:string"/>64 </xs:complexType>65 <xs:key name="physician_pk">66 <xs:selector xpath=".//Physician"/>67 <xs:field xpath="@phy_id"/>68 </xs:key>69 </xs:element>70 <xs:element name="Technician" minOccurs="0" maxOccurs="unbounded">71 <xs:complexType>72 <xs:attribute name="techn_id" type="xs:NMTOKEN" use="required"/>73 <xs:attribute name="techn_des" type="xs:string"/>74 </xs:complexType>75 <xs:key name="technician_pk">76 <xs:selector xpath=".//Technician"/>77 <xs:field xpath="@techn_id"/>78 </xs:key>79 </xs:element>80 <xs:element name="Patient" minOccurs="0" maxOccurs="unbounded">

-1-

Page 166: L’Institut National des Sciences Appliquées de Lyon

81 <xs:complexType>82 <xs:sequence>83 <xs:element name="Has_Record" minOccurs="0" maxOccurs="unbounded">84 <xs:complexType>85 <xs:sequence>86 <xs:element name="Ecg_Rec_Sequence" minOccurs="0" maxOccurs="unbounded">87 <xs:complexType>88 <xs:sequence>89 <xs:element name="Acquired_in" minOccurs="0">90 <xs:complexType>91 <xs:attribute name="dpt_id" type="xs:NMTOKEN"/>92 <xs:attribute name="room_des" type="xs:string"/>93 </xs:complexType>94 <xs:keyref name="acquired_in_fk1" refer="department_pk">95 <xs:selector xpath=".//Acquired_in"/>96 <xs:field xpath="@dpt_id"/>97 </xs:keyref>98 <xs:key name="acquired_in_pk">99 <xs:selector xpath=".//Acquired_in"/>100 <xs:field xpath="././@ecg_rec_id"/>101 <xs:field xpath="./@ecg_seq_id"/>102 </xs:key>103 </xs:element>104 <xs:element name="Referred_by" minOccurs="0" maxOccurs=

"unbounded">105 <xs:complexType>106 <xs:attribute name="refer_id" type="xs:NMTOKEN" use=

"required"/>107 <xs:attribute name="refer_date" type="xs:date"/>108 <xs:attribute name="refer_ind" type="xs:string"/>109 <xs:attribute name="dpt_id" type="xs:NMTOKEN"/>110 <xs:attribute name="phy_id" type="xs:NMTOKEN"/>111 </xs:complexType>112 <xs:key name="referred_by_pk">113 <xs:selector xpath=".//Referred_by"/>114 <xs:field xpath="././@ecg_rec_id"/>115 <xs:field xpath="./@ecg_seq_id"/>116 <xs:field xpath="@refer_id"/>117 </xs:key>118 <xs:keyref name="referred_by_fk1" refer="department_pk">119 <xs:selector xpath=".//Referred_by"/>120 <xs:field xpath="@dpt_id"/>121 </xs:keyref>122 <xs:keyref name="referred_by_fk2" refer="physician_pk">123 <xs:selector xpath=".//Referred_by"/>124 <xs:field xpath="@phy_id"/>125 </xs:keyref>126 </xs:element>127 <xs:element name="Ecg_Free_Text" minOccurs="0" maxOccurs=

"unbounded">128 <xs:complexType>129 <xs:attribute name="ecg_free_text_id" type="xs:NMTOKEN"

use="required"/>130 <xs:attribute name="free_text" type="xs:string"/>131 </xs:complexType>132 <xs:key name="ecg_free_text_pk">133 <xs:selector xpath=".//Ecg_Free_Text"/>134 <xs:field xpath="./@ecg_rec_id"/>135 <xs:field xpath="./@ecg_seq_id"/>136 <xs:field xpath="@ecg_free_text_id"/>137 </xs:key>138 </xs:element>139 <xs:element name="Add_Section_Data" minOccurs="0" maxOccurs=

"unbounded">140 <xs:complexType>141 <xs:attribute name="add_sec_id" use="required"/>142 <xs:attribute name="block_offset"/>143 <xs:attribute name="block_length"/>144 <xs:attribute name="date_creation"/>145 <xs:attribute name="date_revision"/>146 </xs:complexType>147 <xs:key name="add_section_data_pk">148 <xs:selector xpath=".//add_sec_data"/>149 <xs:field xpath="././@ecg_rec_id"/>150 <xs:field xpath="./@ecg_seq_id"/>151 <xs:field xpath="@add_sec_id"/>152 </xs:key>153 <xs:keyref name="add_section_data_fk1" refer="add_section_pk">154 <xs:selector xpath=".//add_sec_data"/>155 <xs:field xpath="@add_sec_id"/>

-2-

Page 167: L’Institut National des Sciences Appliquées de Lyon

156 </xs:keyref>157 </xs:element>158 </xs:sequence>159 <xs:attribute name="ecg_seq_id" type="xs:NMTOKEN" use="required"/>160 <xs:attribute name="date_acquisition" type="xs:dateTime"/>161 <xs:attribute name="huffman_block_offset" type="xs:base64Binary"/>162 <xs:attribute name="huffman_block_length"/>163 <xs:attribute name="signal_data_block_offset"/>164 <xs:attribute name="signal_data_block_length"/>165 <xs:attribute name="amplitude_value_multiplier"/>166 <xs:attribute name="sample_time_interval"/>167 <xs:attribute name="techn_id" type="xs:NMTOKEN"/>168 </xs:complexType>169 <xs:key name="ecg_rec_sequence_pk">170 <xs:selector xpath=".//Ecg_Rec_Sequence"/>171 <xs:field xpath="./@ecg_rec_id"/>172 <xs:field xpath="@ecg_seq_id"/>173 </xs:key>174 <xs:keyref name="ecg_rec_seq_fk1" refer="technician_pk">175 <xs:selector xpath=".//Ecg_Rec_Sequence"/>176 <xs:field xpath="@techn_id"/>177 </xs:keyref>178 </xs:element>179 </xs:sequence>180 <xs:attribute name="ecg_rec_id" type="xs:NMTOKEN" use="required"/>181 <xs:attribute name="ecg_type" type="xs:integer"/>182 <xs:attribute name="hieght" type="xs:positiveInteger"/>183 <xs:attribute name="wieght" type="xs:positiveInteger"/>184 <xs:attribute name="ref_ecg_rec_id" type="xs:NMTOKEN"/>185 </xs:complexType>186 <xs:key name="has_record_pk">187 <xs:selector xpath=".//Has_Record"/>188 <xs:field xpath="@ecg_rec_id"/>189 </xs:key>190 <xs:keyref name="has_record_fk1" refer="has_record_pk">191 <xs:selector xpath=".//Has_Record"/>192 <xs:field xpath="@ref_ecg_rec_id"/>193 </xs:keyref>194 </xs:element>195 </xs:sequence>196 <xs:attribute name="patient_id" type="xs:NMTOKEN" use="required"/>197 <xs:attribute name="first_name" type="String40" use="optional"/>198 <xs:attribute name="last_name" type="String40" use="optional"/>199 <xs:attribute name="date_of_birth" type="xs:dateTime"/>200 <xs:attribute name="sex" type="sexType"/>201 </xs:complexType>202 <xs:key name="patient_pk">203 <xs:selector xpath=".//Patient"/>204 <xs:field xpath="@patient_id"/>205 </xs:key>206 </xs:element>207 <xs:element name="Add_Section" minOccurs="0" maxOccurs="unbounded">208 <xs:complexType>209 <xs:attribute name="add_sec_id" type="xs:NMTOKEN" use="required"/>210 <xs:attribute name="section_nbr" type="xs:int"/>211 <xs:attribute name="version_nbr" type="xs:int"/>212 <xs:attribute name="format_des" type="xs:string"/>213 </xs:complexType>214 <xs:key name="add_section_pk">215 <xs:selector xpath=".//add_sec"/>216 <xs:field xpath="@add_sec_id"/>217 </xs:key>218 </xs:element>219 </xs:sequence>220 </xs:complexType>221 </xs:element>222 <xs:simpleType name="String40">223 <xs:restriction base="xs:string">224 <xs:maxLength value="40"/>225 </xs:restriction>226 </xs:simpleType>227 <xs:simpleType name="sexType">228 <xs:restriction base="xs:string">229 <xs:enumeration value="Not specified"/>230 <xs:enumeration value="F"/>231 <xs:enumeration value="M"/>232 </xs:restriction>233 </xs:simpleType>234 </xs:schema>235

-3-

Page 168: L’Institut National des Sciences Appliquées de Lyon
Page 169: L’Institut National des Sciences Appliquées de Lyon

FOLIO ADMINISTRATIF

THESE SOUTENUE DEVANT L'INSTITUT NATIONAL DES SCIENCES APPLIQUEES DE LYON

NOM : JUMAA DATE de SOUTENANCE : 16 Décembre 2010 Prénoms : Hossam Aldin TITRE : Automatisation de la Médiation entre XML et des Bases de Données Relationnelles “XML and Relational Databases Mediation Automation” NATURE : Doctorat Numéro d'ordre : 2010-ISAL-0120 Ecole doctorale : InfoMaths - Informatique et Mathématiques Spécialité : Informatique Cote B.I.U. - Lyon : T 50/210/19 / et bis CLASSE : RESUME : XML offre des moyens simples et flexibles pour l'échange de données entre applications et s'est rapidement imposé comme standard de fait pour l'échange de données entre les systèmes d'information. Par ailleurs, les bases de données relationnelles constituent aujourd’hui encore la technologie la plus utilisée pour stocker les données, du fait notamment de leur capacité de mise à l’échelle, de leur fiabilité et de leur performance. Combiner la souplesse du modèle XML pour l'échange de données et la performance du modèle relationnel pour l’archivage et la recherche de données constitue de ce fait une problématique majeure. Cependant, l'automatisation des échanges de données entre les deux reste une tâche difficile. Dans cette thèse, nous présentons une nouvelle approche de médiation dans le but d’automatiser l'échange de données entre des documents XML et des bases de données relationnelles de manière indépendante des schémas de représentation des données sources et cibles. Nous proposons tout d’abord un modèle d’architecture de médiation générique des échanges. Pour faciliter la configuration d’interfaces spécifiques, notre architecture est basée sur le développement de composants génériques, adaptés à n'importe quelle source XML et n'importe quelle base de données relationnelle cible. Ces composants sont indépendants de tout domaine d'application, et ne seront personnalisés qu’une seule fois pour chaque couple de formats de données sources et de stockage cible. Ainsi notre médiateur permettra la mise à jour automatique et cohérente de toute base de données relationnelle à partir de données XML. Il permettra aussi de récupérer automatiquement et efficacement les données d'une base de données relationnelle et de les publier dans des documents XML (ou messages) structurés selon le format d'échange demandé. La transformation en XML Schema d'un modèle relationnel constitue l’un des éléments clé de notre médiateur. Nous proposons une méthodologie basée sur deux algorithmes successifs : l’un de stratification des relations en différents niveaux en fonction des dépendances fonctionnelles existant entre les relations et les clés des relations, le deuxième de transformation automatique du modèle relationnel en XML Schema à partir de la définition d’un ensemble de fragments types d’encodage XML des relations, des attributs, des clés et des contraintes référentielles. La méthodologie proposée préserve les contraintes d'intégrité référentielles du schéma relationnel et élimine toute redondance des données. Elle a été conçue pour conserver la représentation hiérarchique des relations, ce qui est particulièrement important pour la génération de requêtes SQL correctes et la mise à jour cohérente des données. Notre approche a été appliquée et testée avec succès dans le domaine médical pour automatiser l’échange de données entre une représentation XML du protocole de communication standard SCP-ECG, une norme ISO décrivant un format ouvert de représentation de bio-signaux et métadonnées associées, et un modèle relationnel européen de référence qui inclut notamment l’archivage de ces données. L'automatisation de la médiation est particulièrement pertinente dans ce domaine où les électrocardiogrammes (ECG) constituent le principal moyen d’investigation pour la détection des maladies cardio-vasculaires et doivent être échangés rapidement et de manière transparente entre les différents systèmes de santé, en particulier en cas d'urgence, sachant que le protocole SCP-ECG a de nombreuses implémentations puisque la plupart des sections et des champs de données sont optionnels. MOTS-CLES : XML, Bases de données relationnelles, Médiation, Echange de données automatisé Laboratoire (s) de recherche : MTIC - Méthodologies de Traitement de l'Information en Cardiologie Directeur de thèse: Prof. Paul RUBEL Co-directrice : Dr. Jocelyne FAYN Président de jury : Prof. Bruno DEFUDE Composition du jury : Prof. Bruno DEFUDE, Dr. Jocelyne FAYN, Prof. Frank MORVAN, Prof. Paul RUBEL et Prof. Christine VERDIER