Pal gov.tutorial2.session12 1.the problem of data integration

19
1 PalGov © 2011 فلسطينيةلكترونية الديمية الحكومة ا أكاThe Palestinian eGovernment Academy www.egovacademy.ps Tutorial II: Data Integration and Open Information Systems Session 12.1 The problem of Data Integration Dr. Mustafa Jarrar University of Birzeit [email protected] www.jarrar.info

Transcript of Pal gov.tutorial2.session12 1.the problem of data integration

1PalGov © 2011

أكاديمية الحكومة اإللكترونية الفلسطينيةThe Palestinian eGovernment Academy

www.egovacademy.ps

Tutorial II: Data Integration and Open Information Systems

Session 12.1

The problem of Data Integration

Dr. Mustafa Jarrar

University of Birzeit

[email protected]

www.jarrar.info

2PalGov © 2011

About

This tutorial is part of the PalGov project, funded by the TEMPUS IV program of the

Commission of the European Communities, grant agreement 511159-TEMPUS-1-

2010-1-PS-TEMPUS-JPHES. The project website: www.egovacademy.ps

University of Trento, Italy

University of Namur, Belgium

Vrije Universiteit Brussel, Belgium

TrueTrust, UK

Birzeit University, Palestine

(Coordinator )

Palestine Polytechnic University, Palestine

Palestine Technical University, PalestineUniversité de Savoie, France

Ministry of Local Government, Palestine

Ministry of Telecom and IT, Palestine

Ministry of Interior, Palestine

Project Consortium:

Coordinator:

Dr. Mustafa Jarrar

Birzeit University, P.O.Box 14- Birzeit, Palestine

Telfax:+972 2 2982935 [email protected]

3PalGov © 2011

© Copyright Notes

Everyone is encouraged to use this material, or part of it, but should

properly cite the project (logo and website), and the author of that part.

No part of this tutorial may be reproduced or modified in any form or by

any means, without prior written permission from the project, who have

the full copyrights on the material.

Attribution-NonCommercial-ShareAlike

CC-BY-NC-SA

This license lets others remix, tweak, and build upon your work non-

commercially, as long as they credit you and license their new creations

under the identical terms.

PalGov © 2011 4

Tutorial Map

Topic h

Session 1: XML Basics and Namespaces 3

Session 2: XML DTD‟s 3

Session 3: XML Schemas 3

Session 4: Lab-XML Schemas 3

Session 5: RDF and RDFs 3

Session 6: Lab-RDF and RDFs 3

Session 7: OWL (Ontology Web Language) 3

Session 8: Lab-OWL 3

Session 9: Lab-RDF Stores -Challenges and Solutions 3

Session 10: Lab-SPARQL 3

Session 11: Lab-Oracle Semantic Technology 3

Session 12_1: The problem of Data Integration 1.5

Session 12_2: Architectural Solutions for the Integration Issues 1.5

Session 13_1: Data Schema Integration 1

Session 13_2: GAV and LAV Integration 1

Session 13_3: Data Integration and Fusion using RDF 1

Session 14: Lab-Data Integration and Fusion using RDF 3

Session 15_1: Data Web and Linked Data 1.5

Session 15_2: RDFa 1.5

Session 16: Lab-RDFa 3

Intended Learning Objectives

A: Knowledge and Understanding

2a1: Describe tree and graph data models.

2a2: Understand the notation of XML, RDF, RDFS, and OWL.

2a3: Demonstrate knowledge about querying techniques for data

models as SPARQL and XPath.

2a4: Explain the concepts of identity management and Linked data.

2a5: Demonstrate knowledge about Integration &fusion of

heterogeneous data.

B: Intellectual Skills

2b1: Represent data using tree and graph data models (XML &

RDF).

2b2: Describe data semantics using RDFS and OWL.

2b3: Manage and query data represented in RDF, XML, OWL.

2b4: Integrate and fuse heterogeneous data.

C: Professional and Practical Skills

2c1: Using Oracle Semantic Technology and/or Virtuoso to store

and query RDF stores.

D: General and Transferable Skills2d1: Working with team.

2d2: Presenting and defending ideas.

2d3: Use of creativity and innovation in problem solving.

2d4: Develop communication skills and logical reasoning abilities.

5PalGov © 2011

Module ILOs

After completing this module students will be able to:

- Understand the importance of Data Integration.

- Understand the problems and challenges of Data Integration.

6PalGov © 2011

Example from the government Domain

• Consider all interactions with government agencies in order to register

a new business in Palestine.

• Example: Establishing a new Radio Station.

Ministry of

Telecom

Ministry of

Information

Ministry of

National Economy

Chamber of

Commerce

Ministry of

Finance

7PalGov © 2011

Example from the government Domain

• Consider when the business evolves or changes.

• Example: Changing the address of the radio station.

– Address must be changed in 5 different databases.

Ministry of

Telecom

Ministry of

Information

Ministry of

National Economy

Chamber of

Commerce

Ministry of

Finance

8PalGov © 2011

Example from the government Domain

• Consider the data registered about the same radio station in the

databases of different ministries and governmental agencies:

ID Name Type Location

R2563I Radio Al-Amal Radio Station Ramallah

B_ID Business Name Activity Type City

LM1847 Al-Amal

Broadcast

Radio

Broadcasting

Ramallah

and Bireh

ID Company Name Company Type Location

182NS3 Broadcast Al-

Amal

Broadcasting

Station

Al-Balu’

Agency 1

Agency 2

Agency 3

. . .

9PalGov © 2011

Example from the government Domain

• From our simple example one can point out to some challenges in

Data Integration:

– No agreed upon naming (name, business name, company name)

– No agreed upon meaning (Does ‟Activity Type‟ mean exactly the same as

„Company Type‟?)

– Different Registered Data: Radio Al-Amal, Al-Amal Broadcast, ….

ID Name Type City

R2563I Radio Al-Amal Radio Station Ramallah

B_ID Business Name Activity Type Province

LM1847 Al-Amal

Broadcast

Radio

Broadcasting

Ramallah

and Bireh

ID Company Name Company Type Location

182NS3 Broadcast Al-

Amal

Broadcasting

Station

Al-Balu’

Agency 1

Agency 2

Agency 3

. . .

10PalGov © 2011

Problem is in all domains

11PalGov © 2011

Problem is in all domains

• Problem is now even more challenging with the Web.

• The Data Web envisions the web as a global world-wide database.

• This means that one can query distributed multiple databases on the

web as if he/she is querying a local database.

12PalGov © 2011

Challenges of Data Integration:

Heterogeneities in Database Schemas

• One can distinguish between several heterogeneities

between different schemas:

– Name Heterogeneities (difference in used vocabulary).

– Meaning Heterogeneities (different meaning for the same attribute

in two schemas).

– Heterogeneities in the structure and type.

– Heterogeneities in the rules and constraints.

– Data Model Heterogeneities.

13PalGov © 2011

Name and Meaning Heterogeneities

• Synonyms – Different names for the same concepts

– employee, clerk

– exam, course

– code, num

• Homonyms – Same name for different concepts (different meanings)

- City as City of birth in one schema,

- City as City of Residence in another schema

Saraly: Net Salary

Salary: Gross Salary

Section

Division

SynonymsHomonyms

A specialized

division of a

large

organization

14PalGov © 2011

Heterogeneities in Structure and Type

• The same concepts are represented with different conceptual

structures in two schemas:

– Attribute in one schema and derived value in another schema.

– Attribute in one schema and entity in another schema.

– Entity in one schema and relationship in another schema.

– Different abstraction levels for the same concept in two schemas:

e.g. two entities with homonym names related by an IS-A hierarchy

in two schemas.

Source: Carlo Batini

15PalGov © 2011

Heterogeneities in Structure

• EXAMPLES:

PUBLISHERBOOKBOOK

PUBLISHER

EMPLOYEE

DEPARTMENT

PROJECT

EMPLOYEE

PROJECT

Source: Carlo Batini

Person

WOMANMAN

GENDERPerson

16PalGov © 2011

Heterogeneities in Type

Examples:

In a single attribute (e.g., Numberic, Alphanumeric). E.g., the

attribute “gender”:

– Male/Female

– M/F

– 0/1

Year has a four digit domain in one schema and two

digit domain in another schema

Different currencies (Euros, US Dollars, etc.)

Different measure systems (kilos vs. pounds, centigrade vs.

Fahrenheit.)

Different granularities (grams, kilos, etc.)

17PalGov © 2011

Heterogeneities in the rules and constraints

• EXAMPLES:

– Different cardinalities in the same relationships

– Key conflicts

Source: Carlo Batini

18PalGov © 2011

Model Heterogeneities

• Model Heterogeneities occurs when different databases adheres to

different data models:

– Relational Data Model, XML, RDF, Object-Oriented, OWL, ...

• Solution: Reduce Model Heterogeneity by using one data model.

• Example: Convert the Relational Model to RDF graph model.

19PalGov © 2011

References

• Carlo Batini: Course on Data Integration. BZU IT Summer School

2011.

• Stefano Spaccapietra: Information Integration. Presentation at the IFIP

Academy. Porto Alegre. 2005.

• Chris Bizer: The Emerging Web of Linked Data. Presentation at SRI

International, Artificial Intelligence Center. Menlo Park, USA. 2009.