Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

41
Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George

Transcript of Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Page 1: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Research Topics in Computing

Data Modelling for Data Schema Integration

1 March 2005

David George

Page 2: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 2

Modelling & Data Integration

Key Elements of today’s Presentation

Key Drivers for Data Integration

Dimensions and Issues in Integration

Three Integration Approaches

David George

Page 3: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 3

Drivers for Data Integration

David George

Page 4: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 4

Drivers for Data Integration (1)

Organisations evolving as global entities with distributed data.

Systems characterised by mix of legacy and new databases and applications.

Organisational change : Organic growth – size and diversity. Business re-engineering. Corporate mergers and acquisitions.

David George

Page 5: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 5

Drivers for Data Integration (2)

Organisations evolved as collections of distinct, autonomous departments with disconnected systems e.g. in financial services.

Trends in Business Intelligence initiatives: Decision-making support. Customer segmentation. Marketing strategies.

Development of distributed or multidatabase systems.

David George

Page 6: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 6

Dimensions and Issues in Integration

David George

Page 7: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 7

Architecture & Design Issues

Multidatabase systems can be classified in two ways:

Homogeneous systems – local databases having same techniques and language.

Heterogeneous systems – local databases demonstrating diverse data models and language.

Key Dimensions in systems heterogeneity

System heterogeneity – hardware, OS, DBMS Semantic heterogeneity - models and data

David George

Page 8: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 8

<<<< << Check

Design >> >>>>

Why Heterogeneity/Conflict?

Translating conceptualisations of the real world into database world representations

David George

Page 9: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 9

Research Work Conceptualised

Books Model (a)

The data of interest is about Books, their

Publishers and adopting Universities.

Publications Model (b)

The data of interest is about Publications and their Types

David George

Page 10: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 10

Publisher

Topics

Book University

Keywords

Publication

Published by Adopted by

contains

Refer to

Title

Word

Title Name

Name

Code

NameAddress

City

Code

Research Area

Publisher

David George

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Books

Publications

Page 11: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 11

Keywords

Word

Publisher

Topics

Book University

Topics

Publication

Published by Adopted by

contains

Refer to

Title

Name

Title Name

Name

Code

NameAddress

City

Code

Research Area

Name

Publisher Published by

David George

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

A

B

Page 12: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 12

Publisher

Topics

Book University

Publication

Published by Adopted by

Refer to

Title

Title NameName

Code

Name

Address City

Code

Research Area

Published by

David George

Books and Publications Integrated

contains

Page 13: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 13

Semantic Heterogeneity/Conflict

Structural Conflicts Generalisation versus Specialisation Conflicts. Entity versus attributes. Naming conflicts.

Attribute (Domain) Conflicts Data Type conflicts. Measure and Scale conflicts. Integrity, Presence & Absence. Data Values

David George

Page 14: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 14

Semantic Heterogeneity/Conflict

Generalisation/Specialisation Conflicts.

(i.e. Structural)

Naming conflicts. Synonyms e.g. vs Homonyms e.g. vs

Customer Client

Market (Products) Market (Customers)

Page 15: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 15

Semantic Heterogeneity/Conflict

Data Type (representation) conflicts. Student - 26254006 (integer or string) Student - No vs Name (integer or string)

Measure and Scale etc conflicts. Dimension - volume vs weight Measure - light years vs miles Scale - miles vs kilometres Precision - 1:100 versus A:E Date - dd/mm/yyyy vs mm-dd-yy ???

David George

Page 16: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 16

Semantic Heterogeneity/Conflict

Integrity Constraints e.g. Age Range <21 vs Age >18 Referential conflict 1:1 vs 1:M (e.g. 1 invoice for 1/ M orders)

Presence/Absence. No null, nulls – e.g. optional No corresponding attribute

Data Values Same items different values

David George

Page 17: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 17

Integration Approaches

David George

Page 18: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 18

Integration Approaches

Federated Database (Multidatabase) Systems.

Data Warehouse (Materialised in house) Systems.

Mediators (Virtual integration) Systems.

David George

Page 19: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 19

Federated Database Systems

David George

Page 20: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 20

Federated Databases (1)

David George

Page 21: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 21

Federated Databases (2)

A Class of heterogeneous databases that: Consist of both new and old systems. Previously existed in their own stand-alone

(autonomous) environments. Integration is a consequence of distribution.

Organisation can adopt different architectures i.e. the way databases are mapped together:

Loosely Coupled integrations. Tightly Coupled integrations.

David George

Page 22: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 22

Federated Databases (3)

Tightly Coupled Federations

Federation administrator determines schema view for all component systems in the federation.

Negotiates export schemas (tables and attributes) from federation participants who control exports of local schemas.

Local schema exports integrated as a federated schema.

Less autonomy at federation user level for view creation.

David George

Page 23: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 23

Federated Databases (4)

Loosely Coupled Federations

The federated component databases have a greater degree of autonomy.

No central schema view is imposed on users.

Federated user is effectively an administrator creating views.

User employs a MDB Query Language (v TC schema integration).

David George

Page 24: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 24

Federated Databases (5)

Sharing is made explicit by allowing export schemas from the local or component database.

The export schemas are imported to the federation to represent the shareable federated database.

Each source can call on others for information.

FDBMSs differ from homogeneous Distributed DBMSs – they use the same data model and DBMS.

DDBMSs sharing is therefore implicit.David George

Page 25: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 25

Data Warehousing Systems

David George

Page 26: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 26

Data Warehousing (1)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Global Schema

Local Schema Local Schema

User Query

O/RDB

Wrapper

Web Sourc

e

Wrapper

Repository

Data Extraction

Global Schema

Local Schema Local Schema

User Query

O/RDB

Wrapper

Web Sourc

e

Wrapper

Repository

Data Extraction

LocalOperational

WarehouseDecision Support& Mining

Network Internet

Integration& Storage

David George

R3R2

Page 27: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 27

Data Warehousing (2)

Represents the physical separation of operational and decision support environments.

Operational data provides the raw material for: Decision support systems. Data-mining (DM).

E.g. identifying trends or characteristics.

DM = process of “non-trivial extraction of implicit, previously unknown, and potentially useful information”.

David George

Page 28: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 28

Data Warehousing (3)

Warehouse integrates multiple, heterogeneous data sources - e.g. Relational DBs, flat files.

Data is pre-fetched into a central or intermediate warehouse repository by mediation process.

Data is “cleaned” and data integration techniques applied e.g. filtered, joined or aggregated.

Data may be transformed to conform to the warehouse schema.

Provides consistency in naming conventions, data structures, attributes, etc.

David George

Page 29: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 29

Data Warehousing (4)

Data then stored (materialised) in warehouse repository – possibly in separate data marts.

Result is a repository of synthesised data for management decision-making.

Queries are made over the repository’s global schema.

Information is independent from the source data.

Data extraction tends to be periodically.

David George

Page 30: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 30

Mediator (+Wrapper) Systems

David George

Page 31: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 31

Mediator Systems (1)

Data Sources

Mediated Schema

Local Schema Local Schema

O/RDB

Wrapper

Web Sourc

e

Wrapper

User Query

Query 2Query1

Integration System

Data Sources

Mediated Schema

Local Schema Local Schema

O/RDB

Wrapper

Web Sourc

e

Wrapper

User Query

Query 2Query1

Integration System

Mediator

Network Internet

David George

Query Translation

Page 32: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 32

Mediator Systems (2)

Global schema created and mapped to the source schemas.

User makes queries over global, mediated schema.

Mappings can be either: Global-as-view (GAV). Local-as-view (LAV).

Mediator translates global schema query and reformulates it into sub-queries of local schemas.

Wrappers execute and return.

David George

Page 33: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 33

Mediator Systems (3)

Wrappers standardise how source information is described and accessed (i.e. they translate or adapt).

Query answers are returned to the user on demand – after sources are interrogated.

Thus data is always up-to-date (v. Warehousing).

Mediators integrate information view, without integrating the source data.

David George

Page 34: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 34

Mediator Systems (4)

Results in a homogeneous information source using views - based on the mediated (global) schema.

Integration is virtual i.e. retrieved by the mediator but not stored in any central repository.

Differs from Warehousing Queries – where made to materialised data.

In short – provides virtual source schema integration via schema mapping and integrated view.

David George

Page 35: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 35

Comparisons

David George

Page 36: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 36

Federation versusWarehousing & Mediation

Federation represents a more “static” approach – using agreed couplings to allow view creation.

Warehousing and Mediation addresses integration in a more “dynamic” way – using extraction, transformation and integration processes.

David George

Page 37: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 37

Warehousing vs. Mediation

Warehouse: Update-driven: i.e. in warehouse repository Heterogeneous data is integrated in advance and

stored in-house for direct query and analysis.

Mediation: Wrapper and Mediator layer on top of source DBs. Query-driven: Query to mediated schema then

translated into queries appropriate to sources. Results integrated into a global answer set.

David George

Page 38: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 38

Summary

David George

Page 39: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 39

Summary Drivers for Data Integration

Organisational change. Business Intelligence and Strategies.

Integration Issues Different Conceptual Model representations. Resulting Semantic Heterogeneities.

Integration Approaches Federated Systems. Data Warehousing and Mediator Systems.

David George

Page 40: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 40

Next step ……

David George

Page 41: Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

Data Integration 41

Research ResourcesReference Material

Journals Books Presentation slides

UCLAN Website

Internal:http://janus/dgeorge/integration/journals.asp

External:http://www.janus.computing.uclan.ac.uk/dgeorge/integration/journals.asp

David George