1 Berendt: Advanced databases, winter term 2007/08, berendt/teaching/2007w/adb/ 1 Advanced databases...

download 1 Berendt: Advanced databases, winter term 2007/08, berendt/teaching/2007w/adb/ 1 Advanced databases – Defining and combining.

If you can't read please download the document

Transcript of 1 Berendt: Advanced databases, winter term 2007/08, berendt/teaching/2007w/adb/ 1 Advanced databases...

  • Slide 1

1 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 1 Advanced databases Defining and combining heterogeneous databases: Basics and overview Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ Last update: 17 October 2007 Slide 2 2 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 2 Motivation: Price comparison engines search & combine heterogeneous travel-agency DBs, which seach & combine heterogeneous airline DBs Slide 3 3 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 3 Agenda Goals and challenges Global schema integration (short survey) Federated database systems An example: IBMs DB2 User sovereignty & multidatabase language approach Slide 4 4 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 4 Multi database systems Multiple databases created for the same functionality n Different operating systems, data formats, query languages etc Typically DBs managed by DBMSs running on heterogeneous computing platforms Information sharing across dissimilar platforms n Interconnect previously isolated software systems (DBMS) n Not only invoke but also coordinate interactions Slide 5 5 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 5 Interoperating with heterogeneous databases requirements (1) n Distributed transparency l users must access a number of different databases in the same way as accessing a single database. n Heterogeneity transparency l users must access other schemas in the same way they access their local database (using a familiar model and language). n The existing database systems and applications must not be changed. Slide 6 6 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 6 n Addition of new databases must be easily accommodated into the system. n The databases have to be accessed both for retrievals and updates. n The performance of heterogeneous systems has to be comparable to the one of homogeneous distributed systems. Interoperating with heterogeneous databases requirements (2) Slide 7 7 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 7 Autonomy and heterogeneity Interconnection and cooperation of autonomous and heterogeneous databases must address n Distribution n Autonomy n Heterogeneity Slide 8 8 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 8 Heterogeneity n Heterogeneity is independent of location of data n When is an information system homogeneous? l Software that creates and manipulates data is the same l All data follows same structure and data model and is part of a single universe of discourse n Different levels of heterogeneity l Different languages to write applications l Different query languages l Different models l Different DBMSs l Different File systems l Semantic heterogeneity etc. Slide 9 9 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 9 Autonomy Databases usually under separate and independent control Aspects of autonomy n Design autonomy: Local DBs chose their own data model, query language, interpretation of data etc. n Communication autonomy: Local DBs decide when and how to respond to other DB requests n Execution autonomy: Execution of local/external operations/transactions is not controlled by any external DBMS n Association autonomy: Local DBs can decide how much of their data/functions/operations to share with other classes of users n Another kind of autonomy: User autonomy / sovereignty ! Slide 10 10 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 10 Balancing autonomy and heterogeneity Different degrees of autonomy: n No/little autonomy (intra corporate, poor networking infrastructure) n More of autonomy and flexible bridging of heterogeneity (federated approach) n Autonomy over heterogeneity (multi database language approach) Slide 11 11 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 11 Interoperability The ability to request and receive services between the interoperating systems and use each others functionality. Systems considered interoperable if n They can exchange messages and requests n They can receive services and operate as a unit in solving a common problem Slide 12 12 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 12 Heterogeneous Distributed Databases Information systems that provide interoperation and varying degrees of integration among multiple DBs are called n Multi database systems or n Federated (database) systems or n More generally, heterogeneous distributed database systems (HDDBSs) Slide 13 13 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 13 Solutions to integrating HDDBSs Global Schema Integration Federated Database systems Multi database language approach Slide 14 14 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 14 Agenda Goals and challenges Global schema integration (short survey) Federated database systems An example: IBMs DB2 User sovereignty & multidatabase language approach Slide 15 15 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 15 Definition and advantages Global database integration: n Based on complete integration to provide a single view Advantages: n Consistent, uniform view of and access to data for users n Users unaware of existing multiple existing DBs Slide 16 16 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 16 Disadvantages n Hard to automate creation of a global schema: structural, semantic or behavioral conflicts n Autonomy esp. association autonomy sacrificed: all local data and operations to be revealed n Loss of semantic information depending on how the schema integration is performed n Correctness of global schema is hard to prove: hard because of context dependent meanings n Error prone, time consuming n Unsuitable for frequent dynamic changes to schemas n Does not scale well with size of DB networks Slide 17 17 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 17 Agenda Goals and challenges Global schema integration (short survey) Federated database systems An example: IBMs DB2 User sovereignty & multidatabase language approach Slide 18 18 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 18 Taxonomy - based on autonomy DBS either centralized or distributed n Centralized: a single DBMS managing a single DB n Distributed: a single distributed DBMS managing multiple DBs MDBS supports operations on multiple DBs Slide 19 19 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 19 What is a federated system? (check that not redundant) A federated system integrates existing, possibly heterogeneous, databases while preserving their autonomy*. The main difference between federated systems and traditional distributed systems is that in federated systems each component remains autonomous. Autonomy of a component system means that the local administrator maintains some control over his/her system. * A. P. Sheth and J. A. Larson. Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases.ACM Computing Surveys, 22(3):183-236, 1990. Slide 20 20 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 20 Definition n A Federated Database System (FDBS) is a collection of cooperating but autonomous component DBSs. n Aim: remove the need for static global schema integration n Allows each local DB to have more control over the shareable information n Control is decentralized n Integration need not be complete but depends on needs of users n More terminology: FDBMS = the software that controls, coordinates the component DBSs of an FDBS Slide 21 21 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 21 FDBS coupling Loosely coupled FDBS n If users responsibility to create and maintain the federation. No control enforced by the federation admin. Tightly coupled FDBS n If federation admin have responsibility for creating and maintaining the federation and actively controlling access to the component DBSs. Association autonomy of the individual component DBs still exists Slide 22 22 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 22 FDBs as a compromise Compromise between n no integration in which users must explicitly interface between multiple autonomous DBs AND n Total integration in which autonomy of each component DBS is sacrificed so that users can access data through a single global interface but not as a local user Support local and global (federated) operations Slide 23 23 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 23 Can continue local operations and participate in more than 1 federation. Can be (de/) centralized or another FDBMS A FDBS and its components cooperation among independent systems Slide 24 24 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 24 Basic system components of the data management architecture Slide 25 25 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 25 FDBSs Schemas Local schema n Conceptual schema of a component DB Component schema n Local schema translated to a common data model of the FDBS. Alleviates data model heterogeneity. Export schema n Specify shareable objects to other members or classes of members of the FDBS. Federated schema n A statically integrated schema or dynamic view of multiple export schemas. Can be multiple federated schemas. External schema n For customization when the federated schema is large and complicated. Another level of abstraction for class of users for example. Slide 26 26 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 26 The systme of schemas needs to be extended Five level schema architecture of a FDBS Slide 27 27 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 27 Loosely coupled FDBSs n User creates and maintains federation schema n Creating schema corresponds to creating a view against relevant export schemas n Therefore, each user must be aware of information and structure of the export schemas n Hard to support view updates therefore, assume highly autonomous read-only DBs Slide 28 28 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 28 Loosely coupled FDBSs - Advantages n Flexibility of different interpretations possible for same federated schema n Easier to cope with dynamic changes in schemas since it is easier to create views. Detection of changes is however expensive. Slide 29 29 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 29 Loosely coupled FDBSs - Disadvantages n Duplicated effort in creation of similar federated schemas. n Difficulty in understanding the semantics of schemas available to the user. n Due to possible multiple view creations, view updating cannot be supported. Slide 30 30 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 30 Tightly coupled FDBSs n Aim: provide location, replication and distribution transparency n Federation administrators have full control over creation and maintenance of federated schemas and access to other export schemas n Single federated schema same as global schema but view updates possible if administrators understand the mappings. Slide 31 31 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 31 Tightly coupled FDBSs Disadvantages n FDBS administrator and component DBSs negotiate creation of export schemas during which adm. has complete read access to component schema and/or data. Violates autonomy n Change in export/component schemas imply redoing federated schema creation. Slide 32 32 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 32 Basic system components of the data management architecture Slide 33 33 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 33 Processors in a FDBS n Transforming processors l Uses mappings to transform commands from internal command language to local query language etc. n Filtering processors l Uses access control specified in export schema to limit allowable operations submitted to corresponding component schemas n Constructing processors l Performs query decomposition and merges data Slide 34 34 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 34 System architecture of an FDBS schemas and processors Slide 35 35 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 35 Data integration approaches n Multidatabase languages and declarative integration languages l Collective identifiers, semantic variables, virtual classes that form a global schema n Conceptual-level abstraction from data sources l Data integration performed on top of this conceptual layer n Object-oriented virtual integration approaches l Enable user to express specific views and ways to compose integrated data objects n Ontology-based integration approaches l Single-ontology (global) or multi-ontologies n Semantic Web approaches l Ontology-based n Taxonomic database systems l Support multiple, overlapping classifications in centralized, non- integrated DB systems Slide 36 36 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 36 Examples of integration approaches (1) Slide 37 37 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 37 Examples of integration approaches (2) Slide 38 38 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 38 Examples of integration approaches (3) Slide 39 39 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 39 Agenda Goals and challenges Global schema integration (short survey) Federated database systems An example: IBMs DB2 User sovereignty & multidatabase language approach Slide 40 40 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 40 Background n Garlic l research project l wrapper architecture ( virtual integration) l start from standard relational database, extend language and data model to support some object-oriented features l cross-source query optimization n DB2 DataJoiner l commercial system l combine multiple heterogeneous relational sources l focus on query optimization n DB2 (see Haas et al. 2002) l incorporates ideas of both Garlic and DataJoiner l user-defined fucntions to "federate" simple data sources Slide 41 41 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 41 DB2 architecture for database federation Slide 42 42 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 42 Styles of federation n Scalar UDFs (user-defined functions) l Input: data from surrounding SQL statement l Output: a single scalar result Can federate function (combine data from one source with a function provided by another, in a single statement) n Table UDFs l Input: as in scalar UDFs l Output: a table Can federate data l Note: UDFs can also be used to access Web services n Wrappers: Federate function and data l A wrapper transforms an external data source to table form l This data source / table is then identified by a nickname (and can be queried like a normal local table) l Wrappers for a variety of relational and non-relational sources are supplied (e.g., Oracle, Excel, XML) l + a toolkit for developing wrappers for other data sources Slide 43 43 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 43 Examples: Scalar UDFs n Send a message to an MQSeries queue : db2mq.mqsend() l Built-in function l [MQSeries: a middleware that allows the exchange of messages between independent applications; all messages are transferred via this queue] n Send a message with database content to the client application: SELECT db2mq.mqsend(a.headline) FROM Articles a WHERE a.article_timestamp >= CURRENT TIMESTAMP Slide 44 44 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 44 Examples: Table UDFs Data source: address book in a Lotus Notes database SELECT a.first, a.last, a.phone, a.email FROM TABLE (addressbook( )) AS a, Company_Profiles c WHERE c.industry FINANCIAL AND c.revenue > 50,000,000 AND c.name = a.company_name Data source: local file system SELECT f.filename, f.author, f.last_modified_date FROM TABLE (dir(\laura\papers, .pdf)) AS f WHERE f.last_modified_date 07/04/2002 Slide 45 45 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 45 Using wrappers to integrate different relational databases (overview) Slide 46 46 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 46 Using wrappers to integrate different relational databases (sample queries) 1. Register nicknames for transactions from 2 company branches: sf.Transactions, ny.Transactions 2. Create federated view CREATE VIEW National_Transactions (store_id, tran_date, tran_id, item_id) AS SELECT store_id, tran_date, tran_id, item_id FROM sf.Transactions UNION ALL SELECT store_id, tran_date, tran_id, item_id FROM ny.Transactions 3. Generate a national sales report SELECT MONTH(tran_date), item_id, COUNT(*) FROM National_Transactions WHERE YEAR(tran_date)=2001 GROUP BY MONTH(tran_date), item_id NB: Can also generate materialized views (cache information locally): CREATE TABLE... AS... Slide 47 47 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 47 Federation of nonrelational structured data (overview) (A single XML document may be mapped to multiple nicknames) Slide 48 48 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 48 Federation of nonrelational structured data (sample query) Excel spreadsheets nicknames: Items, Suppliers SELECT i.mfg, s.id FROM Items i, Suppliers s WHERE i.id = s.id AND i.id = (SELECT g.id FROM (SELECT g.id, COUNT(*), ROWNUMBER( ) OVER (ORDER BY COUNT(*) DESC) AS rownum FROM National_Transactions g, Items it WHERE it.cat=television AND g.id = it.id AND YEAR(tran_date)=2001 GROUP BY g.id) AS tv_total_2001 WHERE rownum = 1) Slide 49 49 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 49 Agenda Goals and challenges Global schema integration (short survey) Federated database systems An example: IBMs DB2 User sovereignty & multidatabase language approach Slide 50 50 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 50 Autonomy of data sources autonomy and sovereignty of users n Autonomy of data sources is valued highly l Degree to which a local data source can operate independently must not be reduced by the integration system n But what about the autonomy of data receivers? l Human users and applications l Autonomous: have different information needs, vary in the ways they perceive their domain of interest l Using integrated data should be non-intrusive: users should not be forced to adapt to any standard concerning structure and semantics of data they desire Slide 51 51 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 51 The ASME criteria for evaluating data integration approaches n Abstraction l Shield users from low-level heterogeneities and underlying data sources n Selection l the possibility of user-specific selection of data and data sources for individual integration n Modeling l The availability of means to incorporate user-specific ways to perceive a domain of interest for which integrated data is desired in the process of data integration n Explicit semantics l Means for explicitly representing the real-world semantics of data Do different approaches realize these (or not)? Can we have it all? Slide 52 52 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 52 Evaluation results (1) Slide 53 53 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 53 Evaluation results (2) Slide 54 54 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 54 Evaluation results (3) Slide 55 55 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 55 Conclusion: Current state of Multi database language approach disadvantages and future work needed Lack of distribution and location transparency for users. Users responsible for n finding relevant DBs, n understanding schemas, n detecting and resolving semantic conflicts n performing view integration Some support offered by the language constructs abstracting the user from technical-level issues and supporting user- specific data selection and modeling are conflicting goals (Ziegler 200, p. 6) Slide 56 56 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 56 What about user sovereignty like this? n Yahoo! Pipes is an interactive data aggregator and manipulator that lets you mashup your favorite online data sources. n Like Unix pipes, simple commands can be combined together to create output that meets your needs: l combine many feeds into one, then sort, filter and translate to create your ultimate custom feed. l remix your favorite data sources and use the Pipe to power a new application. l... (http://pipes.yahoo.com/pipes/)http://pipes.yahoo.com/pipes/ Slide 57 57 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 57 Next lecture Goals and challenges Global schema integration (short survey) Federated database systems An example: IBMs DB2 User sovereignty & multidatabase language approach Schema integration Slide 58 58 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.ac.be/~berendt/teaching/2007w/adb/ 58 References / background reading; acknowledgements n Slides 2-35 are based on l Meena Nagarajan (2006). Federated database systems. Part I. l http://lsdis.cs.uga.edu/~meena/Spring06/ADB/Federated%20Database%20Systems.ppt http://lsdis.cs.uga.edu/~meena/Spring06/ADB/Federated%20Database%20Systems.ppt - which in turn reports the classic survey paper - Amit P. Sheth, James A. Larson: Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. ACM Comput. Surv. 22(3): 183-236(1990) - (available for example at http://www.cs.auc.dk/~tbp/Teaching/DAT5E00/sheth.pdf )http://www.cs.auc.dk/~tbp/Teaching/DAT5E00/sheth.pdf - The slides on DB2 are based on the paper - Haas, L.M., Lin, E.T., & Roth, M.A. (2002). Data integration through database federation. IBM Systems Journal, 41(4), 578-596 - http://researchweb.watson.ibm.com/journal/sj/414/haas.pdf http://researchweb.watson.ibm.com/journal/sj/414/haas.pdf - The slides on user sovereignty and slides 35-38 are based on the paper - Ziegler, P. (2004). User-specific semantic integration of heterogeneous data: What remains to be done? IFI, University of Zurich, Technical Report ifi-2004.01 - ftp://ftp.ifi.unizh.ch/pub/techreports/TR-2004/ifi-2004.01.pdf ftp://ftp.ifi.unizh.ch/pub/techreports/TR-2004/ifi-2004.01.pdf p.40: Garlic: M. Tork Roth, P. Schwarz, and L. Haas, An Architecture for Transparent Access to Diverse Data Sources, Component Database Systems, K. R. Dittrich, A. Geppert, Editors, Morgan-Kaufmann Publishers, San Mateo, CA (2001), pp. 175206. DataJoiner: IBMCorporation, DataJoiner, http://www.software.ibm.com/data/datajoinerhttp://www.software.ibm.com/data/datajoiner