1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical...

30
1 Chapter 13: Chapter 13: Distributed Distributed Databases Databases

Transcript of 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical...

Page 1: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

11

Chapter 13:Chapter 13:Distributed DatabasesDistributed Databases

Page 2: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

22Chapter 13

DefinitionsDefinitions Distributed Database:Distributed Database: A A single single

logical databaselogical database that is spread that is spread physically across computers in multiple physically across computers in multiple locations that are connected by a data locations that are connected by a data communications linkcommunications link

Decentralized Database:Decentralized Database: A collection A collection of of independentindependent databases on non- databases on non-networked computersnetworked computers

They are NOT the same thing!

Page 3: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

33Chapter 13

Reasons forReasons forDistributed DatabaseDistributed Database

Business unit autonomy and distributionBusiness unit autonomy and distribution Data sharingData sharing Data communication costsData communication costs Data communication reliability and costsData communication reliability and costs Multiple application vendorsMultiple application vendors Database recoveryDatabase recovery Transaction and analytic processingTransaction and analytic processing

Page 4: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

44Chapter 13

Distributed Database OptionsDistributed Database Options

Homogeneous - Same DBMS at each nodeHomogeneous - Same DBMS at each node• Autonomous - Independent DBMSsAutonomous - Independent DBMSs• Non-autonomous - Central , coordinating DBMSNon-autonomous - Central , coordinating DBMS• Easy to manage, difficult to enforceEasy to manage, difficult to enforce

Heterogeneous - Different DBMSs at different Heterogeneous - Different DBMSs at different nodesnodes• Systems – with full or partial DBMS functionalitySystems – with full or partial DBMS functionality• Gateways - Simple paths are created to other databases Gateways - Simple paths are created to other databases

without the benefits of one logical databasewithout the benefits of one logical database• Difficult to manage, preferred by independent Difficult to manage, preferred by independent

organizationsorganizations

Page 5: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

55Chapter 13

Distributed Database OptionsDistributed Database Options

SystemsSystems - Supports some or all - Supports some or all functionality of one logical databasefunctionality of one logical database• Full DBMS Functionality - All dist. DB functionsFull DBMS Functionality - All dist. DB functions• Partial-Multi-database - Some dist. DB functionsPartial-Multi-database - Some dist. DB functions

Federated - Supports local databases for unique data Federated - Supports local databases for unique data requestsrequests

• Loose Integration - Local dbs have their own schemasLoose Integration - Local dbs have their own schemas• Tight Integration - Local dbs use common schemaTight Integration - Local dbs use common schema

Unfederated - Requires all access to go through a Unfederated - Requires all access to go through a central, coordinating modulecentral, coordinating module

Page 6: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

66Chapter 13

Homogeneous, Non-Autonomous DatabaseHomogeneous, Non-Autonomous Database

Data is distributed across all the nodesData is distributed across all the nodes Same DBMS at each nodeSame DBMS at each node All data is managed by the distributed All data is managed by the distributed

DBMS (no exclusively local data)DBMS (no exclusively local data) All access is through one, global All access is through one, global

schemaschema The global schema is the The global schema is the unionunion of all of all

the local schemathe local schema

Page 7: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

77Chapter 13

Typical Heterogeneous EnvironmentTypical Heterogeneous Environment

Data distributed across all the nodesData distributed across all the nodes Different DBMSs may be used at Different DBMSs may be used at

each nodeeach node Local access is done using the local Local access is done using the local

DBMS and schemaDBMS and schema Remote access is done using the Remote access is done using the

global schemaglobal schema

Page 8: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

88Chapter 13

Major ObjectivesMajor Objectives Location Transparency Location Transparency

• User does not have to know the location of User does not have to know the location of the data.the data.

• Data requests automatically forwarded to Data requests automatically forwarded to appropriate sitesappropriate sites

Local Autonomy Local Autonomy • Local site can operate with its database Local site can operate with its database

when network connections failwhen network connections fail• Each site controls its own data, security, Each site controls its own data, security,

logging, recoverylogging, recovery

Page 9: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

99Chapter 13

Significant Trade-OffsSignificant Trade-Offs Synchronous Distributed DatabaseSynchronous Distributed Database

• All copies of the same data are always identicalAll copies of the same data are always identical• Data updates are immediately applied to all Data updates are immediately applied to all

copies throughout networkcopies throughout network• Good for data integrityGood for data integrity• High overhead High overhead slow response times slow response times

Asynchronous Distributed DatabaseAsynchronous Distributed Database• Some data inconsistency is toleratedSome data inconsistency is tolerated• Data update propagation is delayedData update propagation is delayed• Lower data integrityLower data integrity• Less overhead Less overhead faster response time faster response time

NOTE: all this assumes replicated data (to be discussed later)

Page 10: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

1010Chapter 13

Advantages ofAdvantages ofDistributed Database over Centralized Distributed Database over Centralized

DatabasesDatabases

Increased reliability/availabilityIncreased reliability/availability Local control over dataLocal control over data Modular growthModular growth Lower communication costsLower communication costs Faster response for certain queriesFaster response for certain queries

Page 11: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

1111Chapter 13

Disadvantages ofDisadvantages ofDistributed Database compared to Distributed Database compared to

Centralized databasesCentralized databases

Software cost and complexitySoftware cost and complexity Processing overheadProcessing overhead Data integrity exposureData integrity exposure Slower response for certain queriesSlower response for certain queries

Page 12: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

1212Chapter 13

Options forOptions forDistributing a DatabaseDistributing a Database

Data replication Data replication • Copies of data distributed to different sitesCopies of data distributed to different sites

Horizontal partitioningHorizontal partitioning• Different rows of a table distributed to different Different rows of a table distributed to different

sitessites Vertical partitioningVertical partitioning

• Different columns of a table distributed to Different columns of a table distributed to different sitesdifferent sites

Combinations of the aboveCombinations of the above

Page 13: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

1313Chapter 13

Data ReplicationData Replication Advantages -Advantages -

• ReliabilityReliability• Fast responseFast response• May avoid complicated distributed May avoid complicated distributed

transaction integrity routines (if replicated transaction integrity routines (if replicated data is refreshed at scheduled intervals)data is refreshed at scheduled intervals)

• De-couples nodes (transactions proceed De-couples nodes (transactions proceed even if some nodes are down)even if some nodes are down)

• Reduced network traffic at prime time (if Reduced network traffic at prime time (if updates can be delayed)updates can be delayed)

Page 14: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

1414Chapter 13

Data ReplicationData Replication Disadvantages -Disadvantages -

• Additional requirements for storage Additional requirements for storage spacespace

• Additional time for update operationsAdditional time for update operations• Complexity and cost of updatingComplexity and cost of updating• Integrity exposure of getting incorrect Integrity exposure of getting incorrect

data if replicated data is not updated data if replicated data is not updated simultaneouslysimultaneously

Therefore, better when used for non-volatile Therefore, better when used for non-volatile (read-only) data(read-only) data

Page 15: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

1515Chapter 13

Types of Data ReplicationTypes of Data Replication

Push Replication – Push Replication – •updating site sends updating site sends changes to other siteschanges to other sites

Pull Replication – Pull Replication – •receiving sites control receiving sites control when update messages when update messages will be processedwill be processed

Page 16: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

1616Chapter 13

Types of Push ReplicationTypes of Push Replication Snapshot Replication - Snapshot Replication -

• Changes periodically sent to master site Changes periodically sent to master site • Master collects updates in logMaster collects updates in log• Full or differential (incremental) snapshotsFull or differential (incremental) snapshots• Dynamic vs. shared update ownershipDynamic vs. shared update ownership

Near Real-Time Replication -Near Real-Time Replication -• Broadcast update orders without requiring Broadcast update orders without requiring

confirmationconfirmation• Done through use of triggersDone through use of triggers• Update messages stored in message queue Update messages stored in message queue

until processed by receiving siteuntil processed by receiving site

Page 17: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

1717Chapter 13

Issues for Data ReplicationIssues for Data Replication

Data timeliness – high tolerance for out-of-Data timeliness – high tolerance for out-of-date data may be requireddate data may be required

DBMS capabilities – if DBMS cannot support DBMS capabilities – if DBMS cannot support multi-node queries, replication may be multi-node queries, replication may be necessarynecessary

Performance implications – refreshing may Performance implications – refreshing may cause performance problems for busy nodescause performance problems for busy nodes

Network heterogeneity – complicates Network heterogeneity – complicates replicationreplication

Network communication capabilities – Network communication capabilities – complete refreshes place heavy demand on complete refreshes place heavy demand on telecommunicationstelecommunications

Page 18: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

1818Chapter 13

Horizontal PartitioningHorizontal Partitioning Different rows of a table at different sitesDifferent rows of a table at different sites Advantages -Advantages -

• Data stored close to where it is used Data stored close to where it is used efficiency efficiency• Local access optimization Local access optimization better performance better performance• Only relevant data is available Only relevant data is available security security• Unions across partitions Unions across partitions ease of query ease of query

DisadvantagesDisadvantages• Accessing data across partitions Accessing data across partitions inconsistent inconsistent

access speedaccess speed• No data replication No data replication backup vulnerability backup vulnerability

Page 19: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

1919Chapter 13

Vertical PartitioningVertical Partitioning Different columns of a table at Different columns of a table at

different sitesdifferent sites Advantages and disadvantages are Advantages and disadvantages are

the same as for horizontal the same as for horizontal partitioning except that combining partitioning except that combining data across partitions is more data across partitions is more difficult because it requires joins difficult because it requires joins (instead of unions)(instead of unions)

Page 20: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

2020Chapter 13

Five Distributed Database OrganizationsFive Distributed Database Organizations

Centralized database, distributed accessCentralized database, distributed access Replication with periodic snapshot updateReplication with periodic snapshot update Replication with near real-time Replication with near real-time

synchronization of updatessynchronization of updates Partitioned, one logical databasePartitioned, one logical database Partitioned, independent, non-integrated Partitioned, independent, non-integrated

segmentssegments

Page 21: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

2121Chapter 13

Factors in Choice ofFactors in Choice ofDistributed StrategyDistributed Strategy

Funding, autonomy, securityFunding, autonomy, security Site data referencing patternsSite data referencing patterns Growth and expansion needsGrowth and expansion needs Technological capabilitiesTechnological capabilities Costs of managing complex Costs of managing complex

technologiestechnologies Need for reliable serviceNeed for reliable service

Page 22: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

2222Chapter 13

Distributed DBMSDistributed DBMS Distributed databaseDistributed database requires requires distributed distributed

DBMSDBMS Functions of a distributed DBMS:Functions of a distributed DBMS:

• Locate data with a Locate data with a distributed data dictionarydistributed data dictionary• Determine location from which to retrieve data and Determine location from which to retrieve data and

process query componentsprocess query components• DBMS translation between nodes with different local DBMS translation between nodes with different local

DBMSs (using DBMSs (using middlewaremiddleware))• Data consistency (via Data consistency (via multiphase commit protocolsmultiphase commit protocols))• Global primary key controlGlobal primary key control• ScalabilityScalability• Security, concurrency, query optimization, failure Security, concurrency, query optimization, failure

recoveryrecovery

Page 23: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

2323Chapter 13

Local Transaction StepsLocal Transaction Steps

1.1. Application makes request to distributed Application makes request to distributed DBMSDBMS

2.2. Distributed DBMS checks distributed Distributed DBMS checks distributed data repository for location of data. data repository for location of data. Finds that it is Finds that it is locallocal

3.3. Distributed DBMS sends request to local Distributed DBMS sends request to local DBMSDBMS

4.4. Local DBMS processes requestLocal DBMS processes request5.5. Local DBMS sends results to applicationLocal DBMS sends results to application

Page 24: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

2424Chapter 13

Distributed DBMSDistributed DBMSTransparency ObjectivesTransparency Objectives

Location TransparencyLocation Transparency• User/application does not need to know where data User/application does not need to know where data

residesresides Replication TransparencyReplication Transparency

• User/application does not need to know about User/application does not need to know about duplicationduplication

Failure TransparencyFailure Transparency• Either all or none of the actions of a transaction are Either all or none of the actions of a transaction are

committedcommitted• Each site has a transaction managerEach site has a transaction manager

Logs transactions and before and after imagesLogs transactions and before and after images Concurrency control scheme to ensure data integrityConcurrency control scheme to ensure data integrity

• Requires special Requires special commit protocolcommit protocol

Page 25: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

2525Chapter 13

Two-Phase CommitTwo-Phase Commit Prepare PhasePrepare Phase

• Coordinator receives a commit requestCoordinator receives a commit request• Coordinator instructs all resource managers to Coordinator instructs all resource managers to

get ready to “go either way” on the get ready to “go either way” on the transaction. Each resource manager writes all transaction. Each resource manager writes all updates from that transaction to its own updates from that transaction to its own physical logphysical log

• Coordinator receives replies from all resource Coordinator receives replies from all resource managers. If all are ok, it writes commit to its managers. If all are ok, it writes commit to its own log; if not then it writes rollback to its logown log; if not then it writes rollback to its log

Page 26: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

2626Chapter 13

Two-Phase CommitTwo-Phase Commit Commit PhaseCommit Phase

• Coordinator then informs each resource Coordinator then informs each resource manager of its decision and broadcasts a manager of its decision and broadcasts a message to either commit or rollback (abort). message to either commit or rollback (abort). If the message is commit, then each resource If the message is commit, then each resource manager transfers the update from its log to its manager transfers the update from its log to its databasedatabase

• A failure during the commit phase puts a A failure during the commit phase puts a transaction “in limbo.” This has to be tested transaction “in limbo.” This has to be tested for and handled with timeouts or pollingfor and handled with timeouts or polling

Page 27: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

2727Chapter 13

Concurrency ControlConcurrency ControlConcurrency TransparencyConcurrency Transparency

• Design goal for distributed databaseDesign goal for distributed database TimestampingTimestamping

• Concurrency control mechanismConcurrency control mechanism• Alternative to locks in distributed Alternative to locks in distributed

databasesdatabases

Page 28: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

2828Chapter 13

Query OptimizationQuery Optimization In a query involving a multi-site join and, possibly, In a query involving a multi-site join and, possibly,

a distributed database with replicated files, the a distributed database with replicated files, the distributed DBMS must decide where to access distributed DBMS must decide where to access the data and how to proceed with the join. Three the data and how to proceed with the join. Three step process:step process:1 Query decompositionQuery decomposition - rewritten and - rewritten and

simplifiedsimplified2 Data localizationData localization - query fragmented so that - query fragmented so that

fragments reference data at only one sitefragments reference data at only one site3 Global optimizationGlobal optimization - -

Order in which to execute query fragmentsOrder in which to execute query fragments Data movement between sitesData movement between sites Where parts of the query will be executedWhere parts of the query will be executed

Page 29: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

2929Chapter 13

Evolution of Distributed DBMSEvolution of Distributed DBMS

““Unit of Work” - All of a transaction’s Unit of Work” - All of a transaction’s steps.steps.

Remote Unit of WorkRemote Unit of Work• SQL statements originated at one location SQL statements originated at one location

can be executed as a single unit of work on can be executed as a single unit of work on a single remote DBMSa single remote DBMS

Page 30: 1 Chapter 13: Distributed Databases. Chapter 13 2 Definitions Distributed Database: A single logical database that is spread physically across computers.

3030Chapter 13

Evolution of Distributed DBMSEvolution of Distributed DBMS

Distributed Unit of WorkDistributed Unit of Work• Different statements in a unit of work may Different statements in a unit of work may

refer to different remote sitesrefer to different remote sites• All databases in a single SQL statement All databases in a single SQL statement

must be at a single sitemust be at a single site Distributed RequestDistributed Request

• A single SQL statement may refer to tables A single SQL statement may refer to tables in more than one remote sitein more than one remote site

• May not support replication transparency or May not support replication transparency or failure transparencyfailure transparency