Distributed Database System. Definition A distributed Database System consists of a collection of...

Distributed Database System

Definition A distributed Database System consists

of a collection of sites, connected together via some kind of communication network, in which: Each site is a full database system site

in its own right, but The sites has agrees to work together so

that a user at any site can access data anywhere in the network exactly as if the data were all stored at the user’s own site.

Workstation

DatabaseDatabase

WorkstationWorkstation

Communicationnetwork

Database ServerDatabase Server

Workstation

WorkstationWorkstation New York London

Each site is a database system site in its own right. Each site has it own local “real” database, Its own local users, Its own local DBMS and transaction Management software

including it own local locking, logging recovery, and etc.), Its own local data communication manager (DC manager)

Overall distributed system can thus be A kind of partnership among the individual local

DBMSs at the individual local site

It has a new S/W component at each site – logical extension of the local DBMS – provides the necessary

partnership functionality, and it is the combination of these new components together

with the existing DBMSs that constitutes what is usually called the “Distributed Database Management System”

Distributed Database (DDB)

DDBDDB is a collection of multiple logically interrelated database distributed over a computer network, and a distributed database management system (DDBMS) as a software system that manages a distributed database while making the distribution transparent to the user.

Type of Distributed Database system

Heterogeneous DDBMS Homogenous DDBMS

Advantage of Distributed Database Why desirable?

The enterprise are usually distributed Data is usually distributed Each organization unit maintain data that is relevant to its

own.

Total information asset of the enterprise is thus splintered into what are sometime called “islands of information”

And what a Distributed system does is provide the necessary “bridge” to connect those islands together.

It enables the structure of database to mirror the structure of the enterprise

Local data can be kept locally, where it most logical belongs While at the same time remote data can be accessed when

necessary

New York

Atlanta

San Francisco

Los Angeles

CommunicationNetwork

Banking systemBanking systemSF accounts keep in SF, NY accounts keep in NY…The advantage are surely obvious: “The distributed arrange combines efficiency of processing (the data kept close to the point where it is most frequency used” with increased accessibility (it is possible to access a LA account from SF, via the communication network)

Function of distributed database

Keeping track of data Distributed query processing Distributed transaction management Replicated data management Distributed data management Distributed data recovery Security Distributed directory (catalog)

management

Fundamental Principle (Rule zero)

The fundamental principle of distributed database“To the user, a distributed system should look exactly

like a non-distributed system”

Local autonomy No reliance on a central site Continuous operation Location independence Fragmentation independence Replication independence Distributed query processing Distributed transaction management Hardware independence Network independence DBMS independence

Local autonomy The sites in a distributed system should be

autonomous. Local autonomous means that all operations at

a given site are controlled by that site; No site X should depend on site Y for its

successful operation Local data is locally owned and managed, with

local accountability; all data “really” belongs to some local database, even if it is accessible from other site

Integrity, security and physical storage representation of local data remain under the control and jurisdiction of the local site

No reliance on a central site All site must be treated as equals. They must not be any reliance on a central

“master” site for some central service – for example

Central query processing Central transaction management Centralized naming services

The entire system is dependent on that central site

Why don’t need First, that central site might be a bottleneck; Second, the system would be vulnerable if the

central site went down, the whole system would be down (“The single point of failure” problem)

Continuous operation The advantage of distributed systems is provide greater

reliability and greater availability Reliability is the probability that the system is up

and running any given moment. Availability is the probability that the system is up

and running continuously (available) during a time interval.

Unplanned shutdowns are undesirable Planned shutdowns should never be required; That is it should never necessary to shut the

system down in order to perform a task such as adding a new site or upgrading the DBMS at an existing site to a new release version.

Location independence/transparency Basic idea

Users should not have to know where data is physically stored, but rather should be able to perform

– at least from a logical standpoint – as if data were all stored at their own local site.

Desirable because It simplifies application programs and user

activities It allows data to migrate from site to site without

invalidating any of those program and activities (migratability is desirable because it allows data

to be move around the network in respond to changing performance requirement)

Fragmentation independence / transparency

Support data fragmentation Base table can be divided into pieces or

fragments for physical storage purposes, and distinct fragments can be stored at different sites.

Fragmentation is desirable for performance reasons: Data can be stored at the location where it is

most frequently used, so that most operations are local and network traffic is reduces

Emp# Dept# Salary

E1 D1 40000E2 D1 42000E3 D2 30000E4 D2 35000E5 D3 48000

Emp# Dept# Salary

E1 D1 40000E2 D1 42000E5 D3 48000

Emp# Dept# Salary

E3 D2 30000E4 D2 35000

Emp

S1_Emp S2_Emp

User perception

Site A Site B

FRAGMENT EMP ASS1_EMP AT SITE ‘SITE_A’ WHERE DEPT# = DEPT#(‘D1’) OR DEPT# = DEPT#(‘D3’) S1_EMP AT SITE ‘SITE_B’ WHERE DEPT# = DEPT#(‘D2’)

Fragmentation Fragmentation type

Horizontal Fragmentation Vertical Fragmentation

Reconstructing the original base relvar from the fragments in done via suitable join (for vertical) and union operations (for horizontal)

Fragmentation independence implies that users will be presented with a view of the data in which the fragments are logically recombines by means of suitable joins and unions. (no fragmentation)

The optimizer responds to determine which fragments need to be physically accessed in order to satisfy any given user request.

Emp where salary > 40000 and dept# = dept#(‘D1’) Optimizer will know from the fragment definitions (in catalog) that

the entire result can be obtained from site_A

Replication independence Support Data replication Desirable because

First, it can mean better performance. Application can operate on local copies instead of

having to communicate with remote sites. Second, it can also mean better availability.

A given replicated object remains available for processing – at least for retrieval as long as at least one copy reminds available

Disadvantage A given replicated object is updates, all copies of that

object must be updated (the update propagation)

Replication transparency Transparency to the user User should be able to behave, at least from a logical

standpoint, as if the data were in fact not replicated at all.

Desirable It simplifies application programs and end-user

activities; It allows replicas to be created and destroy anytime

in response to changing requirements, without invalidating any of those programs or activities.

Replication independence implies that it is the responsibility of the optimizer to determine which replicas physically need to be access in order to satisfy any given user request.

Emp# Dept# Salary

E1 D1 40000E2 D1 42000E5 D3 48000

Emp# Dept# Salary

E3 D2 30000E4 D2 35000

Emp# Dept# Salary

E1 D1 40000E2 D1 42000E5 D3 48000

Emp# Dept# Salary

E3 D2 30000E4 D2 35000

Site_A Site_B

S1_Emp S2_Emp

S12_Emp (S2_EMP replica) S21_Emp (S1_EMP replica)

Distributed query processing

In distributed system data store in many sites and may replicate

Optimization is even more important in a distributed system that it is in a centralized one.

Query that involve several sites

A BCN

Database S{S#,CITY} 10,000 stored at Site A P{P#,Color} 100,000 stored at site B SP{S#,P#} 1,000,000 stored at A Assume every stored tuple is 25 bytes (200bits) Query (Get supplier numbers for LD suppliers of

red parts” ((S JOIN SP JOIN P) where CITY = “LD” and COLOR = (‘Red’)) {S#} Estimated cardinalities of certain intermediate

results: Number of read parts = 10 Number of shipments by LD suppliers = 100,000

Communication assumptions: Data Rate = 50,000 bits per second Access delay = 0.1 second

6 strategies for processing this query and for each i calculate the total communication time Ti from the formula (total access delay) + (total data volume/data rate)

Become in second No of message/10 + No of bits/50000

1. Move parts to Site A and process the query at AT1 = 0.1 + (100000 * 200) /50000

2. Move supplier and shipments to site B and process the query at BT2 = 0.2 + ((10000 + 100000) * 200)/5000

3. Join suppliers and shipments at site A, restrict the result to LD suppliers and then, for each of those supplier in turn, check site B to see whether corresponding part is red. Each of these checks will involved 2 messages – a query and a respond. The transmission time for these messages will be small compared with the access delay

T3 = 20000 seconds approx.

Restrict parts at site B of those the red, and then, for each of those parts in turns, check site A to see whether there exists a shipment relating the part to a LD supplier. Each of these checks will involve 2 messages; transmission time for these message will be small compared with the access delay

T4 = 2 seconds approx.

Join supplier and shipments at site A, restrict the result to LD suppliers, project the result over S# and P#, and move the result to site B. Complete the processing at site B

T5 = 0.1 + (10000 * 200)/50000

Restrict parts at site B to those that are red and move the result to site A. complete the processing at site A

T6 = 0.1 + (10 * 200) / 50000

Distributed transaction management

Related to Transaction management Recovery and concurrency

2 phase commit Prepare phase Commit phase(see the previous slide)

Hardware / Network / DBMS independence

Distributed Database System. Definition A distributed Database System consists of a collection of...

Documents

Transcript of Distributed Database System. Definition A distributed Database System consists of a collection of...