PMIT-6102 Advanced Database Systems
description
Transcript of PMIT-6102 Advanced Database Systems
PMIT-6102Advanced Database
SystemsBy-
Jesmin AkhterAssistant Professor, IIT, Jahangirnagar University
Slide 2
Schedule
Continue from 25.01.2013-14.06.2013Every week
Friday• From 2:00 PM-5:00 PM
NB: Schedule may change
Slide 3
Grading Policy Attendance =10% Exercise test =5%
Instant testAssignmentPresentation
Class Test (Average of three) =15% Mid-Term Examination =30% Final Examination =40%=======================
=========
=100%
Slide 4
Course Plan Introduction (Lecture 01) Overview of Relational DBMS (Lecture 02, 03) Distributed Database Design (Lecture 04)
Overview of Query Processing (Lecture 05) Distributed Query Processing (Lecture 06) Distributed Transaction Management (Lecture 07)
Distributed Concurrency Control (Lecture 08, 09) Reliability (Lecture 10, 11)
Parallel Database Systems (Lecture 12,13) Distributed Object DBMS (Lecture14)
Mid-Term
Tutorial-1
Tutorial-2
Tutorial-3
Total : 14 lectures+ 4Tutorials + Midterm + Final Exam =20 weeks.
Tutorial-4
Slide 5
Text Book Principles of Distributed Database Systems
M. Tamer Özsu & Patrick Valduriez
Slide 6
Exam scheduleTutorial Date and Time
Tutorial-01 22nd February 2013Tutorial-02 22th March 2013Mid term Examination 29th March 2013Tutorial-03 3rd May 2013Tutorial-04 31st May 2013Final Examination 14th June 2013
NB: Schedule may change
Lecture 01Introduction to DDBMS
Slide 8
Outline
IntroductionDistributed Database SystemApplicationsDistributed DBMS PromisesProblem AreasArchitectural Models for Distributed
DBMSs
Slide 9
Database Management
database
DBMS
Applicationprogram 1
Applicationprogram 2
Applicationprogram 3
Data descriptionData manipulation
control
Slide 10
Integrate Databases and Commuinication
DatabaseTechnology
ComputerNetworks
integration distribution
integration
DistributedDatabaseSystems
Slide 11
Distributed Processing (Distributed Computing systems) A number of autonomous processing elements
that are interconnected by a computer network and
that cooperate in performing their assigned tasks. The “processing element” referred to a
computing device that can execute a program on its own.
Slide 12
Processing logic: processing logic or processing elements are distributed
Functions: Various functions of a computer system could be delegated to various pieces of hardware or software
Data: Data used by a number of applications may be distributed to a number of processing sites
Control: The control of the execution of various tasks might be distributed instead of being performed by one computer system.
What is being distributed?
Slide 13
What is a Distributed Database System?“Distributed database system” (DDBS) is used to refer jointly
distributed database and the distributed DBMS.
A distributed database (DDB) is a collection of
multiple, logically interrelated databases distributed over a computer network.
A distributed database management system (D–DBMS) is the software manages the DDB and provides an access mechanism makes this distribution transparent to the users.
Slide 14
Physical distribution does not necessarily imply that the computer systems be geographically far apart;
May be in the same room. The communication between them is done over a
network instead of through shared memory or shared disk
(multiprocessor systems) with the network as the only shared resource.
What is a Distributed Database System?
Slide 15
A timesharing computer system A loosely or tightly coupled multiprocessor
systemNot DDBS, Because in DDBS communication between
computer systems is done over a network instead of through shared memory or shared disk with the network as the only shared resource.
A database system which resides at one of the nodes of a network of
computers - this is a centralized database on a network node
What is not a DDBS?
Slide 16
Timesharing computer system
The CPU time is shared by different processes
Time slice is defined by the OS, for sharing CPU time between processes.
Slide 17
Shared-Memory Architecture (Tightly
coupled)
P1 Pn MD
Not a DDBS
Slide 18
Shared-Nothing Architecture (Loosely
coupled)
P1
M1
D1
Pn
Mn
Dn
Each processor node has its own primary and secondary memory, may also have its own peripherals, are quite similar to
the distributed environment, but there are differences. The fundamental difference is the mode of
operation.Database systems that run over multiprocessor
systems are called parallel database systems
Not a DDBS
Slide 19
Centralized DBMS on a Network
Site 5
Site 1Site 2
Site 3Site 4
CommunicationNetwork
Not a DDBS
Slide 20
Distributed DBMS Environment
Site 5
Site 1Site 2
Site 3Site 4
CommunicationNetwork
Slide 21
Distributed DBMS - Reality
CommunicationSubsystem
UserQuery
DBMSSoftware
DBMSSoftware User
Application
DBMSSoftware
UserApplicationUser
QueryDBMS
Software
UserQuery
DBMSSoftware
Slide 22
Implicit Assumptions Data stored at a number of sites each
site logically consists of a single processor. Processors at different sites are
interconnected by a computer network no multiprocessors
parallel database systems Distributed database is a database, not a
collection of files data logically related as exhibited in the users’ access patterns
relational data model D-DBMS is a full-fledged DBMS
not remote file system.
Slide 23
Manufacturing - especially multi-plant manufacturing
Military command and control Electronic fund transfers and electronic
trading Corporate MIS Airline restrictions Hotel chains Any organization which has a decentralized
organization structure
Applications
Slide 24
Transparent management of distributed, fragmented, and replicated data
Improved reliability/availability through distributed transactions
Improved performance Easier and more economical system
expansion
Distributed DBMS Promises
Slide 25
Example: Four relations: EMP(ENO, ENAME, TITLE)
PROJ(PNO,PNAME, BUDGET) SAL(TITLE, AMT) ASG(ENO, PNO, RESP, DUR).
For a centralized DBMS, find out the names of employees with salary who worked on a project for more than 12 months
SELECT ENAME, AMTFROM EMP, ASG, SALWHERE ASG.DUR > 12AND EMP.ENO = ASG.ENOAND SAL.TITLE = EMP.TITLE
Fully transparent access means that the users can still pose the query as specified above, without paying any attention to
the fragmentation, location, or replication of data, and let the system worry about
resolving these issues.
Transparent management of distributed, fragmented, and replicated data
Slide 26
Example
TITLE AMT
Sal
Elect. Eng. 40000Syst. Anal. 34000Mech. Eng. 27000Programmer 24000
PROJPNO PNAME BUDGET
ENO ENAME TITLEE1 J. Doe Elect. Eng.E2 M. Smith Syst. Anal.E3 A. Lee Mech. Eng.E4 J. Miller ProgrammerE5 B. Casey Syst. Anal.E6 L. Chu Elect. Eng.E7 R. Davis Mech. Eng.E8 J. Jones Syst. Anal.
EMPENO PNO RESP
E1 P1 Manager 12
DUR
E2 P1 Analyst 24E2 P2 Analyst 6E3 P3 Consultant 10E3 P4 Engineer 48E4 P2 Programmer 18E5 P2 Manager 24E6 P4 Manager 48E7 P3 Engineer 36
E8 P3 Manager 40
ASG
P1 Instrumentation 150000
P3 CAD/CAM 250000P2Database Develop.135000
P4 Maintenance 310000
E7 P5 Engineer 23
Slide 27
To localize data such that data about the employees in Waterloo office are stored in Waterloo, those in the Boston office are stored in Boston, and so
forth. The same applies to the project and salary information. That is data is distributed.
We partition each of the relations and store each partition at a different site. This is known as fragmentation.
Data that are commonly accessed by one user can be placed on that user’s local machine as well as on the machine of another user with the same
access requirements. That is data is replicated
Transparent management of distributed, fragmented, and replicated data
Slide 28
Transparent Access
SELECT ENAME,AMTFROM EMP,ASG,SALWHERE DUR > 12AND EMP.ENO = ASG.ENOAND SAL.TITLE =
EMP.TITLE
Paris projectsParis employeesParis assignmentsBoston employees
Montreal projectsParis projectsNew York projects with budget > 200000Montreal employeesMontreal assignments
Boston
CommunicationNetwork
Montreal
Paris
NewYork
Boston projectsBoston employeesBoston assignments
Boston projectsNew York employeesNew York projectsNew York assignments
Tokyo
Fully transparent access means that the users can still create the query without paying any attention to the
fragmentation, location, or replication of data. let the system worry about resolving these issues.
Slide 29
A transparent system “hides” the implementation details from users.
Fundamental issue is to provide Data independence in the distributed environment
Network (distribution) transparencyReplication transparencyFragmentation transparency
horizontal fragmentation: selection vertical fragmentation: projection hybrid
Transparent management of distributed, fragmented, and replicated data
Slide 30
It refers to the immunity of user applications to changes in the definition and organization of data.
Logical data independence Logical data independence refers to the
immunity of user applications to changes in the logical structure (i.e., schema) of the database.
Physical data independenceDeals with hiding the details of the storage
structure from user applications.
Data independence
Slide 31
Network Transparency In centralized database systems, the only available
resource that needs to be shielded from the user is the data.
In a distributed database environmenta second resource that needs to be managed in
much the same manner: the network. The user should be protected from the operational
details of the network; possibly even hiding the existence of the network.
Then there would be no difference between database applications that would run on a centralized database and those that would run on a distributed database.
This type of transparency is referred to as network transparency or distribution transparency.
Slide 32
From a DBMS perspective, distribution transparency requires that users do not have to specify where data are located.
Sometimes two types of distribution transparency are identified:
location transparencyNaming transparency.
Network Transparency
Slide 33
Location transparency refers to the fact that the command used to perform a task is independent of both the location of the data and the system on
which an operation is carried out. Naming transparency means that a unique
name is provided for each object in the database.
In the absence of naming transparency, users are required to embed the location name as part of the object name.
Network Transparency
Slide 34
Distribute data in a replicated fashion across the machines on a network.
If one of the machines fails, a copy of the data are still available on another machine on the networkIncrease reliability, and availability of
data.Increases the locality of reference.
Replication Transparency
Slide 35
Data are replicated, the transparency issue is:
The users should not be aware of the existence of copies and the system should handle the management of copies.
The users not to be involved with handling copies and having to specify the fact that a certain action can and/or should be taken on multiple copies.
Replication Transparency
Slide 36
Fragmentation Transparency Increase performance, availability and
reliability. fragmentation can reduce the negative
effects of replication. Each replica is not the full relation but only a
subset of it; thus less space is required and fewer data
items need be managed.
Slide 37
Horizontal fragmentation: A relation is partitioned into a set of sub-relations each of which have a subset of the tuples (rows) of the original relation.
Vertical fragmentation: Where each sub-relation is defined on a subset of the attributes (columns) of the original relation.
Fragmentation Transparency
Slide 38
Reliability Through Distributed Transactions
Improve reliability since they have replicated components and, thereby eliminate single points of failure.
The failure of a single site, or the failure of a communication link which makes one or more sites unreachable, is not sufficient to bring down the entire system.
Slide 39
Improved Performance Proximity to its points of use (also called data
localization). Requires some support for fragmentation and
replication. This has two potential advantages:
Since each site handles only a portion of the database, contention for CPU and I/O services is not as severe as for centralized databases.
Localization reduces remote access delays that are usually involved in wide area networks.
Slide 40
System Expansion
Issue is database scaling One aspect of easier system expansion is
economics. It normally costs much less to put together a
system of “smaller” computers with the equivalent power of a single big machine.
Slide 41
Problem Areas First, data may be replicated in a distributed
environment. A distributed data base can be designed so that the
entire database, or portions of it, reside at different sites of a computer network.
Second, if some sites fail (e.g., by either hardware or software malfunction), or if some communication links fail (making some of the sites unreachable)
While an update is being executed, the effects will not be reflected on the data residing at the failing or unreachable.
The third point is that since each site cannot have instantaneous information on the actions currently being carried out at the other sites,
The synchronization of transactions on multiple sites is considerably harder than for a centralized system.
Slide 42
Architectural Models for Distributed DBMSsPossible ways in which a distributed DBMS may be architected:
(1) Autonomy of local systems, (2) Their distribution, and (3) Their heterogeneity.
Slide 43
Autonomy Autonomy, refers to the distribution (or
decentralization) of control, not of data. It indicates the degree to which individual
DBMSs can operate independently. Autonomy is a function of a number of
factors such as whether the component systems (i.e.,
individual DBMSs) exchange information, whether they can independently execute
transactions, and whether one is allowed to modify them.
Architectural Models for Distributed DBMSs
Slide 44
Dimensions of AutonomyDesign autonomy
Individual DBMSs are free to use the data models and transaction management techniques that they prefer.
Communication autonomy Each of the individual DBMSs is free to make its
own decision as to what type of information it wants to provide to the other DBMSs or to the software that controls their global execution.
Execution autonomy Each DBMS can execute the transactions that are
submitted to it in any way that it wants to.
Architectural Models for Distributed DBMSs
Slide 45
Distribution The distribution dimension of the taxonomy
deals with data. Physical distribution of data over multiple sites;
The user sees the data as one logical pool. There are a number of ways DBMSs have been
distributed. Two classes: client/server distribution peer-to-peer distribution (or full
distribution).
Architectural Models for Distributed DBMSs
Slide 46
Client/server distribution The client/server distribution concentrates
data management duties at servers while the clients focus on providing the
application environment including the user interface.
The communication duties are shared between the client machines and servers.
Architectural Models for Distributed DBMSs
Slide 47
Peer-to-peer distribution (or full distribution).
In peer-to-peer systems, there is no distinction of client machines versus servers.
Each machine has full DBMS functionality and can communicate with other machines to execute queries and transactions.
Architectural Models for Distributed DBMSs
Slide 48
Heterogeneity Hardware heterogeneity Differences in networking protocols to variations
in data managers. Heterogeneity in query languages
not only involves the use of completely different data access paradigms in different data models.
but also covers differences in languages even when the individual systems use the same data model.
Architectural Models for Distributed DBMSs
Slide 49
Thank You
Slide 50
Exercise What is the basic difference between
Database systems and distributed Database Systems?
What is being distributed? Define a loosely or tightly coupled
multiprocessor system Draw Distributed Database System –
Reality What do you mean by replicated data? What are the Promises Distributed
DBMS