dbms-notes.doc

111
Introduction to Database Management System Importance: Database systems have become an essential component of life in modern society, in that many frequently occurring events trigger the accessing of at least one database: bibliographic library searches, bank transactions, hotel/airline reservations, grocery store purchases, etc., etc. Traditional vs. more recent applications of databases: The applications mentioned above are all "traditional" ones for which the use of rigidly-structured textual and numeric data suffices. Recent advances have led to the application of database technology to a wider class of data. Examples include multimedia databases (involving pictures, video clips, and sound messages) and geographic databases (involving maps, satellite images). Also, database search techniques are applied by some WWW search engines. Definitions: Def 1: A shared collection of logically related data, designed to meet the information needs of multiple users in an organization. Def 2: A data structure that stores metadata, i.e. data about data. More generally we can say an organized collection of information. Def 3: A collection of related information about a subject organized in a useful manner that provides a base or foundation for procedures such as retrieving information, drawing conclusions, and making decisions. Def 4: The term database is often used, rather loosely, to refer to just about any collection of related data. Elmasri & Navathe say that, in addition to being a collection of related data, a database must have the following properties: It represents some aspect of the real (or an imagined) world, called the mini world or universe of discourse. Changes to the mini world are reflected in the database. Imagine, for example, a UNIVERSITY mini world concerned with students, courses, course sections, grades, and course prerequisites. 1 | Page Databases Management System

description

First 4 UNITS

Transcript of dbms-notes.doc

Introduction to Database Management SystemImportance: Database systems have become an essential component of life in modern society, in that many frequently occurring events trigger the accessing of at least one database: bibliographic library searches, bank transactions, hotel/airline reservations, grocery store purchases, etc., etc.

Traditional vs. more recent applications of databases:

The applications mentioned above are all "traditional" ones for which the use of rigidly-structured textual and numeric data suffices. Recent advances have led to the application of database technology to a wider class of data. Examples include multimedia databases (involving pictures, video clips, and sound messages) and geographic databases (involving maps, satellite images).

Also, database search techniques are applied by some WWW search engines.

Definitions:

Def 1: A shared collection of logically related data, designed to meet the information needs of multiple users in an organization.

Def 2: A data structure that stores metadata, i.e. data about data. More generally we can say an organized collection of information.

Def 3: A collection of related information about a subject organized in a useful manner that provides a base or foundation for procedures such as retrieving information, drawing conclusions, and making decisions.

Def 4: The term database is often used, rather loosely, to refer to just about any collection of related data. Elmasri & Navathe say that, in addition to being a collection of related data, a database must have the following properties:

It represents some aspect of the real (or an imagined) world, called the mini world or universe of discourse. Changes to the mini world are reflected in the database. Imagine, for example, a UNIVERSITY mini world concerned with students, courses, course sections, grades, and course prerequisites.

It is a logically coherent collection of data, to which some meaning can be attached. (Logical coherency requires, in part, that the database not be self-contradictory.)

It has a purpose: there is an intended group of users and some preconceived applications that the users are interested in employing.

To summarize: a database has some source (i.e., the mini world) from which data are derived, some degree of interaction with events in the represented mini world, and an audience that is interested in using it.

Aside: data vs. information vs. knowledge: Data is the representation of "facts" or "observations" whereas information refers to the meaning thereof (according to some interpretation) or processed data. Knowledge, on the other hand, refers to the ability to use information to achieve intended ends.

Computerized vs. manual: Not surprisingly (this being a CS course), our concern will be with computerized database systems, as opposed to manual ones, such as the card catalog-based systems that were used in libraries in ancient (i.e., before the year 2000) times.

Size/Complexity: Databases run the range from being small/simple to being huge /complex.

Definition 1: A database management system (DBMS) is general purpose system software facilitating each of the following (with respect to a database):

defining: specifying data types, data organization, and constraints to which the data must conform

constructing: the process of storing the data on some medium (e.g., magnetic disk) that is controlled by the DBMS

manipulating: querying, updating, report generation

Sharing: a database allows multiple users and programs to database concurrently.

Protection & security: protection from hardware and software crashes and security from unauthorized access

Definition 2: Database management system is software of collection of small programs to perform certain operation on data and manage the data.

Two basic operations performed by the DBMS are:

Management of Data in the Database

Management of Users associated with the database.

Management of the data means to specify that how data will be stored, structured and accessed in the database. Management of database users means to manage the users in such a way that they can perform any desired operations on the database. DBMS also ensures that a user can not perform any operation for which he is not allowed. And also an authorized user is not allowed to perform any action which is restricted to that user. In General DBMS is a collection of Programs performing all necessary actions associated to a database.

Differences of the Database Approach and File Processing System:

Database approach vs. File processing approach: Consider an organization/enterprise that is organized as a collection of departments/offices. Each department has certain data processing "needs", many of which are unique to it. In the file processing approach, each department would "own" a collection of relevant data and software applications to manipulate that data.

For example, a university's Registrar's Office would maintain (most likely, with the aid of programmers employed by the university's "computer center") data (and programs) relevant to student grades and course enrollments. The A.O Office would maintain data (and programs) regarding fees owed by students for tuition, room and board, etc.

One result of this approach is, typically, data redundancy, which not only wastes storage space but also makes it more difficult to keep changing data up-to-date, as a change to one copy of some data item must be made to all of them (called duplication-of-effort). Inconsistency results when one (or more) copies of a datum are changed but not others. (E.g., If you change your address, informing the Department's Office should suffice to ensure that your grades are sent to the right place, but does not guarantee that your next bill will be, as the copy of your address "owned" by the A.Os Office might not have been changed.)

In the database approach, a single repository of data is maintained that is used by all the departments in the organization.

Disadvantages of File processing systems: Data are stored in files in all information systems. Files are collections of similar records. Data storage is build around the corresponding application that uses the files.

File Processing Systems: Where data are stored to individual files is a very old, but often used approach to system development.

Each program (system) often had its own unique set of files.

Diagrammatic representation of conventional file systemsUsers of file processing systems are almost always at the mercy of the Information Systems department to write programs that manipulate stored data and produce needed information such as printed reports and screen displays.

What is a file, then? A File is a collection of data about a single entity.

Files are typically designed to meet needs of a particular department or user group.

Files are also typically designed to be part of a particular computer application

Advantages: are relatively easy to design and implement since they are normally based on a single application.

The processing speed is faster than other ways of storing data.

Disadvantages: Program-data dependence.

Duplication of data.

Limited data sharing.

Lengthy program and system development time.

Excessive program maintenance when the system changed.

Duplication of data items in multiple files. Duplication can affect on input, maintenance, storage and possibly data integrity problems.

Inflexibility and non-scalability. Since the conventional files are designed to support single application, the original file structure cannot support the new requirements.

Today, the trend is in favor of replacing file-based systems and applications with database systems and applications.

Main Characteristics of database approach:

1. Self-Description: A database system not only includes the data stored that is of relevance to the organization but also a complete definition/description of the database's structure and constraints. This meta-data (i.e., data about data) is stored in the so-called system catalog, which contains a description of the structure of each file, the type and storage format of each field, and the various constraints on the data (i.e., conditions that the data must satisfy).

The system catalog is used not only by users (e.g., who need to know the names of tables and attributes, and sometimes data type information and other things), but also by the DBMS software, which certainly needs to "know" how the data is structured/organized in order to interpret it in a manner consistent with that structure. Recall that a DBMS is general purpose, as opposed to being a specific database application. Hence, the structure of the data cannot be "hard-coded" in its programs, but rather must be treated as a "parameter" in some sense. 2. Insulation between Programs and Data; (Data Abstraction):

Program-Data Independence: In traditional file processing, the structure of the data files accessed by an application is "hard-coded" in its source code. (E.g., consider a student file in a C program which uses array of structures: it gives a detailed description of the records in a file.)

If, for some reason, we decide to change the structure of the data (e.g., by adding another field Blood Group), every application in which a description of that file's structure is hard-coded must be changed!

In contrast, DBMS access programs, in most cases, do not require such changes, because the structure of the data is described (in the system catalog) separately from the programs that access it and those programs consult the catalog in order to ascertain the structure of the data (i.e., providing a means by which to determine boundaries between records and between fields within records) so that they interpret that data properly.

In other words, the DBMS provides a conceptual or logical view of the data to application programs, so that the underlying implementation may be changed without the programs being modified. (This is referred to as program-data independence.)

Also, which accesses paths (e.g., indexes) exist are listed in the catalog, helping the DBMS to determine the most efficient way to search for items in response to a query?

3. Multiple Views of Data: Different users (e.g., in different departments of an organization) have different "views" or perspectives on the database. For example, from the point of view of A.Os Office, student data does not include anything about which courses were taken or which grades were earned. (This is an example of a subset view.)

As another example, a Registrar's Office employee might think that PERCENTAGE is a field of data in each student's record. In reality, the underlying database might calculate that value each time it is called for. This is called virtual (or derived) data.

A view designed for an academic advisor might give the appearance that the data is structured to point out the prerequisites of each course. A good DBMS has facilities for defining multiple views. This is not only convenient for users, but also addresses security issues of data access.

4. Data Sharing and Multi-user Transaction Processing: As you learned about in the OS course, the simultaneous access of computer resources by multiple users/processes is a major source of complexity. The same is true for multi-user DBMS's.

Arising from this is the need for concurrency control, which is supposed to ensure that several users trying to update the same data do so in a "controlled" manner so that the results of the updates are as though they were done in some sequential order (rather than interleaved, which could result in data being incorrect).

This gives rise to the concept of a transaction, which is a process that makes one or more accesses to a database and which must have the appearance of executing in isolation from all other transactions (even ones that access the same data at the "same time") and of being atomic (in the sense that, if the system crashes in the middle of its execution, the database contents must be as though it did not execute at all).

Applications such as airline reservation systems are known as online transaction processing applications.

Actors on the Scene

These apply to "large" databases, not "personal" databases that are defined, constructed, and used by a single person via, say, MS Access.

1. Database Administrator (DBA): This is the chief administrator, who oversees and manages the database system (including the data and software). Duties include authorizing users to access the database, coordinating/monitoring its use, acquiring hardware/software for upgrades, etc. In large organizations, the DBA might have a support staff.

2. Database Designers: They are responsible for identifying the data to be stored and for choosing an appropriate way to organize it. They also define views for different categories of users. The final design must be able to support the requirements of all the user sub-groups.

3. End Users: These are persons who access the database for querying, updating, and report generation. They are main reason for database's existence!

Casual end users: use database occasionally, needing different information each time; use query language to specify their requests; typically middle- or high-level managers.

Naive/Parametric end users: Typically the biggest group of users; frequently query/update the database using standard canned transactions that have been carefully programmed and tested in advance. Examples:

bank tellers check account balances, post withdrawals/deposits

Reservation clerks for airlines, hotels, etc., check availability of seats/rooms and make reservations.

Shipping clerks (e.g., at DTDC) who use buttons, bar code scanners, etc., to update status of in-transit packages.

Sophisticated end users: engineers, scientists, business analysts who implement their own applications to meet their complex needs.

Stand-alone users: Use "personal" databases, possibly employing a special-purpose (e.g., financial) software package.

4. System Analysts, Application Programmers, Software Engineers:

System Analysts: determine needs of end users, especially naive and parametric users, and develop specifications for canned transactions that meet these needs.

Application Programmers: Implement, test, document, and maintain programs that satisfy the specifications mentioned above.

Workers behind the Scene

DBMS system designers/implementers: provide the DBMS software that is at the foundation of all this!

Tool developers: design and implement software tools facilitating database system design, performance monitoring, creation of graphical user interfaces, prototyping, etc.

Operators and maintenance personnel: responsible for the day-to-day operation of the system.

Capabilities of DBMS's

These are additional characteristics of DBMS.

1. Controlling Redundancy: Data redundancy (such as tends to occur in the "file processing" approach) leads to wasted storage space, duplication of effort (when multiple copies of a datum need to be updated), and a higher likelihood of the introduction of inconsistency.

a. On the other hand, redundancy can be used to improve performance of queries. Indexes, for example, are entirely redundant, but help the DBMS in processing queries more quickly.

b. A DBMS should provide the capability to automatically enforce the rule that no inconsistencies are introduced when data is updated.

2. Restricting Unauthorized Access: A DBMS should provide a security and authorization subsystem, which is used for specifying restrictions on user accounts. Common kinds of restrictions are to allow read-only access (no updating), or access only to a subset of the data

3. Providing Persistent Storage for Program Objects: Object-oriented database systems make it easier for complex runtime objects (e.g., lists, trees) to be saved in secondary storage so as to survive beyond program termination and to be retrievable at a later time.

4. Providing Storage Structures for Efficient Query Processing: The DBMS maintains indexes (typically in the form of trees and/or hash tables) that are utilized to improve the execution time of queries and updates. (The choice of which indexes to create and maintain is part of physical database design and tuning and is the responsibility of the DBA.

a. The query processing and optimization module is responsible for choosing an efficient query execution plan for each query submitted to the system.

5. Providing Backup and Recovery: The subsystem having this responsibility ensures that recovery is possible in the case of a system crash during execution of one or more transactions.

6. Providing Multiple User Interfaces: For example, query languages for casual users, programming language interfaces for application programmers, forms and/or command codes for parametric users, menu-driven interfaces for stand-alone users.

7. Representing Complex Relationships Among Data: A DBMS should have the capability to represent such relationships and to retrieve related data quickly.

8. Enforcing Integrity Constraints: Most database applications are such that the semantics (i.e., meaning) of the data require that it satisfy certain restrictions in order to make sense. Perhaps the most fundamental constraint on a data item is its data type, which specifies the universe of values from which its value may be drawn. (E.g., a Grade field could be defined to be of type Grade Type, which, say, we have defined as including precisely the values in the set { "A", "A-", "B+", ..., "F" }.

a. Another kind of constraint is referential integrity, which says that if the database includes an entity that refers to another one, the latter entity must exist in the database.

9. Permitting Inference and Actions Via Rules: In a deductive database system, one may specify declarative rules that allow the database to infer new data! E.g., Figure out which students are on academic probation. Such capabilities would take the place of application programs that would be used to ascertain such information otherwise.

a. Active database systems go one step further by allowing "active rules" that can be used to initiate actions automatically.

10. Potential for enforcing standards: this is very crucial for the success of database applications in large organizations Standards refer to data item names, display formats, screens, report structures, meta-data (description of data) etc.

11. Reduced application development time: incremental time to add each new application is reduced.

12. Flexibility to change data structures: database structure may evolve as new requirements are defined.

13. Availability of up-to-date information very important for on-line transaction systems such as airline, hotel, car reservations.

14. Economies of scale: by consolidating data and applications across departments wasteful overlap of resources and personnel can be avoided.

The database system

The above fig represents the database system which is collection of database, DBMS, AP, users.When not to use a DBMS:

For below three reasons it is better to use file processing systemMain inhibitors of using a DBMS:

High initial investment and possible need for additional hardware.

Overhead for providing generality, security, concurrency control, recovery, and integrity functions.

When a DBMS may be unnecessary:

If the database and applications are simple, well defined, and not expected to change.

If there are stringent real-time requirements that may not be met because of DBMS overhead.

If access to data by multiple users is not required.

When no DBMS may suffice: If the database system is not able to handle the complexity of data because of modeling limitations.

If the database users need special operations not supported by the DBMS.Data Models, Schemas, and Instances

One fundamental characteristic of the database approach is that it provides some level of data abstraction by hiding details of data storage that are irrelevant to database users.

A data model ---A set of concepts to describe the structure of a database, the operations for manipulating these structures, and certain constraints that the database should obey.

By structure is meant the data types, relationships, and constraints that should hold for the data. Most data models also include a set of basic operations for specifying retrievals/updates.

Data Model Building Blocks

Entity

Anything about which data are to be collected and stored

Attribute

A characteristic of an entity (e.g. last name)

Relationship

An association among (two or more) entities

One-to-many (1:M) relationship

Many-to-many (M:N or M:M) relationship

One-to-one (1:1) relationship

Data Model Operations:

These operations are used for specifying database retrievals and updates by referring to the constructs of the data model.

Operations on the data model may include basic model operations (e.g. generic insert, delete, update) and user-defined operations (e.g. compute_student_gpa, update_inventory)

The Importance of Data Models

Good database design uses an appropriate data model as its foundation

End-users have different views and needs for data

Data model organizes data for various users.The Evolution of Data Models

Hierarchical

Network

Relational

Entity relationship

Object oriented

Object-oriented data models include the idea of objects having behavior (i.e., applicable methods) being stored in the database (as opposed to purely "passive" data).

According to C.J. Date (one of the leading database experts), a data model is an abstract, self-contained, logical definition of the objects, operators, and so forth, that together constitute the abstract machine with which users interact. The objects allow us to model the structure of data; the operators allow us to model its behavior.

In the relational data model, data is viewed as being organized in two-dimensional tables comprised of tuples of attribute values. This model has operations such as Project, Select, and Join.

A data model is not to be confused with its implementation, which is a physical realization on a real machine of the components of the abstract machine that together constitute that model.

Logical vs. physical!!

There are other well-known data models that have been the basis for database systems. The best-known models pre-dating the relational model are the hierarchical (in which the entity types form a tree) and the network (in which the entity types and relationships bet

Categories of Data Models (based on degree of abstractness):

High-level/conceptual: (e.g., ER model ) provides a view close to the way users would perceive data; uses concepts such as

entity: real-world object or concept (e.g., student, employee, course, department, event)

attribute: some property of interest describing an entity (e.g., height, age, color)

relationship: an interaction among entities (e.g., works-on relationship between an employee and a project)

Representational / implementational: intermediate level of abstractness; example is relational data model (or network). Also called record-based model.

Low-level / physical: gives details as to how data is stored in computer system, such as record formats, orderings of records, access paths (indexes).

Schemas, Instances, and Database State

One must distinguish between the description of a database and the database itself. The former is called the database schema, which is specified during design and is not expected to change often.

The actual data stored in the database probably changes often. The data in the database at a particular time is called the state of the database, or a snapshot.

Schema is also called intension.

State is also called extension.

Three-Schema Architecture: This idea was first described by the ANSI/SPARC committee in late 1970's. The goal is to separate (i.e., insert layers of "insulation" between) user applications and the physical database. It is an ideal that few real-life DBMS's achieve fully.

internal/physical schema: describes the physical storage structure (using a low-level data model)

conceptual schema: describes the (logical) structure of the whole database for a community of users. Hides physical storage details, concentrating upon describing entities, data types, relationships, user operations, and constraints. Can be described using either high-level or implementational data model.

external schema (or user views): Each such schema describes part of the database that a particular category of users is interested in, hiding rest of database. Can be described using either high-level or implementational data model. (In practice, usually described using same model as is the conceptual schema.)

Users (including application programs) submit queries that are expressed with respect to the external level. It is the responsibility of the DBMS to transform such a query into one that is expressed with respect to the internal level (and to transform the result, which is at the internal level, into its equivalent at the external level).

Example: Select students with GPA > 3.5.

Q: How is this accomplished?A: By virtue of mappings between the levels:

external/conceptual mapping (providing logical data independence)

conceptual/internal mapping (providing physical data independence)

Data independence is the capacity to change the schema at one level of the architecture without having to change the schema at the next higher level. We distinguish between logical and physical data independence according to which two adjacent levels are involved. The former refers to the ability to change the conceptual schema without changing the external schema. The latter refers to the ability to change the internal schema without having to change the conceptual. Logical Data Independence:

The capacity to change the conceptual schema without having to change the external schemas and their associated application programs.Physical Data Independence:

The capacity to change the internal schema without having to change the conceptual schema.For an example of physical data independence, suppose that the internal schema is modified (because we decide to add a new index, or change the encoding scheme used in representing some field's value, or stipulate that some previously unordered file must be ordered by a particular field ). Then we can change the mapping between the conceptual and internal schemas in order to avoid changing the conceptual schema itself.

Not surprisingly, the process of transforming data via mappings can be costly (performance-wise), which is probably one reason that real-life DBMS's don't fully implement this 3-schema architecture. DBMS Languages

DDL: Data definition (conceptual schema, possibly internal/external))

Used by the DBA and database designers to specify the conceptual schema of a database.

In many DBMSs, the DDL is also used to define internal and external schemas (views).

In some DBMSs, separate storage definition language (SDL) and view definition language (VDL) are used to define internal and external schemas.

SDL: Storage definition (internal schema)

SDL is typically realized via DBMS commands provided to the DBA and database designers

VDL: View definition (external schema)

DML: Data manipulation (retrieval, update)

High-Level or Non-procedural Languages: These include the relational language SQL

May be used in a standalone way or may be embedded in a programming language

Low Level or Procedural Languages:

These must be embedded in a programming language

DBMS Interfaces

Menu-based: popular for browsing on the web

Forms-based: designed for nave users

GUI-based: (Point and Click, Drag and Drop, etc.)

Natural language: requests in written English

Special purpose for parametric users: Bank clerks uses interfaces with minimum function keys.

For DBA: Creating user accounts, granting authorizations

Setting system parameters

Changing schemas or access paths.

Component modules of a DBMS and their interactionsStored data manager: this module of DBMS controls access to DBMS information that is stored on disk, whether it is part of the database or the catalog. The dotted lines illustrate the accesses.

DDL compiler: it processes schema definitions, specified in the DDL, and stores description of schemas in the DBMS catalog.

DBMS catalog: includes information such as the names and data types of data item, storage details of each file, mapping information among schemas, and constraints. These are accessed by various modules of DBMS as shown by dotted lines.

Run time database processor: handles database accesses at runtime; it receives retrieval or update operations and carries them out on the database.

Query compiler: handles high level queries that are entered interactively. It parses, analyzes and compiles or interprets a query by creating database access code and then generates calls to the runtime database processor.

Pre compiler: Extracts DML commands from an application program written in a host language. these are sent to DML compiler for further processing.DBMS UTILITIES

To perform certain functions such as:

Loading data stored in files into a database. Includes data conversion tools.

Backing up the database periodically on tape.

Reorganizing database file structures.

Report generation utilities.

Performance monitoring utilities.

Other functions, such as sorting, user monitoring, data compression, etc.

Data dictionary / repository:

Used to store schema descriptions and other information such as design decisions, application program descriptions, user information, usage standards, contains all information stored in catalog, but accessed by users rather than DBMS.Classification of DBMS's

Based upon

underlying data model (e.g., relational, object, object-relational, network)

multi-user vs. single-user

centralized vs. distributed

cost

general-purpose vs. special-purpose

types of access path options

Data Modeling Using the Entity-Relationship Model

Outline of Database Design:

A simplified diagram to illustrate the main phases of database design

The main phases of database design are depicted in Figure:

Requirements Collection and Analysis: purpose is to produce a description of the users' requirements.

Conceptual Design: purpose is to produce a conceptual schema for the database, including detailed descriptions of entity types, relationship types, and constraints. All these are expressed in terms provided by the data model being used.

Implementation: purpose is to transform the conceptual schema (which is at a high/abstract level) into a (lower-level) representational/implementational model supported by whatever DBMS is to be used.

Physical Design: purpose is to decide upon the internal storage structures, access paths (indexes), etc., that will be used in realizing the representational model produced in previous phase

Entity-Relationship (ER) Model

Our focus now is on the second phase, conceptual design, for which The Entity-Relationship (ER) Model is a popular high-level conceptual data model.

In the ER model, the main concepts are entity, attribute, and relationship.

Entities and Attributes

Entity: An entity represents some "thing" (in the miniworld) that is of interest to us, i.e., about which we want to maintain some data. An entity could represent a physical object (e.g., house, person, automobile, widget) or a less tangible concept (e.g., company, job, academic course).

Attribute: An entity is described by its attributes, which are properties characterizing it. Each attribute has a value drawn from some domain (set of meaningful values).

Example: A STUDENT entity might be described by RNO, Name, BirthDate, etc., attributes, each having a particular value.

What distinguishes an entity from an attribute is that the latter is strictly for the purpose of describing the former and is not, in and of itself, of interest to us. It is sometimes said that an entity has an independent existence, whereas an attribute does not. In performing data modeling, however, it is not always clear whether a particular concept deserves to be classified as an entity or "only" as an attribute.

We can classify attributes along these dimensions:

simple/atomic vs. composite

single-valued vs. multi-valued (or set-valued)

stored vs. derived

A composite attribute is one that is composed of smaller parts. An atomic attribute is indivisible or indecomposable.

Example 1: A BirthDate attribute can be viewed as being composed of (sub-)attributes for month, day, and year.

Example 2: An Address attribute can be viewed as being composed of (sub-)attributes for street address, city, state, and zip code. A street address can itself be viewed as being composed of a number, street name, and apartment number. As this suggests, composition can extend to a depth of two (as here) or more.

To describe the structure of a composite attribute, one can draw a tree , it is customary to write its name followed by a parenthesized list of its sub-attributes. For the examples mentioned above, we would write

BirthDate(Month,Day,Year)Address(StreetAddr(StrNum, StrName, AptNum), City, State, Zip)

A hierarchy of composite attributes

Single- vs. multi-valued attribute: Consider a PERSON entity. The person it represents has (one) SSN, (one) date of birth, (one, although composite) name, etc. But that person may have zero or more academic degrees, dependents, or (if the person is a male living in tenali) spouses! How can we model this via attributes AcademicDegrees, Dependents, and Spouses? One way is to allow such attributes to be multi-valued , which is to say that we assign to them a (possibly empty) set of values rather than a single value.

To distinguish a multi-valued attribute from a single-valued one, it is customary to enclose the former within curly braces (which makes sense, as such an attribute has a value that is a set, and curly braces are traditionally used to denote sets). Using the PERSON example from above, we would depict its structure in text as

PERSON(SSN, Name, BirthDate(Month, Day, Year), { AcademicDegrees(School, Level, Year) }, { Dependents }, ...)

Here we have taken the liberty to assume that each academic degree is described by a school, level (e.g., B.S., Ph.D.), and year. Thus, AcademicDegrees is not only multi-valued but also composite. We refer to an attribute that involves some combination of multi-valuedness and compositeness as a complex attribute.

A more complicated example of a complex attribute is AddressPhone. This attribute is for recording data regarding addresses and phone numbers of a business. The structure of this attribute allows for the business to have several offices, each described by an address and a set of phone numbers that ring into that office. Its structure is given by

{ AddressPhone( { Phone(AreaCode, Number) }, Address(StrAddr(StrNum, StrName, AptNum), City, State, Zip)) } Stored vs. derived attribute: Perhaps independent and derivable would be better terms for these (or non-redundant and redundant). In any case, a derived attribute is one whose value can be calculated from the values of other attributes, and hence need not be stored. Example: Age can be calculated from BirthDate, assuming that the current date is accessible.

The Null value: In some cases a particular entity might not have an applicable value for a particular attribute. Or that value may be unknown. Or, in the case of a multi-valued attribute, the appropriate value might be the empty set.

Example: The attribute DateOfDeath is not applicable to a living person and its correct value may be unknown for some persons who have died.

In such cases, we use a special attribute value, called null. There has been some argument in the database literature about whether a different approach (such as having distinct values for not applicable and unknown) would be superior.

Entity Types, Entity Sets, Keys, and Domains

Above we mentioned the concept of a PERSON entity, i.e., a representation of a particular person via the use of attributes such as Name, Sex, etc. Chances are good that, in a database in which one such entity exists, we will want many others of the same kind to exist also, each of them described by the same collection of attributes. Of course, the values of those attributes will differ from one entity to another (e.g., one person will have the name "keerthy" and another will have the name "sai"). Just as likely is that we will want our database to store information about other kinds of entities, such as business transactions or academic courses, which (of course) will be described by entirely different collections of attributes.

This illustrates the distinction between entity types and entity instances. An entity type serves as a template for a collection of entity instances, all of which are described by the same collection of attributes. That is, an entity type is analogous to a class in object-oriented programming and an entity instance is analogous to a particular object (i.e., instance of a class).

In ER modeling, we deal only with entity types, not with instances or occurrences. In an ER diagram, each entity type is denoted by a rectangular box.

An entity set is the collection of all entities of a particular type that exist, in a database, at some moment in time.

Key Attributes of an Entity Type: A minimal collection of attributes (often only one) that, by design, distinguishes any two (simultaneously-existing) entities of that type. In other words, if attributes A1 through Am together form a key of entity type E, and e and f are two entities of type E existing at the same time, then, in at least one of the attributes Ai (0 < i M:N(yes,no)-->N:1(no,yes)-->1:N(no, no) --> 1:1

participation: specifies whether or not the existence of an entity depends upon its being related to another entity via the relationship.

total participation (or existence dependency): To say that entity type A is constrained to participate totally in relationship R is to say that if (at some moment in time) R's instance set is

{ (a1, b1), (a2, b2), ... (am, bm) },

then (at that same moment) A's instance set must be { a1, a2, ..., am }. In other words, there can be no member of A's instance set that does not participate in at least one instance of R.

According to our informal description of COMPANY, every employee must be assigned to some department. That is, every employee instance must participate in at least one instance of WORKS_FOR, which is to say that EMPLOYEE satisfies the total participation constraint with respect to the WORKS_FOR relationship.

In an ER diagram, if entity type A must participate totally in relationship type R, the two are connected by a double line.

partial participation: the absence of the total participation constraint! (E.g., not every employee has to participate in MANAGES; hence we say that, with respect to MANAGES, EMPLOYEE participates partially. This is not to say that for all employees to be managers is not allowed; it only says that it need not be the case that all employees are managers.

Attributes of Relationship Types

Relationship types, like entity types, can have attributes. A good example is WORKS_ON, each instance of which identifies an employee and a project on which (s)he works. In order to record (as the specifications indicate) how many hours are worked by each employee on each project, we include Hours as an attribute of WORKS_ON. In the case of an M:N relationship type (such as WORKS_ON), allowing attributes is vital. In the case of an N:1, 1:N, or 1:1 relationship type, any attributes can be assigned to the entity type opposite from the 1 side. For example, the StartDate attribute of the MANAGES relationship type can be given to either the EMPLOYEE or the DEPARTMENT entity type.

Weak Entity Types: An entity type that has no set of attributes that qualify as a key is called weak. (Ones that do are strong.)

An entity of a weak identity type is uniquely identified by the specific entity to which it is related (by a so-called identifying relationship that relates the weak entity type with its so-called identifying or owner entity type) in combination with some set of its own attributes (called a partial key).

Example: A DEPENDENT entity is identified by its first name together with the EMPLOYEE entity to which it is related via DEPENDS_ON. (Note that this wouldn't work for former heavyweight boxing champion George Foreman's sons, as they all have the name "George"!)

Because an entity of a weak entity type cannot be identified otherwise, that type has a total participation constraint (i.e., existence dependency) with respect to the identifying relationship.

This should not be taken to mean that any entity type on which a total participation constraint exists is weak. For example, DEPARTMENT has a total participation constraint with respect to MANAGES, but it is not weak.

In an ER diagram, a weak entity type is depicted with a double rectangle and an identifying relationship type is depicted with a double diamond.

Design Choices for ER Conceptual Design: Sometimes it is not clear whether a particular miniworld concept ought to be modeled as an entity type, an attribute, or a relationship type. Here are some guidelines (given with the understanding that schema design is an iterative process in which an initial design is refined repeatedly until a satisfactory result is achieved):

As happened in our development of the ER model for COMPANY, if an attribute of entity type A serves as a reference to an entity of type B, it may be wise to refine that attribute into a binary relationship involving entity types A and B. It may well be that B has a corresponding attribute referring back to A, in which case it, too, is refined into the aforementioned relationship. In our COMPANY example, this was exemplified by the Projects and ControllingDept attributes of DEPARTMENT and PROJECT, respectively.

An attribute that exists in several entity types may be refined into its own entity type. For example, suppose that in a UNIVERSITY database we have entity types STUDENT, INSTRUCTOR, and COURSE, all of which have a Department attribute. Then it may be wise to introduce a new entity type, DEPARTMENT, and then to follow the preceding guideline by introducing a binary relationship between DEPARTMENT and each of the three aforementioned entity types.

An entity type that is involved in very few relationships (say, zero, one, or possibly two) could be refined into an attribute (of each entity type to which it is related).

Summary of notation for ER diagrams

Enhanced ER Model

The ER model is generally sufficient for "traditional" database applications. But more recent applications of DB technology (e.g., CAD/CAM, telecommunication, images/graphics, multimedia, data mining/warehousing, geographic info systems) cry out for a richer model.

The EER model extends the ER model by, in part, adding the concept of specialization (and its inverse, generalization), which is analogous to the same-named concept (also called extension or subclassing) from object-oriented design/programming. An entity type may be recognized as having one or more subclasses, with respect to some criterion. This represents a specialization of the entity type. A subclass inherits the features of its parent (the entity type), and can be given its own "local" features. By features we mean not only attributes but also relationships. (This is entirely analogous to what we find in OO programming languages, where a subclass inherits its parent's features (instance variables and methods) and can be defined to have new ones specific to it.)

Using E&N's example , in Figure 4.1 to illustrate, suppose that we have an entity type EMPLOYEE with attributes Name, SSN, BirthDate, and Address. Specializing this entity type with respect to job type, we might identify SECRETARY, TECHNICIAN, and ENGINEER as subclasses, with attributes TypingSpeed, TGrade, and EngType, respectively. The idea is that (at any moment in time) every member of SECRETARY's entity set is also a member of EMPLOYEE's entity set. And similarly for members of the entity sets of TECHNICIAN and ENGINEER. Hence, a SECRETARY entity, also being an EMPLOYEE entity, can participate in relationship types involving EMPLOYEE (such as WORKS_FOR, although no such relationship type is shown in the figure).

Specialization is the process of defining a set of subclasses of an entity type, usually on the basis of some distinguishing characteristics (or "dimensions") of the entities in the entity type. Interestingly, you may introduce multiple specializations of a single class, each based upon a different distinguishing characteristic of (the entities of) that class.

For example, in Figure 4.1, EMPLOYEE is specialized according to job type (resulting in subclasses SECRETARY, TECHNICIAN, and ENGINEER) and also on the basis of method of pay (resulting in subclasses SALARIED_EMPLOYEE and HOURLY_EMPLOYEE). This results in a situation (not typically seen in OO programming) in which a single entity can be an instance of two sibling subclasses (or, more generally, of two subclasses neither of which is an ancestor of the other).

A reasonable question to ask is What is the purpose of extending the ER model to include specialization? A few answers:

So as not to carry around null-valued attributes that don't apply

To define relationships in which only subclass entities may participate (e.g., trade union membership is applicable only to hourly employees)

The term generalization refers to the inverse of specialization. That is, generalization refers to recognizing the need/benefit of introducing a superclass to one or more classes that have already been postulated. Hence, generalization builds a class hierarchy in a bottom-up manner whereas specialization builds it in a top-down manner.

Constraints and Characteristics of Specialization/Generalization

Constraints on Specialization and Generalization:

A subclass D of a class C is said to be predicate-defined (or condition-defined) if, for any member of the entity set of C, we can determine whether or not it also is a member of D by examining the values in its attributes (and seeing whether those values satisfy the so-called defining predicate). If all subclasses in a given specialization are predicate-defined, then the specialization is said to be attribute-defined.

For example, Figure 4.4 shows an EER diagram indicating that the value of an EMPLOYEE entity's Job_Type attribute determines in which subclass (if any) that entity has membership.

When there is no algorithm for determining membership in a subclass, we say that the subclass is user-defined. For such subclasses, membership of entities cannot be decided automatically and hence must be specified by a user.

Disjointness vs. overlapping constraint:

Let C be a class and S1, ... Sk be the (immediate) subclasses of C arising from some specialization thereof. For this specialization to satisfy the disjointness constraint requires that no instance of C be an instance of more than one of the Si's. In other words, for every i and j, with 0 < i < j "Keerthy", Sex --> Male, IQ --> 786 } Relation: A (named) set of tuples all of the same form (i.e., having the same set of attributes). The term table is a loose synonym.

Relational Schema: used for describing (the structure of) a relation. E.g., R(A1, A2, ..., An) says that R is a relation with attributes A1, ... An. The degree of a relation is the number of attributes it has, here n.

Example: STUDENT(Name, SSN, Address)

One would think that a "complete" relational schema would also specify the domain of each attribute.

Relational Database: A collection of relations, each one consistent with its specified relational schema.

Characteristics of Relations

Ordering of Tuples: A relation is a set of tuples; hence, there is no order associated with them. That is, it makes no sense to refer to, for example, the 5th tuple in a relation. When a relation is depicted as a table, the tuples are necessarily listed in some order, of course, but you should attach no significance to that order. Similarly, when tuples are represented on a storage device, they must be organized in some fashion, and it may be advantageous, from a performance standpoint, to organize them in a way that depends upon their content.

Ordering of Attributes: A tuple is best viewed as a mapping from its attributes (i.e., the names we give to the roles played by the values comprising the tuple) to the corresponding values. Hence, the order in which the attributes are listed in a table is irrelevant. (Note that, unfortunately, the set theoretic operations in relational algebra (at least how Elmasri& Navathe define them) make implicit use of the order of the attributes. Hence, E & N view attributes as being arranged as a sequence rather than a set.)

The Null value: used for don't know, not applicable.

Interpretation of a Relation: Each relation can be viewed as a predicate and each tuple an assertion that that predicate is satisfied (i.e., has value true) for the combination of values in it. In other words, each tuple represents a fact.

Keep in mind that some relations represent facts about entities whereas others represent facts about relationships (between entities).

Relational Model Constraints and Relational Database Schemas

Constraints on databases can be categorized as follows:

inherent model-based: Example: no two tuples in a relation can be duplicates (because a relation is a set of tuples)

schema-based: can be expressed using DDL; this kind is the focus of this section.

application-based: are specific to the "business rules" of the miniworld and typically difficult or impossible to express and enforce within the data model. Hence, it is left to application programs to enforce.

Elaborating upon schema-based constraints:

Domain Constraints:

All the values that appear in a column of a relation must be taken from the same domain. A domain usually consists of the following components.

1. Domain Name

2. Meaning

3. Data Type

4. Size or length

5. Allowable values or Allowable range( if applicable)

6.Data Format

Key Constraints: A relation is a set of tuples, and each tuple's "identity" is given by the values of its attributes. Hence, it makes no sense for two tuples in a relation to be identical (because then the two tuples are actually one and the same tuple). That is, no two tuples may have the same combination of values in their attributes.

Usually the miniworld dictates that there be (proper) subsets of attributes for which no two tuples may have the same combination of values. Such a set of attributes is called a superkey of its relation. From the fact that no two tuples can be identical, it follows that the set of all attributes of a relation constitutes a superkey of that relation.

A key is a minimal superkey, i.e., a superkey such that, if we were to remove any of its attributes, the resulting set of attributes fails to be a superkey.

Example: Suppose that we stipulate that a faculty member is uniquely identified by Name and Address and also by Name and Department, but by no single one of the three attributes mentioned. Then { Name, Address, Department } is a (non-minimal) superkey and each of { Name, Address } and { Name, Department } is a key (i.e., minimal superkey).

Candidate key: any key!

Primary key: a key chosen to act as the means by which to identify tuples in a relation. Typically, one prefers a primary key to be one having as few attributes as possible.

Relational Databases and Relational Database Schemas

A relational database schema is a set of schemas for its relations together with a set of integrity constraints. A relational database state/instance/snapshot is a set of states of its relations such that no integrity constraint is violated.

Entity Integrity, Referential Integrity, and Foreign Keys

Entity Integrity Constraint: In a tuple, none of the values of the attributes forming the relation's primary key may have the (non-)value null.

Referential Integrity Constraint: A foreign key of relation R is a set of its attributes intended to be used (by each tuple in R) for identifying/referring to a tuple in some relation S. (R is called the referencing relation and S the referenced relation.) For this to make sense, the set of attributes of R forming the foreign key should "correspond to" some superkey of S. Indeed, by definition we require this superkey to be the primary key of S.

This constraint says that, for every tuple in R, the tuple in S to which it refers must actually be in S. Note that a foreign key may refer to a tuple in the same relation and that a foreign key may be part of a primary key . A foreign key may have value null in which case it does not refer to any tuple in the referenced relation. A set of attributes FK in relation schema R1 is a foreign key of R1 that references relation R2 if it satisfies following two rules

1. The attributes in FK fave the same domain(s) as the primary key attributes PK of R2;the attributes FK are said to reference or refer to the relation R2

2. A value of FK in a tuple t1 of the current state r1(R1) either occurs as a value of PK for some tuple t2 in the current state r2(R2) or is null.In the former case ,we have t1[FK]=t2[PK], and we say that the tuple t1 references or refers to the tuple t2.

Semantic Integrity Constraints: application-specific restrictions that are unlikely to be expressible in DDL. Examples:

salary of a supervisee cannot be greater than that of her/his supervisor

salary of an employee cannot be lowered

Update Operations and Dealing with Constraint Violations

For each of the update operations (Insert, Delete, and Update), we consider what kinds of constraint violations may result from applying it and how we might choose to react.

Insert:

domain constraint violation: some attribute value is not of correct domain

entity integrity violation: key of new tuple is null

key constraint violation: key of new tuple is same as existing one

referential integrity violation: foreign key of new tuple refers to non-existent tuple

Ways of dealing with it: reject the attempt to insert! Or give user opportunity to try again with different attribute values.

Delete:

Referential integrity violation: a tuple referring to the deleted one exists.

Three options for dealing with it:

Reject the deletion

Attempt to cascade (or propagate) by deleting any referencing tuples (plus those that reference them, etc., etc.)

modify the foreign key attribute values in referencing tuples to null or to some valid value referencing a different tuple

Update:

Key constraint violation: primary key is changed so as to become same as another tuple's

referential integrity violation:

foreign key is changed and new one refers to nonexistent tuple

primary key is changed and now other tuples that had referred to this one violate the constraint

ER-to-Relational Mapping Algorithm

Step 1: Mapping of Regular Entity Types

Step 2: Mapping of Weak Entity Types

Step 3: Mapping of Binary 1:1 Relation Types

Step 4: Mapping of Binary 1:N Relationship Types.

Step 5: Mapping of Binary M:N Relationship Types.

Step 6: Mapping of Multivalued attributes.

Step 7: Mapping of N-ary Relationship Types

Step 1: Mapping of Regular Entity Types.

For each regular (strong) entity type E in the ER schema, create a relation R that includes all the simple attributes of E.

Choose one of the key attributes of E as the primary key for R. If the chosen key of E is composite, the set of simple attributes that form it will together form the primary key of R.

Example: We create the relations EMPLOYEE, DEPARTMENT, and PROJECT in the relational schema corresponding to the regular entities in the ER diagram. SSN, DNUMBER, and PNUMBER are the primary keys for the relations EMPLOYEE, DEPARTMENT, and PROJECT as shown.

Step 2: Mapping of Weak Entity Types

For each weak entity type W in the ER schema with owner entity type E, create a relation R and include all simple attributes (or simple components of composite attributes) of W as attributes of R.

In addition, include as foreign key attributes of R the primary key attribute(s) of the relation(s) that correspond to the owner entity type(s).

The primary key of R is the combination of the primary key(s) of the owner(s) and the partial key of the weak entity type W, if any.

Example: Create the relation DEPENDENT in this step to correspond to the weak entity type DEPENDENT. Include the primary key SSN of the EMPLOYEE relation as a foreign key attribute of DEPENDENT (renamed to ESSN).

The primary key of the DEPENDENT relation is the combination {ESSN, DEPENDENT_NAME} because DEPENDENT_NAME is the partial key of DEPENDENT.

Step 3: Mapping of Binary 1:1 Relation Types

For each binary 1:1 relationship type R in the ER schema, identify the relations S and T that correspond to the entity types participating in R. There are three possible approaches:

(1) Foreign Key approach: Choose one of the relations-S, say-and include a foreign key in S the primary key of T. It is better to choose an entity type with total participation in R in the role of S.

Example: 1:1 relation MANAGES is mapped by choosing the participating entity type DEPARTMENT to serve in the role of S, because its participation in the MANAGES relationship type is total.

(2) Merged relation option: An alternate mapping of a 1:1 relationship type is possible by merging the two entity types and the relationship into a single relation. This may be appropriate when both participations are total.

(3) Cross-reference or relationship relation option: The third alternative is to set up a third relation R for the purpose of cross-referencing the primary keys of the two relations S and T representing the entity types.

Step 4: Mapping of Binary 1:N Relationship Types.

For each regular binary 1:N relationship type R, identify the relation S that represent the participating entity type at the N-side of the relationship type.

Include as foreign key in S the primary key of the relation T that represents the other entity type participating in R.

Include any simple attributes of the 1:N relation type as attributes of S.

Example: 1:N relationship types WORKS_FOR, CONTROLS, and SUPERVISION in the figure. For WORKS_FOR we include the primary key DNUMBER of the DEPARTMENT relation as foreign key in the EMPLOYEE relation and call it DNO.

Step 5: Mapping of Binary M:N Relationship Types.

For each regular binary M:N relationship type R, create a new relation S to represent R.

Include as foreign key attributes in S the primary keys of the relations that represent the participating entity types; their combination will form the primary key of S.

Also include any simple attributes of the M:N relationship type (or simple components of composite attributes) as attributes of S.

Example: The M:N relationship type WORKS_ON from the ER diagram is mapped by creating a relation WORKS_ON in the relational database schema. The primary keys of the PROJECT and EMPLOYEE relations are included as foreign keys in WORKS_ON and renamed PNO and ESSN, respectively.

Attribute HOURS in WORKS_ON represents the HOURS attribute of the relation type. The primary key of the WORKS_ON relation is the combination of the foreign key attributes {ESSN, PNO}.

Step 6: Mapping of Multivalued attributes.

For each multivalued attribute A, create a new relation R. This relation R will include an attribute corresponding to A, plus the primary key attribute K-as a foreign key in R-of the relation that represents the entity type of relationship type that has A as an attribute.

The primary key of R is the combination of A and K. If the multivalued attribute is composite, we include its simple components.

Example: The relation DEPT_LOCATIONS is created. The attribute DLOCATION represents the multivalued attribute LOCATIONS of DEPARTMENT, while DNUMBER-as foreign key-represents the primary key of the DEPARTMENT relation. The primary key of R is the combination of {DNUMBER, DLOCATION}.

Step 7: Mapping of N-ary Relationship Types.

For each n-ary relationship type R, where n>2, create a new relationship S to represent R.

Include as foreign key attributes in S the primary keys of the relations that represent the participating entity types.

Also include any simple attributes of the n-ary relationship type (or simple components of composite attributes) as attributes of S.

Example: The relationship type SUPPY in the ER below. This can be mapped to the relation SUPPLY shown in the relational schema, whose primary key is the combination of the three foreign keys {SNAME, PARTNO, PROJNAME}

SUMMARY -: Correspondence between ER and Relational ModelsER Model

Relational Model

Entity type

Entity relation

1:1 or 1:N relationship type Foreign key (or relationship relation)

M:N relationship type Relationshiprelation and two foreign keys

n-ary relationship type Relationship relation and n foreign keys

Simple attribute

Attribute

Composite attribute

Set of simple component attributes

Multivalued attribute Relation and foreign key

Value set

Domain

Key attribute

Primary (or secondary) key

Relational Algebra A brief introduction Relational algebra and relational calculus are formal languages associated with the relational model.

Informally, relational algebra is a (high-level) procedural language and relational calculus a non-procedural language.

However, formally both are equivalent to one another.

A language that produces a relation that can be derived using relational calculus is relationally complete.

Relational algebra operations work on one or more relations to define another relation without changing the original relations.

Both operands and results are relations, so output from one operation can become input to another operation.

Allows expressions to be nested, just as in arithmetic. This property is called closure.

What? Why? Similar to normal algebra (as in 2+3*x-y), except we use relations as values instead of numbers.

Not used as a query language in actual DBMSs. (SQL instead.)

The inner, lower-level operations of a relational DBMS are, or are similar to, relational algebra operations. We need to know about relational algebra to understand query execution and optimization in a relational DBMS.

Some advanced SQL queries requires explicit relational algebra operations, most commonly outer join.

SQL is declarative, which means that you tell the DBMS what you want, but not how it is to be calculated. A C++ or Java program is procedural, which means

that you have to state, step by step, exactly how the result should be calculated. Relational algebra is (more) procedural than SQL. (Actually, relational algebra is mathematical expressions.)

It provides a formal foundation for operations on relations.

It is used as a basis for implementing and optimizing queries in DBMS software.

DBMS programs add more operations which cannot be expressed in the relational algebra.

Relational calculus (tuple and domain calculus systems) also provides a foundation, but is more difficult to use. Well skip these for now.

Relational Algebra :

Relational algebra is the basic set of operations for the relational model

These operations enable a user to specify basic retrieval requests (or queries)

The result of an operation is a new relation, which may have been formed from one or more input relations

This property makes the algebra closed (all objects in relational algebra are relations)

The algebra operations thus produce new relations

These can be further manipulated using operations of the same algebra

A sequence of relational algebra operations forms a relational algebra expression

The result of a relational algebra expression is also a relation that represents the result of a database query (or retrieval request)

Relational Algebra consists of several groups of operations

Unary Relational Operations

SELECT (symbol: ( (sigma))

PROJECT (symbol: ( (pi))

RENAME (symbol: ( (rho))

Relational Algebra Operations From Set Theory

UNION ( ( ), INTERSECTION ( ( ), DIFFERENCE (or MINUS, )

CARTESIAN PRODUCT ( x )

Binary Relational Operations

JOIN (several variations of JOIN exist)

DIVISION

Additional Relational Operations

OUTER JOINS, OUTER UNION

AGGREGATE FUNCTIONS (These compute summary of information: for example, SUM, COUNT, AVG, MIN, MAX)

All examples discussed below refer to the COMPANY database shown here.

Relational Algebra Operations:

Unary Relational Operations: SELECT

The SELECT operation (denoted by ( (sigma)) is used to select a subset of the tuples from a relation based on a selection condition.

The selection condition acts as a filter

Keeps only those tuples that satisfy the qualifying condition

Tuples satisfying the condition are selected whereas the other tuples are discarded (filtered out)

Examples:

Select the EMPLOYEE tuples whose department number is 4:

DNO = 4 (EMPLOYEE)

Select the employee tuples whose salary is greater than $30,000:

SALARY > 30,000 (EMPLOYEE)

In general, the select operation is denoted by ( (R) where

the symbol ( (sigma) is used to denote the select operator

the selection condition is a Boolean (conditional) expression specified on the attributes of relation R

tuples that make the condition true are selected

appear in the result of the operation

tuples that make the condition false are filtered out

discarded from the result of the operation

SELECT Operation Properties

The SELECT operation ( (R) produces a relation S that has the same schema (same attributes) as R

SELECT ( is commutative:

( (( < condition2> (R)) = ( (( < condition1> (R))

Because of commutativity property, a cascade (sequence) of SELECT operations may be applied in any order:

((( (( (R)) = ( (( (( ( R)))

A cascade of SELECT operations may be replaced by a single selection with a conjunction of all the conditions:

(((< cond2> (((R)) = ( AND < cond2> AND < cond3>(R)))

The number of tuples in the result of a SELECT is less than (or equal to) the number of tuples in the input relation R

Unary Relational Operations: PROJECT

PROJECT Operation is denoted by ( (pi)

This operation keeps certain columns (attributes) from a relation and discards the other columns.

PROJECT creates a vertical partitioning

The list of specified columns (attributes) is kept in each tuple

The other attributes in each tuple are discarded

Example: To list each employees first and last name and salary, the following is used:

(LNAME, FNAME,SALARY(EMPLOYEE)

The general form of the project operation is:

((R)

( (pi) is the symbol used to represent the project operation

is the desired list of attributes from relation R.

The project operation removes any duplicate tuples This is because the result of the project operation must be a set of tuples

Mathematical sets do not allow duplicate elements.

PROJECT Operation Properties

The number of tuples in the result of projection ((R) is always less or equal to the number of tuples in R

If the list of attributes includes a key of R, then the number of tuples in the result of PROJECT is equal to the number of tuples in R

PROJECT is not commutative

( (( (R) ) = ( (R) as long as contains the attributes in

Unary Relational Operations: RENAME

The RENAME operator is denoted by ( (rho)

In some cases, we may want to rename the attributes of a relation or the relation name or both

Useful when a query requires multiple operations

Necessary in some cases (see JOIN operation later) The general RENAME operation ( can be expressed by any of the following forms:

(S (B1, B2, , Bn )(R) changes both:

the relation name to S, and

the column (attribute) names to B1, B1, ..Bn

(S(R) changes:

the relation name only to S

((B1, B2, , Bn )(R) changes:

the column (attribute) names only to B1, B1, ..Bn

For convenience, we also use a shorthand for renaming attributes in an intermediate relation:

If we write:

RESULT ( ( FNAME, LNAME, SALARY (DEP5_EMPS)

RESULT will have the same attribute names as DEP5_EMPS (same attributes as EMPLOYEE)

If we write:

RESULT (F, M, L, S, B, A, SX, SAL, SU, DNO)( ( RESULT (F.M.L.S.B,A,SX,SAL,SU, DNO)(DEP5_EMPS)

The 10 attributes of DEP5_EMPS are renamed to F, M, L, S, B, A, SX, SAL, SU, DNO, respectively

Relational Algebra Operations from Set Theory: UNION

UNION Operation

Binary operation, denoted by (

The result of R ( S, is a relation that includes all tuples that are either in R or in S or in both R and S

Duplicate tuples are eliminated

The two operand relations R and S must be type compatible (or UNION compatible)

R and S must have same number of attributes

Each pair of corresponding attributes must be type compatible (have same or compatible domains

Example:

To retrieve the social security numbers of all employees who either work in department 5 (RESULT1 below) or directly supervise an employee who works in department 5 (RESULT2 below)

We can use the UNION operation as follows:

DEP5_EMPS ( (DNO=5 (EMPLOYEE)

RESULT1 ( ( SSN(DEP5_EMPS)

RESULT2(SSN) ( (SUPERSSN(DEP5_EMPS)

RESULT ( RESULT1 ( RESULT2

The union operation produces the tuples that are in either RESULT1 or RESULT2 or both

Type Compatibility of operands is required for the binary set operation UNION (, (also for INTERSECTION (, and SET DIFFERENCE , see next slides)

R1(A1, A2, ..., An) and R2(B1, B2, ..., Bn) are type compatible if:

they have the same number of attributes, and

the domains of corresponding attributes are type compatible (i.e. dom(Ai)=dom(Bi) for i=1, 2, ..., n).

The resulting relation for R1(R2 (also for R1(R2, or R1R2, see next slides) has the same attribute names as the first operand relation R1 (by convention)

Relational Algebra Operations from Set Theory: INTERSECTION

INTERSECTION is denoted by ( The result of the operation R ( S, is a relation that includes all tuples that are in both R and S

The attribute names in the result will be the same as the attribute names in R

The two operand relations R and S must be type compatible

Relational Algebra Operations from Set Theory: MINUS

SET DIFFERENCE (also called MINUS or EXCEPT) is denoted by

The result of R S, is a relation that includes all tuples that are in R but not in S

The attribute names in the result will be the same as the attribute names in R

The two operand relations R and S must be type compatible

Some properties of UNION, INTERSECT, and DIFFERENCE

Notice that both union and intersection are commutative operations; that is

R ( S = S ( R, and R ( S = S ( R

Both union and intersection can be treated as n-ary operations applicable to any number of relations as both are associative operations; that is

R ( (S ( T) = (R ( S) ( T

(R ( S) ( T = R ( (S ( T)

The minus operation is not commutative; that is, in general

R S S R

Relational Algebra Operations from Set Theory: CARTESIAN PRODUCT

CARTESIAN (or CROSS) PRODUCT Operation

This operation is used to combine tuples from two relations in a combinatorial fashion.

Denoted by R(A1, A2, . . ., An) x S(B1, B2, . . ., Bm)

Result is a relation Q with degree n + m attributes:

Q(A1, A2, . . ., An, B1, B2, . . ., Bm), in that order.

The resulting relation state has one tuple for each combination of tuplesone from R and one from S.

Hence, if R has nR tuples (denoted as |R| = nR ), and S has nS tuples, then R x S will have nR * nS tuples.

The two operands do NOT have to be "type compatible

Generally, CROSS PRODUCT is not a meaningful operation

Can become meaningful when followed by other operations

Example (not meaningful):

FEMALE_EMPS ( ( SEX=F(EMPLOYEE)

EMPNAMES ( ( FNAME, LNAME, SSN (FEMALE_EMPS)

EMP_DEPENDENTS ( EMPNAMES x DEPENDENT

EMP_DEPENDENTS will contain every combination of EMPNAMES and DEPENDENT

whether or not they are actually related

To keep only combinations where the DEPENDENT is related to the EMPLOYEE, we add a SELECT operation as follows

Example (meaningful):

FEMALE_EMPS ( ( SEX=F(EMPLOYEE)

EMPNAMES ( ( FNAME, LNAME, SSN (FEMALE_EMPS)

EMP_DEPENDENTS ( EMPNAMES x DEPENDENT

ACTUAL_DEPS ( ( SSN=ESSN(EMP_DEPENDENTS)

RESULT ( ( FNAME, LNAME, DEPENDENT_NAME (ACTUAL_DEPS)

RESULT will now contain the name of female employees and their dependents

Binary Relational Operations: JOIN

JOIN Operation (denoted by )

The sequence of CARTESIAN PRODECT followed by SELECT is used quite commonly to identify and select related tuples from two relations

A special operation, called JOIN combines this sequence into a single operation

This operation is very important for any relational database with more than a single relation, because it allows us combine related tuples from various relations

The general form of a join operation on two relations R(A1, A2, . . ., An) and S(B1, B2, . . ., Bm) is:

R S

where R and S can be any relations that result from general relational algebra expressions.

Example: Suppose that we want to retrieve the name of the manager of each department.

To get the managers name, we need to combine each DEPARTMENT tuple with the EMPLOYEE tuple whose SSN value matches the MGRSSN value in the department tuple.

We do this by using the join operation.

DEPT_MGR ( DEPARTMENT MGRSSN=SSN EMPLOYEE

MGRSSN=SSN is the join condition

Combines each department record with the employee who manages the department

The join condition can also be specified as DEPARTMENT.MGRSSN= EMPLOYEE.SSN

Consider the following JOIN operation:

R(A1, A2, . . ., An) S(B1, B2, . . ., Bm)

R.Ai=S.Bj

Result is a relation Q with degree n + m attributes:

Q(A1, A2, . . ., An, B1, B2, . . ., Bm), in that order.

The resulting relation state has one tuple for each combination of tuplesr from R and s from S, but only if they satisfy the join condition r[Ai]=s[Bj]

Hence, if R has nR tuples, and S has nS tuples, then the join result will generally have less than nR * nS tuples.

Only related tuples (based on the join condition) will appear

Some properties of JOIN

The general case of JOIN operation is called a Theta-join: R S

The join condition is called theta

Theta can be any general boolean expression on the attributes of R and S; for example:

R.Ai50000 (SELECTION ( SALARY >50000) will be retrieved.

The Existential and Universal Quantifiers

Two special symbols called quantifiers can appear in formulas; these are the universal quantifier )(( and the existential quantifier ).((

Informally, a tuple variable t is bound if it is quantified, meaning that it appears in an (( t) or (( t) clause; otherwise, it is free.

If F is a formula, then so are (( t)(F) and (( t)(F), where t is a tuple variable.

The formula (( t)(F) is true if the formula F evaluates to true for some (at least one) tuple assigned to free occurrences of t in F; otherwise (( t)(F) is false.

The formula (( t)(F) is true if the formula F evaluates to true for every tuple (in the universe) assigned to free occurrences of t in F; otherwise (( t)(F) is false.

( is called the universal or for all quantifier because every tuple in the universe of tuples must make F true to make the quantified formula true.

( is called the existential or there exists quantifier because any tuple that exists in the universe of tuples may make F true to make the quantified formula true.

Examples

Find the names of employees who work on all the projects controlled by department number 5. The query can be:

{e.LNAME, e.FNAME | EMPLOYEE(e) and (( ( x)(not(PROJECT(x)) or not(x.DNUM=5) OR

(( ( w)(WORKS_ON(w) and w.ESSN=e.SSN and x.PNUMBER=w.PNO))))}

Exclude from the universal quantification all tuples that we are not interested in by making the condition true for all such tuples.

The first tuples to exclude (by making them evaluate automatically to true) are those that are not in the relation R of interest.

In query above, using the expression not(PROJECT(x)) inside the universally quantified formula evaluates to true all tuples x that are not in the PROJECT relation. Then we exclude the tuples we are not interested in from R itself. The expression not(x.DNUM=5) evaluates to true all tuples x that are in the project relation but are not controlled by department 5.

Finally, we specify a condition that must hold on all the remaining tuples in R.

(( ( w)(WORKS_ON(w) and w.ESSN=e.SSN and x.PNUMBER=w.PNO)

The Domain Relational Calculus

The domain-oriented calculus differs from the tuple-oriented relational calculus in that it has domain variables instead of tuple variables. That is variables that range over domains instead of over relations.

An expression of the domain calculus is of the form

{ x1, x2, . . ., xn |

COND(x1, x2, . . ., xn, xn+1, xn+2, . . ., xn+m)}

where x1, x2, . . ., xn, xn+1, xn+2, . . ., xn+m are domain variables that range over domains (of attributes)

and COND is a condition or formula of the domain relational calculus.

Examples

Retrieve the birthdate and address of the employee whose name is John B. Smith.

Query :

{uv | (( q) (( r) (( s) (( t) (( w) (( x) (( y) (( z)

(EMPLOYEE(qrstuvwxyz) and q=John and r=B and s=Smith)}

Abbreviated notation EMPLOYEE(qrstuvwxyz) uses the variables without the separating commas: EMPLOYEE(q,r,s,t,u,v,w,x,y,z)

Ten variables for the employee relation are needed, one to range over the domain of each attribute in order. Of the ten variables q, r, s, . . ., z, only u and v are free.

Specify the requested attributes, BDATE and ADDRESS, by the free domain variables u for BDATE and v for ADDRESS.

Specify the condition for selecting a tuple following the bar ( | )namely, that the sequence of values assigned to the variables qrstuvwxyz be a tuple of the employee relation and that the values for q (FNAME), r (MINIT), and s (LNAME) be John, B, and Smith, respectively.

Database Design

Requirements Analysis

user needs; what must database do?

Conceptual Design

high level description (often done with ER model)

Logical Design

translate ER into DBMS data model(Relational model)

(NOW)Schema Refinement

consistency,normalization

Physical Design

- indexes, disk layout

Security Design

- who accesses what

Good Database Design

no redundancy of FACT (!) no inconsistency

no insertion, deletion or update anomalies

no information loss

no dependency loss

Informal Design Guidelines for Relational Databases

1. Semantics of the Relation Attributes

2. Redundant Information in Tuples and Update Anomalies

3. Null Values in Tuples

4. Spurious Tuples

1:Semantics of the Relation Attributes

GUIDELINE 1: Informally, each tuple in a relation should represent one entity or relationship instance. (Applies to individual relations and their attributes).

Attributes of different entities (EMPLOYEEs, DEPARTMENTs, PROJECTs) should not be mixed in the same relation

Only foreign keys should be used to refer to other entities

Entity and relationship attributes should be kept apart as much as possible.

Design a schema that can be explained easily relation by relation. The semantics of attributes should be easy to interpret.

2:Redundant Information in Tuples and Update Anomalies

Information is stored redundantly

Wastes storage

Causes problems with update anomalies

Insertion anomalies

Deletion anomalies

Modification anomalies

Consider the relation:

EMP_PROJ(Emp#, Proj#, Ename, Pname, No_hours)

Insertion anomalies

Cannot insert a project unless an employee is assigned to it.

Deletion anomalies

a. When a project is deleted, it will result in deleting all the employees who work on that project.

b. Alternately, if an employee is the sole employee on a project, deleting that employee would result in deleting the corresponding project.

Modification anomalies

Changing the name of project number P1 from Billing to Customer-Accounting may cause this update to be made for all 100 employees working on project P1.

GUIDELINE 2:

Design a schema that does not suffer from the insertion, deletion and update anomalies.

If there are any anomalies present, then note them so that applications can be made to take them into account.

3:Null Values in Tuples

GUIDELINE 3:

Relations should be designed such that their tuples will have as few NULL values as possible

Attributes that are NULL frequently could be placed in separate relations (with the primary key)

Reasons for nulls:

Attribute not applicable or invalid

Attribute value unknown (may exist)

Value known to exist, but unavailable

4:Spurious Tuples

Bad designs for a relational database may result in erroneous results for certain JOIN operations

The "lossless join" property is used to guarantee meaningful results for join operations

GUIDELINE 4:

The relations should be designed to satisfy the lossless join condition.

No spurious tuples should be generated by doing a natural-join of any relations.

Normalization:

The process of decomposing unsatisfactory "bad" relations by breaking up their attributes into smaller relations

Normalization is used to design a set of relation schemas that is optimal from the point of view of database updating

Normalization starts from a universal relation schema

1NF

Attributes must be atomic:

they can be chars, ints, strings

they cant be

1. _ tuples

2. _ sets

3. _ relations

4. _ composite

5. _ multivalued

Considered to be part of the definition of relation Unnormalised Relations

NamePaperList

SWETHAEENADU, HINDU,DC

PRASANNAEENADU,VAARTHA,HINDU

This is not ideal. Each person is associated with an unspecified

number of papers. The items in the PaperList column do not have a consistent form.

Generally, RDBMS cant cope with relations like this. Each

entry in a table needs to have a single data item in it.

This is an unnormalised relation.

All RDBMS require relations not to be like this - not to havemultiple values in any column (i.e. no repeating groups)

NamePaperList

SWETHAEENADU

SWETHAHINDU

SWETHADC

PRASANNAHINDU

PRASANNAEENADU

PRASANNAVAARTHA

This clearly contains the same information.

And it has the property that we sought. It is in First Normal

Form (1NF).

A relation is in 1NF if no entry consists of more than one value

(i.e. does not have repeating groups)

So this will be the first requirement in designing our databases:

Obtaining 1NF

1NF is obtained by

Splitting composite attributes

splitting the relation and propagating the primary key to remove multi valued attributes

There are three approaches to removing repeating groups from

unnormalized tables:

1. Removes the repeating groups by entering appropriate data in the empty columns of rows containing the repeating data.

2. Removes the repeating group by placing the repeating data, along with a copy of the original key attribute(s), in a separate relation. A primary key is identified for the new relation.

3.By finding maximum possible values for the multi valued attribute and adding that many attributes to the relation

Example:-

The DEPARTMENT schema is not in 1NF because DLOCATION is not a single valued attribute.

The relation should be split into two relations. A new relation DEPT_LOCATIONS is created and the primary key of DEPARTMENT, DNUMBER, becomes an attribute of the new relation. The primary key of this relation is {DNUMBER, DLOCATION}

Alternative solution: Leave the DLOCATION attribute as it is. Instead, we have one tuple for each location of a DEPARTMENT. Then, the relation is in 1NF, but redundancy exists. A super key of a relation schema R = {A1, A2, ...., An} is a set of attributes S subset-of R with the property that no two tuples t1 and t2 in any legal relation state r of R will have t1[S] = t2[S]

A key K is a super key with the additional property that removal of any attribute from K will cause K not to be a super key any more.

If a relation schema has more than one key, each is called a candidate key.

One of the candidate keys is arbitrarily designated to be the primary key, and the others are called secondary keys.

A Prime attribute must be a member of some candidate key

A Nonprime attribute is not a prime attributethat is, it is not a member of any candidate key

Functional Dependencies (FDs)

Definition of FD

Inference Rules for FDs

Equivalence of Sets of FDs

Minimal Sets of FDs

Functional dependency describes the relationship between attributes in a relation.

For example, if A and B are attributes of relation R, and B is

functionally dependent on A ( denoted A( B), if each value of