Master of Computer Application (MCA) – Semester 4 MC0077

MC0077 – Advanced Database Systems

Question 1- List and explain various Normal Forms. How BCNF differs from the Third Normal Form and 4th Normal forms?

First Normal Form - First normal form (1NF) is a property of a relation in a relational database. A relation is in first normal form if the domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain. First normal form is an essential property of a relation in a relational database. Database normalization is the process of representing a database in terms of relations in standard normal forms, where first normal is a minimal requirement. First normal form deals with the "shape" of a record type. Under first normal form, all occurrences of a record type must contain the same number of fields. First normal form excludes variable repeating fields and groups. Second Normal Form - Second normal form (2NF) is a normal form used in database normalization. A table that is in first normal form (1NF) must meet additional criteria if it is to qualify for second normal form. Specifically: a table is in 2NF if and only if it is in 1NF and no non-prime attribute is dependent on any proper subset of any candidate key of the table. A non-prime attribute of a table is an attribute that is not a part of any candidate key of the table. Put simply, a table is in 2NF if and only if it is in 1NF and every non-prime attribute of the table is either dependent on the whole of a candidate key, or on another non-prime attribute. When a 1NF table has no composite candidate keys (candidate keys consisting of more than one attribute), the table is automatically in 2NF. Second and third normal forms deal with the relationship between non-key and key fields. Third normal form - Third normal form is a normal form used in database normalization. A table is in 3NF if and only if both of the following conditions hold: The relation R (table) is in second normal form (2NF), every non-prime attribute of R is non-transitively dependent (i.e. directly dependent) on every super key of R.

Fourth Normal form - Under the fourth normal form, a table cannot have more than one multi valued column. A multivalve column is one where a single entity can have more than one attribute for that column.

Fifth Normal Form - Fifth normal form deals with cases where information can be reconstructed from smaller pieces of information that can be maintained with less redundancy. Second, third, and fourth normal forms also serve this purpose, but fifth normal form generalizes to cases not covered by the others. The fifth normal form is created by removing any columns that can be created from smaller pieces of data that can be maintained with less redundancy.

Difference between BCNF and Third Normal Form

Both 3NF and BCNF are normal forms that are used in relational databases to minimize redundancies in tables. In a table that is in the BCNF normal form, for every non-trivial

1

functional dependency of the form A → B, A is a super-key whereas, a table that complies with 3NF should be in the 2NF, and every non-prime attribute should directly depend on every candidate key of that table. BCNF is considered as a stronger normal form than the 3NF and it was developed to capture some of the anomalies that could not be captured by 3NF. Obtaining a table that complies with the BCNF form will require decomposing a table that is in the 3NF. This decomposition will result in additional join operations (or Cartesian products) when executing queries. This will increase the computational time. On the other hand, the tables that comply with BCNF would have fewer redundancies than tables that only comply with 3NF.

Difference between BCNF and 4th Normal Form

● Database must be already achieved to 3NF to take it to BCNF, but database must be in 3NF and BCNF, to reach 4NF.

● In fourth normal form, there are no multi-valued dependencies of the tables, but in BCNF, there can be multi-valued dependency data in the tables.

Question 2 - What are differences in Centralized and Distributed Database Systems? List the relative advantages of data distribution.

A distributed database is a database that is under the control of a central database management system (DBMS) in which storage devices are not all attached to a common CPU. It may be stored in multiple computers located in the same physical location, or may be dispersed over a network of interconnected computers. Collections of data (e.g. in a database) can be distributed across multiple physical locations. A distributed database can reside on network servers on the Internet, on corporate intranets or extranets, or on other company networks. The replication and distribution of databases improves database performance at end-user worksites. To ensure that the distributive databases are up to date and current, there are two processes: replication and duplication. Replication involves using specialized software that looks for changes in the distributive database. Once the changes have been identified, the replication process makes all the databases look the same. The replication process can be very complex and time consuming depending on the size and number of the distributive databases. This process can also require a lot of time and computer resources. Duplication on the other hand is not as complicated. It basically identifies one database as a master and then duplicates that database. The duplication process is normally done at a set time after hours. This is to ensure that each distributed location has the same data. In the duplication process, changes to the master database only are allowed. This is to ensure that local data will not be overwritten. Both of the processes can keep the data current in all distributive locations. Besides distributed database replication and fragmentation, there are many other distributed database design technologies. For example, local autonomy, synchronous and asynchronous distributed database technologies. These technologies' implementation can and does depend on the needs of the business and the sensitivity/confidentiality of the data to be stored in the database, and hence the price the business is willing to spend on ensuring data security, consistency and integrity.

A database User accesses the distributed database through:

Local applications: Applications which do not require data from other sites.

2

Global applications: Applications which do require data from other sites.

A distributed database does not share main memory or disks. A centralized database has all its data on one place, as it is totally different from distributed database which has data on different places. In centralized database as all the data reside on one place so problem of bottle-neck can occur, and data availability is not efficient as in distributed database.

Advantages of Data Distribution

The primary advantage of distributed database systems is the ability to share and access data in a reliable and efficient manner.

1. Data sharing and Distributed Control : If a number of different sites are connected to each other, then a user at one site may be able to access data that is available at another site. For example, in the distributed banking system, it is possible for a user in one branch to access data in another branch. Without this capability, a user wishing to transfer funds from one branch to another would have to resort to some external mechanism for such a transfer. This external mechanism would, in effect, be a single centralized database. The primary advantage to accomplishing data sharing by means of data distribution is that each site is able to retain a degree of control over data stored locally. In a centralized system, the database administrator of the central site controls the database. In a distributed system, there is a global database administrator responsible for the entire system. A part of these responsibilities is delegated to the local database administrator for each site. Depending upon the design of the distributed database system, each local administrator may have a different degree of autonomy which is often a major advantage of distributed databases.

2. Reliability and Availability : If one site fails in distributed system, the remaining sited may be able to continue operating. In particular, if data are replicated in several sites, transaction needing a particular data item may find it in several sites. Thus, the failure of a site does not necessarily imply the shutdown of the system. The failure of one site must be detected by the system, and appropriate action may be needed to recover from the failure. The system must no longer use the service of the failed site. Finally, when the failed site recovers or is repaired, mechanisms must be available to integrate it smoothly back into the system. Although recovery from failure is more complex in distributed systems than in a centralized system, the ability of most of the systems to continue to operate despite failure of one site, results in increased availability. Availability is crucial for database systems used for real-time applications.

3. Speedup Query Processing : If a query involves data at several sites, it may be possible to split the query into sub queries that can be executed in parallel by several sites. Such parallel computation allows for faster processing of a user’s query. In those cases in which data is replicated, queries may be directed by the system to the least heavily loaded sites.

Question 3 - Describe the concepts of Structural Semantic Data Model (SSM).

A data model in software engineering is an abstract model that describes how data are represented and accessed. Data models formally define data elements and relationships among data elements for a domain of interest. A data model explicitly determines the structure of data or structured data. Typical applications of data models include database models, design of information systems, and enabling exchange of data. Usually data models are specified in a data modeling language. Communication and precision are the two key benefits that make a data model important to applications that use and exchange data. A

3

data model is the medium which project team members from different backgrounds and with different levels of experience can communicate with one another. Precision means that the terms and rules on a data model can be interpreted only one way and are not ambiguous. A data model can be sometimes referred to as a data structure, especially in the context of programming languages. Data models are often complemented by function models, especially in the context of enterprise models.

A semantic data model in software engineering is a technique to define the meaning of data within the context of its interrelationships with other data. A semantic data model is an abstraction which defines how the stored symbols relate to the real world. A semantic data model is sometimes called a conceptual data model. The logical data structure of a database management system (DBMS), whether hierarchical, network, or relational, cannot totally satisfy the requirements for a conceptual definition of data because it is limited in scope and biased toward the implementation strategy employed by the DBMS. Therefore, the need to define data from a conceptual view has led to the development of semantic data modeling techniques. That is, techniques to define the meaning of data within the context of its interrelationships with other data. As illustrated in the figure. The real world, in terms of resources, ideas, events, etc., is symbolically defined within physical data stores. A semantic data model is an abstraction which defines how the stored symbols relate to the real world. Thus, the model must be a true representation of the real world

Data modeling in software engineering is the process of creating a data model by applying formal data model descriptions using data modeling techniques. Data modeling is a technique for defining business requirements for a database. It is sometimes called database modeling because a data model is eventually implemented in a database. Data architecture is the design of data for use in defining the target state and the subsequent planning needed to hit the target state. It is usually one of several architecture domains that form the pillars of an enterprise architecture or solution architecture. Data architecture describes the data structures used by a business and/or its applications. There are descriptions of data in storage and data in motion; descriptions of data stores, data groups and data items; and mappings of those data artifacts to data qualities, applications, locations etc. Essential to realizing the target state, Data architecture describes how data is processed, stored, and utilized in a given system. It provides criteria for data processing operations that make it possible to design data flows and also control the flow of data in the system.

Question 4 - Describe the following with respect to Object Oriented Databases: a) Query Processing in Object-Oriented Database Systems b) Query Processing Architecture

a. Query Processing in Object-Oriented Database Systems

One of the criticisms of first-generation object-oriented database management systems (OODBMSs) was their lack of declarative query capabilities. This led some researchers to brand first generation (network and hierarchical) DBMSs as object-oriented. It was commonly believed that the application domains that OODBMS technology targets do not need querying capabilities. This belief no longer holds, and declarative query capability is accepted as one of the fundamental features of OO-DBMS. Indeed, most of the current prototype systems experiment with powerful query languages and investigate their

4

optimization. Commercial products have started to include such languages as well e.g. O2 and Object-Store.

Query optimization techniques are dependent upon the query model and language. For example, a functional query language lends itself to functional optimization which is quite different from the algebraic, cost-based optimization techniques employed in relational as well as a number of object-oriented systems. The query model, in turn, is based on the data (or object) model since the latter defines the access primitives which are used by the query model. These primitives, at least partially, determine the power of the query model. Despite this close relationship, in this unit we do not consider issues related to the design of object models, query models, or query languages in any detail.

Almost all object query processors proposed to date use optimization techniques developed for relational systems. However, there are a number of issues that make query processing more difficult in OODBMSs. The following are some of the more important issues:

Type System - Relational query languages operate on a simple type system consisting of a single aggregate type: relation. The closure property of relational languages implies that each relational operator takes one or more relations as operands and produces a relation as a result. In contrast, object systems have richer type systems. The results of object algebra operators are usually sets of objects (or collections) whose members may be of different types. If the object languages are closed under the algebra operators, these heterogeneous sets of objects can be operands to other operators.

Encapsulation - Relational query optimization depends on knowledge of the physical storage of data (access paths) which is readily available to the query optimizer. The encapsulation of methods with the data that they operate on in OODBMSs raises (at least) two issues. First, estimating the cost of executing methods is considerably more difficult than estimating the cost of accessing an attribute according to an access path. In fact, optimizers have to worry about optimizing method execution, which is not an easy problem because methods may be written using a general-purpose programming language. Second, encapsulation raises issues related to the accessibility of storage information by the query optimizer. Some systems overcome this difficulty by treating the query optimizer as a special application that can break encapsulation and access information directly.

Complex Objects and Inheritance - Objects usually have complex structures where the state of an object references other objects. Accessing such complex objects involves path expressions. The optimization of path expressions is a difficult and central issue in object query languages.

Object Models - OODBMSs lack a universally accepted object model definition. Even though there is some consensus on the basic features that need to be supported by any object model (e.g., object identity, encapsulation of state and behavior, type inheritance, and typed collections), how these features are supported differs among models and systems. As a result, the numerous projects that experiment with object query processing follow quite different paths and are, to a certain degree, incompatible, making it difficult to amortize on the experiences of others.

5

b. Query Processing Architecture

A query processing methodology similar to relational DBMSs, but modified to deal with the difficulties,

The steps of the methodology are as follows.

1. Queries are expressed in a declarative language2. It requires no user knowledge of object implementations, access paths or

processing strategies3. The calculus expression is first4. Calculus Optimization5. Calculus Algebra Transformation6. Type check7. Algebra Optimization8. Execution Plan Generation9. Execution

Question 5 - Describe the Differences between Distributed & Centralized Databases.

1 Centralized Control vs. Decentralized Control - In centralized control one "database administrator" ensures safety of data whereas in distributed control, it is possible to use hierarchical control structure based on a "global database administrator" having the central responsibility of whole data along with "local database administrators", who have the responsibility of local databases.

2 Data Independence - In central databases it means the actual organization of data is transparent to the application programmer. The programs are written with "conceptual" view of the data (called "Conceptual schema"), and the programs are unaffected by physical organization of data. In Distributed Databases, another aspect of "distribution dependency" is added to the notion of data independence as used in Centralized databases. Distribution Dependency means programs are written assuming the data is not distributed. Thus correctness of programs is unaffected by the movement of data from one site to another; however, their speed of execution is affected.

3 Reduction of Redundancy - In centralized databases redundancy was reduced for two reasons :(a) inconsistencies among several copies of the same logical data are avoided, (b) storage space is saved. Reduction of redundancy is obtained by data sharing. In distributed databases data redundancy is desirable as (a) locality of applications can be increased if data is replicated at all sites where applications need it, (b) the availability of the system can be increased, because a site failure does not stop the execution of applications at other sites if the data is replicated. With data replication, retrieval can be performed on any copy, while updates must be performed consistently on all copies.

4 Complex Physical Structures and Efficient Access - In centralized databases complex accessing structures like secondary indexed, interfile chains are used. All these features provide efficient access to data. In distributed databases efficient access requires accessing

6

data from different sites. For this an efficient distributed data access plan is required which can be generated either by the programmer or produced automatically by an optimizer. Problems faced in the design of an optimizer can be classified in two categories: a) Global optimization consists of determining which data must be accessed at which sites and which data files must consequently be transmitted between sites. b) Local optimization consists of deciding how to perform the local database accesses at each site.

5 Integrity, Recovery and Concurrency Control - A transaction is an atomic unit of execution and atomic transactions are the means to obtain database integrity. Failures and concurrency are two dangers of atomicity. Failures may cause the system to stop in midst of transaction execution, thus violating the atomicity requirement. Concurrent execution of different transactions may permit one transaction to observe an inconsistent, transient state created by another transaction during its execution. Concurrent execution requires synchronization amongst the transactions, which is much harder in all distributed systems.

6 Privacy and Security - In traditional databases, the database administrator, having centralized control, can ensure that only authorized access to the data is performed. In distributed databases, local administrators face the same as well as two new aspects of the problem; (a) security (protection) problems because of communication networks is intrinsic to database systems. (b) In certain databases with a high degree of "site autonomy" may feel more protected because they can enforce their own protections instead of depending on a central database administrator.

7 Distributed Query Processing - The DDBMS should be capable of gathering and presenting data from more than one site to answer a single query. In theory a distributed system can handle queries more quickly than a centralized one, by exploiting parallelism and reducing disc contention; in practice the main delays (and costs) will be imposed by the communications network. Routing algorithms must take many factors into account to determine the location and ordering of operations. Communications costs for each link in the network are relevant, as also are variable processing capabilities and loadings for different nodes, and (where data fragments are replicated) trade-offs between cost and currency.

8 Distributed Directory (Catalog) Management - Catalogs for distributed databases contain information like fragmentation description, allocation description, mappings to local names, access method description, statistics on the database, protection and integrity constraints (consistency information) which are more detailed as compared to centralized databases.

Question 6 - Describe the following: a) Data Mining Functions b) Data Mining Techniques

a) Data Mining Functions

Data mining refers to the broadly-defined set of techniques involving finding meaningful patterns - or information - in large amounts of raw data. At a very high level, data mining is performed in the following stages (note that terminology and steps taken in the data mining process varies by data mining practitioner):

1. Data collection: gathering the input data you intend to analyze2. Data scrubbing: removing missing records, filling in missing values where appropriate

7

3. Pre-testing: determining which variables might be important for inclusion during the analysis stage.

4. Analysis/Training: analyzing the input data to look for patterns5. Model building: drawing conclusions from the analysis phase and determining a

mathematical model to be applied to future sets of input data 6. Application: applying the model to new data sets to find meaningful patterns

Data mining can be used to classify or cluster data into groups or to predict likely future outcomes based upon a set of input variables/data.

b) Data Mining Techniques

There are several major data mining techniques have been developed and used in data mining projects.

Association - Association is one of the best known data mining technique. In association, a pattern is discovered based on a relationship of a particular item on other items in the same transaction. For example, the association technique is used in market basket analysis to identify what products that customers frequently purchase together.

Classification - Classification is a classic data mining technique based on machine learning. Basically classification is used to classify each item in a set of data into one of predefined set of classes or groups.

Clustering - Clustering is a data mining technique that makes meaningful or useful cluster of objects that have similar characteristic using automatic technique. Different from classification, clustering technique also defines the classes and put objects in them, while in classification objects are assigned into predefined classes.

Prediction - The prediction as it name implied is one of a data mining techniques that discovers relationship between independent variables and relationship between dependent and independent variables

Sequential Patterns - Sequential patterns analysis in one of data mining technique that seeks to discover similar patterns in data transaction over a business period. The uncover patterns are used for further business analysis to recognize relationships among data.

Artificial neural networks - These are non-linear, predictive models that learn through training. Although they are powerful predictive modeling techniques, some of the power comes at the expense of ease of use and deployment.

Decision trees - They are tree-shaped structures that represent decision sets. These decisions generate rules, which then are used to classify data. Decision trees are the favored technique for building understandable models. The nearest-neighbor method - This method classifies dataset records based on similar data in a historical dataset.

8

Master of Computer Application (MCA) – Semester 4 MC0077

Technology

Transcript of Master of Computer Application (MCA) – Semester 4 MC0077