Data warehouse and Data MiningData warehouse and Data Mining Lecture No. 12 Normalization and...

Post on 25-Mar-2020

2 views 0 download

Transcript of Data warehouse and Data MiningData warehouse and Data Mining Lecture No. 12 Normalization and...

Naeem A. Mahoto

Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

Email: naeemmahoto@gmail.com

Data warehouse and Data Mining

Lecture No. 12

Normalization and De-normalization

Database Design •  Conceptual

–  identify important entities and relationships –  determine attribute domains and candidate keys

•  Logical –  Split data into multiple tables, such that:

•  no information is lost •  useful information can be easily reconstituted

–  draw the E-R diagram –  validate model using normalization

•  Physical –  implement on DBMS

Database Anomalies •  Database anomalies are unmatched or missing

information caused by limitations or flaws within a given database

•  Database anomalies are the problems in relations that occur due to redundancy in the relations

•  These anomalies affect the process of inserting, deleting and modifying data in the relations/tables

Types of Anomalies •  Insertion Anomaly: It occurs when a new record is inserted

in the relation –  In this anomaly, the user cannot insert a fact about an entity

until he/she has an additional fact about another entity •  Deletion Anomaly: It occurs when a record is deleted from

the relation –  In this anomaly, the deletion of facts about an entity

automatically deleted the fact of another entity •  Modification Anomaly: It occurs when the record is updated

in the relation. –  In this anomaly, the modification in the value of specific attribute

requires modification in all records in which that value occurs

Normalization •  Normalization is the process of converting bad database

design into a form that overcomes database anomalies •  It is the process of organizing the fields and tables of a

relational database to minimize redundancy (eliminate redundant data) and dependency (ensure dependency make sense)

•  Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them

•  The goal is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then the database is updated using the defined relationships

Normalization •  Edgar F. Codd (inventor of relational model)

proposed (in 1970) normalization through several normal forms: –  First normal form (1NF) –  Second normal form (2NF) –  Third normal form (3NF) –  Boyce-Codd normal form (BCNF) –  Fourth normal form (4NF) –  Fifth normal form (5NF) –  Domain key normal form (DKNF)

First Normal Form (1NF) •  A relation/table is in first normal form if the domain

of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain

•  Example: Consider a table that stores Customers and their Telephone Number. A customer may have more than one Telephone number

First Normal Form (1NF) Tables designed with 1NF

Second Normal Form (2NF) •  A relation/table is in 2NF if and only if it is in 1NF

and every non-prime attribute of the table is dependent on the whole of a candidate key

•  A table/relation is in 2NF if it is in first normal form and every non-primary-key column is fully functional dependent on the primary key

•  Full functional dependency indicates that if A and B are columns of a table, B is fully dependent on A

Second Normal Form (2NF) •  Consider a table describing employees' skills:

Candidate Key is composite {Employee, Skill} - Employee might need to appear more than once (he/she might have multiple Skills) - Current Work Location, is dependent on only part of the candidate key - Therefore the table is not in 2NF

A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}

Progressing to 2NF •  If a table is not in second normal form:

–  Move that data item and the part of the primary key on which it is functionally dependent to a new table

–  Add any other data items are functionally dependent on the same part of the key

–  Make the partial primary key the primary key for the new table

Second Normal Form (2NF) A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}

Third Normal Form (3NF) •  A table is in 3NF if and only if both of the

following conditions hold: –  The relation R (table) is in second normal form (2NF) –  Every non-prime attribute of R is non-transitively

dependent (i.e. directly dependent) on every superkey of R

•  A table that is in 1NF and 2NF and in which no non-primary-key column is transitively dependent on the primary key

Third Normal Form (3NF) •  Example: consider a table with A, B, and C. If B

is functional dependent on A and C is functional dependent on B, then C is transitively dependent on A via B (provided that A is not functionally dependent on B or C)

Third Normal Form (3NF) •  2NF table that fails to meet the requirements of

3NF is: Candidate key (composite key)

Winner Date of Birth is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner

Progressing to 3NF •  Move all items involved in transitive

dependencies to a new entity

•  Identify a primary key for the new entity

•  Place the primary key for the new entity as a foreign key on the original entity

Third Normal Form (3NF)

Boyce-Codd Normal Form (BCNF)

•  It is a slightly stronger version of the third normal form (3NF)

•  A relational schema R is in Boyce–Codd normal form if and only if for every one of its dependencies X → Y, at least one of the following conditions hold: –  X → Y is a trivial functional dependency (Y ⊆ X) –  X is a superkey for schema R

•  Only in rare cases does a 3NF table not meet the requirements of BCNF

Fourth Normal Form (4NF) •  A table is in fourth normal form (4NF) if it is in 3NF and

there are no multi-valued dependencies •  Multi-valued Dependency: In a table with columns A, B,

and C, there is a multivalued dependence of column B on column A, if each value for A is associated with a specific collection of values for B and, furthermore, this collection is independent of any values for C –  E.g. (employee, skill, language), Two many-to-many

relationships that are independent because any skill can be paired with any language

•  To remove multi-valued dependencies, create separate tables for the independent repeating groups

De-normalization •  De-normalization is the process of combining

tables in a careful manner to improve performance

•  This is the process of breaking the rules for 3NF •  The primary reasons to do this are:

–  To reduce the no. of joins that must be processed in queries, thereby improving database performance

–  To map the physical database structure more closely to user’s dimensional business model, structuring tables along the lines of how users will ask questions

De-normalization •  Normalization is a rule of thumb in DBMS, but in Decision

Support System (DSS) ease of use is achieved by way of de-normalization

•  It brings "close" dispersed but related data items •  Query performance in DSS significantly dependent on

physical data model •  De-normalization specifically improves performance by either:

–  Reducing the number of tables and hence the reliance on joins, which consequently speeds up performance

–  Reducing the number of joins required during query execution, or –  Reducing the number of rows (records) to be retrieved from the

Primary Data Table

Normalization vs. De-normalization

De-normalization •  “Depending on whether the modeler is building

the model for a data mart or a data warehouse the data modeler will wish to engage in some degree of de-normalization”. [Bill Inmon]

•  De-normalization of the logical data model serves the purpose of making the data more efficient to access. In the case of a data mart, a high degree of de-normalization can be practiced. In the case of a data warehouse a low degree of de-normalization is in order.” [Bill Inmon]

Issues to consider in De-normalization

•  The effects of de-normalization on database performance are unpredictable: as many applications can be affected negatively by de-normalization

•  De-normalize the implementation of the logical model only after one has thoroughly analyzed the costs and benefits, and only after a normalized logical design has been completed

De-normalization: Effects •  Consider the following list of effects of de-

normalization before one decides to undertake design changes: –  A de-normalized physical implementation can

increase hardware costs –  While de-normalization benefits the applications it is

specifically designed to enhance, it often decreases the performance of other applications

–  De-normalization introduces update anomalies to the database

De-normalization •  The following items are typical of the de-

normalizations that can sometimes be exploited to optimize performance: –  Pre-join –  Column Replication or Movement –  Pre-Aggregation

Pre-join: De-normalization •  A pre-join de-normalization moves frequently

joined attributes to the same base relation in order to eliminate join processing

•  It avoids performance impact of the frequent joins

•  Typically increases storage requirements

Pre-join: De-normalization •  Before de-normalization:

sale_id store_id sale_dt …

tx_id sale_id item_id … item_qty sale$

1

m

select sum(sales_detail.sale_amt)!from sales ,sales_detail!where sales.sales_id = sales_detail.sales_id! and sales.sales_dt between '2006-11-26' and '2006-12-25' ;!

Pre-join: De-normalization •  After de-normalization:

t x _ i d sale_id store_id sale_dt item_id … item_qty $

select sum(d_sales_detail.sale_amt)!from d_sales_detail!where d_sales_detail.sales_dt between '2006-11-26' and '2006-12-25';!

Column Replication: De-normalization

•  Take columns that are frequently accessed via large-scale joins and replicate (or move) them into detail table(s) to avoid join operation

•  It avoids performance impact of the frequent joins

•  It increases storage requirements for database

Column Replication: De-normalization A three table join requires re-distribution of significant amounts of data to answer many important questions related to customer transaction behavior Before de-normalization:

After de-normalization:

Tx_Id Account_Id Tx$ Tx_Dt Location_Id …

Account_Id Customer_Id Balance $ Open_Dt …

Tx_Id Account_Id Tx$ Tx_Dt Location_Id …

1 m

1 m

Customer_Id Customer_Nm Address SIC …

Account_Id Customer_Id Balance $ Open_Dt …

Tx_Id Account_Id Customer_Id Tx$ Tx_Dt Location_Id …

1 m

1 m 1

m

Pre-aggregation: De-normalization

•  Take aggregate values that are frequently used in decision-making and pre-compute them into physical tables in the database

•  It can provide huge performance advantage in avoiding frequent aggregation of detailed data

•  Pre-aggregation adds significant burden to maintenance for Data Warehouse