Normalization Is the gradual and sequential process of efficiently organizing data in a database...

12
Normalization • Is the gradual and sequential process of efficiently organizing data in a database that follows the rules listed in the previous slide – Normalization commonly involves following three schemas (in order): • First, Second, and Third Normal Form (1NF, 2NF, 3NF) – This is commonly done during early stages on UML class diagrams • The goal of normalization is to: – eliminate the duplication of data (which make database large, inefficient, and slow) which in turn prevents data manipulation anomalies and loss of data integrity • changes that happen in different places may not be the same – This is done by creating tables and assigning PK for each table, and making sure that each information shows up once in the database • It eliminates redundant data (storing the same data in more than one table) and ensuring data dependencies are logical (only storing related data in a table) • Normalization reduces the amount of space a database consumes and ensures data is logically stored

Transcript of Normalization Is the gradual and sequential process of efficiently organizing data in a database...

Page 1: Normalization Is the gradual and sequential process of efficiently organizing data in a database that follows the rules listed in the previous slide –

Normalization• Is the gradual and sequential process of efficiently organizing data in a

database that follows the rules listed in the previous slide– Normalization commonly involves following three schemas (in order):

• First, Second, and Third Normal Form (1NF, 2NF, 3NF)– This is commonly done during early stages on UML class diagrams

• The goal of normalization is to:– eliminate the duplication of data (which make database large, inefficient, and slow)

which in turn prevents data manipulation anomalies and loss of data integrity• changes that happen in different places may not be the same

– This is done by creating tables and assigning PK for each table, and making sure that each information shows up once in the database

• It eliminates redundant data (storing the same data in more than one table) and ensuring data dependencies are logical (only storing related data in a table)

• Normalization reduces the amount of space a database consumes and ensures data is logically stored

Page 2: Normalization Is the gradual and sequential process of efficiently organizing data in a database that follows the rules listed in the previous slide –

First Normal Form (1NF)• 1NF deals with duplicative data across multiple columns!• It sets the very basic rules to make sure that:

– Separate tables are created for each group of related data (e.g., IsotopicAge, Fold, Rock)• i.e., each table should represent a distinct entity

1. Duplicative (repeating) columns containing the same type of data are removed from the same table• There should be no repeated data types: Mineral1, Mineral2,

Mineral3 or cellPhone, homePhone, workPhone• These should go to a new table

2. All columns must contain a single value, i.e.,• All attributes must be atomic (e.g., XRF,) not multi-valued. Each

cell must only have one value, e.g., XRF, not XRF, REE, Isotope3. There should be a set of one or more columns that uniquely

identify each row, i.e., there should be a primary key

Page 3: Normalization Is the gradual and sequential process of efficiently organizing data in a database that follows the rules listed in the previous slide –

Another example: Analysis tableInvestigator AnalysisType Address

Hassan Babaie XRF 24 Peachtree Center Ave, Atlanta, GA 30303

John Wayne XRF, XRD, REE 3500 Pacific View Dr, Newport Beach, CA

Elizabeth Tucker Petrography 1100 Angela Ra, Charlotte, NC,

John Wayne Isotopic age 3500 Pacific View Dr, Newport Beach, CA

• Investigators submit their samples to an Analyzing company. They company stores the above set of data for the customers

• What are the problems:– This is not in 1NF– The AnalysisType column does not represent a distinct entity

• Can’t find out how many people order analysis for XRF. They are all mixed.

– The Address column is compound, and needs to move out into another table. City depends on zip zode.

– There is no PK

Page 4: Normalization Is the gradual and sequential process of efficiently organizing data in a database that follows the rules listed in the previous slide –

Second Normal Form (2NF)

• 2NF deals with redundancy across multiple rows!• Second normal form (2NF) further addresses the concept of

removing duplicative data• Meet all the requirements of the first normal form (1NF)• Identify columns whose data repeat in different places– Remove them to their own table• In the next slide, we see that data for Joe Strat is

repeated. Solution is to remove the Alum column (with its address and school into their own Table called Alum and School• See next slide for more!

Page 5: Normalization Is the gradual and sequential process of efficiently organizing data in a database that follows the rules listed in the previous slide –

An improved Analysis Table

• Now we can query on the type of analysis• There are still problems with the structure:• There are still redundancies• The company can only keep track of three types of analyses; four would not work!• Address is still compound; needs to be broken• It is difficult to determine the analysis order for each person.

– Order in this case depends on non-Pk columns

Investigator

Analysis1

Analysis2

Analysis3

orders Address

Hassan Babaie

XRF Department of Geosciences, GSU, Atlanta, GA 30303

John Wayne

XRF XRD REE 3500 Pacific View Dr, Newport Beach, CA

Elizabeth Tucker

Petrography

1100 Angela Ra, Charlotte, NC,

John Wayne

Isotopic

3500 Pacific View Dr, Newport Beach, CA

Page 6: Normalization Is the gradual and sequential process of efficiently organizing data in a database that follows the rules listed in the previous slide –

Better solution• We need to break the table into several tables:– Investigator, Analysis, Order, OrderItems, and Address

investiID lastName firstName affiliation

1 Wayne John ExHollywood

2 Babaie Hassan GSU

AnalysisID AnalysisType

1 XRF

2 XRF

Number Street City State zipCode Country

3500 Pacific View Dr. Newport Beach CA 92662 USA

24 Peachtree Center Ave

Atlanta GA 30303

InvestigatorTable

AnalysisTable

AddressTable

Page 7: Normalization Is the gradual and sequential process of efficiently organizing data in a database that follows the rules listed in the previous slide –

• Order and OrderItem Tables, partially shown

OrderItemID OrderID AnalysisID Qty

1 1 1 2

2 2 2 1

OrderID InvestiID OrderDate DeliveryDate

1 1 3/5/1960 4/30/1960

2 2 2/17/2013 3/12/2013

OrderTable

OrderItemTable

Page 8: Normalization Is the gradual and sequential process of efficiently organizing data in a database that follows the rules listed in the previous slide –

Some improvement

Analysis

AnalysisIDAnalysisType

OrderItem

OrderItemIDOrderID

AnalysisIDQty

Order

OrderIDInvestID

OrderDateDeliveryDate

Investigator

InvestIDFirstNameFirstName

Address

Address

AddressIDNumber

Stree…

Page 9: Normalization Is the gradual and sequential process of efficiently organizing data in a database that follows the rules listed in the previous slide –

Third Normal Form (3NF)

• Third normal form goes one large step further• Meet all the requirements of the 2NF• No transitive functional dependencies– Remove columns that are not dependent upon the

primary key• Remove columns that their values depend on columns

other than the PK

– This means: remove subkeys

Page 10: Normalization Is the gradual and sequential process of efficiently organizing data in a database that follows the rules listed in the previous slide –

3NF, cont’d• There should be no partial functional dependencies• If x y, i.e., x functionally determines y, and y is functionally

dependent on x, then given x, we can find y.– Example, in the Address table, given the nine-digit zip code, we can

find city and state because they are functionally dependent on the zip code. The opposite is not true, given a city we cannot find the zip code (Note: some cities have several zip codes)

• By definition, a super key (primary key) functionally determines all other attributes in the table

• The zip code is a subkey (not a superkey) because it only determine the city and state part of the Address table not the other attributes

Page 11: Normalization Is the gradual and sequential process of efficiently organizing data in a database that follows the rules listed in the previous slide –

• To take care of the partial functional dependency issue take 3 steps:– Remove all the attributes that depend on the subkey from the table

(e.g., city and State from Address table)– Move them into a new table (e.g., call it ZipLocations with zipCode,

city, and state attributes– Keep a copy of the subkey attribute (i.e., zipCode) in the original table

as a foreign key• The address table now has firstname, last name, street (these 3

make the PK), and zipCode (as FK to the other table). • Summary: Subkeys always result in redundant data and must be removed!• In other words, remove subsets of data that apply to multiple rows of a

table and place them in separate tables– i.e., remove duplicative data– For example, break address into its independent constituents that do

not depend on each other• Create relationships between these new tables and their predecessors

through the use of foreign keys

Page 12: Normalization Is the gradual and sequential process of efficiently organizing data in a database that follows the rules listed in the previous slide –

Fourth Normal Form (4NF)

• Normalizing a database to the 3NF is usually sufficient

• Finally, fourth normal form (4NF) has one additional requirement

• • Meet all the requirements of the third normal form

• A relation is in 4NF if it has no multi-valued dependencies