1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

48
1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003

description

3 Relational Schema Design (or Logical Design) Main idea: Start with some relational schema Find out its FD’s Use them to design a better relational schema

Transcript of 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

Page 1: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

1

Lecture 10:Database Design andRelational Algebra

Monday, October 20, 2003

Page 2: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

2

Outline

• Design of a Relational schema (3.6)• Relational Algebra (5.2)• Operations on bags (5.3, 5.4)

– Reading assignment 5.3 and 5.4 (won’t have time to cover in class)

Page 3: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

3

Relational Schema Design(or Logical Design)

Main idea:• Start with some relational schema• Find out its FD’s• Use them to design a better relational

schema

Page 4: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

4

Data Anomalies

When a database is poorly designed we get anomalies:

Redundancy: data is repeated

Updated anomalies: need to change in several places

Delete anomalies: may lose data when we don’t want

Page 5: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

5

Relational Schema Design

Anomalies:• Redundancy = repeat data• Update anomalies = Fred moves to “Bellevue”• Deletion anomalies = Joe deletes his phone number:

what is his city ?

Recall set attributes (persons with several phones):

SSN Name, City

Name SSN PhoneNumber City

Fred 123-45-6789 206-555-1234 Seattle

Fred 123-45-6789 206-555-6543 Seattle

Joe 987-65-4321 908-555-2121 Westfield

but not SSN PhoneNumber

Page 6: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

6

Relation DecompositionBreak the relation into two:

Name SSN City

Fred 123-45-6789 Seattle

Joe 987-65-4321 Westfield

SSN PhoneNumber

123-45-6789 206-555-1234

123-45-6789 206-555-6543

987-65-4321 908-555-2121Anomalies have gone:• No more repeated data• Easy to move Fred to “Bellevue” (how ?)• Easy to delete all Joe’s phone number (how ?)

Name SSN PhoneNumber City

Fred 123-45-6789 206-555-1234 Seattle

Fred 123-45-6789 206-555-6543 Seattle

Joe 987-65-4321 908-555-2121 Westfield

Page 7: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

7

Relational Schema Design

PersonbuysProduct

name

price name ssn

Conceptual Model:

Relational Model:plus FD’s

Normalization:Eliminates anomalies

Page 8: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

8

Decompositions in General

R1 = projection of R on A1, ..., An, B1, ..., Bm

R2 = projection of R on A1, ..., An, C1, ..., Cp

R(A1, ..., An, B1, ..., Bm, C1, ..., Cp)

R1(A1, ..., An, B1, ..., Bm) R2(A1, ..., An, C1, ..., Cp)

Page 9: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

9

Decomposition

• Sometimes it is correct:Name Price Category

Gizmo 19.99 Gadget

OneClick 24.99 Camera

Gizmo 19.99 Camera

Name Price

Gizmo 19.99

OneClick 24.99

Gizmo 19.99

Name Category

Gizmo Gadget

OneClick Camera

Gizmo Camera

Lossless decomposition

Page 10: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

10

Incorrect Decomposition

• Sometimes it is not:

Name Price Category

Gizmo 19.99 Gadget

OneClick 24.99 Camera

Gizmo 19.99 Camera

Name Category

Gizmo Gadget

OneClick Camera

Gizmo Camera

Price Category

19.99 Gadget

24.99 Camera

19.99 Camera

What’sincorrect ??

Lossy decomposition

Page 11: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

11

Decompositions in GeneralR(A1, ..., An, B1, ..., Bm, C1, ..., Cp)

If A1, ..., An B1, ..., Bm

Then the decomposition is lossless

R1(A1, ..., An, B1, ..., Bm) R2(A1, ..., An, C1, ..., Cp)

Example: name price, hence the first decomposition is lossless

Note: don’t need necessarily A1, ..., An C1, ..., Cp

Page 12: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

12

Normal Forms

First Normal Form = all attributes are atomic

Second Normal Form (2NF) = old and obsolete

Third Normal Form (3NF) = this lecture

Boyce Codd Normal Form (BCNF) = this lecture

Others...

Page 13: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

13

Boyce-Codd Normal FormA simple condition for removing anomalies from relations:

In English (though a bit vague):

Whenever a set of attributes of R is determining another attribute, should determine all the attributes of R.

A relation R is in BCNF if:

If A1, ..., An B is a non-trivial dependency

in R , then {A1, ..., An} is a key for R

Page 14: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

14

BCNF Decomposition Algorithm

A’s OthersB’s

R1

Is there a 2-attribute relation that isnot in BCNF ?

Repeat choose A1, …, Am B1, …, Bn that violates the BNCF condition split R into R1(A1, …, Am, B1, …, Bn) and R2(A1, …, Am, [others]) continue with both R1 and R2

Until no more violations

R2

Page 15: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

15

Example

What are the dependencies?SSN Name, City

What are the keys?{SSN, PhoneNumber}

Is it in BCNF?

Name SSN PhoneNumber City

Fred 123-45-6789 206-555-1234 Seattle

Fred 123-45-6789 206-555-6543 Seattle

Joe 987-65-4321 908-555-2121 Westfield

Joe 987-65-4321 908-555-1234 Westfield

Page 16: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

16

Decompose it into BCNF

Name SSN City

Fred 123-45-6789 Seattle

Joe 987-65-4321 Westfield

SSN PhoneNumber

123-45-6789 206-555-1234

123-45-6789 206-555-6543

987-65-4321 908-555-2121

987-65-4321 908-555-1234

SSN Name, City

Let’s check anomalies:• Redundancy ?• Update ?• Delete ?

Page 17: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

17

Summary of BCNF Decomposition

Find a dependency that violates the BCNF condition:

A’sOthers B’s

R1 R2

Heuristics: choose B , B , … B “as large as possible”1 2 m

Decompose:

Is there a 2-attribute relation that isnot in BCNF ?

Continue untilthere are noBCNF violationsleft.

A1, A2, …, An B1, B2, …, Bm

Page 18: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

18

Example Decomposition Person(name, SSN, age, hairColor, phoneNumber)

SSN name, ageage hairColor

Decompose in BCNF (in class):

Step 1: find all keys (How ? Compute S+, for various sets S)

Step 2: now decompose

Page 19: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

19

Other Example

• R(A,B,C,D) A B, B C

• Key: AD• Violations of BCNF: A B, A C,

ABC• Pick A BC: split into R1(A,BC) R2(A,D)• What happens if we pick A B first ?

Page 20: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

20

Lossless Decompositions A decomposition is lossless if we can recover: R(A,B,C)

R1(A,B) R2(A,C)

R’(A,B,C) should be the same as R(A,B,C)

R’ is in general larger than R. Must ensure R’ = R

Decompose

Recover

Page 21: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

21

Lossless Decompositions

• Given R(A,B,C) s.t. AB, the decomposition into R1(A,B), R2(A,C) is lossless

Page 22: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

22

3NF: A Problem with BCNFUnit Company Product

Unit Company

Unit Product

FD’s: Unit Company; Company, Product UnitSo, there is a BCNF violation, and we decompose.

Unit Company

No FDs

Notice: we loose the FD: Company, Product Unit

Page 23: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

23

So What’s the Problem?

Unit Company Product

Unit Company Unit Product

Galaga99 UW Galaga99 databasesBingo UW Bingo databases

No problem so far. All local FD’s are satisfied.Let’s put all the data back into a single table again:

Galaga99 UW databasesBingo UW databases

Violates the dependency: company, product -> unit!

Page 24: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

24

Solution: 3rd Normal Form (3NF)

A simple condition for removing anomalies from relations:

A relation R is in 3rd normal form if :

Whenever there is a nontrivial dependency A1, A2, ..., An Bfor R , then {A1, A2, ..., An } a super-key for R, or B is part of a key.

Tradeoff:BCNF = no anomalies, but may lose some FDs3NF = keeps all FDs, but may have some anomalies

Page 25: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

25

Relational Algebra

• Formalism for creating new relations from existing ones

• Its place in the big picture:

Declartivequery

languageAlgebra Implementation

SQL,relational calculus

Relational algebraRelational bag algebra

Page 26: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

26

Relational Algebra• Five operators:

– Union: – Difference: -– Selection:– Projection: – Cartesian Product:

• Derived or auxiliary operators:– Intersection, complement– Joins (natural,equi-join, theta join, semi-join)– Renaming:

Page 27: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

27

1. Union and 2. Difference

• R1 R2• Example:

– ActiveEmployees RetiredEmployees

• R1 – R2• Example:

– AllEmployees -- RetiredEmployees

Page 28: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

28

What about Intersection ?

• It is a derived operator• R1 R2 = R1 – (R1 – R2)• Also expressed as a join (will see later)• Example

– UnionizedEmployees RetiredEmployees

Page 29: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

29

3. Selection• Returns all tuples which satisfy a condition• Notation: c(R)• Examples

– Salary > 40000 (Employee)– name = “Smithh” (Employee)

• The condition c can be =, <, , >, , <>

Page 30: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

30

Selection Example

EmployeeSSN Name DepartmentID Salary999999999 John 1 30,000777777777 Tony 1 32,000888888888 Alice 2 45,000

SSN Name DepartmentID Salary888888888 Alice 2 45,000

Find all employees with salary more than $40,000.Salary > 40000 (Employee)

Page 31: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

31

4. Projection• Eliminates columns, then removes

duplicates• Notation: A1,…,An (R)• Example: project social-security number

and names:– SSN, Name (Employee)– Output schema: Answer(SSN, Name)

Page 32: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

32

Projection Example

EmployeeSSN Name DepartmentID Salary999999999 John 1 30,000777777777 Tony 1 32,000888888888 Alice 2 45,000

SSN Name999999999 John777777777 Tony888888888 Alice

SSN, Name (Employee)

Page 33: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

33

5. Cartesian Product

• Each tuple in R1 with each tuple in R2• Notation: R1 R2• Example:

– Employee Dependents• Very rare in practice; mainly used to

express joins

Page 34: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

34

Cartesian Product Example Employee Name SSN John 999999999 Tony 777777777 Dependents EmployeeSSN Dname 999999999 Emily 777777777 Joe Employee x Dependents Name SSN EmployeeSSN Dname John 999999999 999999999 Emily John 999999999 777777777 Joe Tony 777777777 999999999 Emily Tony 777777777 777777777 Joe

Page 35: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

35

Relational Algebra• Five operators:

– Union: – Difference: -– Selection:– Projection: – Cartesian Product:

• Derived or auxiliary operators:– Intersection, complement– Joins (natural,equi-join, theta join, semi-join)– Renaming:

Page 36: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

36

Renaming

• Changes the schema, not the instance• Notation: B1,…,Bn (R)• Example:

– LastName, SocSocNo (Employee)– Output schema:

Answer(LastName, SocSocNo)

Page 37: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

37

Renaming Example

EmployeeName SSNJohn 999999999Tony 777777777

LastName SocSocNoJohn 999999999Tony 777777777

LastName, SocSocNo (Employee)

Page 38: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

38

Natural Join• Notation: R1 R2⋈• Meaning: R1 R2 = ⋈ A(C(R1 R2))

• Where:– The selection C checks equality of all common

attributes– The projection eliminates the duplicate common

attributes

Page 39: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

39

Natural Join ExampleEmployeeName SSNJohn 999999999Tony 777777777

DependentsSSN Dname999999999 Emily777777777 Joe

Name SSN DnameJohn 999999999 EmilyTony 777777777 Joe

Employee Dependents = Name, SSN, Dname( SSN=SSN2(Employee x SSN2, Dname(Dependents))

Page 40: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

40

Natural Join

• R= S=

• R ⋈ S=

A B

X Y

X Z

Y Z

Z V

B C

Z U

V W

Z V

A B CX Z UX Z VY Z UY Z VZ V W

Page 41: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

41

Natural Join

• Given the schemas R(A, B, C, D), S(A, C, E), what is the schema of R ⋈ S ?

• Given R(A, B, C), S(D, E), what is R ⋈ S ?

• Given R(A, B), S(A, B), what is R ⋈ S ?

Page 42: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

42

Theta Join

• A join that involves a predicate• R1 ⋈ R2 = (R1 R2)• Here can be any condition

Page 43: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

43

Eq-join

• A theta join where is an equality• R1 ⋈A=B R2 = A=B (R1 R2)• Example:

– Employee ⋈SSN=SSN Dependents

• Most useful join in practice

Page 44: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

44

Semijoin

• R ⋉ S = A1,…,An (R ⋈ S)

• Where A1, …, An are the attributes in R• Example:

– Employee ⋉ Dependents

Page 45: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

45

Semijoins in Distributed Databases

• Semijoins are used in distributed databases

SSN Name

. . . . . .

SSN Dname Age

. . . . . .

EmployeeDependents

network

Employee ⋈ssn=ssn (age>71 (Dependents))

T = SSN age>71 (Dependents)R = Employee T⋉

Answer = R ⋈ Dependents

Page 46: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

46

Complex RA Expressions

Person Purchase Person Product

name=fred name=gizmo

pid ssn

seller-ssn=ssn

pid=pid

buyer-ssn=ssn

name

Page 47: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

47

Operations on Bags

A bag = a set with repeated elementsAll operations need to be defined carefully on bags• {a,b,b,c}{a,b,b,b,e,f,f}={a,a,b,b,b,b,b,c,e,f,f}• {a,b,b,b,c,c} – {b,c,c,c,d} = {a,b,b,d}• C(R): preserve the number of occurrences

• A(R): no duplicate elimination• Cartesian product, join: no duplicate eliminationImportant ! Relational Engines work on bags, not sets !

Reading assignment: 5.3 – 5.4

Page 48: 1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.

48

Finally: RA has Limitations !

• Cannot compute “transitive closure”

• Find all direct and indirect relatives of Fred• Cannot express in RA !!! Need to write C program

Name1 Name2 Relationship

Fred Mary Father

Mary Joe Cousin

Mary Bill Spouse

Nancy Lou Sister