Julia Stoyanovich
Final exam logistics
• When: June 6, in class
• The same format as the midterm: open book, open notes
• 2 hours in length
• The exam is cumulative, it will include material from the first half of the term. However, we will likely have more of a focus on the material from the second half of the term.
• To study: go over lecture notes, midterm, homework / solutions
2
Julia Stoyanovich
Normalization: important to know• Closures, keys
• compute the closure of a set of attributes
• compute candidate keys of a relation
• check whether an FD follows from a set of FDs
• compute a minimal basis of a set of FDs
• Normal forms
• check whether a relation is in BCNF
• check whether a relation is in 3NF
• I will not ask you to produce a BCNF or a 3NF decomposition on the final exam
3
Julia Stoyanovich
Closure of a set of attributes
Suppose A = {A1, …, An} is a set of attributes and S is a set of FDs.
The closure of A under the FDs in S is the set of attributes B s.t. every relation that satisfies all the FDs in S also satisfies
4
A→ B
We denote the closure of {A1,A2,…,An} {A1,A2,…,An}+by
Note that {A1,A2,…,An}⊆ {A1,A2,…,An}+
Julia Stoyanovich
Computing the closure of a set of attributes
1. Split the FDs of S using the splitting rule, so that each FD has one attribute on the right
2.Initialize
3. Repeatedly search for some FD such that
4.Stop when no more attributes can be added to
5
Input: a set of attributes {A1,A2,…,An} and a set of FDs SOutput: the closure {A1,A2,…,An}
+
{A1,A2,…,An}+ ← {A1,A2,…,An}
B1,B2,…,Bm →C{B1,B2,…,Bm}⊆ {A1,A2,…,An}
+ ∧C ∉{A1,A2,…,An}+
{A1,A2,…,An}+
Algorithm AttributeClosure
Julia Stoyanovich
Closures and keysQ: How can we tell if a set of attributes is a candidate key or a superkey of a relation R?
A: If = all the attributes in R
6
A1A2…An
Q: How can we compute the candidate keys for R?
A: Find all sets of attributes that functionally determine all other attributes and make sure these sets are minimal.
{A1,A2,...,An}+
Julia Stoyanovich
Minimal basis of a set of FDs• For a given relation R, there may exist several sets of
FDs that are equivalent:
- they give rise to the same closures of all subsets of R’s attributes
- the same sets of FDs follow from them
- all such equivalent sets of FDs are called bases for S in R
• A minimal basis B is a set of FDs that satisfies 3 conditions
1. All FDs in B have 1 attribute on the right
2. If any FD is removed from B, the result is no longer a basis3. If for any FD in B we remove 1 attribute on the left, the
result is no longer a basis
7
Julia Stoyanovich
Example
8
Find all candidate keys R(ABCD) C→ B BC→ A A→C BD→ A
Check whether the following are minimal bases of the set of FDs.{AC→ D,D→ B}{D→ A,D→ B,D→C}
Julia Stoyanovich
Example
9
Compute a projection of the set of FDs when R (ABCD) is projected onto ACD.
R(ABCD) A → B ; B → C ; C → Dπ ACD (R)
Julia Stoyanovich
Example
10
Compute a projection of the set of FDs when R (ABCD) is projected onto ACD.
R(ABCD) A → B ; B → C ; C → Dπ ACD (R)
{A}+ = {A,B,C,D}{C}+ = {C,D} = {C,D}+
{D}+ = {D}
Compute closures of all subsets of attributes in the projected relation.
we stop here, since any set that includes Awill have the same closure as A alone
Compute FDs from these closures that involve only A,C,D on either side, remove redundant FDs (keeping only the minimal basis)
done!T = {A→C,C→ D}
Julia Stoyanovich
Boyce-Codd Normal Form (BCNF)Let R be a relation schema, S be the set of FDs given to hold over R.
R is in BCNF if, for every FD
one of the following statements is true:
11
In a BCNF relation, the only set of attributes that determines values for other attributes is a superkey!
A1A2…An → B1B2…Bm
1. The FD is trivial:
2. is a candidate key of R
3. is a superkey of R
A1A2…AnA1A2…An
{B1,B2,…,Bm}⊆ {A1,A2,…,An}
Julia Stoyanovich
Third Normal Form (3NF)Let R be a relation schema, S be the set of FDs given to hold over R.
R is in 3NF if, for every FD
one of the following statements is true:
12
In contrast to BCNF, some redundancy is possible with 3NF. This normal form is a compromise, needed when no dependency-preserving decomposition into BCNF exists.
A1A2…An → B1B2…Bm
1. The FD is trivial:
2. is a candidate key of R
3. is a superkey of R
4. Each is part of some candidate key of R
A1A2…AnA1A2…An
{B1,B2,…,Bm}⊆ {A1,A2,…,An}{same asfor BCNF
Bi
Julia Stoyanovich
Example
13
Find all candidate keys of the given set of FDs. Check whether R is in BCNF, 3NF
R ABCD( ) ABD → C ; A → B ; AB → C ; B → A{A}+ = {ABC}{AD}+ = {ABCD}{BD}+ = {ABCD}
{B}+ = {ABC}
Julia Stoyanovich
Example
14
AB → C ; D → B ; AC → D R(ABCD)
(a) list candidate keys of R
AD→ B(b) does this FD follow from the set of FDs above?
(c) is R in BCNF? is it in 3NF?
AB, AC, AD
Yes. To check this, we must check whether B is in the closure of {AD}. We know that this is the case because, as we saw in (a), {AD} is a candidate key of R, and so all attributes are in the closure of {AD}.
In 3NF but not in BCNF
Julia Stoyanovich
2-way merge-sort
15
Input file PASS 0
PASS 1
PASS 2
PASS 3
9
3,4 6,2 9,4 8,7 5,6 3,1 2
3,4 5,6 2,6 4,9 7,8 1,3 2
2,3 4,6
4,7 8,9
1,3 5,6 2
2,3 4,4 6,7 8,9
1,2 3,5 6
1,2 2,3 3,4 4,5 6,6 7,8
1-page runs
2-page runs
4-page runs
8-page runs
example with N=7 pages
Julia Stoyanovich
2-way merge-sort
• What is the cost of this algorithm?
• In each pass, we read each page process it, and write it out: 2 disk I/Os per page, per pass
• There are k = log2N + 1 passes
• The over-all cost is 2N (log2N + 1) I/Os
16
suppose the input occupies N = 2k disk pages
Main memory buffers
INPUT 1
INPUT 2
OUTPUT
Disk Disk
Julia Stoyanovich
2-way external merge sort
A file with 10,000 records, each record is 1KB. Size of a page/block is 64KB (i.e., 64 records / block).
What is the number of passes, the cost of 2-way external merge-sort?
In this dataset, there are ceil(10,000 / 64) = 157 pages that must be sorted. In two-way external merge-sort, we use 1 memory block in pass 0 (each 64-record block is sorted), and 3 memory blocks in subsequent passes (pairs of adjacent sorted runs are merged).
To sort 157 pages, we will need 1 + ceil(log2157) = 9 passes.
Each page is read and written once on each pass (2 I/Os per page per pass). Thus, the total cost of two-way external merge-sort on this dataset is 2 * 157 * 9 = 2,826 I/Os.
17
Julia Stoyanovich
Generalization: external merge-sort
M M M M M M M M M M M M M M M
MMM
MMM
MMM
MMM
MMM
MMM
MMM
MM
MMM
MMM
18
N records, divided into NR / M sorted runs of M / R records each
final sorted result
B: block size M: main memory sizeN: input size (blocks) R: size of 1 record
Julia Stoyanovich
External merge-sort exampleA file with 10,000 records, each record is 1KB. Size of a page/block is 64KB (i.e., 64 records / block).
With memory size of 320KB, how many passes for generalized external merge sort? What is the cost?
Memory(its320/64=5pages.Allareusedforsortinginpass0.Allbut1areusedforsortinginsubsequentruns,theremainingpageisusedforoutput.
Inphase0ofgeneralizedexternalmerge-sort,wereadinandsort320KB(5 blocks worth) at a time, creating ceil(157/5) = 32 sorted runs of 5blockseach.
Then insubsequentpasseswemerge5-1=4neighboringruns. Weneedceil(log432)=3passestocompletesorting. That’satotalof3passes,with2I/Osperpageperpass,foratotalof2*157*4=1256I/Os.
19
Julia Stoyanovich
Basic file organization• Heap files: good for full file scans or frequent updates
• unordered files
• insert at the end of file
• assumes equality selection on key, exactly one match (why?)
• Sorted files: good for range queries on sort field(s)
• need external sort to keep sorted
• compacted after deletion
• assumes selection on sort field(s)
• Hashed files: good for selection on equality
• collection of buckets with primary & overflow pages
• hashing function h(r) = bucket for record r
• each bucket is a heap file
20
Julia Stoyanovich
Cost of operations
21
Heap File Sorted File Hashed File
Scan all recs p(T) D p(T) D 1.25 p(T) D
Equality Search p(T) D / 2 D log2 p(T) D
Range Search p(T) D D log2 p(T) + (# pages with matches)
1.25 p(T) D
Insert 2D Search + p(T) D 2D
Delete Search + D Search + p(T) D 2D
*
* assuming no overflow bucket, 80% page occupancy
p(T) - number of data pages in table T
r(T) - number of records in table T
D - time to read or write a disk page
Julia Stoyanovich
Access paths• An access path is a method of retrieving tuples: file scan, or index
that matches a selection in the query
• An index matches a conjunction of terms if it can be used to retrieve all data values that match this conjunction of terms.
• A tree index matches a conjunction of terms that involve only attributes in a prefix of the search key.
• e.g., tree index <a,b,c> matches the selection a=5 AND b=3; it also matches a=5 AND b>4; it does not match b=3.
• A hash index matches a conjunction of terms that has a term attribute=value for every attribute in the search key of the index.
• e.g., hash index on <a,b,c> matches a=b AND b=3 AND c=5; it does not match b=3; or a=5 and b=5; or a>5 AND b=3 and c=5
22
Julia Stoyanovich
Clustered vs. unclustered index
23
Data entries
(Index File)
(Data file)
Data Records
Data entries
Data Records
CLUSTERED UNCLUSTERED
Julia Stoyanovich
Using an index for selection
• Cost of finding qualifying data entries (typically small) plus cost of retrieving records (could be large)
• Example: assuming uniform distribution of names, about 10% of tuples qualify (100 pages, 10,000 tuples).
• with a clustered index, cost is little more than 100 I/Os
• with an unclustered index, cost is up to 10,000 I/Os!
24
SELECT * FROM Reserves R WHERE R.rname < �C%�
Sailors (sid:int, sname: string, rating:int, age:real)
Reserves (sid:int, bid:int, day:date, rname:string)
Reserves (R): each tuple us 40 bytes long, 100 tuples per page, 1000 pages Sailors (S): each tuple is 50 bytes long, 80 tuples per page, 500 pages
Julia Stoyanovich
Access paths: example
25
Employees (eid, name, salary, age, did);Departments (did, budget, floor, manager_eid);
Salaries from $10K to $100K ages from 20 to 80; 5 employees per department; 10 floors; budgets from $10K to $1M. Uniform,
uncorrelated values.
Q1. Print name, age, salary for all employees
A1: clustered hash index on (name, age, salary) of Employees
A2: unclustered hash index on (name, age, salary) of Employees
A3: clustered B+-tree index on (name, age, salary) of Employees
A4: unclustered hash index on (eid, did) of Employees
A5: no index
Julia Stoyanovich
Access paths: example
26
Employees (eid, name, salary, age, did);Departments (did, budget, floor, manager_eid);
Salaries from $10K to $100K ages from 20 to 80; 5 employees per department; 10 floors; budgets from $10K to $1M. Uniform,
uncorrelated values.
Q2. Find dids of departments on the 10th floor with budget < $15K
A1: clustered hash index on (floor) of Departments
A2: clustered hash index on (floor, budget) of Departments
A3: clustered B+-tree index on (floor, budget) of Departments
A4: clustered B+-tree index on (budget) of Departments
A5: no index
Julia Stoyanovich
Access paths: example
27
Sailors (sid, name, rating, age);
Sids from 1 to 100K, ratings from 1 to 10, ages from 20 to 80.Uniform, uncorrelated values.
Q1. Print name, age, rating of all sailors
A1: sequential scan of sorted file, sorted on (id)
A2: clustered hash index on (rating)
A3: unclustered hash index on (id)
A4: unclustered hash index on (age, rating)
A5: unclustered hash index on (name, age)
A6: clustered B+-tree index on (name, age)
A7: unclustered B+-tree index on (age, rating)
Julia Stoyanovich
Access paths: example
28
Sailors (sid, name, rating, age);
Sids from 1 to 100K, ratings from 1 to 10, ages from 20 to 80.Uniform, uncorrelated values.
Q2. Print name, age, rating of the sailor with sid 123
A1: sequential scan of sorted file, sorted on (id)
A2: clustered hash index on (rating)
A3: unclustered hash index on (id)
A4: unclustered hash index on (age, rating)
A5: unclustered hash index on (name, age)
A6: clustered B+-tree index on (name, age)
A7: unclustered B+-tree index on (age, rating)
Julia Stoyanovich
Access paths: example
29
Sailors (sid, name, rating, age);
Sids from 1 to 100K, ratings from 1 to 10, ages from 20 to 80.Uniform, uncorrelated values.
Q3. Count sailors with rating = 5 and age = 40
A1: sequential scan of sorted file, sorted on (id)
A2: clustered hash index on (rating)
A3: unclustered hash index on (id)
A4: unclustered hash index on (age, rating)
A5: unclustered hash index on (name, age)
A6: clustered B+-tree index on (name, age)
A7: unclustered B+-tree index on (age, rating)
Julia Stoyanovich
Access paths: another example
30
Employees (ssn, name, salary, age, did);
100,000 employees, 10 employee records per disk page. Stored on disk in a sorted file (alternative 1), with did as the sort key.
Salaries from 0 to $100K; ages from 20 to 80; 50 employees per department. Uniform, uncorrelated values.
Q1. Compute the number of employees whose salary is $35K and who work in department 177.
For each query: (1) List indexes that would match the query. (2) What index would you build? (3) What is the cost of using that index to answer this query?
Q2. List name, age, salary of employee with eid=12357.
Q3. Compute the number of employees who are between 30 and 35 years old.
Julia Stoyanovich
Relational algebra and SQL
31
Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)
Aircraft (aid: int, aname: string, range: int)
Employees (eid: int, ename: string, salary: int)
Certified (eid: int, aid: int)
(a) List eids of pilots certified to fly Boeing.
(b) List names of pilots certified to fly Boeing.
Julia Stoyanovich
Relational algebra and SQL
32
Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)
Aircraft (aid: int, aname: string, range: int)
Employees (eid: int, ename: string, salary: int)
Certified (eid: int, aid: int)
(a) List eids of pilots certified to fly Boeing.
(b) List names of pilots certified to fly Boeing.
π eid ((σ name='Boeing 'Aircraft)▹◃aid Certified)
π ename(Employees▹◃eid ((σ name='Boeing 'Aircraft)▹◃aid Certified))
Julia Stoyanovich
Relational algebra and SQL
33
(c) List names of aircraft that can be used on non-stop flights from Bonn to Madras.
Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)
Aircraft (aid: int, aname: string, range: int)
Employees (eid: int, ename: string, salary: int)
Certified (eid: int, aid: int)
Julia Stoyanovich
Relational algebra and SQL
34
(c) List names of aircraft that can be used on non-stop flights from Bonn to Madras.
Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)
Aircraft (aid: int, aname: string, range: int)
Employees (eid: int, ename: string, salary: int)
Certified (eid: int, aid: int)
π aname((σ origin='Bonn ' ∧dest='Madras 'Flights)▹◃range≥dist Aircraft)
Julia Stoyanovich
Relational algebra and SQL
35
(d) Find names of pilots who can operate planes with a range greater than 3,000 miles but are not certified on any Boeing aircraft.
Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)
Aircraft (aid: int, aname: string, range: int)
Employees (eid: int, ename: string, salary: int)
Certified (eid: int, aid: int)
Julia Stoyanovich
Relational algebra and SQL
36
(d) Find names of pilots who can operate planes with a range greater than 3,000 miles but are not certified on any Boeing aircraft.
Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)
Aircraft (aid: int, aname: string, range: int)
Employees (eid: int, ename: string, salary: int)
Certified (eid: int, aid: int)
π ename(Employees▹◃ eid
(π eid ((σ range>3000Aircraft)▹◃aid Certified)−π eid ((σ name='Boeing 'Aircraft)▹◃aid Certified)))
not the same as:π ename(Employees▹◃ eid
(π eid ((σ range>3000∧name≠'Boeing 'Aircraft)▹◃aid Certified)))
Julia Stoyanovich
SQL
37
(e) List eids of pilots certified to fly exactly 3 aircraft.
Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)
Aircraft (aid: int, aname: string, range: int)
Employees (eid: int, ename: string, salary: int)
Certified (eid: int, aid: int)
Julia Stoyanovich
SQL
38
(e) List eids of pilots certified to fly exactly 3 aircraft.
Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)
Aircraft (aid: int, aname: string, range: int)
Employees (eid: int, ename: string, salary: int)
Certified (eid: int, aid: int)
select eidfrom Certifiedgroup by eidhaving count(*) = 3
Julia Stoyanovich
SQL
39
(f) List aids of aircraft that can be used on flight AF007, along with an average salary of pilots who are certified to operate these aircraft.
Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)
Aircraft (aid: int, aname: string, range: int)
Employees (eid: int, ename: string, salary: int)
Certified (eid: int, aid: int)
Julia Stoyanovich
SQL
40
(f) List aids of aircraft that can be used on flight AF007, along with an average salary of pilots who are certified to operate these aircraft.
Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)
Aircraft (aid: int, aname: string, range: int)
Employees (eid: int, ename: string, salary: int)
Certified (eid: int, aid: int)
select avg(E.salary), A.aidfrom Flights F, Aircraft A, Certified C, Employees Ewhere F.flno = 'AF007'and F.dist <= A.rangeand A.aid = C.aidand C.eid = E.eidgroup by A.aid
Julia Stoyanovich
When writing queries• For relational algebra, do worry about efficiency: avoid Cartesian
product whenever possible, push selections
• For SQL, do worry about efficiency and readability:
• avoid nested queries if your query can be expressed with a join
• use group by / having as appropriate, not a subquery and a where clause in the outer
• use standard notation, like we covered in class, e.g., no need to write “inner join”, and do write your queries by hand
• For both SQL and relational algebra: do not join with relations unnecessarily. You should have exactly the right number of tables in the from clause of a SQL query, no more no less
41
Julia Stoyanovich
ER modelingDraw an ER diagram that encodes the following business rules. Clearly mark all key and participation constraints.
Chefs work at restaurants. A chef is uniquely identified by an SSN, and is also described by a name and a cuisine in which she specialized. A restaurant is uniquely identified by a combination of name and city. Each chef works in at least one restaurant, and each restaurant must have at least one chef working at it. Some chefs own restaurants, and if a chef owns a restaurant - she is its sole owner.
42
Julia Stoyanovich
ER modelingDraw an ER diagram that encodes the following business rules. Clearly mark all key and participation constraints.
Chefs work at restaurants. A chef is uniquely identified by an SSN, and is also described by a name and a cuisine in which she specialized. A restaurant is uniquely identified by a combination of name and city. Each chef works in at least one restaurant, and each restaurant must have at least one chef working at it. Some chefs own restaurants, and if a chef owns a restaurant - she is its sole owner.
43
RESTAURANTS(
city(name(
CHEFS(
name(ssn(
work_at(
own(cuisine(
Julia Stoyanovich
ER to relational
44
Julia Stoyanovich
ER to relational
45
create table Teams ( name varchar(32) primary key);
create table Athletes ( name varchar(32), dob date, team_name varchar(32), primary key (name, dob), foreign key (team_name) references Teams (name));
create table Sports ( name varchar(32) primary key, olympic char(3) );
create table Athletes_play_Sports ( athlete_name, sport_name, primary key (athlete_name, sport_name), foreign key (athlete_name) references Athletes(name), foreign key (sport_name) references Sports(name) );
Julia Stoyanovich
ER to relational
46
PRESIDENTS)
name)
running_mate) VICE_PRESIDENTS)
name)party)
Julia Stoyanovich
ER to relational
47
PRESIDENTS)
name)
running_mate) VICE_PRESIDENTS)
name)party)
create table Presidents_VPs ( president_name varchar(32) primary key, vp_name varchar(32) unique not null, party varchar(32));
Julia Stoyanovich
Binary vs. ternary relationship sets
48
PRESIDENT)
name)
running_mate) VICE_PRESIDENT)
name)
Party)
name)
PRESIDENT)
name)
running_mate) VICE_PRESIDENT)
name)party)
Julia Stoyanovich
And now with constraints
49
PRESIDENT)
name)
running_mate) VICE_PRESIDENT)
name)
Party)
name)
Julia Stoyanovich
And now with constraints
50
PRESIDENT)
name)
running_mate) VICE_PRESIDENT)
name)
Party)
name)
create table Parties ( name varchar(32) primary key);
create table Presidents_VPs ( president_name varchar(32) primary key, vp_name varchar(32) unique not null, party varchar(32) not null, foreign key party references Parties(name));
Julia Stoyanovich
Candidate keys, superkeys
51
Consider a relation schema and business rules below.
Dancers (name: string, dob: date, stage_name: string, company: string)
• No two dancers have the same combination of name and date of birth (dob).• No two dancers have the same combination of stage name and company.• A name, a dob and a stage name have to be specified for each dancer, but not
all dancers belong to a company.
What are the candidate keys? Which of these would be appropriate for a primary key?Which are not appropriate for a primary key?
What are the superkeys?
Write a valid create table statement.
Top Related