Introduction to Database Theory · Introduction Relational Model Recursive Queries Query Evaluation...
Transcript of Introduction to Database Theory · Introduction Relational Model Recursive Queries Query Evaluation...
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Introduction to Database Theory
Pierre Senellart
ÉCOLE NORMALE
S U P É R I E U R E
RESEARCH UNIVERSITY PARIS
École de Printemps en Informatique Théorique, 2019
2/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Data management
Numerous applications (standalone software, Web sites, etc.)need to manage data:� Structure data useful to the application� Store them in a persistent manner (data retained even
when the application is not running)� Efficiently query information within large data volumes� Update data without violating some structural constraints� Enable data access and updates by multiple users, possibly
concurrently
Often, desirable to access the same data from several distinctapplications, from distinct computers.
3/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Role of a DBMS
Database Management SystemSoftware that simplifies the design of applications that handledata, by providing a unified access to the functionalitiesrequired for data management, whatever the application.
DatabaseCollection of data (specific to a given application) managed bya DBMS
4/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Features of DBMSs (1/2)Physical independence. The user of a DBMS does not need to
know how data are stored (in a file, on a rawpartition, in a distributed filesystem, etc.); storagecan be modified without impacting data access
Logical independence. It is possible to provide the user with apartial view of the data, corresponding to what heneeds and is allowed to access
Ease of data access. Use of a declarative language describingqueries and updates on the data, specifying theintent of a user rather than the way this will beimplemented
Query optimization. Queries are automatically optimized to beimplemented as efficiently as possible on thedatabase
5/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Features of DBMSs (2/2)Logical integrity. The DBMS imposes constraints on data
structure; every modification violating theseconstraints is denied
Physical integrity. The database remains in a coherent state,and data are durably preserved, even in case ofsoftware or hardware failure
Data sharing. Data are accessible by multiple users,concurrently, and these multiple and concurrentaccesses cannot violate logical or physical dataintegrity
Standardization. The use of a DBMS is standardized, so that itmay be possible to replace a DBMS with anotherwithout changing (in a major way) the code of theapplication
6/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Major types of DBMSsRelational (RDBMS). Tables, complex queries (SQL), rich
featuresXML. Trees, complex queries (XQuery),
features similar to RDBMSGraph/Triples. Graph data, complex queries expressing
graph navigationObjects. Complex data model, inspired by OOP
Documents. Complex data, organized in documents,relatively simple queries and features
Key–Value. Very basic data model, focus onperformance
Column Stores. Data model in between key–value andRDBMS; focus on iteration andaggregation on columns
NoS
QL
6/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Major types of DBMSsRelational (RDBMS). Tables, complex queries (SQL), rich
featuresXML. Trees, complex queries (XQuery),
features similar to RDBMSGraph/Triples. Graph data, complex queries expressing
graph navigationObjects. Complex data model, inspired by OOP
Documents. Complex data, organized in documents,relatively simple queries and features
Key–Value. Very basic data model, focus onperformance
Column Stores. Data model in between key–value andRDBMS; focus on iteration andaggregation on columns
NoS
QL
7/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Classical relational DBMSs
� Based on the relational model: decomposition of data intorelations (i.e., tables)� A standard query language: SQL� Data stored on disk� Relations (tables) stored line after line� Centralized system, with limited distribution possibilities
8/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Database theory
� Well-established research area within computer science,concerned with foundational et theoretical aspects of datamanagement
� Motivations:� provide through theoretical tools advances in the practice of
data management� provide through the scope of data management advances in
theoretical computer science
� The success of current-day DBMSs relies in particular on atheoretical result: Codd’s theorem� This lecture: High-level introduction to database theory,focusing on the relational model
9/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational ModelModelRelational AlgebraRelational calculus
Recursive Queries
Complexity of Query Evaluation
Static Analysis of Queries
Conclusion
10/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Relational schemaWe fix countably infinite sets:� L of labels� V of values� T of types, s.t., 8� 2 T ; � � V
DefinitionA relation schema (of arity n) is an n-tuple (A1; : : : ; An) whereeach Ai (called an attribute) is a pair (Li; �i) with Li 2 L,�i 2 T and such that all Li are distinct
DefinitionA database schema is defined by a finite set of labels L � L(relation names), each label of L being mapped to a relationschema.
11/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Example database schema� Universe:
� L the set of alphanumeric character strings starting with aletter
� V the set of finite sequences of bits� T is formed of types such as INTEGER (representation as a
sequence of bits of integers between �231 and 231 � 1), REAL(representation of floating-point numbers following IEEE754), TEXT (UTF-8 representation of character strings),DATE (ISO8601 representation of dates), etc.
� Database schema formed of 2 relation names, Guest andReservation� Guest: ((id; INTEGER); (name; TEXT); (email; TEXT))� Reservation:((id; INTEGER); (guest; INTEGER); (room; INTEGER),(arrival; DATE); (nights; INTEGER))
12/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Database
DefinitionAn instance of a relation schema ((L1; �1); : : : ; (Ln; �n)) (alsocalled a relation on this schema) is a finite set ft1; : : : ; tkg oftuples of the form tj = (vj1; : : : ; vjn) with 8j8i vji 2 �i.
DefinitionAn instance of a database schema (or, simply, a database onthis schema) is a function that maps each relation name to aninstance of the corresponding relation schema.
Note: Relation is used somewhat ambiguously to talk about arelation schema or an instance of a relation schema.
13/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
ExampleGuest
id name email
1 John Smith [email protected] Alice Black [email protected] John Smith [email protected]
Reservation
id guest room arrival nights
1 1 504 2017-01-01 52 2 107 2017-01-10 33 3 302 2017-01-15 64 2 504 2017-01-15 25 2 107 2017-01-30 1
14/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Some notation
� If A = (L; � ) is the ith attribute of a relation R, and t ann-tuple of an instance of R, we note t[A] (or t[L]) the valueof the ith component of t.� Similarly, if A is a k-tuple of attributes among the nattributes of R, t[A] is the k-tuple formed from t byconcatenating the t[A] for A in A.� A tuple is an n-tuple for some n.
15/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Variants: named and unnamed perspectivesThe version presented considers the attributes of a relation areordered and have a name. This is what best matches the wayRDBMSs work, but not necessarily the most pleasant to reasonon the relational model.
Named perspective. We forget the position of attributes, andconsider they are uniquely identified by theirnames.
Unnamed perspective. We forget the name of attributes, andconsider they are uniquely identified by theirposition. One uses notation such as t[2] to accessthe value of the second attribute of a tuple.
No major impact, one will use one or the other depending onwhat is convenient.
16/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Variant: bag semantics
� A relation instance is defined as a (finite) set of tuples.One can also consider a bag semantics of the relationalmodel, where a relation instance is a multiset of tuples.� This is what best matches how RDBMSs work. . .� . . . but most of relational database theory has beenestablished for the set semantics, more convenient to workwith� We will mostly discuss the set semantics in this lecture
17/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Variant: untyped version
� In implementations, attributes are always typed� In models and theoretical results, one often abstractsattribute types away and considers each attribute has auniversal type V� We will most often omit attribute types
18/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational ModelModelRelational AlgebraRelational calculus
Recursive Queries
Complexity of Query Evaluation
Static Analysis of Queries
Conclusion
19/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
The relational algebra� Algebraic language to express queries� A relational algebra expression produces a new relationfrom the database relations� Each operator takes 0, 1, or 2 subexpressions� Main operators:
Op. Arity Description Condition
R 0 Relation name R 2 L
�A!B 1 Renaming A;B 2 L
�A1:::An 1 Projection A1 : : : An 2 L
�� 1 Selection � formula� 2 Cross product[ 2 Unionn 2 Difference./� 2 Join � formula
20/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Relation nameGuest
id name email
1 John Smith [email protected] Alice Black [email protected] John Smith [email protected]
Reservation
id guest room arrival nights
1 1 504 2017-01-01 52 2 107 2017-01-10 33 3 302 2017-01-15 64 2 504 2017-01-15 25 2 107 2017-01-30 1
Expression: GuestResult:
id name email
1 John Smith [email protected] Alice Black [email protected] John Smith [email protected]
21/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
RenamingGuest
id name email
1 John Smith [email protected] Alice Black [email protected] John Smith [email protected]
Reservation
id guest room arrival nights
1 1 504 2017-01-01 52 2 107 2017-01-10 33 3 302 2017-01-15 64 2 504 2017-01-15 25 2 107 2017-01-30 1
Expression: �id!guest(Guest)Result:
guest name email
1 John Smith [email protected] Alice Black [email protected] John Smith [email protected]
22/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
ProjectionGuest
id name email
1 John Smith [email protected] Alice Black [email protected] John Smith [email protected]
Reservation
id guest room arrival nights
1 1 504 2017-01-01 52 2 107 2017-01-10 33 3 302 2017-01-15 64 2 504 2017-01-15 25 2 107 2017-01-30 1
Expression: �email;id(Guest)Result:
email id
23/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
SelectionGuest
id name email
1 John Smith [email protected] Alice Black [email protected] John Smith [email protected]
Reservation
id guest room arrival nights
1 1 504 2017-01-01 52 2 107 2017-01-10 33 3 302 2017-01-15 64 2 504 2017-01-15 25 2 107 2017-01-30 1
Expression: �arrival>2017-01-12^guest=2(Reservation)Result:
id guest room arrival nights
4 2 504 2017-01-15 25 2 107 2017-01-30 1
The formula used in the selection can be any Booleancombination of comparisons of attributes to attributes orconstants.
24/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Cross productGuest
id name email
1 John Smith [email protected] Alice Black [email protected] John Smith [email protected]
Reservation
id guest room arrival nights
1 1 504 2017-01-01 52 2 107 2017-01-10 33 3 302 2017-01-15 64 2 504 2017-01-15 25 2 107 2017-01-30 1
Expression: �id(Guest)� �name(Guest)Result:
id name
1 Alice Black2 Alice Black3 Alice Black1 John Smith2 John Smith3 John Smith
25/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
UnionGuest
id name email
1 John Smith [email protected] Alice Black [email protected] John Smith [email protected]
Reservation
id guest room arrival nights
1 1 504 2017-01-01 52 2 107 2017-01-10 33 3 302 2017-01-15 64 2 504 2017-01-15 25 2 107 2017-01-30 1
Expression: �room(�guest=2(Reservation)) [�room(�arrival=2017-01-15(Reservation))
Result:room
107302504
This simple union could have been written�room(�guest=2_arrival=2017-01-15(Reservation)). Not always possible.
25/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
UnionGuest
id name email
1 John Smith [email protected] Alice Black [email protected] John Smith [email protected]
Reservation
id guest room arrival nights
1 1 504 2017-01-01 52 2 107 2017-01-10 33 3 302 2017-01-15 64 2 504 2017-01-15 25 2 107 2017-01-30 1
Expression: �room(�guest=2(Reservation)) [�room(�arrival=2017-01-15(Reservation))
Result:room
107302504
This simple union could have been written�room(�guest=2_arrival=2017-01-15(Reservation)). Not always possible.
26/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
DifferenceGuest
id name email
1 John Smith [email protected] Alice Black [email protected] John Smith [email protected]
Reservation
id guest room arrival nights
1 1 504 2017-01-01 52 2 107 2017-01-10 33 3 302 2017-01-15 64 2 504 2017-01-15 25 2 107 2017-01-30 1
Expression: �room(�guest=2(Reservation)) n�room(�arrival=2017-01-15(Reservation))
Result:room
107
This simple difference could have been written�room(�guest=2^arrival6=2017-01-15(Reservation)). Not alwayspossible.
26/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
DifferenceGuest
id name email
1 John Smith [email protected] Alice Black [email protected] John Smith [email protected]
Reservation
id guest room arrival nights
1 1 504 2017-01-01 52 2 107 2017-01-10 33 3 302 2017-01-15 64 2 504 2017-01-15 25 2 107 2017-01-30 1
Expression: �room(�guest=2(Reservation)) n�room(�arrival=2017-01-15(Reservation))
Result:room
107
This simple difference could have been written�room(�guest=2^arrival6=2017-01-15(Reservation)). Not alwayspossible.
27/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
JoinGuest
id name email
1 John Smith [email protected] Alice Black [email protected] John Smith [email protected]
Reservation
id guest room arrival nights
1 1 504 2017-01-01 52 2 107 2017-01-10 33 3 302 2017-01-15 64 2 504 2017-01-15 25 2 107 2017-01-30 1
Expression: Reservation ./guest=id GuestResult:
id guest room arrival nights name email
1 1 504 2017-01-01 5 John Smith [email protected] 2 107 2017-01-10 3 Alice Black [email protected] 3 302 2017-01-15 6 John Smith [email protected] 2 504 2017-01-15 2 Alice Black [email protected] 2 107 2017-01-30 1 Alice Black [email protected]
The formula used in the join can be any Boolean combinationof comparisons of attributes of the table on the left toattributes of the table on the right.
28/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Note on the join
� The join is not an elementary operator of the relationalalgebra (but it is very useful)� It can be seen as a combination of renaming, cross product,selection, projection� Thus:
Reservation ./guest=id Guest
��id;guest;room;arrival;nights;name;email(
�guest=temp(Reservation� �id!temp(Guest)))
� If R and S have for attributes A and B, we note R ./ S thenatural join of R and S, where the join formula isVA2A\B A = A.
29/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Bag semantics
In bag semantics (what is actually used by RDBMS):
� All operations return multisets� In particular, projection and union can introduce multisets
even when initial relations are sets
30/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Extension: Aggregation� Various extensions have been proposed to the relationalalgebra to add additional features� In particular, aggregation and grouping [Klug, 1982,Libkin, 2003] of results� With a syntax inspired from [Libkin, 2003]:
�avg>3( avgroom[�x:avg(x)](�room;nights(Reservation)))
computes the average number of nights per reservation foreach room having an average greater than 3
room avg
302 6504 3.5
31/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational ModelModelRelational AlgebraRelational calculus
Recursive Queries
Complexity of Query Evaluation
Static Analysis of Queries
Conclusion
32/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Relational calculus
� Logical language to express queries� First-order logic formula, without function symbols, andwith relation symbols the labels of the database schema(plus comparison predicates)� Unnamed, untyped perspective� Fix:
� A set X of variables� A set V of values� A database schema S
33/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Relational calculus: Syntax
� For every relation R 2 S of arity n, for every(�1; : : : ; �n) 2 (X [ V)n: R(�1; : : : ; �n) 2 FO
� Also allow equality predicate, possibly inequality� For every (�1; �2) 2 FO2, for every x 2 X :
� �1 ^ �2 2 FO� �1 _ �2 2 FO� :�1 2 FO� 8x �1 2 FO� 9x �1 2 FO
� Free variables of � 2 FO: variables x appearing in � andnot qualified by a 8x or a 9x� One writes a relational calculus query in the formQ(x1; : : : ; xm) = � where x1; : : : ; xm are free variables of �
34/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Relational calculus: Semantics� A relational calculus query on schema S can be seen as a
function with input a database D over S and producing arelation as output� adom(D): active domain of D, set of values in D
� If Q(x1; : : : ; xn) = � is a calculus query over S and D adatabase over S, then:
Q(D) = f (v1; : : : ; vn) 2 (adom(D))n j D j= �[x1=v1; : : : ; xn=vn] g
where D j= � is defined inductively:� D j= R(u1; : : : ; um) () R(u1; : : : ; um) 2 D� D j= �1 ^ �2 () D j= �1 ^D j= �2� D j= �1 _ �2 () D j= �1 _D j= �2� D j= :�1 () D 6j= �1� D j= 8x �1 () 8v 2 adom(D)D j= �1[x=v]� D j= 9x �1 () 9v 2 adom(D)D j= �1[x=v]
35/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Codd’s theorem
Theorem ([Codd, 1972])The relational algebra and the relational calculus areequivalent:� for every relational algebra query q over a schema S,there exists a relational calculus query Q over S suchthat for every database D over S, q(D) = Q(D)
� for every relational calculus query Q over a schema S,there exists a relational algebra query q over S suchthat for every database D over S, q(D) = Q(D)
Furthermore, translating from one formalism to the othercan be done in polynomial time.
36/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Why is this important?� Allows using a declarative formalism to specify queries:logics. . . or SQL� These queries are then compiled via Codd’s transformationinto an algebraic formalism� Algebraic queries are then optimized, by using theproperties of the relational algebra (transformation rules,e.g., pushing selection within joins, exploiting associativityof joins, etc.)� Optimized queries can then be evaluated, by exploiting thefact that each operator of the relational algebra can easilybe implemented (in several different ways, to be chosenbased on a cost function)� This is RDBMS Implementation 101, a main reason of thesuccess of RDBMSs!
37/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Subclasses of queries
� Conjunctive query (CQ): relational calculus query without_, :, 8� Positive query (PQ): relational calculus query without :, 8� Union of conjunctive queries (UCQ): special case ofpositive query where the _ and ^ form a DNF formula
Expressiveness
� CQs are equivalent to the relational algebra without [and n, and where � does not feature disjunction� UCQs are equivalent to PQs (but exponential blow-up),and equivalent to the relational algebra without n
37/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Subclasses of queries
� Conjunctive query (CQ): relational calculus query without_, :, 8� Positive query (PQ): relational calculus query without :, 8� Union of conjunctive queries (UCQ): special case ofpositive query where the _ and ^ form a DNF formula
Expressiveness
� CQs are equivalent to the relational algebra without [and n, and where � does not feature disjunction� UCQs are equivalent to PQs (but exponential blow-up),and equivalent to the relational algebra without n
38/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational Model
Recursive QueriesDatalogAlgebraFixpoint logics
Complexity of Query Evaluation
Static Analysis of Queries
Conclusion
39/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Motivation: recursive queries
� Query languages considered so far (relational algebra,calculus) have a limited horizon
� Some data structures (trees, graphs) require arbitrarilydeep navigation, recursion
� How can we build a theory of recursive query languages?� RDBMSs are not always adapted to this type of
data/queries, cf. XML or graph DBMSs� Example application: transitive closure of a graphG(from; to)
40/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational Model
Recursive QueriesDatalogAlgebraFixpoint logics
Complexity of Query Evaluation
Static Analysis of Queries
Conclusion
41/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Datalog
� Simplest recursive query language: adding recursion toconjunctive queries� Inspired from logic programming� Datalog query (or program): set of rules that produceintensional facts� Schema of a Datalog program: classical database schema(extensional schema) + (disjoint) schema of intensionalfacts (intensional schema)� Fix a distinguished relation Goal of the intensionalschema, whose arity is the arity of the query
42/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Syntax
Finite set of rules r of the form:
S(y)| {z }head
R1(x1); : : : ; Rn(xn)| {z }body
with:� S relation of the intensional schema� R1; : : : ; Rn relations of the intensional or extensional
schemas� x1; : : : ;xn;y: tuples of variables (or possibly constants), of
arity compatible with the relations� Each variable in the head is present in the body
43/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Fix-point semantics� Each rule r of a program P can be seen as a conjunctive
query on the database D:
r(D) ��= fS(y) j 9z1 : : : zk R1(x1) 2 D ^ � � � ^Rn(xn) 2 Dg
where the zi’s are the variables of the rule body� Consequence operator �P defined by:
�P (D) := D [[
r2P
fr(D)g
� We consider the sequence (Dn) defined by:D0 = D, Dn+1 = �P (Dn)
� The semantics of P over D is the set of facts of the relationGoal in D1, the fixpoint of the sequence (Dn)
44/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Example: transitive closure
Goal(x; y) G(x; y)
Goal(x; y) Goal(x; z); G(z; y)
45/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational Model
Recursive QueriesDatalogAlgebraFixpoint logics
Complexity of Query Evaluation
Static Analysis of Queries
Conclusion
46/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
While operator� Algebra: essentially imperative programming (in contrast
to calculus)� One can add to the algebra:
� the possibility to define intermediary variables and toassign them values (does not affect expressive power)
� a while operator of the form:
while change do
. . . assign value to one or several variables. . .
done
� Semantics: the content of the loop is evaluated while theassignments change the underlying variables
� Infinite loops possible!
47/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Inflationary vs non-inflationary while
� Two variants:Non-inflationary Assignment operator ��=, arbitrary
assignmentInflationary Assignment operator +=, the assigned value
can only grow� Infinite loops impossible with an inflationary while
(because the active domain is finite)� Non-inflationary potentially more expressive
48/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Example: transitive closureWe introduce a relation schema C with attributesfrom and to, similarly to G.Non-inflationary
C ��= G
while change do
C ��= C [ �from;to(�to!int (C) on �from!int (G))
done
CInflationary
C += G
while change do
C += �from;to(�to!int (C) on �from!int (G))
done
C
49/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational Model
Recursive QueriesDatalogAlgebraFixpoint logics
Complexity of Query Evaluation
Static Analysis of Queries
Conclusion
50/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Non-inflationary fixpoint
� We add to the relational calculus a fixpoint construction� Let �(T ) be a calculus formula, mentioning the schemarelations as well as a new relation T , having for freevariables x1; : : : ; xn� Then �T [�(T )](x1; : : : ; xn) is a formula of the fixpoint
calculus� Semantics: in terms of the least fixpoint (if it exists) of therelation T , obtained by replacing T at each step with theset of facts of the form T (x1; : : : ; xn) for �(T )(x1; : : : ; xn)satisfied (starting with T = ;)� Equivalent to algebra with non-inflationary while!
51/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Inflationary fixpoint
� We add to the relational calculus a fixpoint construction� Let �(T ) be a calculus formula, mentioning the schema
relations as well as a new relation T , having for freevariables x1; : : : ; xn� Then �+T [�(T )](x1; : : : ; xn) is a formula of the fixpointcalculus� Semantics: in terms of the least fixpoint of the relation T ,
obtained by adding to T at each step with the set of factsof the form T (x1; : : : ; xn) for �(T )(x1; : : : ; xn) satisfied(starting with T = ;)� Equivalent to algebra with inflationary while!
52/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Example: transitive closure
Non-inflationary
f (x; y) j
�C [G(x; y) _ C(x; y) _ (9z C(x; z) ^G(z; y))](x; y)g
Inflationary
f (x; y) j
�+C [G(x; y) _ (9z C(x; z) ^G(z; y))](x; y)g
53/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational Model
Recursive Queries
Complexity of Query EvaluationConjunctive QueriesFirst-order logicFixpoint logics
Static Analysis of Queries
Conclusion
54/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Query evaluation
� Query Q in some query language (CQ, FO, FO+�,FO+�+. . . ) – we will use a logical formalism here
� Database D (always finite!)� Query evaluation: Computing Q(D)
� Complexity of this problem?� To simplify the study of complexity, we often assume thatQ is a Boolean query, i.e., it returns > or ?
55/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Data complexity
For some fixed Q, what is the complexity of computing Q(D) interms of the size of the database D?
56/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Combined complexity
For some query language Q, what is the complexity ofcomputing Q(D) in terms of the size of the query Q 2 Q and ofthe database D?
57/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Complexity classes� We restrict to Boolean problems (returning > or ?)� Set of all problems solvable by a resource-constrainedcomputing method:� For example:
PTIME: deterministic Turing machine in polynomialtime
NP: non-deterministic Turing machine inpolynomial time
PSPACE: deterministic Turing machine in polynomialspace
AC0: Boolean circuit of polynomial size andconstant depth
� We know that: AC0 ( PTIME � NP � PSPACE� Open whether PSPACE � PTIME (!)
58/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Membership and hardness for a class
� A problem P belongs to a complexity class C (or in C) if itis solvable by the corresponding resource-constrainedcomputing method� A problem P is hard for a complexity class C (or C-hard) ifthere exists a reduction that transforms whatever problemP 0 2 C into an instance of the problem P
� complete: in C + C-hard� Several ways to define reductions� Here, we assume that there exists a function computable inpolynomial time that transforms one instance I 0 of problemP 0 into an instance I of P such that P (I) = P 0(I 0)
59/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Descriptive complexity
� A query language Q captures a complexity class C if:� For all Q 2 Q, query evaluation of query Q in in C (data
complexity)� For all problem P in C, there exists a query Q 2 Q such
that evaluating Q exactly solves P (without a reduction)!
� If Q captures C and if C has problems that are complete forC, then there exists Q 2 Q such that Q is C-complete, butthe converse is not true
60/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational Model
Recursive Queries
Complexity of Query EvaluationConjunctive QueriesFirst-order logicFixpoint logics
Static Analysis of Queries
Conclusion
61/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Data complexity
TheoremCQ evaluation is PTIME in data complexity.
Proof.By enumerating all valuations of variables of the query in thedatabase. We will see later a much stronger result.
62/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Combined complexity
TheoremCQ evaluation is NP-complete in combined complexity.
Proof.Membership in NP is easy. Hardness for NP can be proved byreduction from graph 3-colorability.
63/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
�-acyclic query
� A CQ can be seen as a hypergraph (vertices are variables,hyperedges the atoms of the CQ, labeled by the relationname)� A hypergraph H has a join tree where one can find a treewhose nodes are labeled by the hyperedges of H and suchthat:� every hyperedge of H appears as the label of one node of
the tree;� for every vertex x of H, the set of tree nodes labeled by a
hyperedge referring to x is a connected subtree
� A query is �-acyclic if its hypergraph has a join tree� Can be obtained in linear time if it exists [Tarjan andYannakakis, 1984]
64/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Yannakakis’s algorithm [Yannakakis, 1981]
� Algorithm to evaluate acyclic queries (non-necessarilyBoolean):1. Construct the join tree2. Eliminate all useless tuples of a relation with the semijoin
operator n: Rn S = �R:�(R on S) by navigating twice inthe join tree: from bottom up, then from top down
3. Evaluate the query bottom up, by computing joins followingthe tree and by projecting useless variables out as you go
� Polynomial complexity in the size of the query, the input,and the output (combined complexity)
65/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational Model
Recursive Queries
Complexity of Query EvaluationConjunctive QueriesFirst-order logicFixpoint logics
Static Analysis of Queries
Conclusion
66/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Data complexity
TheoremFO evaluation is PTIME in data complexity.
Proof.By rewriting in prenex form and naive evaluation.
67/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
FO does not capture the whole of PTIME
TheoremOne cannot compute in FO that a relation containing atotal order has an even number of elements, or that a graphis connected.
Fairly complex to prove, relies on Ehrenfeucht–Fraïssé games(see [Libkin, 2004]).
68/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Data complexity, more precise
TheoremFO evaluation is AC0 in data complexity.
Proof.By rewriting to the relational algebra.
69/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Combined complexity
TheoremFO evaluation is PSPACE-complete in combinedcomplexity.
Proof.Membership in PSPACE easy. Hardness for PSPACE from theQSAT problem.
70/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational Model
Recursive Queries
Complexity of Query EvaluationConjunctive QueriesFirst-order logicFixpoint logics
Static Analysis of Queries
Conclusion
71/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Data complexity
TheoremEvaluation of FO+�+ is PTIME in data complexity.Evaluation of FO+� is PSPACE in data complexity.
Proof.Direct. Only need polynomial space to detect infinite loops.
72/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Connectedness, parity
TheoremConnectedness is expressible in FO+�+.
The problem of determining whether an arbitrary relationhas an even number of elements (or tuples) cannot beexpressed in FO+�.
Proof.First part easy, recall transitive closure query.Second part shown in [Abiteboul et al., 1995], chapter 17.
73/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Parity and FO+�+, with order
TheoremFO+�+ can compute if a relation that contains a totalorder has an even number of elements.
Proof.Explicit construction.
74/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
More general results
TheoremFO+�+ captures PTIME on databases including a totalorder relation of the domain.
FO+� captures PSPACE on databases including a totalorder relation of the domain.
Proof.Simulation of polynomial-time or polynomial-spacedeterministic Turing machines with a fixpoint computation.
75/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Data complexity (bis)
TheoremEvaluating FO+�+ is PTIME-complete in data complexity.Evaluating FO+� is PSPACE-complete in data complexity.
Proof.Hardness comes from descriptive complexity on orderedstructures.
76/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational Model
Recursive Queries
Complexity of Query Evaluation
Static Analysis of QueriesContainment and EquivalenceConjunctive QueriesRelational Calculus
Conclusion
77/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational Model
Recursive Queries
Complexity of Query Evaluation
Static Analysis of QueriesContainment and EquivalenceConjunctive QueriesRelational Calculus
Conclusion
78/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Query optimization
� Goal: Given a query q in some query language Q and adatabase D, find a query equivalent to q on D and faster toevaluate on D
� Here: Q in the relational calculus (or a fragment thereof),and one looks for a query faster on whatever database (wedo not look D, we perform static analysis)� In actual RDBMSs: Q is the set of query execution plans(a specialization of the relational algebra whereimplementations are chosen for each operator) andstatistics on D are used
79/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Global optimization
� We consider global optimization techniques, considering aquery in its entirety (techniques on execution plans aremore local, e.g., local rewritings)� We formally define:Equivalence: q � q0 if for all database D, q(D) = q0(D)
Minimality: q0 is the “best” query equivalent to q in Q
80/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Containment and equivalence
DefinitionA query q is contained in a query q0 (denoted q v q0) if for alldatabase D, q(D) � q0(D)
Propositionq � q0 iff q v q0 and q0 v q.
Proof.Immediate.
80/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Containment and equivalence
DefinitionA query q is contained in a query q0 (denoted q v q0) if for alldatabase D, q(D) � q0(D)
Propositionq � q0 iff q v q0 and q0 v q.
Proof.Immediate.
81/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational Model
Recursive Queries
Complexity of Query Evaluation
Static Analysis of QueriesContainment and EquivalenceConjunctive QueriesRelational Calculus
Conclusion
82/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Case of the conjunctive queries
� We consider conjunctive queries (CQ) of the form:
q(x) 9y : R1(z1) ^ � � � ^Rn(zn)
where each zi is a tuple of variables among x and y, andwhere each xj appears at least in one zj
� Set semantics: for all database D, q(D) is a finite set oftuples
83/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Homomorphism
DefinitionA homomorphism from a CQ q to a CQ q0 is a function � fromthe variables x;y of q fo the variables x0;y0 of q0 such that:� �(x) = x0
� for every atom R(zi) of q, there exists an atom R(z0i0) of q0
such that �(zi) = z0i0
DefinitionA homomorphism is an isomorphism if it is one-to-one and itsconverse is a homomorphism.
84/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Instance associated to a query
DefinitionFor all conjunctive query
q(x) 9y : R1(z1) ^ � � � ^Rn(zn)
one can construct the instance associated to q, denoted Iq,where the active domain is f az j z 2 x [ y g and which isformed of the n tuples R(azi1;:::;zik) for R(zi1; : : : ; zik) atom of q
PropositionFor all CQs q(x), q0(x0), there exists a homomorphismfrom q to q0 iff (ax0
1; : : : ; ax0
j) 2 q(Iq0).
84/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Instance associated to a query
DefinitionFor all conjunctive query
q(x) 9y : R1(z1) ^ � � � ^Rn(zn)
one can construct the instance associated to q, denoted Iq,where the active domain is f az j z 2 x [ y g and which isformed of the n tuples R(azi1;:::;zik) for R(zi1; : : : ; zik) atom of q
PropositionFor all CQs q(x), q0(x0), there exists a homomorphismfrom q to q0 iff (ax0
1; : : : ; ax0
j) 2 q(Iq0).
85/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Homomorphism theorem
Theorem ([Chandra and Merlin, 1977])For all CQs q, q0, q v q0 iff there exists a homomorphismfrom q0 to q.
86/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Minimal query
DefinitionA conjunctive query is minimal if it has a minimal number ofatoms among all equivalent conjunctive queries.
� Translation of a CQ to an algebra query: if there are natoms, we obtain n� 1 joins� Joins are the most costly operations of the relational
algebra (bar cross products)� Finding a minimal query amounts to global optimization
86/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Minimal query
DefinitionA conjunctive query is minimal if it has a minimal number ofatoms among all equivalent conjunctive queries.
� Translation of a CQ to an algebra query: if there are natoms, we obtain n� 1 joins� Joins are the most costly operations of the relationalalgebra (bar cross products)� Finding a minimal query amounts to global optimization
87/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Unicity of minimal queryProposition ([Chandra and Merlin, 1977])Let q be a CQ. Then there exists a CQ q0 obtained byremoving atoms from q which is minimal.
Proof.Consider a minimal query equivalent to q and apply thehomomorphism theorem.
Proposition ([Chandra and Merlin, 1977])Let q, q0 be two equivalent minimal CQs. Then there existsan isomorphism from q to q0.
Proof.Apply the homomorphism theorem. The image by thehomomorphism is an equivalent minimal query.
88/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Minimization algorithm
Apply the following procedure to minimize a query:For every atom of the query, test if there exists anequivalent query not containing this atom, and thusif there exists a homomorphism sending this atom toanother atom of the query. If so, delete it, and con-tinue until obtaining an equivalent minimal query.
89/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Complexity issues
PropositionThe following problems are NP-complete:� given two CQs q, q0, determine whether q v q0
� given two CQs q, q0, determine whether q � q0
� given a CQ q, determine if q is non-minimal
Proof.NP-hardness is by reduction from 3-colorability, as forcombined complexity of query evaluation. Membership in NP isdirect.
NP-hard. . . in the queries. Queries may be small enough sothat an exponential algorithm may not be an issue.
89/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Complexity issues
PropositionThe following problems are NP-complete:� given two CQs q, q0, determine whether q v q0
� given two CQs q, q0, determine whether q � q0
� given a CQ q, determine if q is non-minimal
Proof.NP-hardness is by reduction from 3-colorability, as forcombined complexity of query evaluation. Membership in NP isdirect.
NP-hard. . . in the queries. Queries may be small enough sothat an exponential algorithm may not be an issue.
90/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Bag semantics[Chaudhuri and Vardi, 1993]
� In practice, RDBMSs implement a bag semantics� Two queries in bag semantics are equivalent if and only if
they are isomorphic (intuitively, because two similar butnon isomorphic queries can introduce a different number ofresults)
� Query containment: �P2 -hard claimed (but not proved).
Decidability (and precise complexity if decidable): open!
91/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational Model
Recursive Queries
Complexity of Query Evaluation
Static Analysis of QueriesContainment and EquivalenceConjunctive QueriesRelational Calculus
Conclusion
92/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Satisfiability in the relational calculus
DefinitionA Boolean relational calcululs query q is satisfiable if thereexists a (finite) database D such that D j= q.
Theorem ([Trakhtenbrot, 1963])Satisfiability of the relational calculus (in the finite case) isundecidable.
Proof.Reduction possible from the POST correspondence problem,technical, see [Abiteboul et al., 1995].
92/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Satisfiability in the relational calculus
DefinitionA Boolean relational calcululs query q is satisfiable if thereexists a (finite) database D such that D j= q.
Theorem ([Trakhtenbrot, 1963])Satisfiability of the relational calculus (in the finite case) isundecidable.
Proof.Reduction possible from the POST correspondence problem,technical, see [Abiteboul et al., 1995].
93/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Containment and equivalence of the calculus
TheoremContainment and equivalence of relational calculus queriesare undecidable and co-recursively enumerable.
Proof.Undecidability is by direct reduction from the undecidability ofsatisfiability.Co-recursive enumerability is shown directly, by enumeratingpossible counter-examples.
93/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Containment and equivalence of the calculus
TheoremContainment and equivalence of relational calculus queriesare undecidable and co-recursively enumerable.
Proof.Undecidability is by direct reduction from the undecidability ofsatisfiability.Co-recursive enumerability is shown directly, by enumeratingpossible counter-examples.
94/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational Model
Recursive Queries
Complexity of Query Evaluation
Static Analysis of Queries
Conclusion
95/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Plan
Introduction
Introduction to The Relational Model
Recursive Queries
Complexity of Query Evaluation
Static Analysis of Queries
Conclusion
96/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Database theory: a rich field of research
� Rich connections with various areas of theoreticalcomputer science and mathematics: logics, algorithms,graph theory, automata theory, typing theory, algebra, etc.� This lecture: very high-level overview, no time for proofs,some quite interesting� Further lectures at EPIT: more on efficient queryevaluation, XML and graph databases, descriptivecomplexity, logical independence, uncertainty in databases,connections with description logics� Plenty of other interesting topics: data exchange andintegration, probabilistic databases, theory of indexingstructures, ranking and top-k queries. . .
97/97
Introduction Relational Model Recursive Queries Query Evaluation Static Analysis Conclusion
Reference
� Main database theory textbook: [Abiteboul et al., 1995]
� In this lecture: (part of) chapters 3, 4, 5, 6, 12, 14, 17
Bibliographie I
Serge Abiteboul, Richard Hull, and Victor Vianu. Foundationsof Databases. Addison-Wesley, 1995.
Ashok K. Chandra and Philip M. Merlin. Optimalimplementation of conjunctive queries in relational databases. In Proceedings of the 9th Annual ACM Symposiumon Theory of Computing, May 4-6, 1977, Boulder,Colorado, USA, pages 77–90, 1977. doi:10.1145/800105.803397. URLhttp://doi.acm.org/10.1145/800105.803397.
Surajit Chaudhuri and Moshe Y. Vardi. Optimization of Realconjunctive queries. In Proceedings of the Twelfth ACMSIGACT-SIGMOD-SIGART Symposium on Principles ofDatabase Systems, May 25-28, 1993, Washington, DC,USA, pages 59–70, 1993. doi: 10.1145/153850.153856. URLhttp://doi.acm.org/10.1145/153850.153856.
Bibliographie II
Edgar F. Codd. Relational completeness of data basesublanguages. In Data Base Systems. Courant ComputerScience Symposium, 1972.
Anthony C. Klug. Equivalence of relational algebra andrelational calculus query languages having aggregatefunctions. J. ACM, 29(3):699–717, 1982.
Leonid Libkin. Expressive power of SQL. Theor. Comput.Sci., 296(3):379–404, 2003.
Leonid Libkin. Elements of Finite Model Theory. Texts inTheoretical Computer Science. An EATCS Series. Springer,2004. doi: 10.1007/978-3-662-07003-1. URLhttp://dx.doi.org/10.1007/978-3-662-07003-1.
Bibliographie III
Robert Endre Tarjan and Mihalis Yannakakis. Simplelinear-time algorithms to test chordality of graphs, testacyclicity of hypergraphs, and selectively reduce acyclichypergraphs. SIAM J. Comput., 13(3):566–579, 1984. doi:10.1137/0213035. URLhttp://dx.doi.org/10.1137/0213035.
Boris A. Trakhtenbrot. Impossibility of an algorithm for thedecision problem in finite classes. American MathematicalSociety Translations Series 2, 23:1–5, 1963.
Mihalis Yannakakis. Algorithms for acyclic database schemes.In VLDB, pages 82–94, 1981.