ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION.
Transactions, Concluded, and the Future of Data Management Zachary G. Ives University of...
-
Upload
imogen-nichols -
Category
Documents
-
view
213 -
download
0
Transcript of Transactions, Concluded, and the Future of Data Management Zachary G. Ives University of...
Transactions, Concluded, and the Future of Data Management
Zachary G. IvesUniversity of Pennsylvania
CIS 550 – Database & Information Systems
December 4, 2003
Slide content courtesy of Susan Davidson, Raghu Ramakrishnan & Johannes Gehrke
2
Final Administrivia
Project demos today and tomorrow Final exam handed out at the end of today’s
class Finals plus project reports due by 1PM,
12/18/2003 Project reports should be ballpark 10-15 pages Remember, quality and clarity of presentation matters! Also, email me a brief message detailing:
Your contributions to the project Your group members’ contributions and your assessment of
“group dynamics”
Turn in at my office, 576 Levine Hallor to my assistant, Kathy Venit, in 308 Levine Hall
3
Last Time…
We were discussing isolation levels How to keep transactions from interfering with
one another Or at least, how to minimize this
Recall the strongest version of isolation was serializability
4
Theory of Serializability
A schedule of a set of transactions is a linear ordering of their actions e.g. for the simultaneous deposits example:
R1(X.bal) R2(X.bal) W1(X.bal) W2(X.bal) A serial schedule is one in which all the steps of
each transaction occur consecutively A serializable schedule is one which is equivalent
to some serial schedule (i.e. given any initial state, the final state is the same as one produced by some serial schedule) The example above is neither serial nor serializable
5
Questions of Concern
Given a schedule S, is it serializable? How can we "restrict" transactions in
progress to guarantee that only serializable schedules are produced?
6
Conflicting Actions
Consider a schedule S in which there are two consecutive actions Ii and Ij of transactions Ti and Tj respectively
If Ii and Ij refer to different data items, then swapping Ii and Ij does not matter
If Ii and Ij refer to the same data item Q, then swapping Ii and Ij matters if and only if one of the actions is a write Ri(Q) Wj(Q) produces a different final value for Q than
Wj(Q) Ri(Q)
7
Testing for Serializability
Given a schedule S, we can construct a di-graph G=(V,E) called a precedence graph V : all transactions in S E : Ti Tj whenever an action of Ti precedes
and conflicts with an action of Tj in S
Theorem: A schedule S is conflict serializable if and only if its precedence graph contains no cycles
Note that testing for a cycle in a digraph can be done in time O(|V|2)
8
An Example
T1 T2 T3 R(X,Y,Z) R(X) W(X) R(Y) W(Y) R(Y) R(X) W(Z)
T1 T2 T3
Cyclic: Not serializable.
9
Another Example
T1 T2 T3 R(X) W(X) R(X) W(X) R(Y) W(Y) R(Y) W(Y)
T1 T2 T3
Acyclic: serializable
10
Producing the Equivalent Serial Schedule
If the precedence graph for a schedule is acyclic, then an equivalent serial schedule can be found by a topological sort of the graph
For the second example, the equivalent serial schedule is:
R1(Y)W1(Y) R2(X)W2(X) R2(Y)W2(Y) R3(X)W3(X)
11
Locking and Serializability
We said that for a serializable schedule, a transaction must hold all locks until it terminates (a condition called strict locking)
It turns out that this is crucial to guarantee serializability Note that the first (bad) example could have
been produced if transactions acquired and immediately released locks.
12
Well-Formed, Two-Phased Transactions
A transaction is well-formed if it acquires at least a shared lock on Q before reading Q or an exclusive lock on Q before writing Q and doesn’t release the lock until the action is performed Locks are also released by the end of the transaction
A transaction is two-phased if it never acquires a lock after unlocking one i.e., there are two phases: a growing phase in which
the transaction acquires locks, and a shrinking phase in which locks are released
13
Two-Phased Locking Theorem
If all transactions are well-formed and two-phase, then any schedule in which conflicting locks are never granted ensures serializability i.e., there is a very simple scheduler!
However, if some transaction is not well-formed or two-phase, then there is some schedule in which conflicting locks are never granted but which fails to be serializable i.e., one bad apple spoils the bunch.
14
Summary of Transactions
Transactions are all-or-nothing units of work guaranteed despite concurrency or failures in the system
Theoretically, the “correct” execution of transactions is serializable (i.e. equivalent to some serial execution)
Practically, this may adversely affect throughput isolation levels
With isolation levels, users can specify the level of “incorrectness” they are willing to tolerate
15
What to Look for Down the Road
… well, no one really knows the answer to this…
… But here are some hints, ideas, and hot directions Sensors and streaming data Peer-to-peer meets databases “The Semantic Web” Collaborative data sharing
16
Sensors and Streaming Data
No databases at all… … Instead we have
networks of simple sensors Madden, starting at MIT Gehrke, Cornell Widom, Stanford
queries are in SQL data is live and
“streaming” we compute
aggregates over “windows”
17
What’s Interesting Here
We’re not talking about data on disk – we’re talking about queries over “current readings”
Sensors are generally “stupid” and may be battery-operated A lot of challenges are networking-related: how to aggregate
data before it gets sent, etc.
The next step (e.g., work initiated here @ Penn): including sensors that capture images – a very different problem! This has many more compelling applications – security,
monitoring, correlating multiple sensors, rescue operations, military logistics and coordination, etc.
18
Peer-to-Peer Computing
Fundamentally, our model of DBMSs tends to be centralized Even for data integration: there’s a single mediator
This has many implications: central administration, central coordination, etc.
What can be gained from borrowing a page from peer-to-peer systems like Napster, Kazaa, etc.? A better architecture? Solutions to many problems unsolved by distributed
DBMSs? Replication, object location, distributed optimization,
resiliency to failure, … New types of applications, e.g., in integration?
19
P2P Work
As a new architecture for storage and querying PIER (Berkeley), P-Grid (EPFL), Medusa (MIT)
A better way of thinking about translating and exchanging data Piazza (Washington), Orchestra (Penn),
Hyperion (Toronto), work at Trento
20
The Semantic Web
In some ways, a very “pie-in-the-sky” vision But some real and concrete problems might be partly
solvable Goal is really very similar to data integration, where somehow
we have mappings between the schemas Currently, most people in the SW community are from
knowledge representation community and use RDF Focus: very rich ways of describing schemas – “ontologies” –
that blend querying with class definitions “Teachers are people who teach students”
“Tenure-track professors are teachers at universities who can get tenure”; etc.
Implicit take on the problem: if we create better languages for describing ontologies, it’s easier to mediate between schemas
21
Holes in the Semantic Web
What issues and concerns came up in the data integration assignment you had? Do you think a richer schema language would help for these? Do you think “better normalization” would help?
Fundamentally, we need: Languages for not only describing relationships, but
transformations between formats (e.g., XML schemas) Automatic or partly automated ways of discovering
mappings and correspondences These are all database problems, and the solution likely must
come from the DB community This is part of what P2P systems like Piazza, Hyperion try to
address
22
My Take on the Future We’ve evolved from a world where data
management is about controlling the data Instead, data management is about translating and
transforming data using declarative languages It should ultimately become much like TCP or SOAP – a set of
standard services for “getting stuff” from one point to another, or from one form to another
It’s the plumbing that connects different applications using different formats
Orchestra project at Penn: focuses on how to build a system for supporting collaborative science People publish and map data in different schemas What happens if people start updating it? How do you propagate, manage, trace, reconcile changes?