Transactions, Concluded, and the Future of Data Management Zachary G. Ives University of...

Transactions, Concluded, and the Future of Data Management

Zachary G. IvesUniversity of Pennsylvania

CIS 550 – Database & Information Systems

December 4, 2003

Slide content courtesy of Susan Davidson, Raghu Ramakrishnan & Johannes Gehrke

2

Final Administrivia

Project demos today and tomorrow Final exam handed out at the end of today’s

class Finals plus project reports due by 1PM,

12/18/2003 Project reports should be ballpark 10-15 pages Remember, quality and clarity of presentation matters! Also, email me a brief message detailing:

Your contributions to the project Your group members’ contributions and your assessment of

“group dynamics”

Turn in at my office, 576 Levine Hallor to my assistant, Kathy Venit, in 308 Levine Hall

3

Last Time…

We were discussing isolation levels How to keep transactions from interfering with

one another Or at least, how to minimize this

Recall the strongest version of isolation was serializability

4

Theory of Serializability

A schedule of a set of transactions is a linear ordering of their actions e.g. for the simultaneous deposits example:

R1(X.bal) R2(X.bal) W1(X.bal) W2(X.bal) A serial schedule is one in which all the steps of

each transaction occur consecutively A serializable schedule is one which is equivalent

to some serial schedule (i.e. given any initial state, the final state is the same as one produced by some serial schedule) The example above is neither serial nor serializable

5

Questions of Concern

Given a schedule S, is it serializable? How can we "restrict" transactions in

progress to guarantee that only serializable schedules are produced?

6

Conflicting Actions

Consider a schedule S in which there are two consecutive actions Ii and Ij of transactions Ti and Tj respectively

If Ii and Ij refer to different data items, then swapping Ii and Ij does not matter

If Ii and Ij refer to the same data item Q, then swapping Ii and Ij matters if and only if one of the actions is a write Ri(Q) Wj(Q) produces a different final value for Q than

Wj(Q) Ri(Q)

7

Testing for Serializability

Given a schedule S, we can construct a di-graph G=(V,E) called a precedence graph V : all transactions in S E : Ti Tj whenever an action of Ti precedes

and conflicts with an action of Tj in S

Theorem: A schedule S is conflict serializable if and only if its precedence graph contains no cycles

Note that testing for a cycle in a digraph can be done in time O(|V|2)

8

An Example

T1 T2 T3 R(X,Y,Z) R(X) W(X) R(Y) W(Y) R(Y) R(X) W(Z)

T1 T2 T3

Cyclic: Not serializable.

9

Another Example

T1 T2 T3 R(X) W(X) R(X) W(X) R(Y) W(Y) R(Y) W(Y)

T1 T2 T3

Acyclic: serializable

10

Producing the Equivalent Serial Schedule

If the precedence graph for a schedule is acyclic, then an equivalent serial schedule can be found by a topological sort of the graph

For the second example, the equivalent serial schedule is:

R1(Y)W1(Y) R2(X)W2(X) R2(Y)W2(Y) R3(X)W3(X)

11

Locking and Serializability

We said that for a serializable schedule, a transaction must hold all locks until it terminates (a condition called strict locking)

It turns out that this is crucial to guarantee serializability Note that the first (bad) example could have

been produced if transactions acquired and immediately released locks.

12

Well-Formed, Two-Phased Transactions

A transaction is well-formed if it acquires at least a shared lock on Q before reading Q or an exclusive lock on Q before writing Q and doesn’t release the lock until the action is performed Locks are also released by the end of the transaction

A transaction is two-phased if it never acquires a lock after unlocking one i.e., there are two phases: a growing phase in which

the transaction acquires locks, and a shrinking phase in which locks are released

13

Two-Phased Locking Theorem

If all transactions are well-formed and two-phase, then any schedule in which conflicting locks are never granted ensures serializability i.e., there is a very simple scheduler!

However, if some transaction is not well-formed or two-phase, then there is some schedule in which conflicting locks are never granted but which fails to be serializable i.e., one bad apple spoils the bunch.

14

Summary of Transactions

Transactions are all-or-nothing units of work guaranteed despite concurrency or failures in the system

Theoretically, the “correct” execution of transactions is serializable (i.e. equivalent to some serial execution)

Practically, this may adversely affect throughput isolation levels

With isolation levels, users can specify the level of “incorrectness” they are willing to tolerate

15

What to Look for Down the Road

… well, no one really knows the answer to this…

… But here are some hints, ideas, and hot directions Sensors and streaming data Peer-to-peer meets databases “The Semantic Web” Collaborative data sharing

16

Sensors and Streaming Data

No databases at all… … Instead we have

networks of simple sensors Madden, starting at MIT Gehrke, Cornell Widom, Stanford

queries are in SQL data is live and

“streaming” we compute

aggregates over “windows”

17

What’s Interesting Here

We’re not talking about data on disk – we’re talking about queries over “current readings”

Sensors are generally “stupid” and may be battery-operated A lot of challenges are networking-related: how to aggregate

data before it gets sent, etc.

The next step (e.g., work initiated here @ Penn): including sensors that capture images – a very different problem! This has many more compelling applications – security,

monitoring, correlating multiple sensors, rescue operations, military logistics and coordination, etc.

18

Peer-to-Peer Computing

Fundamentally, our model of DBMSs tends to be centralized Even for data integration: there’s a single mediator

This has many implications: central administration, central coordination, etc.

What can be gained from borrowing a page from peer-to-peer systems like Napster, Kazaa, etc.? A better architecture? Solutions to many problems unsolved by distributed

DBMSs? Replication, object location, distributed optimization,

resiliency to failure, … New types of applications, e.g., in integration?

19

P2P Work

As a new architecture for storage and querying PIER (Berkeley), P-Grid (EPFL), Medusa (MIT)

A better way of thinking about translating and exchanging data Piazza (Washington), Orchestra (Penn),

Hyperion (Toronto), work at Trento

20

The Semantic Web

In some ways, a very “pie-in-the-sky” vision But some real and concrete problems might be partly

solvable Goal is really very similar to data integration, where somehow

we have mappings between the schemas Currently, most people in the SW community are from

knowledge representation community and use RDF Focus: very rich ways of describing schemas – “ontologies” –

that blend querying with class definitions “Teachers are people who teach students”

“Tenure-track professors are teachers at universities who can get tenure”; etc.

Implicit take on the problem: if we create better languages for describing ontologies, it’s easier to mediate between schemas

21

Holes in the Semantic Web

What issues and concerns came up in the data integration assignment you had? Do you think a richer schema language would help for these? Do you think “better normalization” would help?

Fundamentally, we need: Languages for not only describing relationships, but

transformations between formats (e.g., XML schemas) Automatic or partly automated ways of discovering

mappings and correspondences These are all database problems, and the solution likely must

come from the DB community This is part of what P2P systems like Piazza, Hyperion try to

address

22

My Take on the Future We’ve evolved from a world where data

management is about controlling the data Instead, data management is about translating and

transforming data using declarative languages It should ultimately become much like TCP or SOAP – a set of

standard services for “getting stuff” from one point to another, or from one form to another

It’s the plumbing that connects different applications using different formats

Orchestra project at Penn: focuses on how to build a system for supporting collaborative science People publish and map data in different schemas What happens if people start updating it? How do you propagate, manage, trace, reconcile changes?

Transactions, Concluded, and the Future of Data Management Zachary G. Ives University of...

Documents

Transcript of Transactions, Concluded, and the Future of Data Management Zachary G. Ives University of...