ICS 224: Database Management Systems Spring 2011
description
Transcript of ICS 224: Database Management Systems Spring 2011
1
ICS 224: Database Management Systems Spring 2011
Professor Sharad Mehrotra
Information and Computer Science Department
University of California, Irvine
ICS214A Notes 01 2
Course General Info
• URL: http://www.ics.uci.edu/~cs224/– All course info will be posted online
• Lecture times: Tue-Thurs 5 – 6.30
• Instructor: Sharad Mehrotra, BH 2082, [email protected]
• Office Hours: on request
ICS214A Notes 01 3
Prerequisites• Basic Data Management Concepts:
– DB design, relational model, SQL, database programming CS 122 or equivalent
– Database system implementation Indexing, query optimization, query processing, storage management,
etc. ICS 222 or equivalent
• Basic Computer Science Concepts:
– Depth-first search, directed/undirected graphs, “big O” notation, computational complexity, NP completeness …
ICS214A Notes 01 4
Course Requirements• Class Participation: 50%
– Attendance, presentations, comments, interaction, enthusiasm, etc.
• Class Projects: 50%– Implementation Oriented:
Take a idea/topic, identify a project, get it okayed by instructor, develop a demonstration
– Survey of an area In depth survey in the style of computing survey
articles. Provide your own perspective in a subarea.
– MUST commit to project at end of 2nd week.
ICS214A Notes 01 5
Class Structure
• Each week we will– Pick a topic– identify 1 paper per student/group of 2 students – 2 papers as lead papers for presentation (one
for each class), others presented as short presentations
• Each week– Start with overview– Lead paper presentation– short presentation of other papers (main idea)– Discussions
ICS214A Notes 01 6
This course …
Most important ideas in data management (instructor’s pick)
But with the eye towards an end application …
Sentient spaces
Sentient Spaces … • Spaces in which sensors are used to capture the dynamic
evolving state which is then analyzed for implementing adaptations.
• Numerous examples … – intelligent transportation systems– reconnaissance– surveillance systems– smart buildings– smart grid ...
7
Example:Smart Video Surveillance
CS Building in UC Irvine
Video collection
8
SurveillanceVideo
Database
SemanticExtraction
EventDatabase
Query
Query Analysi
s
Implications of Sentient Space focus ..
• Class focuses on topics which you might need to know if you wanted to explore application in sentient space …
• Projects should target something about sentient spaces … – E.g., data cleaning of sentient data, data
model to represent sentient spaces, …
ICS214A Notes 01 9
ICS214A Notes 01 10
Data Models (2 weeks)
– Representing time - TSQL2– Representing space– Querying streaming data – CQL,
ASQL– Semi-structured data –OEM, Lore
New Ideas in Storage & Indexing (2 weeks)
• New storage models– Key-Value store– Bigtable– Column Stores
• New database system architecture– Data outsourcing– Multitenant databases
• New Indexing techniques– Correlation maps
ICS214A Notes 01 11
Data Quality (2 weeks)
• Data quality issues– Inaccuracy, incompleteness, ambiguity,
errors, …
• Two aspects:– Techniques to improve quality
Exploiting contextual knowledge, issues of efficiency
– Techniques to tolerate poor quality of data in applications.
ICS214A Notes 01 12
New Computing Architecture (2 weeks)
• Map Reduce framework• Hive• Pig latin• Join processing• HadoopDB• Hyrax?
ICS214A Notes 01 13
Data Privacy (2 weeks)
• Use cases – Data publishing, queries, sharing,
data outsourcing.
• Diverse criteria– Differential privacy, Anonymity, l-
diversity, ..
• Mechanisms to implement
ICS214A Notes 01 14
16
A walk down the history of data models …
Two papers (MUST READ)•Inclusion of New types in relational databases, Stonebraker•Postgrest Next Generation databsase, Stonebraker.
ICS214A Notes 01 17
The Paleolithic Period …• There were no general purpose tools for
managing large volumes of data…– OS provided resource management– Data was stored in files– Applications performed data management
functionalities Fault-tolerance Concurrency control Reliability Optimizations …
– Such functionalities had to be re-implemented for each application
ICS214A Notes 01 18
The Neolithic Period…• Early file systems evolve into general-purpose data
management tools.• DBMS Goals:
– Efficiency and scalability (faster than files) – Management of large heterogeneous types of structured
data– High reliability– Information sharing (multiple users)
• DBMS Users:– E-commerce companies, banks, airlines, transportation
companies, corporate databases, government agencies, …– Anyone you can think of!
ICS214A Notes 01 19
The Dark Ages ….
• Network & hierarchical data models– Resulted in data spaghetti– Applications needed to chase pointers – There was little data abstraction or separation
of concerns little difference between physical data
representation and logical data representation
– optimization was entirely left to application writers
– There were no clean data management languages Unless you are a Cobol fan!
ICS214A Notes 01 20
The Relational Era..• Relational model proposed by Codd
– Everything is a relation– Query consists of algebraic composition of a few powerful
operators– Equivalent to a first-order relational calculus
• Primary features– Simple clean data representation
solid mathematical basis– data abstraction
Users did not need to be concerned about how data is stored physically
– simple declarative query language User’s specify what to compute not how to do it.
– optimization by the system
ICS214A Notes 01 21
Data Wars (1)• Codasyl versus relational debates began…
– Heated arguments during early SIGMODS– Codasyl: relational model is too simple,
applications built using it will never scale in performance.
– Relational: network/hierarchical models have no formal basis, are too complex, and unmanageable as application complexity increases.
• Relational model found many supporters– Specially at universities– Its simplicity was enticing
ICS214A Notes 01 22
Data Wars (2)
• Many projects started off trying to implement a relational DBMS– System R @ IBM Almaden– Ingres @ Berkeley– These early systems led to the technologies that drive modern data
management• Early prototypes became products
– DB2 & Ingres• Principle designers from both the System R teams & Ingres left to
start companies– Oracle, Sybase
• Early relational companies went door to door converting industry to the relational model– Industry got hooked on to the simplicity of writing complex applications
in relational model– Boeing among the first converts
ICS214A Notes 01 23
Pointer’s Strike Back…
• Complex objects in emerging DBMS applications cannot be effectively represented as records in relational model.
• Representing information in RDBMSs requires complex and inefficient conversion into and from the relational model to the application programming language
• ODBMSs provide a direct representation of objects to DBMSs overcoming the impedance mismatch problem
Application
data structures
Relational
representation
RDBMS
Copy and
translation
Transparent
ODBMS
data transfer
ICS214A Notes 01 24
Object Model
• Object:– observable entity in the world being modeled– similar to concept to entity in the E/R model
• An object consists of:– attributes: properties built in from primitive types– relationships: properties whose type is a reference
to some other object or a collection of references– methods: functions that may be applied to the
object.
ICS214A Notes 01 29
Object Oriented Databases
• Evolved as persistent Object Oriented Programming Languages:
• Start with an OO language (e.g., C++, Java, SMALLTALK) which has a rich type system
• Add persistence to the objects in programming language where persistent objects stored in databases
ICS214A Notes 01 30
Persistent Programming Languages
• Single programming language for application and data management
• Update to persistent variable results in automatic update to
database.
• Persistent data could be types such as sets and lists and arrays.
• Application can follow pointers (OID) to navigate through data.
ii 2
a[ j] a[ j 1] 3
Employee Spouse benefit_levelbenefitlevel1
ICS214A Notes 01 31
Persistence
• Objects created may have different lifetimes:– transient: allocated memory managed by the programming
language run-time system. E.g., local variables in procedures have a lifetime of a procedure
execution global variables have a lifetime of a program execution
– persistent: allocated memory and stored managed by ODBMS runtime system.
• Classes are declared to be persistence-capable or transient.
• Different languages have different mechanisms to make objects persistent:– creation time: Object declared persistent at creation time (e.g., in
C++ binding) (class must be persistent-capable)– persistence by reachability: object is persistent if it can be
reached from a persistent object (e.g., in Java binding) (class must be persistent-capable).
ICS214A Notes 01 32
Persistent Object-Oriented Programming Languages• Persistent objects are stored in the database and
accessed from the programming language.• Single programming language for applications as well as
data management. – Avoid having to translate data to and from application
programming language and DBMS efficient implementation less code
– Programmer does not need to write explicit code to fetch data to and from database
persistent objects to programmer looks exactly the same as transient objects.
System automatically brings the objects to and from memory to storage device. (pointer swizzling).
ICS214A Notes 01 33
Approaches To Persistent Programming
• Persistent Virtual Memory
– disk representation and memory representation of data is
identical.
– No cost to translate data from one representation to another—
efficient!
– DB size limited to address space
32bit processor 2^32 byte addressability (4 GBytes)
– Differentiating persistent objects and non-persistent objects is
difficult.
– Difficult to optimize disk layout and locality of access.
– Example system using approach:
OBJECT STORE.
ICS214A Notes 01 34
Approaches To Persistent Programming Languages
• Store persistent objects in files
– Objects brought to memory on demand.
– Implementation of OID complex since pointers do not suffice
in general.
If object in memory pointer can be used for OID
if object on disk a disk address still not good as OID since
storage can be reorganized. A separate mechanism needed.
Pointer swizzling for efficiency.
ICS214A Notes 01 35
Challenges In Building Persistent Languages
• Efficient caching of objects in client address space.
– Cache coherence.
• In OODB data migrates to clients unlike relational
client server systems where query migrates to
server.
• Given a large number of clients each with the cache
of objects ensuring consistency of object across
multiple clients is a challenge.
ICS214A Notes 01 36
Disadvantages of ODBMS Approach
• Low protection– since persistent objects manipulated from applications directly,
more changes that errors in applications can violate data integrity.
• Non-declarative interface:– difficult to optimize queries– difficult to express queries
• But …..– Most ODBMSs offer a declarative query language OQL to
overcome the problem.– OQL is very similar to SQL and can be optimized effectively.– OQL can be invoked from inside ODBMS programming
language.– Objects can be manipulated both within OQL and programming
language without explicitly transferring values between the two languages.
– OQL embedding maintains simplicity of ODBMS programming language interface and yet provides declarative access.
ICS214A Notes 01 37
The Return of the Relations … POSTGRES
• Relational model evolved into ORDBMSs that include “best of”
object-oriented concepts
• Amongst the first ORDBMS prototype built @ Berkeley
POSTGRES Illustra
Informix IUS
• Has had major impact on major commercial DBMS which have
all migrated to ORDBMS model.
• SQL3 supported by modern databases adapted many of the
concepts developed in Postgres
bought bycommercialized
ICS214A Notes 01 38
POSTGRES — Combinations
• Introduced object orientation into relation DBMSs.
• Fundamental Concepts.
– Each record has an OID.
– Access to data though:
query language POSTQUEL.
navigation through OIDs.
– Classes:
– Inheritance:
– Types: rich set of types available for columns.
– Functions: can be called within POSTQUEL.
ICS214A Notes 01 39
Classes And Inheritance
• Class analogous to relation
• User can create new class
create Emp (name = c12, salary = float, age = int)
• Classes can inherit from others
create Salesman (quota = float) inherits Emp
• Multiple inheritance permitted. If new class causes ambiguity it is not
created.
• Classes:
– real: base classes or relations
– derived: views
– version: maintained differentially compared to parent class
ICS214A Notes 01 40
Types In POSTGRES
• Standard base types
– float, int, charac. Strings, etc.
– Abstract data type (ADT) facility to create new base types
e.g.; create type point (x = int, y = int)
create type polygon
• ADT’s can be used in class definitions.
Create Dept( dname = c10,
mgr = c12,
floorspace = polygon
mailstop = point
)
mailstop
ICS214A Notes 01 41
Functions In POSTGRES
• Three types: (1) C functions
(2) Operators
(3) POSTQUEL functions
• C-functions
– any C-function over base types or composite typeretrieve (Dept. name) where
area (Dept. floorspace) > 500
retrieve (Emp. name)
where overpaid (Emp)
Function over a class or method
ICS214A Notes 01 42
Operators
• Arbit C-functions are not optimized by query optimizers.
– Special functions - operators can utilize indexes for their evaluation.
• Operator: function with 1 or 2 operandretrieve (Dept. name)
where Dept. floor space-AGT “(0,0), (1,1), (0,2)”
• Index (e.g.; B-tree) defined properly can be used to speed up
evaluation of operators such as AGT.
Area Greater Than
ICS214A Notes 01 43
Other Features Of POSTGRES
• Allowed creation of new indices by user.
• To an extent pioneered the approach of extensible
database technology which is prevalent with
vendors today
• Supported transitive closure in query.
retrieve* into ans (parent. older)
from a in answer where.
Parent. younger = “John” or
parent. younger = a. older
• Supported rules
ICS214A Notes 01 45
POSTQUEL Functions
• Any collection of commands in POSTQUEL.– query = POSTQUEL function.
define function high-pay
returns Emp as
retrieve (Emp. all)
where Emp. salary > 50k
• POSTQUEL function with parameters.define function Sal-lookup (c12)
returns float as
retrieve (Emp. Salary)
where Emp. name = $1
• Usage of POSTQUEL functionretrieve Emp. name
where Emp. Salary = Sal-lookup (“Joe”)
ICS214A Notes 01 46
Composite Types In POSTGRES
• POSTQUEL:
– Composite types accessed via path expressions, using
nested dot notation.
remove (Emp mgr age)
where (Emp name = ‘joe’)
• Prevents having to specify a join.
ICS214A Notes 01 47
Composite Types In POSTGRES
• Attributes can have a class name as a type resulting
in complex objects with structure.
Create Emp ( name = c12,
salary = float [c12],
age = int,
mgr = Emp,
coworker = Emp
)
• A set type that can hold elements of any class.
Add to Emp (hobbies = set)
Refers to 0 or more references of Emp class.
Could be elements of any class
ICS214A Notes 01 48
Types In POSTGRES
• Array type (constructor)crate Emp ( name = c12,
salary = float [12],
age = int
)
• POSTQUEL query
retrieve (Emp name)
where (Emp salary [4] = 1000)
Salary for each month.
Array in query usage.
ICS214A Notes 01 49
Database Technology Matrix
File System
RDBMSs ORDBMSs
OODBMSs
YES
Simple Complex
Database Types
NO
Qury
Support
ICS214A Notes 01 50
XML & RDF - the new revolution
• Just when relational model had driven out object-oriented database technology, WWW led to the proliferation of semi-structured data.
• 2 approaches to supporting XML/RDF– Extend relational technology to support
XML/RDF– Native XML databases
ICS214A Notes 01 51
Summary of Evolution of Data Model• The Dark Ages: network & heirarchical models
• Victory of simplicity and beauty over data spaghetti: The Relational DBMS:
• The pointers strike back -- Object-Orientation, OODBMSs
• The return of the relations -- ORDBMS -- took the best of the OO concepts and incorporated them in the relational model.
• The current and near future -- support for XML & RDF
• The final frontier -- anyone’s guess!
52
Key Data Management Technologies (quick
review)…
ICS214A Notes 01 53
Key Database Technologies
• File Management– provides a file abstraction as a collection of records stored in
disk
• Index Management and Access Methods– implements techniques for associative access to data
• Query Optimization and Processing– given a query and data storage structures, determines an
efficient strategy to evaluate the query.
• Transaction management– ensures consistency of the database in presence of
concurrent transactions and various types of failures
• Catalog Management– maintains database schema information
• Authorization and Integrity Management– tests for integrity constraints and user authorization
ICS214A Notes 01 54
Database Management System Architecture
Database andIndices
TransactionManager
Buffer manager
File system
Metadataand data
dictionary
compilers
evaluatoroptimizer Query processor
Storage manager
Application Queries Schema changes
ICS214A Notes 01
Storage Media and their Properties
• Main Memory– costs $100/Mbyte -- reduces every year– ‘volatile’ -- does not survive system failures– random I/O very fast– data can be processed by CPU directly– capacity limited to orders of magnitude lower than what
database needs.
• Magnetic Disk– costs $0.50/Mbyte -- reduces each year– Non-volatile (except when disk crashes)– random I/O not as fast– CPU cannot directly process data. Needs to be transferred to
main memory
• Tape– Cheaper but slower than disks. Sequential I/O devices. Handy for
backups, sometimes for archival.
ICS214A Notes 01 56
Databases and Storage Devices• Due to capacity, cost, volatility factors databases traditionally
stored in disks.• Data brought to main memory for processing from disks• There are many ways to interface memory with disk resident data• E.g., virtual memory:
– VM size limited to max address generated by CPU– Existing VM does not support durability
• File system provides a more powerful mapping between memory and disk storage
• A bunch of tricks used ensure that high latency of secondary storage does not impact application response time and system throughput– access disks asynchronously with active applications– prefetch data before application needs it– intelligent caching techniques
ICS214A Notes 01 57
Functional Abstraction of a Simplistic DBMS
beginTSQLSQLendT
beginTSQLSQLendT
Query Processor
optimizer
Record-oriented file system
Basic file system
Buffer manager
Hardware
SQL statements
Read write records, scan relations
Get page containing tuples
Read/write file pages
Access plan
ICS214A Notes 01 58
Basic File System
• Provides the abstraction of a file where a file is an array of fixed size blocks
• Hides the disk geometry -- cylinders, tracks, sectors, slots and other functional components like arms, head, etc. such that the programs do not need to deal with these complexities
• Operations supported:– create a file– delete a file– open a file– close a file– extend a file– read (set of) file blocks into buffers in memory– write (set of) file blocks
ICS214A Notes 01 59
Basic File System Design Issues
• File allocation: how to allocate blocks on disk to a file.– Contiguous allocation: file stored in contiguous disk blocks. Blocks for
storing file found using either of best-fit, worst-fit or first-fit policies. +ve: provides fast sequential scan of file -ve: fragmentation, difficult to enlarge files
– Linked allocation: file is a linked list of disk blocks +ve: prevents fragmentation, easy to enlarge files -ve: slow for both sequential and random access
– Index allocation: file implemented using fixed size blocks pointed to by an index (e.g., B-tree). Popularized by Unix
+ve: good random access, easy enlargement, no fragmentation. -ve: poor sequential access performance
– Extent based allocation: file is a collection of clusters of consecutive disk blocks (extents) where collection maintained using linked lists or index
Most popular approach with vendors.
• Free space management: information about which blocks are free
ICS214A Notes 01 60
Buffer Management
• Makes file pages addressable in memory and coordinates writing of pages to disk with other components to guarantee transactional properties
• Acts as a mediator between basic file system and record-oriented file system
• Buffer frames maintained in main memory. When a request for file page access comes, check if page in buffer. Else get a free frame and load file page into buffer
• Operations Supported:– bufferfix– bufferunfix– get block– flush
ICS214A Notes 01 61
Database Buffer Management Design Issues
• DBMS buffer manager returns pointer to frame containing data instead of returning copy of requested page to caller.– Efficiency: prevents unnecessary copying of data– Allows sharing of data at finer granularity than a page
2 transactions T1 and T2. T1 and T2 update records r1 and r2 on same page if buffer manager allowed applications to copy data to their address
space and rewrite updated versions, updates might be lost
• Database buffer manager participates in protocols to implement transactions (WAL, FL@C, pinning buffer slots)
• Novel page replacement strategies:– Traditional LRU strategy used in OS works well only under the
assumption of locality of reference which may not hold for DBMSs– Since DBMS query language are declarative, system has much
more information about reference patterns which it can exploit to improve caching performance of buffer manager
ICS214A Notes 01 62
Record-Oriented File System• Provides the abstraction of a file as a collection of records.• Records can be:
– fixed size or variable length– short, long, or very long– attributes can be fixed length or variable length– simple or complex (e.g., containing set valued attributes)
• Operations supported:– create, delete, open, close, alter, drop– read, insert, update, delete record– scan all records in a file
• Issues Involved:– mapping records to pages– file organization: organization of records in a file.
Where to insert new records what mechanism can be used to retrieve records
ICS214A Notes 01
Index Management and Associative Access
• Associative access: accessing records based on their attribute values.
• Index Files
– an index file declared over a (set of) attribute of the data file provides associative access to records in the data file.
– Index file contains pointers to disk blocks where the record corresponding to the value appear.
• Types of an Index: (let indexing attribute be A)– primary: A is a key and data file stored sorted on A– clustered: A is not a key but data file stored sorted on A– secondary (key): A is a key but data file not sorted on A– secondary (non-key): A is neither a key and nor is data file
sorted on A.
ICS214A Notes 01 64
Organization of Index File• B-tree Index: index file is organized as a B-tree
– Advantages: Supports range searches efficiently. E.g., retrieve all employees with salary between 100K and 200K
– Disadvantages: Guaranteed good storage utilization searching for a given record could take around 3-4 disk I/Os
• Hash Index: index file maintained as a hash file.– Advantages:
Looking for a specific record very efficient -- 1 disk I/O
– Disadvantages: cannot support range searches
• Multdimensional Access Methods– modern databases are beginning to support novel data structures
like R-trees, grid files, inverted lists to better serve emerging application requirements
ICS214A Notes 01 65
Multidimensional Indexing Motivation• Many applications of databases are geographical = 2-d data.
Others involve large number of dimensions• Examples:
– location of restaurants in a city.– Map data: zones, county lines, rivers, lakes, etc. (Data has spatial
extent)– Sales information described by store, day, item, color, size, etc. Sale
= point in multidimensional space.– Student described by age, zipcode, marital status.
• Queries:– Range Query: “ find all McDonald restaurant within a given region”.– Nearest Neighbor Query: Find the nearest McDonald to my house– partial match queries
ICS214A Notes 01 66
Approach: Utilize Single Dimensional Index• Index on attributes independently• Project query range to each attribute determine pointers.• Intersect pointers • go to the database and retrieve objects in the intersection.
May result in very high I/O cost
ICS214A Notes 01 67
R-tree Data Structure• Extension of B-tree to
multidimensional space.
• Paginated, balanced, guaranteed storage utilization.
• Can support both point data and data with spatial extent
• Groups objects into possibly overlapping clusters (rectangles in our case)
• Search for range query proceeds along all paths that overlap with the query.
ICS214A Notes 01 68
Split Node• Given a node split it into two nodes which are each atleast half full• Multiple Objectives:
– minimize overlap– minimize covered area
• R-tree minimizes covered area• What is an optimal criteria???
Minimize overlap Minimize covered area
ICS214A Notes 01 69
Minimizing Covered Area
• Group objects into 2 parts such that the covered area is minimized
• NP Hard!!• Hence use heuritics• Two heuristics explored
– quadratic and linear
ICS214A Notes 01 70
Other Multidimensional Data Structures• Many generalizations of R-tree
– different splitting criteria– different shapes of clusters (e.g., d-dimensional spheres)– adding redundancy to reduce search cost:
store objects in multiple rectangles instead of a single rectangle to reduce cost of retrieval. But now insert has to store objects in many clusters. This strategy also increases overlap causing search performance to detoriate.
• Space Partitioning Data Structures– unlike R-tree which group objects into possibly overlapping clusters,
these methods attempt to partition space into non-overlapping regions.
– E.g., KD tree, quad tree, grid files, KD-Btree, HB-tree, hybrid tree.
• Space filling curves– superimpose an ordering on multidimensional space that preserves
proximity in multidimensional space. (Z-ordering, hilbert ordering)– Use a B-tree as an index on that ordering
ICS214A Notes 01 71
KD-tree
• A main memory data structure based on binary search trees– can be adapted to block model of
storage (KD-Btree)
• Levels rotate among the dimensions, partitioning the space based on a value for that dimension
• KD-tree is not necessarily balanced.
ICS214A Notes 01 72
KD-Tree Example
X=5
y=5 y=6
x=3
y=2
x=8 x=7
X=5 X=8
X=7X=3
Y=2
Y=6
ICS214A Notes 01 73
Adapting KD Tree to Block Model
• Similar to B-tree, tree nodes split many ways instead of two ways– Risk:
insertion becomes quite complex and expensive. No storage utilization guarantee since when a higher level
node splits, the split has to be propagated all the way to leaf level resulting in many empty blocks.
• Pack many interior nodes (forming a subtree) into a block.– Risk
it may not be feasible to group nodes at lower level into a block productively.
Many interesting papers on how to optimally pack nodes into blocks recently published.
ICS214A Notes 01 74
Quad Tree
• Nodes split along all dimensions simultaneously
• Division fixed: by quadrants• As with KD-tree we cannot make
quadtree levels uniform
ICS214A Notes 01 75
Quad Tree Example
X=5 X=8
X=7X=3SW
SE NE
NW
ICS214A Notes 01 76
Grid Files• Space Partitioning strategy but
different from a tree.• Select dividers along each
dimension. Partition space into cells
• Unlike KD-tree dividers cut all the way.
• Each cell corresponds to 1 disk page.
• Many cells can point to the same page.
• Cell directory potentially exponential in the number of dimensions
ICS214A Notes 01 77
Space Filling Curve• Assumption
– finite precision in representing each coordinate.
00 01 10 11
00
01
10
11
A B
C
Z(A) = shuffle(x_A, y_A) = shuffle(00,11)
= 0101 = 5
Z(B) = 11 = 3
(common prefix to all its blocks)
Z(C1) = 0010 = 2
Z(C2) = 1000 = 8
ICS214A Notes 01 78
Deriving Z-Values for a Region
• Obtain a quad-tree decomposition of an object by recursively dividing it into blocks until blocks are homogeneous.
00 10
1101
0001
11
0011
Objects representation
is
0001, 0011,01
ICS214A Notes 01 79
Generalized Search Trees• Motivation:
– disparate applications require different data structures and access methods.
– Requires separate code for each data structure to be integrated with the database code
too much effort. Vendors will not spend time and energy unless application very
important or data structure has general applicability.
• Generalized search trees abstract the notion of data structure into a template. – Basic observation: most data structures are similar and a lot of
book keeping and implementation details are the same.– Different data structures can be seen as refinements of basic GiST
structure. Refinements specified by providing a registering a bunch of functions per data structure to the GiST.
ICS214A Notes 01 80
GiST supports extensibility both in terms of data types and queries
• GiST is like a “template” - it defines its interface in terms of ADT rather than physical elements (like nodes, pointers etc.)
• The access method (AM) can customize GiST by defining his or her own ADT class i.e. you just define the ADT class, you have your access method implemented!
• No concern about search/insertion/deletion, structural modifications like node splits etc.
ICS214A Notes 01 81
Query Processing in DBMSs
Parsing and Translation
optimizer
Evaluation engine
Statistics about data
Select …From …Where ...
Internal relational algebra based
representation of query
Optimized execution plan
Data and index
Sally 4000
Dick 9000 ……...
Query results
ICS214A Notes 01
Query Optimization• Goals: to find the cheapest evaluation strategy for a query• Stages of Optimization:
– algebraic manipulations: heuristics used to convert query tree into an equivalent but more efficient representation.
perform selections and projections as early as possible. combine selections with cartesian products to make a join combine sequence of unary operations (selections and projections). look for common subexpressions in an expression.
– Cost based Analysis: given optimized representation produced after algebraic manipulation:
generate all possible query plans and estimate their costs based on the statistical information and costs of each unary and binary operations.
Best possible query plan chosen as an execution strategy. Number of plans considered even after heuristic are applied is
exponential in the number of operators in query tree. It is important to choose a good plan since cost of generating plan amortized over multiple query executions.
ICS214A Notes 01
Cost of Query Execution
• Access to disk: cost of reading, writing, searching data blocks. (i/o cost)
• Storage Costs: cost of storing intermediate files generated during query execution. (i/o cost)
• Computation cost: cost of in memory execution of operations. (cpu cost)
• Communication cost: cost of shipping the query and results from site to site or terminal where query originated. (communication cost)
• Total cost = I/O cost + w1* CPU cost + w2 *Communication cost
• Traditionally I/O cost considered most important
ICS214A Notes 01 84
Transaction Management
Applications in databases are modeled as transactions which provides ACID guarantees.
• Atomicity: either all the effects of a transaction appear in database or none of the effects of a transaction appears in database.
• Consistency: each transaction maps a database from consistent state to another consistent state
• Isolation: concurrent execution of trasnactions is hidden from other concurrently executing transactions
• Durability: if a transaction completes its effects are permanent and survive failures.
ICS214A Notes 01 85
Transaction Model
• Transactions provide a simple, powerful, and a natural programming model for writing database applications.
• Transaction concept supports:– simple failure semantics: either all the effects of
transaction appear in database or none do -- all or nothing– isolated view of the world: protection from partial effects of
other concurrent applications.
• Transactions allows applications to share data without having to explicitly deal with either fault-tolerance or synchronization
• Transactions are the enabling technology for large distributed applications.
ICS214A Notes 01 86
Isolation• Isolation is implemented by using 2 phase locking protocol• 2 Phase Locking Protocol:
– Each transaction acquires a lock on a data item before accessing data
– Locks are released when a transaction commits
User 1 reads account = 1500
User 2 reads account = 1500
User 1sets account value = 500(withdraws 1000 dollars)
User 2 sets account value = 700(withdraws 800 dollars)
tim
e
The execution will be prevented by 2 phase locking since user 1’s transaction will not release the lock on account until user 1 transaction terminates
ICS214A Notes 01 87
Atomicity• Atomicity is implemented by using a logging strategy.• A transaction, before updating a data item writes a undo log
record, using which its effects can be undone.• If transaction aborts then undo log records used toreconstruct
database state before transaction execution
Undo log record
DO
Old state New state
Normal processing
Transaction rollback due to either user requested abort, system failure, consistency violation
Undo log record
UNDO
Old stateNew state
ICS214A Notes 01 88
Durability• Durability is implemented using logging strategy• A transaction, before updating a data item, writes a redo log
record using which its effects are redone• If system fails before a committed transaction’s effects appear
in database its effects are redone using redo log records on recovery.
Redo Log record
DO
Old state New state
Normal processing
Redo of committed transaction
Redo log record
REDO
Old state
New state