Advanced Database Techniqueshomepages.cwi.nl/~mk/onderwijs/adt2007/lectures/lecture...Advanced...

Post on 18-Jul-2020

4 views 0 download

Transcript of Advanced Database Techniqueshomepages.cwi.nl/~mk/onderwijs/adt2007/lectures/lecture...Advanced...

Advanced Database Techniques

Martin.Kersten @ cwi.nlStefan Manegold@cwi.nl

Sandor Heman @ cwi.nlJennie Zhang @ cwi.nl

Romulo Goncalves @cwi.nl

Administrative details• The website evolves as during the course• Exam material is marked explicitly• Lab work deadlines are strict

• Email is the preferred way to communicate• Tomorrow the assistants will be available in

person between 11:00-12:00, room REC-P.123

Relational systems• A database system should simplify the

organization, validation, sharing, and bookkeeping of information

• Prerequisite knowledge– Relational data model and algebra– Data structures (B-tree, hash)– Operating system concepts– Using a SQL database system

• What is your practical experience?[Ruby on Rails expertise needed]

Applications• Bread-and-butter applications?

– Web-shop– Banking systems– Inventory systems– Production systems– Shopping systems– Government systems– Health systems– Multimedia systems– Science systems …

Advanced Applications• Bread-and-butter applications ???

– Banking systems• What happens if you install a stock trading system

which should handle >100K transactions/minute• How to derive trading advice using compute

intensive applications• How to warn thousands of users about their trading

opportunity

– …. Need for parallel, distributed main-memory database technology…

Advanced application requirements• Bread-and-butter applications

– Inventory applications• How to install a battlefield inventory systems• How to deliver goods just in time?• How to keep track of moving objects/persons ?

• … need for sensor-based database support and RFID tags … need for a new DBMS ?…

Advanced Applications• Production systems

– How to interact with component suppliers– How to manage the production workflow– How to avoid bad production steps– How to maintain a database with 12000 tables

(SAP)

• … need for interoperability between autonomous systems… datamining and knowledge discovery…

Advanced Applications• Health information systems

– How to monitor your health over 30 years– How to enable quick response to a heart attack

• …need for interoperable database systems …

HELP

The Ambient Home

HELP

The Ambient Home

911 called

MonetDB DataCell

MonetDB DataCell

911 called

nucleus

A Shared Tuple Spaceusing an SQL DBMS

MonetDB DataCell

911 called

receptors emittersnucleus

HELP

MonetDB DataCell

Recall

receptors emittersnucleus

MonetDB DataCell

Keep

911 called

receptors emittersnucleus

HELP

MonetDB DataCell

forget

receptors emittersnucleus

MonetDB DataCell

Aggregate

911 called

receptors emittersnucleus

MonetDB DataCell

911 called

receptors emittersnucleus

Recall

Aggregate

Keep

Forget

SQL work load-- SQL-queries

insert into hospital select ‘John’,* from medic where temp>40.0;

insert into epdselect * from medic where temp>=38.0;

delete from medic ;

Recall

Aggregate

Keep

Forget

SQL work load

insert into hospital select ‘John’,* from medic where temp>40.0;

insert into epdselect * from medic where temp>=38.0;

delete from medic ;

Start End

Query optimizationThe queries in a datacell have

- a soft/hard deadline- strong flow dependency

The operands to the queries are small tables:

- empty- single value- a few values

Traditional query optimizers are biased towards large operands.

Recall

Aggregate

Keep

Forget

Query optimizationChallenges:

• How to optimize the individual SQL programs to select the proper QEP ?

•How to weave the collection of SQL programs to create an optimal multi-query version?

Recall

Aggregate

Keep

Forget

Advanced Applications• Multimedia Systems

– Narrow/broad casting, selective dissemination of volumetric information

– Searching in multimedia storage

• … need for P2P infrastructure …search facilities over feature spaces…

Advanced applications• Government systems

– Security• Biometric data management issues, finger/image

matching

– Public safety• Forensics, manipulate complex objects using

proprietary algorithms

• …need for extensible database technology…need to support unstructured data…

Advanced Applications• Science systems

– The new accelerator in CERN • how to handle >1PTByte files

– The Sloan Digital Skyserver schema is 200 pages and the catalogued data 2.5Tb

• How to query this efficiently

– ..need for P2P and … a novel way to organize data…

LOFAR central processor specs• Streaming Data

– Input: 320 Gbit/s– Internally within correlator: 20 Tbit/s– Into storage: 25 Gbit/s = 250 TByte/day– Final products: 1-3 TByte/day

• High Performance Computing– Correlation: 15 Tflops– Pre processing and filtering: 5 Tflops– Off-line processing (calibration, analysis): 5-10 Tflops– Visualisation, control, scheduling etc: 2 Tflops

• Storage– On-line temporal storage: 500 TByte– Archive: PByte range of data stored in Grid

Technological challenges• Data is often not structured as tables

– XML and XQuery

• Data does not always fit on one system– Distributed and parallel databases

• Querying is more like world-wide searching– Continuous and streaming queries

• A database tells more than facts– Datamining and knowledge discovery

Code bases• Database management systems are BIG software

systems– Oracle, SQL-server, DB2 >1 M lines– PostgreSQL 300K lines– MySQL 500 K lines– MonetDB 200-800 K lines – SQLite 40K lines

• Programmer teams for DBMS kernels range from a few to a few hundred

Performance components• Hardware platform• Data structures• Algebraic optimizer• SQL parser• Application code

– What is the total cost of execution ?– How many tasks can be performed/minute ?– How good is the optimizer?– What is the overhead of the datastructures ?

Not all are equal

0.400.611.704.550.93Big delete and small insert

1.483.211.8113.160.36Big insert after delete

0.752.062.261.310.22Delete with index

0.564.000.971.500.32Delete on text index

1.592.781.5361.360.65Insert from select

1.722.406.9848.1310.3225000 updates on text

3.103.528.1318.798.3325000 updates with index

0.630.638.411.730.431000 updates

1.161.121.274.615.225000 range index selects

3.373.364.6413.402.15100 string range selects

2.522.492.763.620.18100 range selects

1.420.942.184.916.7125000 inserts 1 transaction

0.2213.060.154.300.271000 inserts transactions

SQLlitenosync

SQLiteMySQLPostgreSQLMonetDB

Not all are equal

Not all are equal

Why does it take so long to built a 10Mx2 table?How long will it take to do 10Mx32 on SQLserver Beta 2 ?

Gaining insight• Study the code base (inspection + profiling)

– Often not accessible outside development lab

• Study individual techniques (data structures + simulation)– Focus of most PhD research in DBMS

• Detailed knowledge becomes available, but ignores the total cost of execution.

• Study as a functional black box– Analyse a small application framework

The Jack The Ripper Project• Study the snippet of the database technology and

design an XQuery and SQL application

• What is the schema?

• What are the queries?

• What are unorthodox solutions?

Learning points• My poor knowledge on relational database? Read

the chapters on SQL and relational algebra. Knowledge on data structures comes in handy.

• Database systems are much more than administrative bookkeeping systems

Learning points

– Advanced application challenge the technology provided by a DBMS

– Many techniques do not easily scale in size, complexity, functionality

– Effectiveness of a DBMS is determined by many tightly interlocked components