Db Compare

8/2/2019 Db Compare

1/31

Comparison of

Oracle, MySQL and

PostgreSQL DBMS

in the context of ALICE needsWiktor Peryt, Warsaw University of Technology, Faculty of Physics

8/2/2019 Db Compare

2/31

We have taken the following approach: first of all we determined whatfeatures of DBMS are important from the point of view of such a large

experiment.

We chose the following features:

Elementary featuresbasic data types

SQL language features

declarative integrity constraints

programming abstractionsautomatic generation of identifiers

national characters support

8/2/2019 Db Compare

3/31

Transactions and multi-user access transactions

locks

multi-user access

Programming in database stored procedures

triggers

Elements of database administration access control

backup copies

data migration

8/2/2019 Db Compare

4/31

Portability and scalability portability of DBMS

scalability

Performance and VLDB(Very Large Databases) query optimization

structures supporting query optimization

support for analytical processing

allocation of disk space

data size limits

VLDB implementations

Distributed databases access to multiple databases heterogeneous systems support

8/2/2019 Db Compare

5/31

Distributed databases access to multiple databases heterogeneous systems support

Special data types large objects in database

post-relational extensions support for special data types

Application development and interfaces embedded SQL

standard interfaces, additional interfaces

interoperability with Web technology

XML, CASE

8/2/2019 Db Compare

6/31

Reliability failure recovery

Commercial issues technical support available

market position

Having completed step one we carried out subsequent work in 3subgroups; each of them dealt with only one DBMS.

The members of particular subgroups had their own practicalexperience with using DBMS being subject to investigation by theirsubgroup.

Such a procedure gave us the possibility of verifying information

contained in manuals and other documentation available (for instanceon Internet).

As a result 3 extended documents devoted to Oracle, MySQL and

PostgreSQL were created.

8/2/2019 Db Compare

7/31

Konrad Bohuszewicz undergraduate student

Maciej Czyzowicz undergraduate student

Michal Janik Ph.D. student

Dawid Jarosz undergraduate student

Piotr Mazan undergraduate student

Marcin Mierzejewski undergraduate student

Mikolaj Olszewski undergraduate student

Wiktor S. Peryt

Sylwester Radomski undergraduate student

Piotr Szarwas Ph.D. student

Tomasz Traczyk

Dominik Tukendorf undergraduate student

Jacek Wojcieszuk undergraduate student

Faculty of Electronics and Information Technology

Faculty of Mathematics and Information Sciences

Faculty of Physics

8/2/2019 Db Compare

8/31

About Comparison About Comparison

discussion by all people involved in this task compilation was made by Dr. Tomasz Traczyk

compilation circulated within the whole group a few times to make sure

we avoided some omissions or mistakes

this version of the document is accepted by all co-authors

we consider it a quite comprehensive and objective comparison

it contains also some kind of "weights" called by us "importance", with

differentiation for Central database and Lab-participants. Central

database should be a kind of data warehouse at CERN, containing all

the data, also data transferred from Lab-participants periodically

the term "Lab-participants" denotes smaller databases in labs involved

in ALICE experiment preparation

few explanations of terminology used in the database domain are also

included to make this document easy to comprehend for non-specialists

8/2/2019 Db Compare

9/31

Summary

Importance Assessment

Category Problem Central

database

Lab-

participants

MySQ L Oracle8 Pos tg reSQL

Basic data types C C B C A

SQL B B C B B

Declarative constraints B B C A A

Programming abstractions A C D A C

Generation of ids C C C A A

Elementary features

National chars B C B A B

Transactions A C D A A

Locks A C D A ATransactions

Multiuser access A D C A C

Programming in DBStored proce dures and

triggersB C D A A

Access control B D A A B

Backup A C C A CAdministrationData migration C C A B A

Portability B C B A BPortability and scalability

Scalability A C B A C

Query optimizatio n A C B A B

Structures supporting

optimizationB D D A B

Support for OLAP B D D A D

Allocation of the disk space A C C A C

Size limits A B B A C

Performance and VLDB

VLDB implementations A C D A B

Access to multiple databases C D C A C

Distributed databases Heterogeneous systems

support

B D D B D

Large objects B B B A C

Post-relational extensions C C D A BSpecial data types

Support for special data

typesC C D A C

Embedded SQL C C D A B

Standard interfaces B C B A B

Additional interfaces A A A A A

Web technology A A B A B

XML B C D A D

Application development and

interfaces

CASE B C D A D

Reliability Recovery A B C A C

Prices C A A D A

Technical support A B C B DCommercial issuesPosition on the market A C D A D

8/2/2019 Db Compare

10/31

Our preliminary conclusions:

for Central Data Repository for ALICE at CERN:

only ORACLE can be taken into account seriously

for Labs-participants (mainly for production phase databases):

Oracle is also the best but using MySQL or PostgreSQL ispossible

the choice one of them is not obvious at the moment

Some extended tests concerning MySQL and PostgreSQL performance,

stability etc. with real data for STAR SSD are still in progress in Warsaw.

They will be published in 1-2 weeks on the website:

http://ITS_DB_ALICE.if.pw.edu.pl

the same place for document Comparison of Oracle, MySQL and PostgreSQL DBMS

8/2/2019 Db Compare

11/31

Questions for ALICEQuestions for ALICE

How to start with databases for ALICE and how

to manage the project?

General concept of system architecture Databases in production phase

Software technologies recommended

DBMS platform choice

How to proceed?

8/2/2019 Db Compare

12/31

Databases types for ALICEDatabases types for ALICE

The following main categories of information should gointo databases:

production and assembly phase measurements anddescriptive dataProdPhase database

calibrations dataCalibration databaseconfiguration dataConfiguration databasedetector condition dataCondition databaserun logs dataRunLog database

geometry data (?)Geometry database or part ofCalibrationDB (?)

some others? ... to be defined later, during "phase one" work

8/2/2019 Db Compare

13/31

Databases contentsDatabases contents (1)(1)

ProdPhase database all information coming from test-beds, from manufacturers,

assembly processes, object flow between manufacturers andlabs, etc.

RunLog database

to store the summary information describing the contents of an

experimental run and to point the locations where detailed information

associated with the run is stored

Example of Web based interface developed by Sylwester Radomski (undergraduate

student from Faculty of Physics, WUT) for STAR can be seen on

http://www.star.bnl.gov -> Computingand from tableNew thefirst item

8/2/2019 Db Compare

14/31

The environment in which the archive facility operates iscomposed of many sources of information

We have to deal with data:

produced by various test-bench systems

entered manually by operators submitted by collaborating institutes and companies

Usually there is a number of distinct data formats

Files are stored in many locations

Consequently, without database: it is not only hard to locate the right piece of information but also to

ensure the safety and good quality of data

Why database in production phase?Why database in production phase?

8/2/2019 Db Compare

15/31

secure archiving of all the test results in repositorysecure archiving of all the test results in repository

easy availability of info upon location of objects (in geographic

sense: manufacturers, labs)makes the assembly arrangement

easier

creating the possibility of automatic assignment of qualityattributes according to the well defined criteria

statistical analysis of the quality should be made easily and at

any time

preparing data for future on-line use by slow-control, DCS and

DAQ

easy access to all data during production and assembly phase

In the future - easy access to all data during experiment run

Goals for production phase databaseGoals for production phase database

8/2/2019 Db Compare

16/31

Basic requirements: data should be stored in central repository to make easy

and reliable the management and maintenance

access to the data should be assured for everybody

which participates in tests during production phase, i.e.software allowing use of WEB browsers is necessary

objects' registration should be possible manually (by

operator with suitable privileges) as well as

automatically (from LabVIEW application, for example or

other software) The software should allow creating (SQL) queries to the

database even for inexperienced users

DB production phase

8/2/2019 Db Compare

17/31

From the point of view of domain experts ..... (1)(1)

there is an ever-increasing demand for centralized storage

of data for consistent and easy to use search and retrieval

facilities

experts want to be able to retrieve and analyze the

information in a user-friendly way, regardless of its origin

They do not want to be forcedto perform several queries just

because data in question was taken by different dataacquisition systems

8/2/2019 Db Compare

18/31

From the point of view of domain experts ..... (2)(2)

they wish to do statistics on data sets spanning months (and

more)without having to browse tens of subdirectories on

backup storage devices

usually - they prefer to use industry-standard, versatilesoftware tools to process and analyze data

they certainly would not mind should they be able to

automate their routine,everyday tasks

Their task is to lookTheir task is to look atat the information, not to lookthe information, not to look forfor itit

8/2/2019 Db Compare

19/31

we should address those issues by providing amodular framework for archiving and for platform-independent retrieval of data in heterogeneousdistributed computing environment

our database system must be open enough tofollow inevitable evolution of information gatheringsystems related to the development of the particulardetectors

we should be able to cope with the fast evolving newInternet technologies in order to take full advantage offacilities they provide

Requirements addressed to software developersRequirements addressed to software developers

8/2/2019 Db Compare

20/31

Use of: PHP4 software running on the server side

C/C++ for API

JAVA + SWING & JDBC for applications requiringmore interactivity

(JDBC = JAVA DataBase Connectivity)

" seems to be the right choice ofseems to be the right choice of tools used for clientside software development

DB for STAR -DB for STAR - software technologies usedsoftware technologies used

8/2/2019 Db Compare

21/31

On the flight plots creation ...On the flight plots creation ...

8/2/2019 Db Compare

22/31

From SQL query to plot......

Generation of plots and histograms from database andputting them on the Web.Attempt made by S. Radomski:Data chain:Http server (Apache - Tomacat) calls servlet (dbPlot) withparameter - SQL query.Servlet in http server connects to ROOT based server

through socket and sends queryROOT server means ROOT script which handles connectionsand scripting dbPlot class.dbPlot::Init() reuse existing connection to database orcreates new one if the old one does not exist.dbPlot::TakeData() server sends query to DB and takes datausing TSQLServer class.dbPlot::TakeData() takes data from TSQLResult and putthem to TNtuple. This function can recognise and parse'private' format of data stored in BLOB.dbPlot::PlotData() calls TTree->Draw() with proper

parameters.dbPlot::Style() set colors and labels.

8/2/2019 Db Compare

23/31

Performance and problems ......

One histogram takes about 1-2 sec.Slowest element in the chain - convert. Convertmakes use of GhostScript. Creation of PostScriptand then conversion to PNG is overcomplicatedand rather simple TGrph with ~758 lines takes ~10

sROOT cannot generate Gif in -b mode.Problems with memory deallocation in ROOT -after about 100 plots ROOT crashes.Modification of Draw() in Ttree:

In 1-D Histogram Draw() always makes 100 bins.If data has its own grid (measurement precision)plots look terribly - especially when histogrammingintegers.Small modification in TTreePlayer permits to

recognize if data are gridy and sets number of

8/2/2019 Db Compare

24/31

Typical architecture for local site/lab i.e. Lab-participantTypical architecture for local site/lab i.e. Lab-participant

measurements are performed on

dedicated computer

data are transferred over Ethernet

local network to database

users can access the measurements

by means of JAVA applets or PHP

applications

graphical user interface make the

construction of complex queries

easy even for user with no database

experience

another capability of this applet isthe visualisation of selected data

it is clear that using JAVA, JDBC and

PHP allows to access the database

over the Internet or local network

with user's favourite browser

DB server

(daemon)

repository

JAVA appletPHP

applications

LabVIEW

application

DUT

ROOT

or AliROOT

8/2/2019 Db Compare

25/31

Production phase database for ALICEProduction phase database for ALICE

MySQL

repository

MySQL server

(daemon)

JAVA applet PHP

applications

LabVIEW

application

DUT

ROOT

or AliROOT

Lab 3

Lab 2

Lab 4 Lab n

DATA

repository

Data services

Data Archive

Server LibraryORACLE server

(daemon)

Application services

DBMS

AliROOT

CERN

Lab1

somewhere

in Europe

.....

C CC t l d t h t CERN

8/2/2019 Db Compare

26/31

Central data warehouse at CERNCentral data warehouse at CERNthree tier architecturethree tier architecture

one can easily distinguish thethree logical tiers - according to

present tendencies: client layer,

application services layer and

data services layer each layer contains several

components (not all shown on the

picture) top level is a layer containing

client applications, responsible for

data transfer into database and

visualisation the middle layer is composed of

application services; this layerknows the logical structure and

physical locations of data the bottom layer contains data

and Database Management

System

Data Archive

Server Library

DATA

repository

Data services

ORACLE server

Application services

Filters

Generic

dataloader

Custom

data loader

Interactive software:

WWW browsers,

JAVA applets,

PHP, HTML,

command line utilities

etc...

Client layermodules and applications

DBMS

AliROOT

8/2/2019 Db Compare

27/31

Project should be managed in few phases

Project is large so I strongly suggest to apply

methodology proven in

"commercial environment"

How to start with databases for ALICE?How to start with databases for ALICE?

8/2/2019 Db Compare

28/31

How to manage the project?How to manage the project? (1)(1)

Phase 1: strategy (or planning) for the WHOLE

project:determination of scope of the project

partitioning into subsystems (natural way: subdetectors, but not only)

formulation of general models which could be applied

creation of list of actors/participants

approximate time schedule for particular tasks

initial choice of software technologies

H t th j t?H t th j t? (2)(2)

8/2/2019 Db Compare

29/31

Successive phases should be performed in "spiral

cycle"

It simply means that particular subsystems

are elaborated successively.

Each subsystem must go through the

following phases:

analysis/conceptual design software design development (programming) implementation

improvements/corrections in earlier completed subsystems

must be continued during the work on successive subsystems

simultaneous work on several subsystems is a good practice;


H t th j t?H t th j t? (3)(3)

8/2/2019 Db Compare

30/31

work on pilot project - in parallel to the main one; thesame software technology, it should contain most urgentthings

efficiency tests; creation of "simulated data" withcapacity volumes similar to the expected ones

creation of"conceptual models"during the analysisphase is necessary before design of subsystems; theappropriate formalism and class CASE tools are neededfor that. For linux - UML (UnifiedModellingLanguage) is aappropriate option

elaboration for the whole project of such standards as:system of keys, terminology, security, access rights etc.


Fi t tFi t t

8/2/2019 Db Compare

31/31

First steps ...First steps ...

Start to formally organize database central group for ALICE After that: begin phase 1 of the project, i.e. strategy for the WHOLE

project/experiment

Partial, of highest priority tasks for this group: determination of scope of the project formulation of general models which could be applied creation of list of actors/participants(including 1-2 representatives from

each subdetector!)

initial choice of software technologies which could be used partitioning into subsystems analysis/conceptual design

Db Compare

Documents

Transcript of Db Compare