Db Compare

download Db Compare

of 31

Transcript of Db Compare

  • 8/2/2019 Db Compare

    1/31

    Comparison of

    Oracle, MySQL and

    PostgreSQL DBMS

    in the context of ALICE needsWiktor Peryt, Warsaw University of Technology, Faculty of Physics

  • 8/2/2019 Db Compare

    2/31

    We have taken the following approach: first of all we determined whatfeatures of DBMS are important from the point of view of such a large

    experiment.

    We chose the following features:

    Elementary featuresbasic data types

    SQL language features

    declarative integrity constraints

    programming abstractionsautomatic generation of identifiers

    national characters support

  • 8/2/2019 Db Compare

    3/31

    Transactions and multi-user access transactions

    locks

    multi-user access

    Programming in database stored procedures

    triggers

    Elements of database administration access control

    backup copies

    data migration

  • 8/2/2019 Db Compare

    4/31

    Portability and scalability portability of DBMS

    scalability

    Performance and VLDB(Very Large Databases) query optimization

    structures supporting query optimization

    support for analytical processing

    allocation of disk space

    data size limits

    VLDB implementations

    Distributed databases access to multiple databases heterogeneous systems support

  • 8/2/2019 Db Compare

    5/31

    Distributed databases access to multiple databases heterogeneous systems support

    Special data types large objects in database

    post-relational extensions support for special data types

    Application development and interfaces embedded SQL

    standard interfaces, additional interfaces

    interoperability with Web technology

    XML, CASE

  • 8/2/2019 Db Compare

    6/31

    Reliability failure recovery

    Commercial issues technical support available

    market position

    Having completed step one we carried out subsequent work in 3subgroups; each of them dealt with only one DBMS.

    The members of particular subgroups had their own practicalexperience with using DBMS being subject to investigation by theirsubgroup.

    Such a procedure gave us the possibility of verifying information

    contained in manuals and other documentation available (for instanceon Internet).

    As a result 3 extended documents devoted to Oracle, MySQL and

    PostgreSQL were created.

  • 8/2/2019 Db Compare

    7/31

    Konrad Bohuszewicz undergraduate student

    Maciej Czyzowicz undergraduate student

    Michal Janik Ph.D. student

    Dawid Jarosz undergraduate student

    Piotr Mazan undergraduate student

    Marcin Mierzejewski undergraduate student

    Mikolaj Olszewski undergraduate student

    Wiktor S. Peryt

    Sylwester Radomski undergraduate student

    Piotr Szarwas Ph.D. student

    Tomasz Traczyk

    Dominik Tukendorf undergraduate student

    Jacek Wojcieszuk undergraduate student

    Faculty of Electronics and Information Technology

    Faculty of Mathematics and Information Sciences

    Faculty of Physics

  • 8/2/2019 Db Compare

    8/31

    About Comparison About Comparison

    discussion by all people involved in this task compilation was made by Dr. Tomasz Traczyk

    compilation circulated within the whole group a few times to make sure

    we avoided some omissions or mistakes

    this version of the document is accepted by all co-authors

    we consider it a quite comprehensive and objective comparison

    it contains also some kind of "weights" called by us "importance", with

    differentiation for Central database and Lab-participants. Central

    database should be a kind of data warehouse at CERN, containing all

    the data, also data transferred from Lab-participants periodically

    the term "Lab-participants" denotes smaller databases in labs involved

    in ALICE experiment preparation

    few explanations of terminology used in the database domain are also

    included to make this document easy to comprehend for non-specialists

  • 8/2/2019 Db Compare

    9/31

    Summary

    Importance Assessment

    Category Problem Central

    database

    Lab-

    participants

    MySQ L Oracle8 Pos tg reSQL

    Basic data types C C B C A

    SQL B B C B B

    Declarative constraints B B C A A

    Programming abstractions A C D A C

    Generation of ids C C C A A

    Elementary features

    National chars B C B A B

    Transactions A C D A A

    Locks A C D A ATransactions

    Multiuser access A D C A C

    Programming in DBStored proce dures and

    triggersB C D A A

    Access control B D A A B

    Backup A C C A CAdministrationData migration C C A B A

    Portability B C B A BPortability and scalability

    Scalability A C B A C

    Query optimizatio n A C B A B

    Structures supporting

    optimizationB D D A B

    Support for OLAP B D D A D

    Allocation of the disk space A C C A C

    Size limits A B B A C

    Performance and VLDB

    VLDB implementations A C D A B

    Access to multiple databases C D C A C

    Distributed databases Heterogeneous systems

    support

    B D D B D

    Large objects B B B A C

    Post-relational extensions C C D A BSpecial data types

    Support for special data

    typesC C D A C

    Embedded SQL C C D A B

    Standard interfaces B C B A B

    Additional interfaces A A A A A

    Web technology A A B A B

    XML B C D A D

    Application development and

    interfaces

    CASE B C D A D

    Reliability Recovery A B C A C

    Prices C A A D A

    Technical support A B C B DCommercial issuesPosition on the market A C D A D

  • 8/2/2019 Db Compare

    10/31

    Our preliminary conclusions:

    for Central Data Repository for ALICE at CERN:

    only ORACLE can be taken into account seriously

    for Labs-participants (mainly for production phase databases):

    Oracle is also the best but using MySQL or PostgreSQL ispossible

    the choice one of them is not obvious at the moment

    Some extended tests concerning MySQL and PostgreSQL performance,

    stability etc. with real data for STAR SSD are still in progress in Warsaw.

    They will be published in 1-2 weeks on the website:

    http://ITS_DB_ALICE.if.pw.edu.pl

    the same place for document Comparison of Oracle, MySQL and PostgreSQL DBMS

  • 8/2/2019 Db Compare

    11/31

    Questions for ALICEQuestions for ALICE

    How to start with databases for ALICE and how

    to manage the project?

    General concept of system architecture Databases in production phase

    Software technologies recommended

    DBMS platform choice

    How to proceed?

  • 8/2/2019 Db Compare

    12/31

    Databases types for ALICEDatabases types for ALICE

    The following main categories of information should gointo databases:

    production and assembly phase measurements anddescriptive dataProdPhase database

    calibrations dataCalibration databaseconfiguration dataConfiguration databasedetector condition dataCondition databaserun logs dataRunLog database

    geometry data (?)Geometry database or part ofCalibrationDB (?)

    some others? ... to be defined later, during "phase one" work

  • 8/2/2019 Db Compare

    13/31

    Databases contentsDatabases contents (1)(1)

    ProdPhase database all information coming from test-beds, from manufacturers,

    assembly processes, object flow between manufacturers andlabs, etc.

    RunLog database

    to store the summary information describing the contents of an

    experimental run and to point the locations where detailed information

    associated with the run is stored

    Example of Web based interface developed by Sylwester Radomski (undergraduate

    student from Faculty of Physics, WUT) for STAR can be seen on

    http://www.star.bnl.gov -> Computingand from tableNew thefirst item

  • 8/2/2019 Db Compare

    14/31

    The environment in which the archive facility operates iscomposed of many sources of information

    We have to deal with data:

    produced by various test-bench systems

    entered manually by operators submitted by collaborating institutes and companies

    Usually there is a number of distinct data formats

    Files are stored in many locations

    Consequently, without database: it is not only hard to locate the right piece of information but also to

    ensure the safety and good quality of data

    Why database in production phase?Why database in production phase?

  • 8/2/2019 Db Compare

    15/31

    secure archiving of all the test results in repositorysecure archiving of all the test results in repository

    easy availability of info upon location of objects (in geographic

    sense: manufacturers, labs)makes the assembly arrangement

    easier

    creating the possibility of automatic assignment of qualityattributes according to the well defined criteria

    statistical analysis of the quality should be made easily and at

    any time

    preparing data for future on-line use by slow-control, DCS and

    DAQ

    easy access to all data during production and assembly phase

    In the future - easy access to all data during experiment run

    Goals for production phase databaseGoals for production phase database

  • 8/2/2019 Db Compare

    16/31

    Basic requirements: data should be stored in central repository to make easy

    and reliable the management and maintenance

    access to the data should be assured for everybody

    which participates in tests during production phase, i.e.software allowing use of WEB browsers is necessary

    objects' registration should be possible manually (by

    operator with suitable privileges) as well as

    automatically (from LabVIEW application, for example or

    other software) The software should allow creating (SQL) queries to the

    database even for inexperienced users

    DB production phase

  • 8/2/2019 Db Compare

    17/31

    From the point of view of domain experts ..... (1)(1)

    there is an ever-increasing demand for centralized storage

    of data for consistent and easy to use search and retrieval

    facilities

    experts want to be able to retrieve and analyze the

    information in a user-friendly way, regardless of its origin

    They do not want to be forcedto perform several queries just

    because data in question was taken by different dataacquisition systems

  • 8/2/2019 Db Compare

    18/31

    From the point of view of domain experts ..... (2)(2)

    they wish to do statistics on data sets spanning months (and

    more)without having to browse tens of subdirectories on

    backup storage devices

    usually - they prefer to use industry-standard, versatilesoftware tools to process and analyze data

    they certainly would not mind should they be able to

    automate their routine,everyday tasks

    Their task is to lookTheir task is to look atat the information, not to lookthe information, not to look forfor itit

  • 8/2/2019 Db Compare

    19/31

    we should address those issues by providing amodular framework for archiving and for platform-independent retrieval of data in heterogeneousdistributed computing environment

    our database system must be open enough tofollow inevitable evolution of information gatheringsystems related to the development of the particulardetectors

    we should be able to cope with the fast evolving newInternet technologies in order to take full advantage offacilities they provide

    Requirements addressed to software developersRequirements addressed to software developers

  • 8/2/2019 Db Compare

    20/31

    Use of: PHP4 software running on the server side

    C/C++ for API

    JAVA + SWING & JDBC for applications requiringmore interactivity

    (JDBC = JAVA DataBase Connectivity)

    " seems to be the right choice ofseems to be the right choice of tools used for clientside software development

    DB for STAR -DB for STAR - software technologies usedsoftware technologies used

  • 8/2/2019 Db Compare

    21/31

    On the flight plots creation ...On the flight plots creation ...

  • 8/2/2019 Db Compare

    22/31

    From SQL query to plot......

    Generation of plots and histograms from database andputting them on the Web.Attempt made by S. Radomski:Data chain:Http server (Apache - Tomacat) calls servlet (dbPlot) withparameter - SQL query.Servlet in http server connects to ROOT based server

    through socket and sends queryROOT server means ROOT script which handles connectionsand scripting dbPlot class.dbPlot::Init() reuse existing connection to database orcreates new one if the old one does not exist.dbPlot::TakeData() server sends query to DB and takes datausing TSQLServer class.dbPlot::TakeData() takes data from TSQLResult and putthem to TNtuple. This function can recognise and parse'private' format of data stored in BLOB.dbPlot::PlotData() calls TTree->Draw() with proper

    parameters.dbPlot::Style() set colors and labels.

  • 8/2/2019 Db Compare

    23/31

    Performance and problems ......

    One histogram takes about 1-2 sec.Slowest element in the chain - convert. Convertmakes use of GhostScript. Creation of PostScriptand then conversion to PNG is overcomplicatedand rather simple TGrph with ~758 lines takes ~10

    sROOT cannot generate Gif in -b mode.Problems with memory deallocation in ROOT -after about 100 plots ROOT crashes.Modification of Draw() in Ttree:

    In 1-D Histogram Draw() always makes 100 bins.If data has its own grid (measurement precision)plots look terribly - especially when histogrammingintegers.Small modification in TTreePlayer permits to

    recognize if data are gridy and sets number of

  • 8/2/2019 Db Compare

    24/31

    Typical architecture for local site/lab i.e. Lab-participantTypical architecture for local site/lab i.e. Lab-participant

    measurements are performed on

    dedicated computer

    data are transferred over Ethernet

    local network to database

    users can access the measurements

    by means of JAVA applets or PHP

    applications

    graphical user interface make the

    construction of complex queries

    easy even for user with no database

    experience

    another capability of this applet isthe visualisation of selected data

    it is clear that using JAVA, JDBC and

    PHP allows to access the database

    over the Internet or local network

    with user's favourite browser

    DB server

    (daemon)

    repository

    JAVA appletPHP

    applications

    LabVIEW

    application

    DUT

    ROOT

    or AliROOT

  • 8/2/2019 Db Compare

    25/31

    Production phase database for ALICEProduction phase database for ALICE

    MySQL

    repository

    MySQL server

    (daemon)

    JAVA applet PHP

    applications

    LabVIEW

    application

    DUT

    ROOT

    or AliROOT

    Lab 3

    Lab 2

    Lab 4 Lab n

    DATA

    repository

    Data services

    Data Archive

    Server LibraryORACLE server

    (daemon)

    Application services

    DBMS

    AliROOT

    CERN

    Lab1

    somewhere

    in Europe

    .....

    C CC t l d t h t CERN

  • 8/2/2019 Db Compare

    26/31

    Central data warehouse at CERNCentral data warehouse at CERNthree tier architecturethree tier architecture

    one can easily distinguish thethree logical tiers - according to

    present tendencies: client layer,

    application services layer and

    data services layer each layer contains several

    components (not all shown on the

    picture) top level is a layer containing

    client applications, responsible for

    data transfer into database and

    visualisation the middle layer is composed of

    application services; this layerknows the logical structure and

    physical locations of data the bottom layer contains data

    and Database Management

    System

    Data Archive

    Server Library

    DATA

    repository

    Data services

    ORACLE server

    Application services

    Filters

    Generic

    dataloader

    Custom

    data loader

    Interactive software:

    WWW browsers,

    JAVA applets,

    PHP, HTML,

    command line utilities

    etc...

    Client layermodules and applications

    DBMS

    AliROOT

  • 8/2/2019 Db Compare

    27/31

    Project should be managed in few phases

    Project is large so I strongly suggest to apply

    methodology proven in

    "commercial environment"

    How to start with databases for ALICE?How to start with databases for ALICE?

  • 8/2/2019 Db Compare

    28/31

    How to manage the project?How to manage the project? (1)(1)

    Phase 1: strategy (or planning) for the WHOLE

    project:determination of scope of the project

    partitioning into subsystems (natural way: subdetectors, but not only)

    formulation of general models which could be applied

    creation of list of actors/participants

    approximate time schedule for particular tasks

    initial choice of software technologies

    H t th j t?H t th j t? (2)(2)

  • 8/2/2019 Db Compare

    29/31

    Successive phases should be performed in "spiral

    cycle"

    It simply means that particular subsystems

    are elaborated successively.

    Each subsystem must go through the

    following phases:

    analysis/conceptual design software design development (programming) implementation

    improvements/corrections in earlier completed subsystems

    must be continued during the work on successive subsystems

    simultaneous work on several subsystems is a good practice;

    How to manage the project?How to manage the project? (2)(2)

    H t th j t?H t th j t? (3)(3)

  • 8/2/2019 Db Compare

    30/31

    work on pilot project - in parallel to the main one; thesame software technology, it should contain most urgentthings

    efficiency tests; creation of "simulated data" withcapacity volumes similar to the expected ones

    creation of"conceptual models"during the analysisphase is necessary before design of subsystems; theappropriate formalism and class CASE tools are neededfor that. For linux - UML (UnifiedModellingLanguage) is aappropriate option

    elaboration for the whole project of such standards as:system of keys, terminology, security, access rights etc.

    How to manage the project?How to manage the project? (3)(3)

    Fi t tFi t t

  • 8/2/2019 Db Compare

    31/31

    First steps ...First steps ...

    Start to formally organize database central group for ALICE After that: begin phase 1 of the project, i.e. strategy for the WHOLE

    project/experiment

    Partial, of highest priority tasks for this group: determination of scope of the project formulation of general models which could be applied creation of list of actors/participants(including 1-2 representatives from

    each subdetector!)

    initial choice of software technologies which could be used partitioning into subsystems analysis/conceptual design