ITI015En-The evolution of databases (I)

37
The evolution of database technology (I) Huibert Aalbers Senior Certified Executive IT Architect

Transcript of ITI015En-The evolution of databases (I)

The evolution of database technology (I)Huibert Aalbers

Senior Certified Executive IT Architect

IT Insight podcast• This podcast belongs to the IT Insight series

• You can subscribe to the podcast through iTunes.

• Additional material such as presentations in PDF format or white papers mentioned in the podcast can be downloaded from the IT insight section of my site at http://www.huibert-aalbers.com

• You can send questions or suggestions regarding this podcast to my personal email, [email protected]

Hierarchical databases• In the 60’s IBM launched the first

computers equipped with a hard disk drive

• This spurred the development of a technology to store, process and retrieve data. IMS, in 1968, became the first commercial database software, developed by IBM to inventory the very large bill of materials (BOM) for the Saturn V moon rocket and Apollo space vehicle.

• IMS was the first hierarchical database

Hierarchical databases• Hierarchical databases have a serios

limitation. They only support 1 to n relationships, which make data modeling difficult

• A parent can have multiple children

• A child can only have a single parent

Hierarchical databases• The most well known hierarchical

databases are

• IMS (still popular in large banks)

• Windows registry

• LDAP directories (depending on the implementation)

• Hierarchical databases still have a significant performance edge over more modern relational databases

Relational databases• In 1970, Ted Codd, a British

mathematician who worked at IBM, published a paper titled “A relational model of data for large shared data banks”

• His groundwork generated much interest in the information management world and spurred the creation of new companies such as Oracle (1977) or Informix (1980) that implemented Codd’s ideas. Meanwhile, IBM developed DB2, which first appeared on mainframes (1981) and later on distributed platforms.

Relational databases

• For over thirty years, relational databases have ruled the database market, based on their undeniable strengths

• During that period, users have shaped the evolution of the technology by demanding new features and increased performance

Strengths of Relational Databases• Great technology to store large

volumes of structured data

• The consistency of the data is guaranteed through the implementation of the ACID properties

• Atomicity

• Consistency

• Isolation

• Durability

User requirements that have shaped modern relational databases

• Increased scalability

• Alibí to perform complex queries against large data sets (Data warehousing)

• Support for new programming languages and types of data

• Requirements inspired by trends in modern programming languages

• Improved administration features to ease management of large numbers of database instances

Increased scalability• Symmetric Multiprocessing (SMP)

• IBM System 65 (1967)

• UNIX (starting in the mid 80’s)

• Support for multiples processor cores

• Power 4 (2001)

• Data partitioning

• SQL query optimizer improvements

• Data compression

• Increased use of RAM

• Clustering

What are the relational databases bottlenecks?

• I/O

• SQL joins

• Transactions (Locks), Distributed Transactions (Two phase commit)

• Concurrency

• Hardware

Data partitioning• Hard disk drives used to be the main

bottleneck which prevented quick data access. That is why a system was needed to access the data from multiple disks, in parallel.

• A partitioned table has its data spread over multiple disks, based on:

• Expression

• Range

• Round-Robin

Data compression• Data compression allows for significant storage (and

therefore money) savings. In addition, and this may sound counterintuitive, it also increases performance, since data is read much faster (with less I/O), specially when data is stored in a columnar form. Administrators can choose to compress:

• Data

• Indices

• Blobs

• Results are spectacular

• Up to 80% less space needed to store the data

• Up to 20% less I/O

In-memory databasesThese databases store data in memory (RAM) instead of hard disk drives to scale better and support extremely high volumes transactions

• This technology was originally designed to meet needs of specific industries (telcos and financial institutions primarily), that required processing unusually high volumes of transactions

Recently, the line that divided in-memory databases from traditional databases has started to blur with the introductions of databases such as DB2 BLU which automatically try to make the most use of RAM to improve performance without requiring all the data to be loaded in memory

BLU(Oracle Exalytics)

Data Warehouse• The need for analyzing vast amounts of data was the first

application that challenged the dominance of RDMS as the only tool required to work with data, as performance became a serious issue.

• In order to avoid impacting the performance of OLTP (OnLine Transaction Processing) databases, common sense dictated that data analysis should be performed on a different data store. As a result, the process is as follows:

• The data is first moved from the OLTP database to an operational data store (ODS), a repository used to transform the data before it can be used

• Then, the data is moved to the databased in which the information is analyzed, the Data Warehouse (DW)

• This process is at the origin of the spectacular growth in the use of ETL (Extraction, Transformation and Load) and data quality tools

Extraction-Transformation-Load ETL tools

• In Data warehouse environments, it is common to update the data regularly (usually nightly) with the latest information from the transactional systems (OLTP). In general, the data needs to be transformed before it can be loaded into the DWH.

• In addition, it is still very common to exchange data between systems by sending flat files from one computer to another.

• This will probably disappear over time, as we move to a world where systems need to be online at all time.

Data Replication• In modern environments which are online 24/7,

exchanging flat files to share information among systems is not a viable solution.

• As soon as a change happens in one database, this needs to be reflected in the other repositories that require the información.

• The data replication needs to be guaranteed, even if one of the repositories is momentarily off-line

• Most databases include some kind of built-in replication functionality, but it is usually limited in scope, i.e. not allowing replication between databases from different brands.

Data cleansing and enrichment• Analyzing dirty data is simply not possible. It

needs to be cleansed.

• Standardize data (addresses, names, etc.)

• Eliminate duplicates, erroneous data, etc.

• Further, deeper analysis can be performed when the data is enriched with additional data

• Geocoding (distance to store)

• Demographics (age, sex, marital status, estimated income, house ownership, attended college, political leaning, charitable giving, number of cars owned, etc.)

Data Warehouse• Data is usually kept in a star schema, a special

case of the snowflake schema, which is effective to handle simpler DWH queries

• Fact tables are at the centre of the schema and surrounded by the dimensions tables.

• These tables are usually not normalized, for performance reasons. Referential integrity is not a concern as the data is usually imported from databases that enforce referential integrity.

• Specialized DWH databases can load data very quickly, run queries very fast by using specialized indices and execute critical operations such as building aggregate tables in optimized ways

BLU

Database clusteringIf web servers can scale horizontally, why can’t relational databases do the same? Couldn’t we share the workload among multiple computer nodes?

In order to achieve that, computer scientists have created two distinct architectures

• Shared disk

• Multiple instances of the database, all pointing to a single copy of de los datos

• Shared Nothing

• Multiples instances of the database, each one owning part of the data set (the data is partitioned)

Shared Disk• Pros

• If one database instance or even a computing node fails, the system keeps working

• Good performance when reading data, even though the shared disk can become a bottleneck

• Cons

• Write operations become the main bottleneck (specially when using more than two nodes), because all the nodes need to be coordinated

• This can be mitigated by partitioning the data

• If the shared disk fails, the whole system fails

• Recovery after a node fails is a lengthy operation

pureScale

Shared Nothing

XPSEEE/DPF

• Pros

• In general, write operations are extremely fast

• Scales linearly

• Cons

• Read operations can be slower when queries execute joins on data residing on different disks

• This also applies, to a lesser extent, to write operations on data residing on multiple disks

• If a computing node or its disk fails, all that data becomes unavailable

Data marts• The first DWH grew extremely quickly

until they became too hard to manage

• That is why organizations started to build specialized Data marts by function (HR, finance, sales, etc.) or department

• In order to avoid creating information silos, all data marts should use the same dimensions

• This is usually enforced by the ETL tools

OLAP cubes

The data is stored in a repository using a star schema, which in turn is used to build a multidimensional cube to analyze the information through multiple dimensions (sales, regions, time periods, etc.)

MOLAP / ROLAP tools• MOLAP tools (Multidimensional OLAP)

load data into a cube, on which the user can quickly execute complex queries

• ROLAP tools (Relational OLAP) transform user queries into complex SQL queries that are executed on a relational database

• This requires a relational database that has been optimized to handle data warehouse type queries

• In addition, to improve performance, aggregate tables need to be built

MOLAP tools• Pros

• In most cases delivers better performance, due to index optimization, data cache and efficient storage mechanisms

• Lower disk space usage due to efficient data compression techniques

• Aggregate data is built automatically

• Cons

• Loading large data sets can be slow

• Working with models that include large amounts of data and a large number of dimensions is highly inefficient

ROLAP tools• Pros

• Usually scales better (dimensions and registers)

• Loading data with a robust ETL usually is much faster

• Cons

• Generally offers worse performance when both MOLAP and ROLAP tools can perform the job. This can however be mitigated by using ad-hoc database extensions (for example DB2 cubes)

• Depends on SQL. In some cases, this does not translate well for some particular use cases (budgeting, financial reporting, etc.)

• Uses much more space on disk

HOLAP toolsHOLAP (Hybrid Online Analytical Processing) is a combination of ROLAP and MOLAP

With this technology it becomes possible to store part of the data in a MOLAP repository and the remainder information in a ROLAP one in order to choose the best strategy for each case. For example:

• Keep large tables with the detailed data in a relational database

• Keep agregate data in a MOLAP repository

Hardware (Appliances)In order to obtain the best performance from the software and simplify the database management, some manufacturers have opted for developing integrated hardware and software soluciones (a.k.a. appliances)

• It simplifies configuration (loading data, index and schema creation, etc).

• It simplifies maintenance (standard components and streamlined support)

• It allows to get the most performance out of the hardware by using specialized chips and optimized storage devices

Hardware (CPU)• The IBM Power 8 micro processor was

specially designed to excel at data processing applications

• Large memory cache (512k for each core, 96MB shared L3 cache and a 128MB L4 cache outside the chip)

• 8 threads per core, 12 cores per chip (96 threads per chip)

• Up to 5GHz

Columnar databases

BLU

In Business Analytics environments, it is very unlikely that all the columns of a register will be required as part of the result of a query or in the WHERE clause

• Having the data organized by columns instead of by register (rows) allows to significantly improve query times because usually much less information has to be read from disk

• Modern databases such as DB2 BLU have been designed to excel both in OLTP as well as in OLAP environments. That means that DBAs can choose at database or table creation time how the data will be stored on disk (columns or rows)

Support for new data typesDuring the 90s, developers started to ask for expanded datatype support in relational databases

• Distinct types based on existing types

• STRUCT like composed types

• Completely new data types with their own indexing methods (videos, pictures, sound)

• Time series

• Coordinates (2D, 3D)

• Text documents

• XML, Word, PDF, etc.

• Etc.

Requirements inspired by trends in modern programming languages

• Inheritance

• Tables and types that inherit part of their structure from other tables/types

• Polymorphism

• More flexibility to define/overload functions, stored procedures and operators

• Stored procedures written in modern programming languages

Object-relational databases• Illustra was company that developed an

object relational databased that pioneered many of these interesting concepts that came primarily from Java and Smalltalk

• Informix acquired Ilustra and integrated these novel ideas into version 9.x of its flagship IDS database

• Later on, DB2 and Oracle also implemented some of those ideas

• Mapping Java objects to a relational database (O/R mapping) is a different issue that can be solved using object persistence libraries

Improved database management• The more options a DBA has to tune the

system, the better are his chances to get the most performance out of the system

• However, as we provide more knobs to tune the system, the DBA’s job becomes more and more complex, specially in large datacenter where a single DBA may be responsible for hundreds of database instances

• The solution to this problem is Autonomic Computing, which allows the database to tune itself, based on rules that result from experience

Relational databases have evolved, a lot

Despite the fact that just before the data explosion resulting from the Web 2.0 phenomenon, some large enterprises still used niche databases to cope with the limitations of relational databases in some edge use cases, the fact is that in most cases the most advanced database products (such as DB2, Oracle and Informix) had been very successful evolving very quickly in order to solve virtually all emerging information management problems, and therefore avoiding to have their privileged position be threatened in any significant way by new products.

Contact informationOn Twitter: @huibert (English), @huibert2 (Spanish)

Web site: http://www.huibert-aalbers.com

Blog: http://www.huibert-aalbers.com/blog