Introduction to Datawarehousing

Introduction to Data Warehousing

Contents

• History• OLTP vs. OLAP• Paradigm Shift• Architecture• Emerging Technologies• Questions

History: Hollerith Cards

Once upon a time…• Reporting was done with data stored on Hollerith cards.

– A card contained one record.– Maximum length of the record was 80 characters.– A data file consisted of a stack of cards.

• Data was “loaded” each time a report was run.• Data had no “modeling” per se. There were just sets of records. • Programs in languages such as COBOL, RPG, or BASIC would:

– Read a stack of data cards into memory– Loop through records and perform a series of steps on each (e.g. increment a counter or add an amount to a total)– Send a formatted report to a printer

• It was difficult to report from multiple record types.• Changes to data were implemented by simply adding, removing or replacing cards.

History: Hollerith Cards

FACTOIDS• Card type: IBM 80-column punched card• A/K/A: “Punched Card”, “IBM Card”• Size: 7 3⁄8 by 3 1⁄4 inches• Thickness: .007 inches (143 cards per inch)• Capacity: 80 columns with 12 punch locations each

History: Magnetic Tape

• Card were eventually replaced by magnetic tape.• Tapes made data easier to load and storage more efficient.• Records were stored sequentially, so individual records could not be quickly accessed.• Data processing was still very similar to that of cards—load a file into computer memory

and loop through the records to process.• Updating data files was still difficult.

History: Disk Storage

• The arrival of disk storage revolutionized data storage and access.• It was now possible to have a home base for data: a database.• Data was always available: online.• Direct access replaced sequential access so data could be accessed more quickly. • Data stored on disk required new methods for adding, deleting, or updating records. This

type of processing became known as Online Transaction Processing (OLTP).• Reporting from a database became known as Online Analytical Processing (OLAP).

History

History: Disk Storage

FACTOIDS• Storage Device: IBM 350 disk storage unit• Released: 1956• Capacity: 5 million 6-bit characters (3.75 megabytes).• Disk spin speed: 1200 RPM. • Data transfer rate: 8,800 characters per second.

History

http://en.wikipedia.org/wiki/Megabyte

http://en.wikipedia.org/wiki/RPM

History: Relational Model

• In the 1960’s, E.F. Codd developed the relational model. • Relational modeling was based on a branch of mathematics called set theory and added

rigor to the organization and management of data.• Relational modeling also improved OLTP (inserting, updating, and deleting data) by

making these processes more efficient and reducing data anomalies.• The relational model also introduced primary keys, foreign keys, referential integrity,

relational algebra, and a number of other concepts used in modern database systems.• It soon became apparent that different data models facilitated OLAP vs. OLTP. • Relational data was often denormalized (made non-relational) to support OLAP.• Structured Query Language (SQL) was the first language created to support relational

database operations both OLAP and OLTP.

History

History: Relational Model

FACTOIDS• Although E.F. Codd was employed by IBM when he create the relational model and IBM

originated the SQL language (then “SEQUEL”), IBM was not the first vendor to produce a relational database or to use SQL.

• The first commercial implementation of relational database and SQL was from Relational Software, Inc. which is now Oracle Corporation.

• SQL has been standardized by standards organizations American National Standards Institute (ANSI) and the International Standards Organization (ISO).

History: Extracts 1

• In early relational databases, data was extracted from OLTP systems into denormalized extracts for reporting.

ReportOLTP Source

OLAP Extract

History: Extracts 2

• And more extracts...

History

ReportOLTP Source

OLAP Extract

ReportOLTP Source

OLAP Extract

ReportOLAP Extract

History: Extracts 3


History: Extracts 4


History: Naturally Evolving Systems 1

• Naturally evolving systems began to emerge.

• Naturally evolving systems resulted in– Poor organization of data– Extremely complicated processing requirements– Inconsistencies in extract refresh status– Inconsistent report results.

• This created a need for architected systems for analysis and reporting.• Instead of multiple extract files, a single source of truth was needed for each data source.

History: Naturally Evolving Systems 2

History: Architected Systems

• Developers began to design architected systems for OLAP data.• In the 1980’s and 1990’s, organizations began to integrate data from multiple sources such

as accts. receivable, accts. payable, HR, inventory, and so on. These integrated OLAP databases became known as Enterprise Data Warehouses (EDWs).

• Over time methods and techniques for extracting and integrating data into architected systems began to evolve and standardize.

• The term data warehousing is now used to refer to the commonly used architectures, methods, and techniques for transforming and integrating data to be used for analysis.

History: An Architected Data Warehouse

Example of an Architected Data Warehouse

ReportOLTP

OLTP

StagingHistory

DM

DM

DM

DM

Data set

Data set

Report

OLTP

OLTPODS

History: Compare Naturally Evolving System

Compare: Naturally evolving system

1818

History: Compare Architected Data Warehouse

Compare: Architected Data Warehouse

ReportOLTP

OLTP

StagingHistory

DM

DM

DM

DM

Data set

Data set

Report

OLTP

OLTPODS

History: Inmon

• In the early 1990’s W.H. Inmon published Building the Data Warehouse (ISBN-10: 0471141615)

• Inmon put together the quickly accumulating knowledge of data warehousing and popularized most of the terminology we use today.

• Data in a data warehouse is extracted from another data source, transformed to make it suitable for analysis, and loaded into the data warehouse. This process is often referred to as Extract, Transform and Load (ETL).

• Since data from multiple sources was integrated in most data warehouses, Inmon also described the process as Transformation and Integration (T&I).

• Data in a data warehouse is stored in history which is modeled for fast performance when querying the data.

• The history tables are the source of truth.• Data from history is usually extracted into data marts which are used for analysis and

reporting.• Separate data marts are created for each application. There is often redundant data across

data marts.

History: Inmon

FACTOIDS• W.H. Inmon coined the term data warehouse.• W.H. Inmon is recognized by many as the father of data warehousing.• W.H. Inmon created the first and most commonly accepted definition of a data warehouse:

A subject oriented, nonvolatile, integrated, time variant collection of data in support of management's decisions.

• Other firsts of W.H. Inman– Wrote the first book on data warehousing– Wrote the first magazine column on data warehousing– Taught the first classes on data warehousing

History: Kimball 1

• Also in the 1990’s, Ralph Kimball published The Data Warehouse Toolkit (ISBN-10: 0471153370) which popularized dimensional modeling.

• Dimensional modeling is based on the cube concept which is a multi-dimensional view of data.

• The cube metaphor can only illustrate three dimensions. A dimensional model can be any number of dimensions.

A cube used to represent multi-dimensional data

History: Kimball 2

• Kimball implemented cubes as star schemas which support querying data in multiple dimensions.

History: Kimball 3

• Kimball’s books do not discuss the relational model in depth, but his dimensional model can be explained in relational terms.

• A star schema is a useful way to store data for quickly slicing and dicing data on multiple dimensions.

• Dimensional modeling and star schema are frequently misunderstood and improperly implemented. Queries against incorrectly designed tables in a star schema can skew report results.

• The term OLAP has come to be used specifically to refer to dimensional modeling in many marketing materials.

• Star schemas are implemented as data marts so that they can be queried by users and applications. However, data marts aren’t necessarily star schema.

History: Kimball 4

FACTOIDS• Ralph Kimball had a Ph.D. in electrical engineering from Stanford University.• Kimball worked at the Xerox Palo Alto Research Center (PARC). PARC is where laser

printing, Ethernet, object-oriented programming, and graphic user interfaces were invented.

• Kimball was a principal designer of the Xerox Star Workstation which was the first personal computer to use windows, icons, and mice.

OLTP vs. OLAP

Operational Data/OLTP Data Warehouse/OLAP

Data is normalized (3NF) Data may be normalized, denormalized, use dimensional models, application-specific data sets, or other designs.

Data is constantly updated. Data represents a state at a point in time. Existing data does not change. New data can be added to history.

Typical operations are selects on small sets of records, inserts, updates, and deletes of individual records.

Typical operations are selects, sorts, groupings and aggregations of large numbers of records, and inserts of thousands or millions of records.

All transactions are logged. Inserts may not be logged at record level. There normally are no updates or deletes.

B-tree indexes used for performance Partitioning and bitmap indexes are used for performance.

Traditional development life cycle Heuristic and agile development

Data designed for application Data taken from some other application

Date range of records is limited; old transactions are archived.

Date range of history tables can be many years.

Paradigm Shift: For Management

For Management• Traditional development life cycle doesn’t work well when building a data warehouse. There

is a discovery process. Agile development works better.• OLTP data was designed for a given purpose, but OLAP is created from data that was

designed for some other purpose—not reporting. It is important to evaluate data content before designing applications.

• OLAP data may not be complete or precise per the application.• Data integrated from different sources may be inconsistent.

– Different code values– Different columns– Different meaning of column names

• OLAP data tend to be much larger requiring more resources.• Storage, storage, storage…

Paradigm Shift: For DBAs

For DBAs• Different system configurations (in Oracle, different initialization parameters)• Transaction logging may not be used, and methods for recovery from failure are different.• Different tuning requirements:

– Selects are high cardinality (large percentage of rows)– Massive sorting, grouping and aggregation– DML operations can involve thousands or millions of records.

• Need much more temporary space for caching aggregations, sorts and temporary tables.• Need different backup strategies. Backup frequency is based on ETL scheduling instead of

transaction volume.• May be required to add new partitions and archive old partitions in history tables.• Storage, storage, storage…

Paradigm Shift: For Architects & Developers

For Architects and Developers• Different logical modeling and schema design. • Use indexes differently (e.g. bitmap rather than b-tree)• Extensive use of partitioning for history and other large tables• Different tuning requirements

– Selects are high cardinality (large percentage of rows)– Lots of sorting, grouping and aggregation– DML operations can involve thousands or millions of records.

• ETL processes are different than typical DML processes– Use different coding techniques– Use packages, functions, and stored procedures but rarely use triggers or constraints– Many steps to a process– Integrate data from multiple sources

• Iterative and incremental development process (agile development)

Paradigm Shift: For Analysts and Data Users

For Analysts and Data Users—All Good News • A custom schema (data mart) can be created for each application per the user requirements.• Data marts can be permanent, temporary, generalized or project-specific.• New data marts can be created quickly—typically in days instead of weeks or months.• Data marts can easily be refreshed when new data is added to the data warehouse. Data

mart refreshes can be scheduled or on demand.• In addition to parameterized queries and SQL, there may be additional query tools and

dashboards (e.g. Business Intelligence, Self-Service BI, data visualization, etc.).• Several years of history can be maintained in a data warehouse—bigger samples.• There is a consistent single source of truth for any given data set.

Architecture: Main Components

Components of a Data Warehouse

ReportOLTP

OLTP

Staging HistoryREF

DM

DM

DM

DM

Data set

Data set

Report

ODS

Operational Data Data Warehouse OLAP Data

ETL ETL

Architecture: Staging and ODS

Staging and ODS

ReportOLTP

OLTP

Staging HistoryREF

DM

DM

DM

DM

Data set

Data set

Report

ODS


ETL

• New data is initially loaded into staging so that it can be processed into the data warehouse.

• Many options are available for getting operational data from internal or external sources into the staging area

• SQL Loader• imp/exp/impdp/expdp• Change Data Capture (CDC)• Replication via materialized views• Third-party ETL tools

• Staging contains a snapshot in time of operational data.

• An Operational Data Store (ODS) is an optional component that is used for near real time reporting.

• Transformation and integration of data in an ODS is limited.

• Less history (shorter time span) is kept in an ODS.

Architecture: History and Reference Data

History and Reference Data

ReportOLTP

OLTP

Staging HistoryREF

DM

DM

DM

DM

Data set

Data set

Report

ODS


ETL

• History includes all source data—no exclusions or integrity constraints.

• Partitioning is used to:• manage extremely large tables• improve performance of queries• to facilitate “rolling window” of

history.• Denormalization can be used to

reduce number of joins when selecting data from history.

• No surrogate keys—maintain all original code values in history.

• Reference data should also have history (e.g. changing ICD9 codes over time).

Architecture: Data Marts

Data Marts

ReportOLTP

OLTP

Staging HistoryREF

DM

DM

DM

DM

Data set

Data set

Report

ODS

Operational Data OLAP Data

ETL ETL

• Data marts are per requirements of users and applications.

• Selection criteria (conditions in WHERE clause) are applied when creating data marts.

• Logical data modeling is applied at data mart level (e.g. denormalized, star schemas, analytic data sets, etc.).

• Integrity constraints can be applied at data mart level.

• Any surrogate keys can be applied at data mart level (e.g. patient IDs).

• Data marts can be Oracle, SQL Server, text files, SAS data sets, etc.

• Data marts can be permanent or temporary for ongoing or one-time applications.

• Data mart refreshes can be scheduled or on demand.

Emerging Technologies

Emerging technologies that are having an impact on data warehousing• Massively Parallel Processing (MPP) • In-Memory Databases (IMDB)• Unstructured Databases• Column-Oriented Databases• Database Appliances• Data Access Tools• Cloud Database Services

Emerging Technologies: MPP

Massively Parallel Processing (MPP) • Data is partitioned over hundreds or even thousands of server nodes.• A controller node manages query execution.• A query is passed to all nodes simultaneously.• Data is retrieved from all nodes and assembled to produce query results.• MPP systems will automatically partition and distribute data using their own algorithms.

Developers and architects need only be concerned with conventional data modeling and DML operations.

• MPP systems make sense for OLAP and data warehousing where queries are on very large numbers of records.

Emerging Technologies: IMDB

In-Memory Database• Data is stored in random access memory (RAM) rather than on disk or SSD.• Memory is accessed much more quickly reducing seek times.• Traditional RDBMS software often uses a memory cache when processing data, but it is

optimized for limited cache with most data stored on disk.• IMDB software has modified algorithms to be optimized to read data from memory.• Database replication with failover is typically required because of the volatility of computer

memory.• Cost of RAM has dropped considerably in recent years making IMDB systems more feasible.• Microsoft SQL Server has an In-Memory option. Tables must be defined as memory

optimized to use this feature.• Oracle has recently announced the upcoming availability of their In-Memory Option.

Emerging Technologies: Unstructured Databases

Unstructured Databases• Unstructured databases-- sometimes referred to as NoSQL databases--support vast amounts

of text data and extremely fast text searches.• Unstructured databases utilize massively parallel processing (MPP) and extensive text

indexing.• Unstructured databases do not fully support relational features such as complex data

modeling, join operations and referential integrity. However, these databases are evolving to incorporate additional relational capabilities.

• Oracle, Microsoft, and other RDBMS vendors are creating hybrid database systems that incorporate unstructured data with relational database systems.

• Unstructured databases are useful for very fast text searches on very large amounts of data—they are generally not useful for complex transaction processing, analyses and informatics.

Emerging Technologies: Big Data

FACTOIDS• Big data became an issues as early as 1880 with the U.S. Census which took several years to

tabulate with then existing methods.• The term information explosion was first used in a the Lawton Constitution, a small-town

Oklahoma newspaper in 1941.• The term big data was used for the first time in an article by NASA researchers Michael Cox

and David Ellsworth. The article discussed the inability of current computer systems to handle the increasing amounts of data.

• Google was a pioneer in creating modern hardware and software solutions for big data.• Parkinson’s Law of Data: “Data expands to fill the space available.”• 1 exabyte= 10006 bytes = 1018 bytes = 1000 petabytes = 1 billion gigabytes.

Emerging Technologies: Column-Oriented

Column-Oriented Databases• Data in a typical relational database is organized by row. The row paradigm is used for

physical storage as well as the logical organization of data. • Column-Oriented databases physically organize data by column while still able to present data

within rows. • Data is stored on disk in blocks. While the row-oriented databases store the contents of a

row in a block, column-oriented databases store the contents of a column in a block.• Each column has row and table identifiers so that columns can be combined to produce rows

of data in a table. • Since most queries select a subset of columns (rather than entire rows), column-oriented

databases tend to perform much better for analytical processing (e.g. querying a data mart).• Microsoft SQL Server and Oracle Exadata have support for column-based data storage.

Emerging Technologies: Appliances

Database Appliances • A database appliance is an integrated, preconfigured package of RDBMS software and

hardware. • The most common type of database appliance is a data warehouse appliance.• Most major database vendors including Microsoft and Oracle and their hardware partners

package and sell database appliances for data warehousing.• Data warehouse appliances utilize massively parallel processing (MPP). • Database appliances generally do not scale well outside of the purchased configuration. For

example, you generally don’t add storage to a database appliance. • The database appliance removes the burden of performance tuning. Conversely, database

administrators have less flexibility. • A database appliance can be a cost-effective solution for data warehousing in many

situations.

Emerging Technologies: Data Access Tools

Data Access Tools • Business Intelligence (BI) tools allow users to view and access data, create aggregations and

summaries, create reports, and view dashboards with current data.• BI tools typically sit on top of data marts created by the architects and developers. Data

marts that support BI are typically star schema.• Newer Self-Service BI tools add additional capabilities such as allowing users to integrate

multiple data sources and do further analysis on result data sets from previous analyses.• Data visualization tools allow users to view data in various graphs.• Newer tools allow users to access and analyze data from multiple form factors including

smart phones and tablets.• Data access, BI and data visualization tools do not always provide the capability to perform

complex analyses or fulfill specific requirements of complex reports (e.g. complex statistical analyses or studies submitted to journals). Programming skills are frequently still required.

Emerging Technologies: Cloud Databases

Cloud Database Services • Oracle, Microsoft, and other database vendors offer cloud database services.• The cloud service performs all database administrative tasks:

– Replicate data on multiple severs – Make backups– Scale growing databases– Performance tuning

• Cloud services can be useful for prototyping and heuristic development. A large commitment to hardware purchases and administrative staff can be postponed for later assessment.

• Cloud services could result in considerable cost savings for some organizations. • A cloud hybrid database is one that has database components both on the cloud and on local

servers.• Cloud services may limit administrative options and flexibility vs. having your own DBAs.• Cloud services may not meet regulatory requirements for security and storage for some

industries (e.g. medical data).

ReportOLTP

OLTP

Staging HistoryREF

DM

DM

DM

DM

Data set

Data set

Report

ODS


ETL ETL

Introduction to Datawarehousing

Data & Analytics

Transcript of Introduction to Datawarehousing