2.6(2) DATA Ware Housing

7/30/2019 2.6(2) DATA Ware Housing

1/19

Introduction

Over the last 20 years, $1 trillion has been invested in new computer systems to gain competitiveadvantage. The vast majority of these systems have automated business processes, to make them faster,cheaper, and more responsive to the customer. Electronic point of sales (EPOS) at supermarkets,itemized billing at telecommunication companies (telcos), and mass market mailing at catalog

companies are some examples of such Operational Systems. These systems computerized the day-to-day operations of business organizations. Some characteristics of the operational systems are asfollows:

Most organizations have a number of individual operational systems (databases, applications)

On-Line Transaction Processing (OLTP) systems capture the business transactions that occur.

An Operational System is a system that is used daily (perhaps constantly) to perform routineoperations - part of the normal business processes.

Examples: Order Entry, Purchasing, Stock/Bond trading, bank operations.

Users make short term, localized business decisions based on operational data. e.g., "Can I fillthis order based on the current units in inventory?"

Presently almost all businesses have operational systems and these systems are not giving them anycompetitive advantage. These systems have gathered a vast amount of data over the years. Thecompanies are now realizing the importance of this hidden treasure of information. Efforts are nowon to tap into this information that will improve the quality of their decision-making.A data warehouse is nothing but a repository of data collected from the various operational systemsof an organization. This data is then comprehensively analyzed to gain competitive advantage. Theanalysis is basically used in decision making at the top level.From being just a passing fad, Data Warehousing technology has grown much in scale and reputation inthe past few years, as evidenced by the increasing number of products, vendors, organizations, and yesbooks, even books, devoted to the subject. Enterprises that have successfully implemented datawarehouses find it strategic and often wonder how they ever managed to survive without it in the past.

As early as 1995, a Gartner Group survey of Fortune 500 IT managers found that 90% of allorganizations had planned to implement Data Warehouses by 1998.

Data Warehousing Systems

A data warehousing system can perform advanced analyses of operational data without impactingoperational systems. OLTP is very fast and efficient at recording the business transactions - not so goodat providing answers to high-level strategic questions.

Component Systems

Legacy Systems

Any information system currently in use that was built using previous technology generations. Mostlegacy Systems are operational in nature, largely because the automation of transaction-orientedbusiness process had long been the priority of IT projects.

Object1Object2Object3

Object4Object5


2/19

Source Systems

Any system from which data is taken for a data warehouse. A source system is often called a legacysystem in a mainframe environment.

Operational Data Stores (ODS)

An ODS is a collection of integrated databases designed to support the monitoring of operations.

Unlike the databases of OLTP applications (that are function oriented), the ODS contains subjectoriented, volatile, and current enterprise-wide detailed information. It serves as a system of record thatprovides comprehensive views of data in operational sources.Like data warehouses, ODSs are integrated and subject-oriented. However, an ODS is always currentand is constantly updated. The ODS is an ideal data source for a data warehouse, since it alreadycontains integrated operational data as of a given point in time.In short, ODS is an integrated collection of clean data destined for the data warehouse.

Definition

Data Warehouses are mostly populated with periodic migrations of data from operational systems. Thesecond source is made up of external, frequently purchased, databases. Examples of this data wouldinclude lists of income and demographic information. This purchased information is linked withinternal data about customers to develop a good customer profile.

A Data Warehouse is a

Subject-oriented

Integrated

Time-variant

Non-volatile

collection of data in support of management decisions.

Subject Oriented

OLTP databases usually hold information about small subsets of the organization. For example, aretailer might have separate order entry systems and databases for retail, catalog, and outlet sales. Eachsystem will support queries about the information it captures. But if somebody wants to find out detailsof all sales, then these separate systems are not adequate. To address this type of situation, your datawarehouse database should be subject-oriented, organized into subject areas like sales, rather thanaround OLTP data sources.

A data warehouse is organized around major subjects such as customer, products, sales, etc. Data areorganized according to subject instead of application. For exmple, an insurance company using a datawarehouse would organize their data by customer, premium, and claim instead of by different products(auto, life, property etc.).


3/19

Integrated

A data warehouse is usually constructed by integrating multiple, heterogeneous sources, such asrelational databases, flat files, and OLTP files. When data resides in many separate applications in theoperational environment, the encoding of data is often inconsistent. For example, in the above system,the retail system uses a numeric 7-digit code for products, the outlet system code consists of 9 alpha-numerics, and the catalog system uses 4 alphabets and 4 numerics. To create a useful subject area, the

source data must be integrated. There is no need to change the coding in these systems, but there mustbe some mechanism to modify the data coming into the data warehouse and assign a common codingscheme.

Nonvolatile

Unlike operational databases, warehouses primarily support reporting, not data capture. A datawarehouse is always a physically separate store of data. Due to this separation, data warehouses do notrequire transaction processing, recovery, concurrency control etc. The data are not updated or changedin any way once they enter the data warehouse, but are only loaded, refreshed and accessed for queries.

Time Variant

Data are stored in a data warehouse to provide historical perspective. Every key structure in the datawarehouse contains, implicitly or explicitly, an element of time. A data warehouse generally stores datathat is 5-10 years old, to be used for comparisons, trends, and forecasting.

Operational Systems vs Data Warehousing Systems

Operational

Data Warehouse

Holds current data Holds historic data

Data is dynamic Data is largely static

Read/Write accesses Read only accesses


4/19

Repetitive processing Adhoc complex queries

Transaction driven Analysis driven

Application oriented Subject oriented

Used by clerical staff for day-to-dayoperations

Used by top managers for analysis

Normalized data model (ER model) Denormalized data model (Dimensionalmodel)

Must be optimized for writes and smallqueries.

Must be optimized for queries involvinga large portion of the warehouse.

Advantages of Data Warehousing

Potential high Return on Investment

Competitive Advantage

Increased Productivity of Corporate Decision Makers

Problems with Data Warehousing

Underestimation of resources for data loading Hidden problems with source systems

Required data not captured

Increased end-user demands

High maintenance

Long duration projects

Complexity of integration

Data Warehouse Architecture*

A typical data warehousing architecture is illustrated below:


5/19

DATA WAREHOUSE COMPONENTS & ARCHITECTURE

The data in a data warehouse comes from operational systems of the organization as well as from otherexternal sources. These are collectively referred to assource systems. The data extractedfrom sourcesystems is stored in a area called data staging area, where the data is cleaned, transformed, combined,deduplicated to prepare the data for us in the data warehouse. The data staging area is generally acollection of machines where simple activities like sorting and sequential processing takes place. Thedata staging area does not provide any query or presentation services. As soon as a system providesquery or presentation services, it is categorized as apresentationserver. A presentation server is thetarget machine on which the data is loadedfrom the data staging area organized and stored for directquerying by end users, report writers and other applications. The three different kinds of systems thatare required for a data warehouse are:

1. Source Systems2. Data Staging Area3. Presentation servers

The data travels from source systems to presentation servers via the data staging area. The entireprocess is popularly known as ETL (extract, transform, and load) or ETT (extract, transform, andtransfer). Oracles ETL tool is called Oracle Warehouse Builder (OWB) and MS SQL Servers ETL

tool is called Data Transformation Services (DTS).A typical architecture of a data warehouse is shown below:


6/19

Each component and the tasks performed by them are explained below:

OPERATIONAL DATA

The sources of data for the data warehouse is supplied from:

The data from the mainframe systems in the traditional network and hierarchical format.

Data can also come from the relational DBMS like Oracle, Informix.

In addition to these internal data, operational data also includes external data obtained

from commercial databases and databases associated with supplier and customers.

LOAD MANAGER

The load manager performs all the operations associated with extraction and loading data into the datawarehouse. These operations include simple transformations of the data to prepare the data for entryinto the warehouse. The size and complexity of this component will vary between data warehouses andmay be constructed using a combination of vendor data loading tools and custom built programs.

WAREHOUSE MANAGER

The warehouse manager performs all the operations associated with the management of data in thewarehouse. This component is built using vendor data management tools and custom built programs.

The operations performed by warehouse manager include:

Analysis of data to ensure consistency

Transformation and merging the source data from temporary storage into data warehousetables

Create indexes and views on the base table.

Denormalization

Generation of aggregation

Backing up and archiving of data


7/19

In certain situations, the warehouse manager also generates query profiles to determine which indexesands aggregations are appropriate.

QUERY MANAGER

The query manager performs all operations associated with management of user queries. Thiscomponent is usually constructed using vendor end-user access tools, data warehousing monitoringtools, database facilities and custom built programs. The complexity of a query manager is determinedby facilities provided by the end-user access tools and database.

DETAILED DATA

This area of the warehouse stores all the detailed data in the database schema. In most cases detaileddata is not stored online but aggregated to the next level of details. However the detailed data is addedregularly to the warehouse to supplement the aggregated data.

LIGHTLY AND HIGHLY SUMMERIZED DATA

The area of the data warehouse stores all the predefined lightly and highly summarized (aggregated)data generated by the warehouse manager. This area of the warehouse is transient as it will be subject tochange on an ongoing basis in order to respond to the changing query profiles. The purpose of thesummarized information is to speed up the query performance. The summarized data is updatedcontinuously as new data is loaded into the warehouse.

ARCHIVE AND BACK UP DATA

This area of the warehouse stores detailed and summarized data for the purpose of archiving and backup. The data is transferred to storage archives such as magnetic tapes or optical disks.

META DATA

The data warehouse also stores all the Meta data (data about data) definitions used by all processes inthe warehouse. It is used for variety of purposed including:

The extraction and loading process Meta data is used to map data sources to a commonview of information within the warehouse.

The warehouse management process Meta data is used to automate the production ofsummary tables.

As part of Query Management process Meta data is used to direct a query to the mostappropriate data source.

The structure of Meta data will differ in each process, because the purpose is different. More aboutMeta data will be discussed in the later Lecture Notes.

END-USER ACCESS TOOLS

The principal purpose of data warehouse is to provide information to the business managers for

strategic decision-making. These users interact with the warehouse using end user access tools. Theexamples of some of the end user access tools can be:

Reporting and Query Tools

Application Development Tools

Executive Information Systems Tools

Online Analytical Processing Tools

Data Mining Tools

THE E T L (EXTRACT TRANSFORMATION LOAD) PROCESS


8/19

In this section we will discussed about the 4 major process of the data warehouse. They are extract(data from the operational systems and bring it to the data warehouse),transform(the data into internalformat and structure of the data warehouse),cleanse (to make sure it is of sufficient quality to be usedfor decision making) and load (cleanse data is put into the data warehouse).The four processes from extraction through loading often referred collectively as Data Staging.

EXTRACT

Some of the data elements in the operational database can be reasonably be expected to be useful in thedecision making, but others are of less value for that purpose. For this reason, it is necessary to extractthe relevant data from the operational database before bringing into the data warehouse. Manycommercial tools are available to help with the extraction process. Data Junction is one of thecommercial products. The user of one of these tools typically has an easy-to-use windowed interface bywhich to specify the following:

Which files and tables are to be accessed in the source database?

Which fields are to be extracted from them? This is often done internally by SQL Selectstatement.

What are those to be called in the resulting database?

What is the target machine and database format of the output?

On what schedule should the extraction process be repeated?

TRANSFORMThe operational databases developed can be based on any set of priorities, which keeps changing withthe requirements. Therefore those who develop data warehouse based on these databases are typicallyfaced with inconsistency among their data sources. Transformation process deals with rectifying anyinconsistency (if any).One of the most common transformation issues is Attribute Naming Inconsistency. It is common for

the given data element to be referred to by different data names in different databases. Employee Namemay be EMP_NAME in one database, ENAME in the other. Thus one set of Data Names are pickedand used consistently in the data warehouse. Once all the data elements have right names, they must beconverted to common formats. The conversion may encompass the following:

Characters must be converted ASCII to EBCDIC or vise versa.

Mixed Text may be converted to all uppercase for consistency.

Numerical data must be converted in to a common format.

Data Format has to be standardized.

Measurement may have to convert. (Rs/ $)

Coded data (Male/ Female, M/F) must be converted into a common format.

All these transformation activities are automated and many commercial products are available toperform the tasks. DataMAPPER from Applied Database Technologies is one such comprehensive tool.

CLEANSING

Information quality is the key consideration in determining the value of the information. The developerof the data warehouse is not usually in a position to change the quality of its underlying historic data,though a data warehousing project can put spotlight on the data quality issues and lead toimprovements for the future. It is, therefore, usually necessary to go through the data entered into the


9/19

data warehouse and make it as error free as possible. This process is known as Data Cleansing.Data Cleansing must deal with many types of possible errors. These include missing data and incorrectdata at one source; inconsistent data and conflicting data when two or more source are involved. Thereare several algorithms followed to clean the data, which will be discussed in the coming lecture notes.

LOADING

Loading often implies physical movement of the data from the computer(s) storing the sourcedatabase(s) to that which will store the data warehouse database, assuming it is different. This takesplace immediately after the extraction phase. The most common channel for data movement is a high-speed communication link. Ex: Oracle Warehouse Builder is the API from Oracle, which provides thefeatures to perform the ETL task on Oracle Data Warehouse.

Data Warehouse Design

An introduction to Dimensional Modeling

Data Warehouses are not easy to build. Their design requires a way of thinking that is just opposite tomanner in which traditional computer systems are developed. Their construction requires radicalrestructuring of vast amounts of data, often of dubious or inconsistent quality, drawn from numerousheterogeneous sources. Their implementation strains the limits of todays IT. Not surprisingly, a largenumber of data warehouse projects fail. Successful data warehouses are built for just one reason: toanswer business questions. The type of questions to be addressed will vary, but the intention is alwaysthe same. Projects that deliver new and relevant information succeed. Projects that do no, fail. [6]

To deliver answers to businesspeople, one must understand their questions. The DW design fusesbusiness knowledge and technology know-how. The design of the data warehouse will mean the

difference between success and failure.The design of the data warehouse requires a deep understanding of the business. Yet the task of designis undertaken by IT professionals, but not business decision makers. Is it reasonable to expect theproject to succeed? The answer is yes. The key is learning to apply technology toward businessobjectives.Most computer systems are designed to capture data, data warehouses are designed to for getting dataout. This fundamental difference suggests that the data warehouse should be designed according to adifferent set of principles.Dimensional Modeling is the name of a logical design technique often used for data warehouses. It isdifferent from entity-relationship modeling. ER modeling is very useful for transaction capture inOLTP systems.

Dimensional Modeling is the only viable technique for delivering data to the end users in a datawarehouse.

Comparison between ER and Dimensional ModelingThe characteristics of ER Model are well understood; its ability to support operational processes is itsunderlying characteristic. The conventional ER models are constituted to

Remove redundancy in the data model

Facilitate retrieval of individual records having certain critical identifiers and

Therefore, optimize online transaction processing (OLTP) performance


10/19

In contrast, the dimensional model is designed to support the reporting and analytical needs of a datawarehouse system.

Why ER is not suitable for Data Warehouses?

End user cannot understand or remember an ER Model. End User cannot navigate an ERModel. There is no graphical user interface or GUI that takes a general ER diagram and makesit usable by end users.

ER modeling is not optimized for complex, ad-hoc queries. They are optimized for repetitivenarrow queries

Use of ER modeling technique defeats this basic allure of data warehousing, namely intuitiveand high performance retrieval of data because it leads to highly normalized relational tables.

Introduction to Dimensional Modeling Concepts

The objective of dimensional modeling is to represent a set of business measurements in a standardframework that is easily understandable by end users. A Dimensional model contains the sameinformation as an ER model but packages the data in a symmetric format whose design goals are

User understandability

Query Performance Resilience to Change

The main components of a Dimensional Model are Fact Tables and Dimension Tables. Afact table isthe primary table in each dimensional model that is meant to contain measurements of the business.The most useful facts are numeric and additive. Every fact table represents a many to many relationshipand every fact table contains a set of two or more foreign keys that join to their respective dimensiontables.

A fact depends on many factors. For example, sale_amount, a fact, depends on product, location andtime. These factors are known as dimensions. Dimensions are factors on which a given fact depends.The sale_amount fact can also be thought of as a function of three variables.

sales_amount = f(product, location, time)Likewise in a sales fact table we may include other facts like sales_unit and cost.Dimension tables are companion tables to a fact table in a star schema. Each dimension table is definedby its primary key that serves as the basis for referential integrity with any given fact table to which itis joined. Most dimension tables contain textual information.To understand the concepts of facts, dimension, and star schema, let us consider the following scenario:Imagine standing in the marketplace and watching the products being sold and writing down thequantity sold and the sales amount each day for each product in each store. Note that a measurementneeds to be taken at every intersection of all dimensions (day, product, and store). The informationgathered can be stored in the following fact table:


11/19

The facts are Sale_Unit, Sale_Amount, and Cost (note that all are numeric and additive), which dependon dimensions Date, Product, and Store. The details of the dimensions are stored in dimension tables.

Note the following points about the star schema:

The most popular schema design for data warehouses is the Star Schema

Each dimension is stored in a dimension table and each entry is given its own unique identifier.

The dimension tables are related to one or morefact tables.

The fact table contains a composite key made up of the identifiers (primary keys) from thedimension tables.

The fact table also contains facts about the given combination of dimensions. For example acombination of store_key, date_key and product_key giving the amount of a certain product

sold on a given day at a given store. Fact table has foreign keys to all dimension tables in a star schema. In this example there are

three foreign keys (date key, product key, and store key).

Fact tables are normalized, whereas dimension tables are not

Fact tables are very large as compared to dimension tables.[7]

The facts in a star schema are of the following three types:

Fully-additive

Semi-additive

Non-additive

The facts in the above schema are fully-additive.

Designing a Dimensional Model: Steps Involved

Step 1 - Select the Business ProcessThe first step in the design is to decide what business process (es) to model by combining anunderstanding of the business requirements with an understanding of the available data [8]

Step 2 - Declare the GrainOnce the business process has been identified, the data warehouse team faces a serious decision about


12/19

the granularity. What level of detail must be made available in the dimensional model? The grain of afact table represents the level of detail of information in a fact table. Declaring the grain meansspecifying exactly what an individual fact table record represents. It is recommended that the mostatomic information captured by a business process. Atomic data is the most detailed informationcollected. The more detailed and atomic the fact measurements are, the more we know and we cananalyze the data better. In the star schema discussed above, the most detailed data would be transaction

line item detail in the sale receipt.(date, time, product code, product name, price/unit, number of units, amount)18-SEP-2002, 11.02, p1, dettol soap, 15, 2, 30But in the above dimensional model we provide sales data rolled up by product(all recordscorresponding to the same product are combined) in a store on a day. A typical fact table record wouldlook like this:18-SEP-2002, Product1, Store1, 150, 600This record tells us that on 18th Sept. 150 units of Product1 was sold for Rs. 600 from Store1. [9]

Step 3 Choose the DimensionsOnce the grain of the fact table has been chosen, the date, product, and store dimensions are readilyidentified. It is often possible to add more dimensions to the basic grain of the fact table, where these

additional dimensions naturally take on only one value under each combination of the primarydimensions. If the additional dimension violates the grain by causing additional fact rows to begenerated, then the grain must be revised to accommodate this dimension.

Step 4 Identify the FactsThe first step in identifying fact tables is where we examine the business, and identify the transactionthat may be of interest. In our example the electronic point of sale (EPOS) transactions give us twofacts, quantity sold and sale amount.

Strengths of Dimensional ModelingThe dimensional model has a number of important data warehouse advantages that the ER model

lacks[9]. Its strengths are:

The dimensional model is a predictable, standard framework. Report writers, query tools, andend user interfaces canal make strong assumptions to make the user interfaces moreunderstandable and to make processing more efficient

Star schema can withstand changes in user behavior. All dimensions can be thought of assymmetrically equal entry points into the fact table. The logical design can be done independentof the expected query patterns.

It is gracefully extensible to accommodate new data elements and new design decisions.

All existing tables can be changed by either adding new data rows or by alter tablecommands. Data should not have to be reloaded.

No query or reporting tool needs to be reprogrammed to accommodate the change Old applications continue to run without yielding different results.

The following graceful can be made to the design after the data warehouse is up and running:

Adding new facts as long as they are consistent with the grain of the existing facttable

Adding new dimensions, as long as there is a single value of that dimensiondefined for each existing fact record

Adding new, unanticipated dimension attributes


13/19

Standard approaches available for handling common modeling situations in the business world.Each of these situations has well understood set of alternatives that can be easily programmedinto report writers, query tools, and other user interfaces. These modeling situations include:

Slowly changing dimensions, where a dimension such as product or customer evolvesslowly. Dimensional modeling provides specific techniques for handling slowlychanging dimensions, depending on the business environment and requirements.

Heterogeneous products, where a business like bank needs to track a number of differentlines of business.

Event handling databases, where the fact table turns out to be factless

Details about the above modeling situations to be provided in later article.

Support for aggregates. Aggregates are summary records that are logically redundant with baselevel data already in the data warehouse, but are used to enhance query performance. If youdont aggregate records then you might be spending lots of money on hardware upgrades totackle performance problems that could otherwise be addressed by aggregates. All the aggregatemanagement software packages and aggregation navigation utilities depend on very specificsingle structure of fact and dimension tables that is absolutely dependent on the dimensional

approach. If you are not using the dimensional approach, you cant benefit from these tools. (seechapter 7 of the text book for details)

A dimensional model can be implemented in a relational database, a multi-dimensional databaseor even an object-oriented database.

Snowflake and Starflake Schemas

In dimensional modeling the dimension tables are in denormalized form whereas fact tables are innormalized form.Snowflaking is removing low cardinality (an attribute not having low distinct values to table cardinality

ratio) textual attributes from dimension tables and placing them in secondary dimension tables. Forinstance, a product category can be treated this way and physically removed from the low-level productdimension table by normalizing the dimension table. This is particularly done on large dimensiontables. Snowflaking a dimension means normalizing it and making it more manageable by reducing itssize. But this may have an adverse effect on performance, as joins need to be performed.If all the dimensions in a star schema are normalized, the resulting schema is called asnowflakeschema and if only a few dimensions are normalized, we call it a starflake schema.

Multidimensional Databases and MOLAP

Database Evolution

Flat files,hierarchical, and network

Relational Distributed relational

Multidimensional

Multidimensional Databases

Result of research at MIT in 1960s

Database engine of choice for data analysis applications (OLAP)

OLAP using MDDBs is called MOLAP

Business process is multidimensional in the sense that mangers ask questions about product sales in


14/19

different regions over specific time periods.

Dimensions: Product, Region, Time periodFact or Measure: Sale

An MDDB is a computer software system designed to allow for the efficient and convenient [10]

storage and retrieval system of large volumes of data that is1. Intimately related &2. Stored, viewed and analyzed form different perspectives

These perspectives are called Dimensions.

A Motivating Example

An automobile manufacturer wants to increase sale volumes by examining sales data collectedthroughout the organization. The evaluation would require viewing historical sales volume figures frommultiple dimensions such as

Sales volume by model Sales volume by color

Sales volume by dealership

Sales volume over time

Analyzing sales volumes data from any one or more of the above dimensions can give answers toimportant queries such as:

What is the trend in sales volumes over a period of time for a specific model and color across a specificgroup of dealerships?

Consider the relation given below containing the manufacturers sales data:

SALES VOLUMES FOR GLEASON DEALERSHIP


15/19

The above matrix is a 2-D array. An array is a fundamental component of MDDBs.In an array, each axis is called a dimension (MODEL & COLOR)Each element in the dimension is called a position.For model, 3 positions, van, sedan, and coupe.For color, 3 positions, blue, white, and red.

Intersections of dimensions are called cells and are populated with the data of interest or measure orfact (sales).

Advantages of MDDBs

Direct inspection of an array gives a great deal of information as opposed to relational

table Array conveniently groups like information in columns and rows

Sedan sales are all lined up color-wise. Total sedan sales can be added very quickly. Similarly salesfor each color are also lined up.

Represents a higher level of organization than the relational table

The relational structure tells us nothing about the possible contents of those fields

Increasingly Complex Relational Tables [11]If we add a new field, dealers to the relational table, with three possible values, the relational table

becomes even more awkward for presenting data to the end user.

SALES VOLUMES FOR ALL DEALERSHIPS

MODEL COLOR DEALERSHIP VOLUMEMINI VAN BLUE CLYDE 6MINI VAN BLUE GLEASON 6MINI VAN BLUE CARR 2MINI VAN RED CLYDE 3MINI VAN RED GLEASON 5


16/19

MINI VAN RED CARR 5MINI VAN WHITE CLYDE 2MINI VAN WHITE GLEASON 4MINI VAN WHITE CARR 3SPORTS COUPE BLUE CLYDE 2SPORTS COUPE BLUE GLEASON 3

SPORTS COUPE BLUE CARR 2SPORTS COUPE RED CLYDE 7SPORTS COUPE RED GLEASON 5SPORTS COUPE RED CARR 2SPORTS COUPE WHITE CLYDE 4SPORTS COUPE WHITE GLEASON 5SPORTS COUPE WHITE CARR 1SEDAN BLUE CLYDE 6SEDAN BLUE GLEASON 4SEDAN BLUE CARR 2SEDAN RED CLYDE 1SEDAN RED GLEASON 3SEDAN RED CARR 4SEDAN WHITE CLYDE 2SEDAN WHITE GLEASON 2SEDAN WHITE CARR 3

Multidimensional Simplification [9]We just need to add a third axis or dimension called Dealers. The array now becomes 3-D (3x3x3 with27 cells). Earlier it was 2-D (3x3 with 9 cells). The array can now be thought of as a cube with 3 faces,

with each face having 9 cells.If we have a 10x10x10 array, with each of the three dimensions having 10 positions. In relationalformat, we will need 1000 records to represent this array.


17/19

Performance AdvantagesConsider a 10x10x10 array. A user wants to find out the sales figure for blue colored sedan sold byGleason dealer. A relational system might have to search through all 1000 records just to find thequalifying records. The multidimensional system has to search only along three dimensions of 10positions each to find the matching record. This is a maximum of 30 position searches for the array

versus 1000 record searches for the table

Adding DimensionsThe 3D model can easily be extended to four dimensions by adding time dimension to indicate themonth of the year in which sale was made.


18/19

Trade-Offs: MDDB vs. RDBMS [13]Consider the following factors when choosing between the multidimensional approaches:

Size. MDDBS are generally limited by size, although the size limit has been increasinggradually over the years. Today, MDDBs can handle data up to 100GB of data efficiently. Largedata warehouses are still served better by relational front-ends running against high-performance and scalable relational databases.

Volatility of Source Data. Highly volatile data are better handled by relational technology.Multidimensional data in hypercubes generally take long to load and update. Thus, the timerequired to constantly load and update the multidimensional data structure may prevent the

enterprise form loading new data as often as desired. Aggregate Strategy. Multidimensional hypercubes (multidimensional arrays) support

aggregations better., although this advantage will disappear as RDBMSs improve their supportfor aggregation navigation*.

Investment Protection. Most organizations have already made significant investments inrelational technology and skill sets.The continued use of these tools and skills for anotherpurpose provides additional return on investment and lowers the technical risk for the datawarehousing effort. Use of MDDBS will need more investment in buying tools and trainingpeople to use these tools.

Ability to Manage Complexity. MDDB adds a layer to the overall system architecture of thewarehouse. Sufficient resources must be allocated to administer and maintain the MDDB layer.

Type of Users. Power users generally prefer the range of functionality available in MOLAPtools. Users that require broad views of the enterprise data are better served by ROLAP.

Recently, many of the large database vendors have announced plans to integrate their multidimensionaland relational database products. In this situation, the end-users make use of the multidimensionalfront-end tools for all their queries. If the query requires data that are not available in MDDB, the toolwill retrieve the required data from the larger relational database. This feature is called as drill-through.

The following table sums up the comparison between MDDBs and RDBMSs


19/19

MDDB RDBMS

Data is stored in multidimensional arrays Data is stored in relations

Direct inspection of an array gives a great

deal of information

Not so

Can handle limited size databases (< 100GB) Proven track record for handling VLDBs

Takes long to load and update Highly volatile data are better handled

Support aggregations better RDBMSs are catching up-AggregateNavigators

New investments need to be made and new

skill sets need to be developed

Most enterprises already made significant

investments in RDBMS technology and skillsets

Adds complexity to the overall systemarchitecture

No additional complexity

Limited no. of facts an dimensional tables No such restriction

Examples

Arbor-Essbase Brio Query-Enterprise

Dimensional Insight-DI Diver

Oracle-Express Server

Examples

IBM-DB2 Microsoft-SQL Server

Oracle-Oracle RDBMS

Red Brick Systems-Red BrickWarehouse

more about aggregation and aggregate navigation later in the course

Conclusion:

Data warehousing and Data Mining are two important components of business intelligence. Datawarehousing is necessary to analyze (Analysis) the business needs, integrate (Integration) data from

several sources, model (Data Modeling) the data in an appropriate manner to present the businessinformation in the form of dashboards and reports (Reporting).

2.6(2) DATA Ware Housing

Documents

Transcript of 2.6(2) DATA Ware Housing