Ch 1 intro_dw

Post on 20-May-2015

820 views 0 download

Tags:

description

This gives an idea about Data Warehouse

Transcript of Ch 1 intro_dw

SUSHIL KULKARNISUSHIL KULKARNI

DATA WAREHOUSINGDATA WAREHOUSINGDATA WAREHOUSING

Which are ourlowest/highest margin

customers ?

Which are ourlowest/highest margin

customers ?Who are my customers and what products are they buying?

Who are my customers and what products are they buying?

Which customersare most likely to go to the competition ?

Which customersare most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest impact on revenue?

What product prom--otions have the biggest impact on revenue?

What is the most effective distribution

channel?

What is the most effective distribution

channel?

A producer wants to knowA producer wants to know……..

Lot of data everywhereLot of data everywhere

yet ...yet ...• I can’t find the data I need

– data is scattered over the network

– many versions, subtle differences

• I can’t get the data I need

– need an expert to get the data

• I can’t understand the data I found

– available data poorly documented

• I can’t use the data I found

– results are unexpected

– data needs to be transformed from one form to other

What is a Data Warehouse?What is a Data Warehouse?

A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

[Barry Devlin]

What users says...What users says...

• Data should be integrated across the enterprise

• Summary data has a real value to the organization

• Historical data holds the key to understanding data over time

• What-if capabilities are required

What is Data Warehousing?What is Data Warehousing?

A process of transforming

data into information and making it available to users in a timely enough manner to make a difference

[Forrester Research, April 1996]

Data

EvolutionEvolution• 60’s: Batch reports

– hard to find and analyze information

– inflexible and expensive, reprogram every new request

• 70’s: Terminal-based DSS and EIS (executive information systems)

– still inflexible, not integrated with desktop tools

• 80’s: Desktop data access and analysis tools

– query tools, spreadsheets, GUIs

– easier to use, but only access operational databases

• 90’s: Data warehousing with integrated OLAP engines and tools

Warehouses are Very Large Warehouses are Very Large

DatabasesDatabases

35%

30%

25%

20%

15%

10%

5%

0%5GB

5-9GB

10-19GB 50-99GB 250-499GB

20-49GB 100-249GB 500GB-1TB

Initial

Projected 2Q96

Source: META Group, Inc.

Respondents

Very Large Data BasesVery Large Data Bases

• Terabytes -- 10^12 bytes:

• Petabytes -- 10^15 bytes:

• Exabytes -- 10^18 bytes:

• Zettabytes -- 10^21 bytes:

• Zottabytes -- 10^24 bytes:

Walmart -- 24 Terabytes

Geographic Information Systems

National Medical Records

Weather images

Intelligence Agency Videos

Data Warehousing Data Warehousing ----

It is a processIt is a process• Technique for assembling and

managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible

• A decision support database

maintained separately from the organization’s operational database

Data WarehouseData Warehouse

• A data warehouse is a

– subject-oriented

– integrated

– time-varying

– non-volatile

collection of data that is used primarily

in organizational decision making.

-- Bill Inmon, Building the Data Warehouse 1996

Customers: Get information of different prices of a beer

Farmers: Harvest information from known access paths

Data Warehouse SubjectData Warehouse Subject--orientedoriented

Students: Get information about various universities in U.K.

Data Warehouse SubjectData Warehouse Subject--orientedoriented

Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data

• Focusing on the modelling and

analysis of data for decision makers,

not on daily operations or transaction

processing

• Provide a simple and concise view

around particular subject issues by

excluding data that are not useful in the

decision support process

Data Warehouse SubjectData Warehouse Subject--orientedoriented

Customers

Etc…

Vendors Etc…

Orders

DataWarehouse

Enterprise“Database”

Transactions

Copied, organizedsummarized

Data Mining

Data Miners:

• “Farmers” – they know• “Explorers” - unpredictable

Data Warehouse SubjectData Warehouse Subject--orientedoriented

Use to study trends and changes

Data Warehouse :Data Warehouse :Time Time -- variantvariant

• The time horizon for the data warehouse is

significantly longer than that of operational

systems

– Data warehouse data: provide information from a historical

perspective (e.g., past 5-10 years)

– Operational database: current value data

• Every key structure in the data warehouse

– Contains an element of time explicitly or implicitly, while

the key of operational data may or may not contain “time

element”

Data Warehouse :Data Warehouse :Time Time -- variantvariant

cannot updated by end users

Data Warehouse : Data Warehouse : NonNon--volatilevolatile

Data Warehouse ArchitectureData Warehouse Architecture

Data Warehouse

Engine

Optimized Loader

Extraction

Cleansing

Analyze

Query

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

ERPSystems

Data MartData Mart

• A Data Mart is a smaller, more focused

Data Warehouse – a mini-warehouse.

• A Data Mart typically reflects the business

rules of a specific business unit within an

enterprise.

Data Warehouse to Data MartData Warehouse to Data Mart

DataWarehouse

Data Mart

Data Mart

Data Mart

DecisionSupport

Information

DecisionSupport

Information

DecisionSupport

Information

DATA MARTSDATA MARTS

• Create many DM’s • Limited scope

Examples:

1. Financial DM2. Marketing DM3. Supply chain DM

Generic Architecture of DataGeneric Architecture of Data

(synonym) Transaction data

Transaction (Operational) Transaction (Operational)

DataData• Operational (production) systems create

(massive number of) transactions, such as sales, purchases, deposits, withdrawals, returns, refunds, phone calls, toll roads, web site “hits”, etc…

• Transactions are the base level of data –the raw material for understanding customer behavior

• Unfortunately, operational systems change due to changing business needs

• Fortunately, operational systems can usually be changed to support changing business needs

• Data warehousing strategies need to be aware of operational system changes

Operational Summary DataOperational Summary Data

Summaries are for a specific time period and utilize the transaction data for that time period

Other Examples???

Decision Support Summary DataDecision Support Summary Data

• The data that are used to help make decisions about the business– Financial Data, such as:

• Income Statements (Profit & Loss)• Balance Sheets (Assets – Liabilities = Net

Worth)

– Sales summaries– Other examples???

• Data warehouses maintain this type of data, however financial data “of record”(for audit purposes) usually comes from databases and not the data warehouse (confusing???)

• Generally, it is a bad idea to use the same system for analytic and operational purposes

Data Warehouse for Decision Data Warehouse for Decision

Support Support

• Putting Information technology to help

the knowledge worker make faster and

better decisions

– Which of my customers are most likely to

go to the competition?

– What product promotions have the biggest

impact on revenue?

– How did the share price of software

companies correlate with profits over last

10 years?

Decision SupportDecision Support

• Used to manage and control business

• Data is historical or point-in-time

• Optimized for inquiry rather than update

• Use of the system is loosely defined and can

be ad-hoc

• Used by managers and end-users to

understand the business and make

judgements

Database SchemaDatabase Schema

• Database schema defines the structure of data, not the values of the data (e.g., first name, last

name = structure; Ron Norman = values of the data)

• In RDBMS:

– Columns = fields = attributes (A,B,C)

– Rows = records = tuples (1-7)

Logical Database SchemaLogical Database Schema• Describes data in a way that is familiar to

business users

Physical Database SchemaPhysical Database Schema• Describes the data the way it will be stored in an

RDBMS which might be different than the way the logical shows it

MetadataMetadata

• General definition: Data about data !!!– Examples:

• A library’s card catalog (metadata) describes publications (data)

• A file system maintains permissions (metadata) about files (data)

• A form of system documentation including:– Values legally allowed in a field (e.g., AZ,

CA, OR, UT, WA, etc.)– Description of the contents of each field

(e.g., start date)– Date when data were loaded– Indication of currency of the data

(last updated)– Mappings between systems

(e.g., A.this = B.that)

• Invaluable, otherwise have to research to find it

Business RulesBusiness Rules

• Highest level of abstraction from

operational (transaction) data

• Describes why relationships exist and

how they are applied

• Examples:

– Need to have 3 forms of ID for credit

– Only allow a maximum daily withdrawal of

$200

– After the 3rd log-in attempt, lock the log-in

screen

– Accept no bills larger than $20

– Others???

General Architecture for Data General Architecture for Data

WarehousingWarehousing

• Source systems

• Extraction, (Clean),

Transformation, &

Load (ETL)

• Central repository

• Metadata repository

• Data marts

• Operational

feedback

• End users

(business)

DATA WAREHOUSE SCOPEDATA WAREHOUSE SCOPE

Broad :

Required for companies, Very costly, May be divided according to Depts.

Narrow:

Required for Personal information

Design of a Data Warehouse: A Design of a Data Warehouse: A

Business Analysis FrameworkBusiness Analysis Framework• Four views regarding the design of a data

warehouse

– Top-down view

• allows selection of the relevant information necessary for

the data warehouse

– Data source view

• exposes the information being captured, stored, and

managed by operational systems

– Data warehouse view

• consists of fact tables and dimension tables

– Business query view

• sees the perspectives of data in the warehouse from the

view of end-user

Data Warehouse Design Process Data Warehouse Design Process

• Top-down, bottom-up approaches or a combination of both

– Top-down: Starts with overall design and planning

– Bottom-up: Starts with experiments and prototypes (rapid)

• From software engineering point of view

– Waterfall: structured and systematic analysis at each step before proceeding to the next

– Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around

• Typical data warehouse design process

– Choose a business process to model, e.g., orders, invoices, etc.

– Choose the grain (atomic level of data) of the business process

– Choose the dimensions that will apply to each fact table record

– Choose the measure that will populate each fact table record

MultiMulti--Tiered ArchitectureTiered Architecture

Data

Warehouse

Extract

Transform

Load

Refresh

OLAP Engine

Analysis

Query

Reports

Data mining

Monitor

&

Integrator

Metadata

Data Sources Front-End Tools

Serve

Data Marts

Operational

DBs

other

sources

Data Storage

OLAP Server

Three Data Warehouse ModelsThree Data Warehouse Models

• Enterprise warehouse

– collects all of the information about subjects spanning

the entire organization

• Data Mart

– a subset of corporate-wide data that is of value to a

specific groups of users. Its scope is confined to

specific, selected groups, such as marketing data mart

• Independent vs. dependent (directly from warehouse) data

mart

• Virtual warehouse

– A set of views over operational databases

– Only some of the possible summary views may be

materialized

Data Mining works with Data Mining works with

Warehouse DataWarehouse Data

• Data Warehousing provides the Enterprise with a memory

• Data Mining provides the Enterprise with intelligence

We want to know ...We want to know ...

• Given a database of 100,000 names, which persons are the least likely to default on their credit cards?

• Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer?

• If I raise the price of my product by Rs. 2, what is the effect on my ROI?

• If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result?

• If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues?

• Which of my customers are likely to be the most loyal?

Data Mining helps extract such information

Application AreasApplication Areas

Industry Application

Finance Credit Card Analysis

Insurance Claims, Fraud Analysis

Telecommunication Call record analysis

Transport Logistics management

Consumer goods promotion analysis

Data Service providers Value added data

Utilities Power usage analysis

Data Mining in UseData Mining in Use

• Data Mining can be used to track fraud

• A Supermarket becomes an information broker

• Basketball teams use it to track game strategy

• Cross Selling

• Warranty Claims Routing

• Holding on to Good Customers

• Weeding out Bad Customers

Two Systems Two Systems

• Operational System

• Information System

Operational SystemsOperational Systems

• Run the business in real time

• Based on up-to-the-second data

• Optimized to handle large numbers of simple read/write transactions

• Optimized for fast response to predefined transactions

• Used by people who deal with customers, products --clerks, salespeople etc.

• They are increasingly used by customers

It refers to a class of

systems that facilitate

and manage

transaction-oriented

applications, typically for data entry and

retrieval transaction

processing

On Line Transaction Process On Line Transaction Process

(OLTP)(OLTP)

OLTP technology is used in a number of industries, including banking, airlines, mail order, supermarkets, and manufacturing. Applications include electronic banking, order processing, employee time clock systems, e-commerce, and eTrading. The most widely used OLTP system is probably IBM's CICS.

On Line Transaction Process On Line Transaction Process

(OLTP)(OLTP)

What are Operational Systems?What are Operational Systems?

• They are OLTP systems

• Run mission critical

applications

• Need to work with stringent performance requirements for routine tasks

• Used to run a business!

RDBMS used for OLTPRDBMS used for OLTP

• Database Systems have been used traditionally for OLTP– clerical data processing tasks

– detailed, up to date data

– structured repetitive tasks

– read/update a few records

– isolation, recovery and

integrity are critical

Operational Summary DataOperational Summary Data

Summaries are for a specific time period and utilize the transaction data for that time period

Other Examples???

Examples of Operational DataExamples of Operational Data

Data Industry Usage Technology Volumes

CustomerFile

All TrackCustomerDetails

Legacy application, flatfiles, main frames

Small-medium

AccountBalance

Finance Controlaccountactivities

Legacy applications,hierarchical databases,mainframe

Large

Point-of-Sale data

Retail Generatebills, managestock

ERP, Client/Server,relational databases

Very Large

CallRecord

Telecomm-unications

Billing Legacy application,hierarchical database,mainframe

Very Large

ProductionRecord

Manufact-uring

ControlProduction

ERP,relational databases,AS/400

Medium

So, whatSo, what’’s different?s different?

ApplicationApplication--Orientation vs. Orientation vs.

SubjectSubject--OrientationOrientation

Application-Orientation

Operational Database

LoansCredit Card

Trust

Savings

Subject-Orientation

DataWarehouse

Customer

Vendor

Product

Activity

OLTP vs. Data WarehouseOLTP vs. Data Warehouse

• OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse

• Special data organization, access methods and implementation methods are needed to support

data warehouse queries (typically multidimensional queries)

– e.g., average amount spent on phone calls between

9AM-5PM in Pune during the month of December

OLTP OLTP vsvs Data WarehouseData Warehouse

• OLTP

– Application Oriented

– Used to run business

– Detailed data

– Current up to date

– Isolated Data

– Repetitive access

– Clerical User

• Warehouse (DSS)

– Subject Oriented

– Used to analyze

business

– Summarized and refined

– Snapshot data

– Integrated Data

– Ad-hoc access

– Knowledge User

(Manager)

OLTP OLTP vsvs Data WarehouseData Warehouse

• OLTP

– Performance Sensitive

– Few Records accessed at a time (tens)

– Read/Update Access

– No data redundancy

– Database Size 100MB -100 GB

• Data Warehouse

– Performance relaxed

– Large volumes accessed at a time(millions)

– Mostly Read (Batch Update)

– Redundancy present

– Database Size 100 GB - few terabytes

OLTP OLTP vsvs Data WarehouseData Warehouse

• OLTP

– Transaction

throughput is the

performance metric

– Thousands of users

– Managed in entirety

• Data Warehouse

– Query throughput is

the performance

metric

– Hundreds of users

– Managed by subsets

To summarize ...To summarize ...

• OLTP Systems are used to “run” a business

• The Data Warehouse helps to “optimize” the business

Why Separate Data Why Separate Data

Warehouse?Warehouse?• Performance

– Op dbs designed & tuned for known txs & workloads.

– Complex OLAP queries would degrade perf. for op txs.

– Special data organization, access & implementation methods needed for multidimensional views & queries.

• Function

– Missing data: Decision support requires historical data, which op dbs do not typically maintain.

– Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources.

– Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

INFORMATION SYSTEMSINFORMATION SYSTEMS

• Designed to support decision-making based on

1. Historical data2. Prediction data.

• Designed for complex queries or data-mining applications.

Examples:

1. Sales trend analysis, 2. Customer segmentation3. Human resources planning

INFORMATION SYSTEMSINFORMATION SYSTEMS

DIFFERENCEDIFFERENCE

Periodical batch updates and queries requiring many or all rows

Many, constant updates and queries on one or a few table rows

Volume

Ease of flexible access and use

Performance throughput, availability

Design goal

Broad, ad hoc, complex queries and analysis

Narrow, planned, and simple updates and queries

Scope of usage

Managers, business analysts, customers

Clerks, sales-persons, administrations

Primary users

Real and analyze historical data.

Real time data entryPurpose

Informational SystemsOperational SystemsCharacteristics

T H A N K S !T H A N K S !