Datawarehousing
description
Transcript of Datawarehousing
DATA WAREHOUSING ANDDATA MINING
M.Mageshwari,Lecturer
M.S.P.V.L Polytechnic College
2
Course OverviewThe course: what and
how
0. Introduction I. Data Warehousing II. Decision Support and
OLAP III. Data Mining IV. Looking Ahead
Demos and Labs
3
0. Introduction
Data Warehousing, OLAP and data mining: what and why (now)?
Relation to OLTPA case study
demos, labs
4
Which are our lowest/highest margin
customers ?
Which are our lowest/highest margin
customers ?
Who are my customers and what products are they buying?
Who are my customers and what products are they buying?
Which customers are most likely to go to the competition ?
Which customers are most likely to go to the competition ?
What impact will new products/services
have on revenue and margins?
What impact will new products/services
have on revenue and margins?
What product prom--otions have the biggest
impact on revenue?
What product prom--otions have the biggest
impact on revenue?
What is the most effective distribution
channel?
What is the most effective distribution
channel?
A producer wants to know….
5
Data, Data everywhereyet ... I can’t find the data I need
data is scattered over the network many versions, subtle differences
I can’t get the data I need need an expert to get the data
I can’t understand the data I found available data poorly documented
I can’t use the data I found results are unexpected data needs to be transformed
from one form to other
6
What is a Data Warehouse?
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
7
What are the users saying...
Data should be integrated across the enterprise
Summary data has a real value to the organization
Historical data holds the key to understanding data over time
What-if capabilities are required
8
What is Data Warehousing?
A process of transforming data into information and making it available to users in a timely enough manner to make a difference
Data
Information
9
Evolution
60’s: Batch reports hard to find and analyze information inflexible and expensive, reprogram every new
request
70’s: Terminal-based DSS(Decision Support System and EIS (executive information systems) still inflexible, not integrated with desktop tools
10
Data Warehouse Structure
base customer (1985-87)custid, from date, to date, name, phone, dob
base customer (1988-90)custid, from date, to date, name, credit rating,
employer
customer activity (1986-89) -- monthly summary
customer activity detail (1987-89)custid, activity date, amount, clerk id, order no
customer activity detail (1990-91)custid, activity date, amount, line item no, order no
Time is Time is part of part of key of key of each tableeach table
Definition of DSS
Decision support system is defined as a system that helps the decision makers in various levels to take decisions
This system uses data, analytical models and user friendly software for taking decision
11
Definition of EIS
Executive information system(EIS) is defined as a system that helps the high level executives to take policy decisions.
This system user higher level data, analytical models and user friendly software for taking decisions.
12
Evolution
80’s: Desktop data access and analysis tools query tools, spreadsheets, GUIs easier to use, but only access operational
databases
90’s: Data warehousing with integrated OLAP(online analytical processing)engines and tools
13
14
Data Warehousing -- It is a process
Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible
A decision support database maintained separately from the organization’s operational database
15
Characteristics of Data Warehouse
A data warehouse is a subject-oriented
integrated
time-varying
non-volatile
collection of data that is used primarily in organizational decision making.
subject-oriented A data warehouse is organized around the
major subjects of the organization such as customer, supplier, product, sales, etc..,
Data warehouse provides a simple and concise view around a particular subject by excluding data that are not useful to the decision support process.
16
Integrated:
A data warehouse is constructed by integrating multiple sources of data such as relational database, flat files and on-line transaction records.
Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attributes etc..,
17
Time Variant
Data warehouse maintains records of both historical and current data.
So it can provide information in a historical perspective
18
Non Volatile
Once data warehouse is loaded with data, it is not possible to perform any modifications in the stored data.
19
20
Explorers, Farmers and Tourists
Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data
Farmers: Harvest informationfrom known access paths
Tourists: Browse information about Tourists
21
Application-Orientation vs. Subject-Orientation
Application-Orientation
Operational Database
LoansCredit Card
Trust
Savings
Subject-Orientation
DataWarehouse
Customer
VendorProduct
Activity
Functioning of Data warehousing
22
Data Source
cleaningTransformation
Data Warehouse
New Update
Collection data
Data warehousing collect data from various data sources such as relational data base, flat files and on-line records
The collection of data are stored in database inside the warehouse.
The type of data collection used depends on the architecture of the ware house.
23
Integration
Each and every data source uses from different schema.
Data warehouse get data from different source with different schema and convert the data from various sources into a common integrated schema.
24
25
Star Schema
A single fact table and for each dimension one dimension table
Does not capture hierarchies directly
T ime
prod
cust
city
fact
date, custno, prodno, cityname, ...
26
Snowflake schema
Represent dimensional hierarchy directly by normalizing tables.
Easy to maintain and saves storage
T ime
prod
cust
city
fact
date, custno, prodno, cityname, ...
region
Data transformation and cleaning
The task of correcting and preparing the data is called data cleaning.
Data source delivers data into the database of data warehouse it should be corrected.
27
Update of data
Update on tables at the data sources must be sent to the data warehouse.
If the tables in data warehouse are same as sources, the updation is easy.
28
Summarizing data
The raw data generated by a transaction may be too large to store online.
Therefore, we can use summary of transactions for easy querying.
29
30
Data Warehouse for Decision Support & OLAP
Putting Information technology to help the knowledge worker make faster and better decisions Which of my customers are most likely to go to
the competition? What product promotions have the biggest
impact on revenue? How did the share price of software
companies correlate with profits over last 10 years?
31
Decision Support
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than update
Use of the system is loosely defined and can be ad-hoc
Used by managers and end-users to understand the business and make judgments
OLAP(Online analytical processing)
A data warehouse stores data , but OLAP transform the data warehouse data into specific meaningful information.
Therefore OLAP provides a user friendly environment for interactive data analysis.
32
OLAP
33
DATA WAREHOUSE
OLAP SERVER
FRONT END TOOL
User
Result
Result set
Request
SQL
OLAP OPERATION on the multidimensional data
Roll-up(GROUP)Drill down(Less)Slice and Dice(Pice)Pivot(rotate)
34
TYPES OF OLAP
MOLAP(MULTIDIMENSIONAL OLAP)
ROLAP(RELATIONAL ROLAP)
35
36MonthMonth
1 1 22 3 3 4 4 776 6 5 5
Pro
du
ctP
rod
uct
Toothpaste Toothpaste
JuiceJuiceColaColaMilk Milk
CreamCream
Soap Soap
Regio
n
Regio
n
WWS S
N N
Dimensions: Dimensions: Product, Region, TimeProduct, Region, TimeHierarchical summarization pathsHierarchical summarization paths
Product Product Region Region TimeTimeIndustry Country YearIndustry Country Year
Category Region Quarter Category Region Quarter
Product City Month WeekProduct City Month Week
Office DayOffice Day
Multi-dimensional Data
“Hey…I sold $100M worth of goods”
37
Data Warehouse Architecture
Data Warehouse Engine
Optimized Loader
ExtractionCleansing
AnalyzeQuery
Metadata Repository
RelationalDatabases
LegacyData
Purchased Data
ERPSystems
Architecture of data warehousing
38
External data
Data Acquisition
Data Manager
Warehouse data
External data
Data Dictionary
Information Directiory
Warehouse data
Middleware
Design
Management
Data Access
Architecture of
39
40
Design Component
The data warehouse designer design the database of the data warehouse and the warehouse administrator manages the data warehouse.
The designer and administrator use the design component to design and store data
Types of design
Bottom-up designBusiness value can be returned as quickly as
the first data marts can be created Top-down designAtomic data, that is, data at the lowest level
of detail, are stored in the data warehouse.
Hybrid design
41
Hybrid design. Hybrid methodologies have evolved
to take advantage of the fast turn-around time of bottom-up design and the enterprise-wide data consistency of top-down design.
42
Data Manager Component
The database in the data warehouse uses the data manager component for managing and accessing the data stored in the data warehouse.
RdbmsMdbms
43
Management Component
Administering data acquisition operation
Managing backup copies of the dataRecovering the lost data Providing security to the data stored
in the data warehouse.Authorizing access to the data stored
in the data warehouse.
44
Data Acquisition Component
This component acquires data from various sources by using the data acquisition applications
The data acquisition applications are based on rules that are defined by the data warehouse developers.
45
The operation performed during data clean up
Restructuring the records and fields of the database tables.
Removing the irrelevant and redundant data
obtaining and adding missing data.Verifying integrity and consistency of
the data
46
The operation performed on the data for enhancement are
Decoding and translating the values in fields.
Summarizing dataCalculating the derived values.
47
Information directory Component
This component helps the end users to know the details of the data stored in the data warehouse.
This is done with the help of the data about the data named meta data.
Technical dataBusiness data
48
Middleware Component
This components connect to the local databases.
Analytical server used to analyze multidimensional data.
Intelligent data warehousing middleware to control the access to the warehouse database.
49
Data mart
Data mart is a database that contains data needed for a small group of users for their own department needs.
–Dependent data mart–Independent data mart
50
Different between data warehouse and data martData warehouse Data Mart
Data mart is therefore useful for small organizations with very few departments
data warehousing is suitable to support an entire corporate environment.
If you listen to some vendors, you may be left thinking that building data warehouses is a waste of time.
data mart vendor that tells you this are looking out for their own best interests.
This supports the entire information requirement of an organization.
This support the information requirement of a department in an organization
This has large model, wider implementation, large data and more number of users.
This has small data model, shorter implementation, less data and some users.
51
Advantages of data martSince each department has its own data
mart, the departments can summarize, sort , select structure etc their own department’s data. This will not confused with any other department.
The department can do whatever DSS processing they want.
The processing cost and storage are less that the data warehouse.
The department can select a software for their data mart. it is powerful to fit their needs.
52
Data warehousing life cycle
53
Design
Enhance prototype
Operate
deploy
54MonthMonth
1 1 22 3 3 4 4 776 6 5 5
Pro
du
ctP
rod
uct
Toothpaste Toothpaste
JuiceJuiceColaColaMilk Milk
CreamCream
Soap Soap
Regio
n
Regio
n
WWS S
N N
Dimensions: Dimensions: Product, Region, Product, Region, periodsperiodsHierarchical summarization pathsHierarchical summarization paths
Product Product Region Region PeriodPeriodIndustry Country YearIndustry Country Year
Category Region Quarter Category Region Quarter
Product City Month WeekProduct City Month Week
Office DayOffice Day
Data Modeling(Multi-dimensional Database)
“Hey…I sold $100M worth of goods”
Building of data warehouse The builder must forecast the usage of the warehouse
by the users. The design should support accessing data with any
meaningful values of the attributes. To build a good data warehouse data acquisition
process must follow the steps given flowextract the data from multiple heterogeneous
sourcesFormat the data for consistency within the
warehouse.The data must be cleaned to ensure validityThe data must be converted from relational ,object
oriented ,hierarchy model to a multidimensional model.
The data are loaded into the warehouse. Good monitoring tools are necessary to recover from incorrect load. 55
Data warehouse and views
Data warehouse is a permanent storage of data in multidimensional tables.
View are temporarily created when needed using data warehouse.
This is used for decision support system.
56
Different between data warehouse and views
Data warehouse Views
Data warehouse is a permanent storage data.
Views are created from warehouse data when needed and it is not permanent
Data warehouse are multidimensional
Views are relational
Data warehouse can be indexed to maximize performance.
Views cannot be indexed.
Data warehouse provides specific support to a functionality
Views cannot give specific support to a functionality.
Data warehouse provide large amount of data.
Views are created by extracting minimum data from data warehouse.
57
Data warehouse FutureNew techniques must be introduced in
data cleaning ,indexing and partitioning.The manual operation involved in data
acquisition ,management data quality and performance maximization must be automated.
Proper business rules must be developed and incorporated in warehouse creation and maintenance process.
58
Data Mining
Data mining is sorting through data to identify patterns and establish relationships.
59
60
Data Mining (cont.)
61
Data Mining works with Warehouse Data
Data Warehousing provides the Enterprise with a memory
Data Mining provides the Enterprise with intelligence
62
“The key in business is to know something that nobody else knows.”
— Aristotle Onassis
“To understand is to perceive patterns.” — Sir Isaiah Berlin
PH
OT
O: L
UC
IND
A D
OU
GL
AS
-ME
NZ
IES
PHOTO: HULTON-DEUTSCH COLL
Data Mining Motivation
63
Application Areas
Industry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud Analysis
Telecommunication Call record analysis
Consumer goods promotion analysisData Service providersValue added dataUtilities Power usage analysis
64
Data Mining in Use
The US Government uses Data Mining to track fraud
A Supermarket becomes an information broker
Basketball teams use it to track game strategy
Cross SellingWarranty claims RoutingHolding on to Good CustomersWeeding out Bad Customers
65
What is data mining technology
The process of extracting or finding hidden knowledge from large database is called data mining.
Ex: Age 21------ we can understand he is major
data information
Data Mining Technology
66
Cleaning and Integration Databases
Data Warehouse
Flat Files
Patterns Knowledge
Selection and transformation
Data Mining
The various step
Data cleaning To remove noise and inconsistent data
Data integration Data from multiple sources are combined
Data selection relevant data are retrieved from the database for analysis
67
Data transformation The selected data are made for mining by performing aggregation operations
Data mining Intelligent methods are applied to extract data patterns
Pattern evaluation Identify the needed patterns
Knowledge presentation present the mined knowledge to the user
68
Loading the Warehouse
Cleaning the data before it is loaded
70
Data Integration Across Sources
Trust Credit cardSavings Loans
Same data different name
Different data Same name
Data found here nowhere else
Different keyssame data
71
Data Transformation Exampleen
cod
ing
unit
field
appl A - balanceappl B - balappl C - currbalappl D - balcurr
appl A - pipeline - cmappl B - pipeline - inappl C - pipeline - feetappl D - pipeline - yds
appl A - m,fappl B - 1,0appl C - x,yappl D - male, female
Data Warehouse
Structuring/Modeling Issues
Data Warehouse vs. Data Marts
74
From the Data Warehouse to Data Marts
DepartmentallyStructured
IndividuallyStructured
Data WarehouseOrganizationallyStructured
Less
More
HistoryNormalizedDetailed
Data
Information
75
Data Warehouse and Data Marts
OLAPData MartLightly summarizedDepartmentally structured
Organizationally structuredAtomicDetailed Data Warehouse Data
76
Characteristics of the Departmental Data Mart
OLAPSmallFlexibleCustomized by
DepartmentSource is
departmentally structured data warehouse
77
Techniques for Creating Departmental Data Mart
OLAP
Subset
Summarized
Superset
Indexed
Arrayed
Sales Mktg.Finance
78
Data Mart Centric
Data Marts
Data Sources
Data Warehouse
79
True Warehouse
Data Marts
Data Sources
Data Warehouse
II. On-Line Analytical Processing (OLAP)
Making Decision Support Possible
81
What Is OLAP?
Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software
Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System
OLAP = Multidimensional Database MOLAP: Multidimensional OLAP (Arbor Essbase,
Oracle Express) ROLAP: Relational OLAP (Informix MetaCube,
Microstrategy DSS Agent)
82
The OLAP Market
Rapid growth in the enterprise market 1995: $700 Million 1997: $2.1 Billion
Significant consolidation activity among major DBMS vendors 10/94: Sybase acquires ExpressWay 7/95: Oracle acquires Express 11/95: Informix acquires Metacube 1/97: Arbor partners up with IBM 10/96: Microsoft acquires Panorama
Result: OLAP shifted from small vertical niche to mainstream DBMS category
83
Strengths of OLAP
It is a powerful visualization paradigm
It provides fast, interactive response times
It is good for analyzing time series
It can be useful to find some clusters and
outliers
Many vendors offer OLAP tools
84
OLAP Is FASMI
FastAnalysisSharedMultidimensionalInformation
85
Data Cube Lattice
Cube lattice ABC
AB AC BC A B C none
Can materialize some groupbys, compute others on demand
Question: which groupbys to materialze? Question: what indices to create Question: how to organize data (chunks, etc)
86
Visualizing Neighbors is simpler
1 2 3 4 5 6 7 8AprMayJunJulAugSepOctNovDecJanFebMar
Month Store SalesApr 1Apr 2Apr 3Apr 4Apr 5Apr 6Apr 7Apr 8May 1May 2May 3May 4May 5May 6May 7May 8Jun 1Jun 2
87
A Visual Operation: Pivot (Rotate)
1010
4747
3030
1212
JuiceJuice
ColaCola
Milk Milk
CreaCreamm
NYNY
LALA
SFSF
3/1 3/2 3/3 3/1 3/2 3/3 3/43/4
DateDate
Month
Month
Reg
ion
Reg
ion
ProductProduct
88
“Slicing and Dicing”
Product
Sales Channel
Regio
ns
Retail Direct Special
Household
Telecomm
Video
Audio IndiaFar East
Europe
The Telecomm Slice
89
Roll-up and Drill Down
Sales ChannelRegionCountryState Location AddressSales
Representative
Roll
Up
Higher Level ofAggregation
Low-levelDetails
Drill-D
ow
n
90
Nature of OLAP AnalysisAggregation -- (total sales,
percent-to-total)Comparison -- Budget vs.
ExpensesRanking -- Top 10, quartile
analysisAccess to detailed and
aggregate dataComplex criteria
specificationVisualization
91
Organizationally Structured Data
Different Departments look at the same detailed data in different ways. Without the detailed, organizationally structured data as a foundation, there is no reconcilability of data
marketing
manufacturing
sales
finance
92
Multidimensional SpreadsheetsAnalysts need
spreadsheets that support pivot tables (cross-tabs) drill-down and roll-up slice and dice sort selections derived attributes
Popular in retail domain
© Prentice Hall 93
OLAP Operations
Single Cell Multiple Cells Slice Dice
Roll Up
Drill Down
94
Relational OLAP: 3 Tier DSS
Data Warehouse ROLAP Engine Decision Support Client
Database Layer Application Logic Layer Presentation Layer
Store atomic data in industry standard RDBMS.
Generate SQL execution plans in the ROLAP engine to obtain OLAP functionality.
Obtain multi-dimensional reports from the DSS Client.
95
MD-OLAP: 2 Tier DSS
MDDB Engine MDDB Engine Decision Support Client
Database Layer Application Logic Layer Presentation Layer
Store atomic data in a proprietary data structure (MDDB), pre-calculate as many outcomes as possible, obtain OLAP functionality via proprietary algorithms running against this data.
Obtain multi-dimensional reports from the DSS Client.
MSPVL Polytechnic CollegePavoorchatram
96