Chap 11 Corrected
-
Upload
farooq-shad -
Category
Documents
-
view
226 -
download
0
Transcript of Chap 11 Corrected
-
7/29/2019 Chap 11 Corrected
1/34
Chapter 11
Chapter 11:
Data Warehousing
Modern Database Management
8th Edition
Jeffrey A. Hoffer, Mary B. Prescott,
Fred R. McFadden2
007byPrenticeHall
FAROOQ
1
-
7/29/2019 Chap 11 Corrected
2/34
Chapter 11
Objectives
Definition of terms
Reasons for information gap between informationneeds and availability
Reasons for need of data warehousing
Describe three levels of data warehousearchitectures
List four steps of data reconciliation
Describe two components of star schema
Estimate fact table size
Design a data mart2
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
3/34
Chapter 11
Definition Data Warehouse:
A subject-oriented, integrated, time-variant, non-updatable
collection of data used in support of management decision-making processes
Subject-oriented: e.g. customers, patients, students,products
Integrated: Consistent naming conventions, formats,
encoding structures; from multiple data sources Time-variant: Can study trends and changes
Nonupdatable: Read-only, periodically refreshed
Data Mart: A data warehouse that is limited in scope
3
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
4/34
Chapter 11
Need for Data Warehousing Integrated, company-wide view of high-quality
information (from disparate databases)
Separation ofoperationaland informationalsystems and
data (for improved performance)
4
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
5/34
Chapter 11
5Source: adapted from Strange (1997).
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
6/34
Chapter 11
Data Warehouse Architectures Generic Two-Level Architecture
Independent Data Mart
Dependent Data Mart and Operational Data Store
Logical Data Mart and Real-Time Data Warehouse
Three-Layer architecture
6
All involve some form ofextraction,transformation and loadi
ng (ETL)
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
7/34
Chapter 11
7
Figure 11-2: Generic two-level data warehousing architecture
E
T
L
One,company-
wide
warehouse
Periodic extraction data is not completely current in warehouse
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
8/34
Chapter 11
8
Figure 11-3 Independent data mart
data warehousing architectureData marts:Mini-warehouses, limited in scope
E
T
L
Separate ETL for each
independent data mart
Data access complexity
due tomultiple data marts
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
9/34
Chapter 11
9
Figure 11-4 Dependent data mart with
operational data store:a three-level architecture
E
T
L
Single ETL for
enterprise data warehouse
(EDW)
Simpler data access
ODS provides option for
obtainingcurrent data
Dependent data marts
loaded from EDW
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
10/34
Chapter 11
10
E
T
L
Near real-time ETL for
Data Warehouse
ODS and data warehouse
are one and the same
Data marts are NOT separate databases,
but logical views of the data warehouse
Easier to create new data marts
Figure 11-5 Logical data mart and real
time warehouse architecture
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
11/34
Chapter 11
11
Figure 11-6 Three-layer data architecture for a data warehouse
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
12/34
Chapter 11
Data CharacteristicsStatus vs. Event Data
12
Status
Status
Event = a database action(create/update/delete) thatresults from a transaction
Figure 11-7
Example of DBMS
log entry
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
13/34
Chapter 11 13
Data CharacteristicsTransient vs. Periodic Data
Withtransientdata,changesto existingrecords
arewrittenoverpreviousrecords,
thusdestroyingthepreviousdata
content
Figure 11-8
Transient
operational data
-
7/29/2019 Chap 11 Corrected
14/34
Chapter 11 14
Periodicdata are
neverphysicallyaltered or
deletedonce they
havebeen
added tothe store
Data CharacteristicsTransient vs. Periodic Data
Figure 11-9:
Periodic
warehouse data
-
7/29/2019 Chap 11 Corrected
15/34
Chapter 11
Other Data Warehouse
Changes New descriptive attributes
New business activity attributes
New classes of descriptive attributes
Descriptive attributes become more refined
Descriptive data are related to one another
New source of data
15
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
16/34
Chapter 11
The Reconciled Data Layer
Typical operational data is: Transient
not historical
Not normalized (perhaps due to denormalization for performance)
Restricted in scopenot comprehensive
Sometimes poor qualityinconsistencies and errors
After ETL, data should be: Detailednot summarized yet
Historicalperiodic
Normalized3rd normal form or higher
Comprehensiveenterprise-wide perspective
Timely
data should be current enough to assist decision-making
Quality controlledaccurate with full integrity
16
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
17/34
Chapter 11
The ETL Process
Capture/Extract
Scrub or data cleansingTransform
Load and Index
17
ETL = Extract, transform, and load
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
18/34
Chapter 11
18
Static extract = capturinga snapshot of the sourcedata at a point in time
Incremental extract =capturing changes thathave occurred since the laststatic extract
Capture/Extractobtaining a snapshot of a chosen subsetof the source data for loading into the data warehouse
Figure 11-10:
Steps in data
reconciliation
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
19/34
Chapter 11
19
Scrub/Cleanseuses pattern recognition and AItechniques to upgrade data quality
Fixing errors: misspellings,erroneous dates, incorrect fieldusage, mismatched addresses,missing data, duplicate data,inconsistencies
Also: decoding, reformatting,time stamping, conversion, keygeneration, merging, errordetection/logging, locatingmissing data
Figure 11-10:
Steps in data
reconciliation(cont.)
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
20/34
Chapter 11
20
Transform = convert data from format of operationalsystem to format of data warehouse
Record-level:Selectiondata partitioningJoiningdata combiningAggregationdata summarization
Field-level:single-fieldfrom one field to one fieldmulti-fieldfrom many fields to one, orone field to many
Figure 11-10:
Steps in data
reconciliation(cont.)
2
007byPrenticeHall
FAROOQ
-
7/29/2019 Chap 11 Corrected
21/34
Chapter 11
21
Load/Index= place transformed datainto the warehouse and create indexes
Refresh mode: bulk rewritingof target data at periodic intervals
Update mode: only changesin source data are written to datawarehouse
Figure 11-10:
Steps in data
reconciliation(cont.)
2
007byPrenticeHa
ll
FAROOQ
Fi 11 11 Si l fi ld t f ti
-
7/29/2019 Chap 11 Corrected
22/34
Chapter 11
22
Figure 11-11: Single-field transformation
In generalsome transformationfunction translates data from old
form to new form
Algorithmictransformation usesa formula or logical expression
Tablelookupanotherapproach, uses a separatetable keyed by sourcerecord code
2
007byPrenticeHa
ll
FAROOQ
-
7/29/2019 Chap 11 Corrected
23/34
Chapter 11
23
Figure 11-12: Multifield transformation
M:1from many sourcefields to one target field
1:Mfrom onesource field tomany target fields
2
007byPrenticeHa
ll
FAROOQ
-
7/29/2019 Chap 11 Corrected
24/34
Chapter 11
Derived Data Objectives
Ease of use for decision support applications
Fast response to predefined user queries
Customized data for particular target audiences
Ad-hoc query support
Data mining capabilities
Characteristics
Detailed (mostly periodic) data
Aggregate (for summary)
Distributed (to departmental servers)
24Most common data model = star schema
(also called dimensional model)
2
007byPrenticeHa
ll
FAROOQ
-
7/29/2019 Chap 11 Corrected
25/34
Chapter 11
25
Figure 11-13 Components of a star schema
Fact tablescontain factualor quantitative data
Dimension tablescontain descriptionsabout the subjects of the business
1:N relationship between
dimension tables and fact tables
Excellent for ad-hoc queries, but bad for online transaction processing
Dimension tables are denormalized to
maximize performance
2
007byPrenticeHa
ll
FAROOQ
-
7/29/2019 Chap 11 Corrected
26/34
Chapter 11
26
Figure 11-14 Star schema example
Fact tableprovides statistics for salesbroken down by product, period andstore dimensions
2
007byPrenticeHa
ll
FAROOQ
-
7/29/2019 Chap 11 Corrected
27/34
Chapter 11
27
Figure 11-15 Star schema with sample data
2
007byPrenticeHa
ll
FAROOQ
-
7/29/2019 Chap 11 Corrected
28/34
Chapter 11
Issues Regarding Star Schema Dimension table keys must be surrogate (non-intelligent
and non-business related), because: Keys may change over time
Length/format consistency
Granularity of Fact Tablewhat level of detail do youwant? Transactional grain
finest level
Aggregated grainmore summarized
Finer grains better market basket analysis capability
Finer grainmore dimension tables, more rows in fact table
Duration of the databasehow much history should bekept? Natural duration13 months or 5 quarters
Financial institutions may need longer duration
Older data is more difficult to source and cleanse28
2
007byPrenticeHa
ll
FAROOQ
-
7/29/2019 Chap 11 Corrected
29/34
Chapter 11
29
Figure 11-16: Modeling dates
Fact tables contain time-period data Date dimensions are important
2
007byPrenticeHa
ll
FAROOQ
-
7/29/2019 Chap 11 Corrected
30/34
Chapter 11
The User Interface
Metadata (data catalog)
Identify subjects of the data mart
Identify dimensions and facts
Indicate how data is derived from enterprise datawarehouses, including derivation rules
Indicate how data is derived from operational data store,including derivation rules
Identify available reports and predefined queries Identify data analysis techniques (e.g. drill-down)
Identify responsible people30
2
007byPrenticeHa
ll
FAROOQ
-
7/29/2019 Chap 11 Corrected
31/34
Chapter 11
On-Line Analytical Processing (OLAP) Tools
The use of a set of graphical tools that provides
users with multidimensional views of their dataand allows them to analyze the data using simplewindowing techniques
Relational OLAP (ROLAP)
Traditional relational representation Multidimensional OLAP (MOLAP)
Cube structure
OLAP Operations
Cube slicing
come up with 2-D view of data
Drill-downgoing from summary to more detailed views
31
2
007byPrenticeHa
ll
FAROOQ
-
7/29/2019 Chap 11 Corrected
32/34
Chapter 11
32
Figure 11-23 Slicing a data cube
2
007byPrenticeHa
ll
FAROOQ
-
7/29/2019 Chap 11 Corrected
33/34
Chapter 11
33
Figure 11-24
Example of drill-down
Summary report
Drill-down withcolor added
Starting with summarydata, users can obtaindetails for particular
cells
2
007byPrenticeHa
ll
FAROOQ
-
7/29/2019 Chap 11 Corrected
34/34
Chapter 11
Data Mining and Visualization Knowledge discovery using a blend of statistical, AI, and computer
graphics techniques
Goals: Explain observed events or conditions
Confirm hypotheses
Explore data for new or unexpected relationships
Techniques
Statistical regression Decision tree induction
Clustering and signal processing
Affinity
Sequence association
Case-based reasoning
Rule discovery Neural nets
Fractals
Data visualizationrepresenting data in graphical/multimedia formatsfor analysis
34
2
007byPrenticeHa
ll
FAROOQ