1 9 Ch1 and 2, Hachim Haddouti Adv. DBS and Data Warehouse CSC5301 Ch1 and 2 Hachim Haddouti.
-
date post
21-Dec-2015 -
Category
Documents
-
view
233 -
download
1
Transcript of 1 9 Ch1 and 2, Hachim Haddouti Adv. DBS and Data Warehouse CSC5301 Ch1 and 2 Hachim Haddouti.
2
9
Ch1 and 2, Hachim Haddouti
Do You Remember? OLTP (examples?)
OLAP
ROLAP / MOLAP
Star Schema
Fact Table
DSS
Slice/Dice, Drill Down/RollUp
3
9
Ch1 and 2, Hachim Haddouti
Requirements of OLAP Products
FASMI Definition
FAST (simple query under 1s, medium under 5 s, complex 20 s)
Analysis (analytical and statistical funtionality, ad hoc)
Shared
Multidimensional (essential to give different views)
Information (extraction of information from data)
4
9
Ch1 and 2, Hachim Haddouti
MD Data model
Asymmetric, one large dominant table in the center of the schema (multiple joins) and around Dimension tables
Better visualization Business queries (why, what if)
5
9
Ch1 and 2, Hachim Haddouti
MD Model
Fact table: where numercial measurements of the business are stored, plus keys of dimension tables
Dimension table: textual description of dimension
Attributes: describing items in Dimension
7
9
Ch1 and 2, Hachim Haddouti
Basic Concepts of the MDD-Model
Def.: A Dimension is a data type (almost always finite), which is used as a component of a composite (multidimensional) key.
Def.: Dimension-Members are elements of a dimension.
Examples: frequently enumeration types or intervals: Month =( January, February, …, December)
Day = [ 1:31]
8
9
Ch1 and 2, Hachim Haddouti
Further examples:
BMW-engines = {1600, 1800, 2000, 2300, 2800, 3000, 3500, 4000, 5000, 2500D, 3000D}
BMW-bodies = {3rd, 5th, 7th, 7thL, 3rdL, 3rd Cabrio, 5thL, 8th, Z3}
Note : not all combinations of engines and bodies are built
9
9
Ch1 and 2, Hachim Haddouti
Data Cube Example
Product
TrekkingBike
GB Region
Time
35
Cell
Facts(Measures)
Dimension
June
Dimension-Members
Mountain Bike
D
30
25
10
9
Ch1 and 2, Hachim Haddouti
Def.: Dimension-Attributes: additional attributes for a detailed description of the dimension members.
Examples: Number of days per month
dimension Month = ((January,31), (February, 29),:
(April, 30),..., (December,31))
Gasoline type and number of cylinders of an engine dimension BMW-engine = {(1600, Super,4),...,
(2500D, Diesel,5),..., (4000, Regular,8)…}
11
9
Ch1 and 2, Hachim Haddouti
Relational Modeling
Dimension finite fixed relation
Dimension-Element
key
Dimension-Attribute
other non-key attributes
Cube relation, on E/R level a relationship between dimensions
Cube-key key composed of foreign keys of dimensions
Measures non-key attributes of cube
Dimensionality number of foreign keys
12
9
Ch1 and 2, Hachim Haddouti
Relational Model
cyl fuel B-E
BMW-engines Facts
B-E B-B M € # Name days
Bodies
B-B …
Month
GRAIN
13
9
Ch1 and 2, Hachim Haddouti
E/R-Model: Star-Schema
BMW-Engines Facts BMW-bodies
Months
i.e. simple star schema
15
9
Ch1 and 2, Hachim Haddouti
Multidimensional Representation of 3-dim Data: Dimensions with Measures or Facts
16
9
Ch1 and 2, Hachim Haddouti
Density and SparsityDef.: Dense data-cube: all combinations of (d1, …,
dm) occur.Sparse data-cube is not dense
Def.:
Note: logical model assumes dense cubes physical storage model deals with
dense and sparse cubes.
cells ofnumber total
cells occupied ofnumber 1sparsity
17
9
Ch1 and 2, Hachim Haddouti
?
Purchase Storage Personnel FinancialSales
Customer Supllier Market competition
Internal Information Sources
External information sources
Data Warehouse
Ana
lyze
s, T
rend
s
Data Warehouse Shape
18
9
Ch1 and 2, Hachim Haddouti
MIS (=Manage-ment Informa-tionssystem)
MAIS (=Marke-ting Informations-system)
60' 70' 80', Begin 90' Mid 90'
DSS (=DecisionSupport System)
EIS (=ExecutiveInformation System)
Data-Ware-housesystemEIS (=Enter-prise IntelligenceSystem)IDF (=Informa-tion Delivery Facility)InformationWarehouseEIS (=Enter-prise Information System)
Unchanged Vision: right informationto the right time and place
History
21
9
Ch1 and 2, Hachim Haddouti
Ch2: The Grocery StoreCase: 500 large grocery stores, in 3 Regions, each store
with many depts;
Steps in the design process
Choose a business process to model (order, invoice,sales etc.)
Choose the grain of the business process (daily snapshots, monthly..)
Choose the dimensions(e.g; time, customer, product) Choose the measured facts (Quantity sold, DH Sold)
22
9
Ch1 and 2, Hachim Haddouti
Grocery Store Item Movement
Stock Keeping Unit (SKU), individual products (60 000) Universal Product Code (UPC), bar codes point of sale (POS), front door where customers buy takeaway, back door where vendors make deliveries temporary price reduction (TPR), promotions shelf/end-aisle display, ads like dispalys at stores
23
9
Ch1 and 2, Hachim Haddouti
Design Principles:Identifying the processes to model
1. The first step in the design is to decide what business process(es) to model, by combining an understanding of the business with an understanding of what data is available
daily item movement (GRAIN is SKU by store by promotion by day);what^products are selling in which stores, prices, days
2. The second step in the design is to decide on the grain of the fact table in each business process (daily item movement, why not each Transaction?).
market basket depleted cannibalized syndicated data suppliers ( comparing own sales with other competitive stores, eg.
Top 10 sold by my competitor)
3. A data ware house almost always demands data expressed at the lowest possible grain of each dimension, not because queries want to see individual low-level records, but because queries need to cut through the database in very precise ways.
24
9
Ch1 and 2, Hachim Haddouti
Design Principles
4. A careful grain statement determines the primary dimensionality of the fact table. It is then usually possible to add additional dimensions to the basic grain of the fact table, where these additional dimensions naturally take on only a single value under each combination of the primary dimensions. If it is recognized that an additional desired dimension violates the grain by causing additional records to be generated, then the grain statement must be revised to accomodate this additional dimension.
25
9
Ch1 and 2, Hachim Haddouti
Grocery Store Schema
Sales Fact
Time keyProdukt keyStore keyPromotion keyOther fact…
Product dimProduct keyProduct attributes…
Store dimStore keyStore attributes…
Time dimTime keyTime attributes…
Promotion dimPromotion keyPromotion attributes…
26
9
Ch1 and 2, Hachim Haddouti
Design Principles
Picking the business measurements for the fact table The number of base sales transaction line items in a
business can be estimated by dividing the gross revenue of the business by the average price of a sales item.
Resisting normalization 5. The fact table in a dimensional schema is
naturally highly normalized.
6. Efforts to normalize any of the tables in a dimensional database solely in order to save disk space are a waste of time.
27
9
Ch1 and 2, Hachim Haddouti
Grocery Store Schema showing measured facts
Sales Fact
Time key Produkt key Store key Promotion key
DH_salesUnits_salesDH_costCustomer_count
Product dimProduct keyProduct attributes…
Store dimStore keyStore attributes…
Time dimTime keyTime attributes…
Promotion dimPromotion keyPromotion attributes…
Suppose: each attribute in Fact table has 4 Bytes, except StoreKey 2 BytesWhole fact table is only 30 Bytes large, for 1b row-fact table 30 GB
28
9
Ch1 and 2, Hachim Haddouti
Design Principles
Preserving browsing7. The dimension tables must not be normalized but
should remain as flat tables. Normalized dimension tables destroy the ability to browse. Disk space savings gained by normalizing the dimension tables are typically less than one percent of the total disk space needed for the overall schema.
29
9
Ch1 and 2, Hachim Haddouti
Time Dimension
The time dimension (time_key, holiday_flag, fiscal_period, season, event,..)
Most data warehouses need an explicit time dimension table even though the primary time key may be an SQL date-valued object. The explicit time dimension is needed to describe fiscal periods, seasons, holidays, weekends, and other calendar calculations that are difficult to get from the SQL data machinery.
30
9
Ch1 and 2, Hachim Haddouti
The product dimension
Production dimension (product_key, SKU_desc, SKU_no, brand, ..) merchandise hierarchy (eg. SKU to package to brand to
subcategories to categories to departments) drill up/down drilling down in a data warehouse is nothing more than adding row
headers from the dimension tables. Drilling up is subtracting row headers. An explicit hierarchy is not needed to support drilling down.
The product dimension is one of the two or three dimensions in nearly every data warehouse. Great care should be taken to fill this dimension with as many descriptive attributes as possible. Retail product dimension tables should have at least 50 attributes.
31
9
Ch1 and 2, Hachim Haddouti
Store Dimension
The store dimension (store_key, store_name, store_no, store_address, …)
pull down list SYNONYM (create FIRST_OPEN_TIME as SYNONYM
FOR DATE)
32
9
Ch1 and 2, Hachim Haddouti
The promotion dimension
Promotion dim (promotion_key, promotion_name, price_reduction_type, display_type, ..)
causal dimension (temporary price reduction, coupons,..)
Lift/baseline sales (gain in sales during promotion)
time shifting cannibalization growing the market profitability
33
9
Ch1 and 2, Hachim Haddouti
The grocery store facts
Additive Attributes, we can compute: gross profit (DH cost – DH Revenue) gross margin ( gross profit / DH Revenue)
8. A nonadditive calculation, such as the ratio like gross margin, can be calculated for any slice of the fact table by remembering to calculate the ratio of sums, not the sum of the ratios. In other words, the computation must be distributed over the sums, not the other way around.
9. Customer counts are usually semi-additive when they occur in time snapshot fact tables because they double count activity across products during the customer event. In these cases they can be used correctly in user applications only by restricting the keys in the nonadditive dimensions to single values.
34
9
Ch1 and 2, Hachim Haddouti
Database sizing for the grocery chain
time dimension: 2 years x 365 days = 730 days store dimension: 300 stores, reporting sales each day product dimension: 30,000 products in each store;
3,000 sell each day/store promotion dimension: 1 item in no more than 1
promotion/store/day base fact records: 657M key fields: 4; fact fields: 4; total fields: 8 base fact table size: 657M x 8 fields x 4 bytes = 21G