1 9 Ch1 and 2, Hachim Haddouti Adv. DBS and Data Warehouse CSC5301 Ch1 and 2 Hachim Haddouti.

34
1 9 Ch1 and 2, Hachim Haddouti Adv. DBS and Data Warehouse CSC5301 Ch1 and 2 Hachim Haddouti
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    233
  • download

    1

Transcript of 1 9 Ch1 and 2, Hachim Haddouti Adv. DBS and Data Warehouse CSC5301 Ch1 and 2 Hachim Haddouti.

1

9

Ch1 and 2, Hachim Haddouti

Adv. DBS and Data WarehouseCSC5301Ch1 and 2

Hachim Haddouti

2

9

Ch1 and 2, Hachim Haddouti

Do You Remember? OLTP (examples?)

OLAP

ROLAP / MOLAP

Star Schema

Fact Table

DSS

Slice/Dice, Drill Down/RollUp

3

9

Ch1 and 2, Hachim Haddouti

Requirements of OLAP Products

FASMI Definition

FAST (simple query under 1s, medium under 5 s, complex 20 s)

Analysis (analytical and statistical funtionality, ad hoc)

Shared

Multidimensional (essential to give different views)

Information (extraction of information from data)

4

9

Ch1 and 2, Hachim Haddouti

MD Data model

Asymmetric, one large dominant table in the center of the schema (multiple joins) and around Dimension tables

Better visualization Business queries (why, what if)

5

9

Ch1 and 2, Hachim Haddouti

MD Model

Fact table: where numercial measurements of the business are stored, plus keys of dimension tables

Dimension table: textual description of dimension

Attributes: describing items in Dimension

6

9

Ch1 and 2, Hachim Haddouti

Data Warehouse Structure

7

9

Ch1 and 2, Hachim Haddouti

Basic Concepts of the MDD-Model

Def.: A Dimension is a data type (almost always finite), which is used as a component of a composite (multidimensional) key.

Def.: Dimension-Members are elements of a dimension.

Examples: frequently enumeration types or intervals: Month =( January, February, …, December)

Day = [ 1:31]

8

9

Ch1 and 2, Hachim Haddouti

Further examples:

BMW-engines = {1600, 1800, 2000, 2300, 2800, 3000, 3500, 4000, 5000, 2500D, 3000D}

BMW-bodies = {3rd, 5th, 7th, 7thL, 3rdL, 3rd Cabrio, 5thL, 8th, Z3}

Note : not all combinations of engines and bodies are built

9

9

Ch1 and 2, Hachim Haddouti

Data Cube Example

Product

TrekkingBike

GB Region

Time

35

Cell

Facts(Measures)

Dimension

June

Dimension-Members

Mountain Bike

D

30

25

10

9

Ch1 and 2, Hachim Haddouti

Def.: Dimension-Attributes: additional attributes for a detailed description of the dimension members.

Examples: Number of days per month

dimension Month = ((January,31), (February, 29),:

(April, 30),..., (December,31))

Gasoline type and number of cylinders of an engine dimension BMW-engine = {(1600, Super,4),...,

(2500D, Diesel,5),..., (4000, Regular,8)…}

11

9

Ch1 and 2, Hachim Haddouti

Relational Modeling

Dimension finite fixed relation

Dimension-Element

key

Dimension-Attribute

other non-key attributes

Cube relation, on E/R level a relationship between dimensions

Cube-key key composed of foreign keys of dimensions

Measures non-key attributes of cube

Dimensionality number of foreign keys

12

9

Ch1 and 2, Hachim Haddouti

Relational Model

cyl fuel B-E

BMW-engines Facts

B-E B-B M € # Name days

Bodies

B-B …

Month

GRAIN

13

9

Ch1 and 2, Hachim Haddouti

E/R-Model: Star-Schema

BMW-Engines Facts BMW-bodies

Months

i.e. simple star schema

14

9

Ch1 and 2, Hachim Haddouti

Relational Representation of Multidimensional Data

15

9

Ch1 and 2, Hachim Haddouti

Multidimensional Representation of 3-dim Data: Dimensions with Measures or Facts

16

9

Ch1 and 2, Hachim Haddouti

Density and SparsityDef.: Dense data-cube: all combinations of (d1, …,

dm) occur.Sparse data-cube is not dense

Def.:

Note: logical model assumes dense cubes physical storage model deals with

dense and sparse cubes.

cells ofnumber total

cells occupied ofnumber 1sparsity

17

9

Ch1 and 2, Hachim Haddouti

?

Purchase Storage Personnel FinancialSales

Customer Supllier Market competition

Internal Information Sources

External information sources

Data Warehouse

Ana

lyze

s, T

rend

s

Data Warehouse Shape

18

9

Ch1 and 2, Hachim Haddouti

MIS (=Manage-ment Informa-tionssystem)

MAIS (=Marke-ting Informations-system)

60' 70' 80', Begin 90' Mid 90'

DSS (=DecisionSupport System)

EIS (=ExecutiveInformation System)

Data-Ware-housesystemEIS (=Enter-prise IntelligenceSystem)IDF (=Informa-tion Delivery Facility)InformationWarehouseEIS (=Enter-prise Information System)

Unchanged Vision: right informationto the right time and place

History

19

9

Ch1 and 2, Hachim Haddouti

Example 1

20

9

Ch1 and 2, Hachim Haddouti

Example 2

21

9

Ch1 and 2, Hachim Haddouti

Ch2: The Grocery StoreCase: 500 large grocery stores, in 3 Regions, each store

with many depts;

Steps in the design process

Choose a business process to model (order, invoice,sales etc.)

Choose the grain of the business process (daily snapshots, monthly..)

Choose the dimensions(e.g; time, customer, product) Choose the measured facts (Quantity sold, DH Sold) 

22

9

Ch1 and 2, Hachim Haddouti

Grocery Store Item Movement

Stock Keeping Unit (SKU), individual products (60 000) Universal Product Code (UPC), bar codes point of sale (POS), front door where customers buy takeaway, back door where vendors make deliveries temporary price reduction (TPR), promotions shelf/end-aisle display, ads like dispalys at stores

23

9

Ch1 and 2, Hachim Haddouti

Design Principles:Identifying the processes to model

1. The first step in the design is to decide what business process(es) to model, by combining an understanding of the business with an understanding of what data is available

daily item movement (GRAIN is SKU by store by promotion by day);what^products are selling in which stores, prices, days

2. The second step in the design is to decide on the grain of the fact table in each business process (daily item movement, why not each Transaction?). 

market basket depleted cannibalized syndicated data suppliers ( comparing own sales with other competitive stores, eg.

Top 10 sold by my competitor)

3. A data ware house almost always demands data expressed at the lowest possible grain of each dimension, not because queries want to see individual low-level records, but because queries need to cut through the database in very precise ways.

24

9

Ch1 and 2, Hachim Haddouti

Design Principles

4. A careful grain statement determines the primary dimensionality of the fact table. It is then usually possible to add additional dimensions to the basic grain of the fact table, where these additional dimensions naturally take on only a single value under each combination of the primary dimensions. If it is recognized that an additional desired dimension violates the grain by causing additional records to be generated, then the grain statement must be revised to accomodate this additional dimension.

25

9

Ch1 and 2, Hachim Haddouti

Grocery Store Schema

Sales Fact

Time keyProdukt keyStore keyPromotion keyOther fact…

Product dimProduct keyProduct attributes…

Store dimStore keyStore attributes…

Time dimTime keyTime attributes…

Promotion dimPromotion keyPromotion attributes…

26

9

Ch1 and 2, Hachim Haddouti

Design Principles

Picking the business measurements for the fact table The number of base sales transaction line items in a

business can be estimated by dividing the gross revenue of the business by the average price of a sales item.

Resisting normalization 5. The fact table in a dimensional schema is

naturally highly normalized.

6. Efforts to normalize any of the tables in a dimensional database solely in order to save disk space are a waste of time.

27

9

Ch1 and 2, Hachim Haddouti

Grocery Store Schema showing measured facts

Sales Fact

Time key Produkt key Store key Promotion key

DH_salesUnits_salesDH_costCustomer_count

Product dimProduct keyProduct attributes…

Store dimStore keyStore attributes…

Time dimTime keyTime attributes…

Promotion dimPromotion keyPromotion attributes…

Suppose: each attribute in Fact table has 4 Bytes, except StoreKey 2 BytesWhole fact table is only 30 Bytes large, for 1b row-fact table 30 GB

28

9

Ch1 and 2, Hachim Haddouti

Design Principles

Preserving browsing7. The dimension tables must not be normalized but

should remain as flat tables. Normalized dimension tables destroy the ability to browse. Disk space savings gained by normalizing the dimension tables are typically less than one percent of the total disk space needed for the overall schema.

29

9

Ch1 and 2, Hachim Haddouti

Time Dimension

The time dimension (time_key, holiday_flag, fiscal_period, season, event,..)

Most data warehouses need an explicit time dimension table even though the primary time key may be an SQL date-valued object. The explicit time dimension is needed to describe fiscal periods, seasons, holidays, weekends, and other calendar calculations that are difficult to get from the SQL data machinery.

30

9

Ch1 and 2, Hachim Haddouti

The product dimension

Production dimension (product_key, SKU_desc, SKU_no, brand, ..) merchandise hierarchy (eg. SKU to package to brand to

subcategories to categories to departments) drill up/down drilling down in a data warehouse is nothing more than adding row

headers from the dimension tables. Drilling up is subtracting row headers. An explicit hierarchy is not needed to support drilling down.

The product dimension is one of the two or three dimensions in nearly every data warehouse. Great care should be taken to fill this dimension with as many descriptive attributes as possible. Retail product dimension tables should have at least 50 attributes.

31

9

Ch1 and 2, Hachim Haddouti

Store Dimension

The store dimension (store_key, store_name, store_no, store_address, …)

pull down list SYNONYM (create FIRST_OPEN_TIME as SYNONYM

FOR DATE)

32

9

Ch1 and 2, Hachim Haddouti

The promotion dimension

  Promotion dim (promotion_key, promotion_name, price_reduction_type, display_type, ..)

causal dimension (temporary price reduction, coupons,..)

Lift/baseline sales (gain in sales during promotion)

time shifting cannibalization growing the market profitability

33

9

Ch1 and 2, Hachim Haddouti

The grocery store facts

Additive Attributes, we can compute: gross profit (DH cost – DH Revenue) gross margin ( gross profit / DH Revenue)

8. A nonadditive calculation, such as the ratio like gross margin, can be calculated for any slice of the fact table by remembering to calculate the ratio of sums, not the sum of the ratios. In other words, the computation must be distributed over the sums, not the other way around.

9. Customer counts are usually semi-additive when they occur in time snapshot fact tables because they double count activity across products during the customer event. In these cases they can be used correctly in user applications only by restricting the keys in the nonadditive dimensions to single values. 

34

9

Ch1 and 2, Hachim Haddouti

Database sizing for the grocery chain

time dimension: 2 years x 365 days = 730 days store dimension: 300 stores, reporting sales each day product dimension: 30,000 products in each store;

3,000 sell each day/store promotion dimension: 1 item in no more than 1

promotion/store/day base fact records: 657M key fields: 4; fact fields: 4; total fields: 8 base fact table size: 657M x 8 fields x 4 bytes = 21G