Data ware house design

54
SOFTWARE/WEB/MOBILE/DATABASE ARCHITECT, ENGINEER, AND DEVELOPER TORONTO, CANADA HTTP://SAYED.JUSTETC.NET HTTP://WWW.JUSTETC.NET Sayed Ahmed gical Design of a Data Warehous

Transcript of Data ware house design

Page 1: Data ware house design

SOFTWARE/WEB/MOBILE/DATABASE ARCHITECT, ENGINEER, AND DEVELOPER TORONTO, CANADAHTTP://SAYED.JUSTETC.NETHTTP://WWW.JUSTETC.NET

Sayed Ahmed

Logical Design of a Data Warehouse

Page 2: Data ware house design

OUR SERVICES Free Training and Educational

Services Training and Education in Bangla: 

Bangla.SaLearningSchool.com Training and Education in English:

www.SaLearningSchool.com English.SaLearningSchool.com

Ask a question and get answers:  Ask.JustEtc.net

Page 3: Data ware house design

TOPICS - KEYWORDS Design a Data Warehouse

Star Schema Snow Flake Schema Dimension Tables Fact Tables Auditing Surrogate Keys Type 1, Type 2, Type 3, and Mixed solutions for slowly

changing dimension data ( SCD management) Pivoting for Analysis

To help with SSAS on data warehouse

Page 4: Data ware house design

TOPICS - KEYWORDS Design a Data Warehouse

Additive measures Semi additive measures

Hierarchies for dimensions Attributes in dimensions Attributes in lookup tables

Long term data warehouse design Usually Star Schema

Short term data warehouse design POC Usually snowflake schema

Page 5: Data ware house design

TOPICS - KEYWORDS Fact Tables

 measures foreign keys and possibly an additional primary key and lineage columns  granularity of fact tables auditing and lineage needs

Measures can be additive non-additive semi-additive

Page 6: Data ware house design

TOPICS - KEYWORDS dimension

keys names attributes member properties translations and lineage

Page 7: Data ware house design

TOPICS - KEYWORDS attributes

natural hierarchies many-to-many fact table relationships

you can introduce an additional intermediate dimension

Page 8: Data ware house design

CONCLUSION Not much – right However, if you understand all the terms and

can implement all these concepts in your data warehouse That will be great Not necessarily you will need to use all of these

concepts; however, you may need to justify based on the situation, will all or any of these will help?

What will help and what will not help Check our sub sequent videos and tutorials

Page 9: Data ware house design

THANK YOU Any Concerns?

http://ask.justetc.net Or comment below...

Page 10: Data ware house design

TOOLS AND SOFTWARE REQUIREMENTS Download the Adventure Works databases

OLTP database (LOB database) Data warehouse Database From

http://msftdbprodsamples.codeplex.com/releases/view/55330 For this tutorial, you can just check our slides Though the following tools will help

And probably check the details in the downloaded databases esp. The AdventureWorksDW2012

You will need help from SQL Server and SQL Server MGMT Studio Tools

Page 11: Data ware house design

REQUIRED TOOLS Useful/Required SQL Server

Components Database Engine Services Documentation Components Management Tools - Basic Management Tools – Complete SQL Server Data Tools

Page 12: Data ware house design

DATA WAREHOUSE DESIGN – THE DETAILS Data Warehouse Logical Design

Topics: Design and Implement a Data Warehouse Design and implement dimensions. Design and implement fact tables Design Auditing

track the source and time for data coming into a DW through auditing i.e lineage information

Why a Data Warehouse? It is hard to

generate reports from OLTP/LOB/Transactional database To do Analysis on OLTP database data (some times) Get useful information/useful summarized and details data to

be used to take business decisions

Page 13: Data ware house design

DATA WAREHOUSE DESIGN – THE DETAILS Why a Data Warehouse?

Data in OLTP are heavily normalized. The goal was to keep one data only in one single place to reduce redundancy and consistency of data

You may end up with many tables 100s, 1000s To generate reports you may need to join many

tables – will be slow Historical data may not be there Data quality is also an issue For reporting or analyzing, you may need data from

multiple databases across many departments

Page 14: Data ware house design

WHY A DATA WAREHOUSE? So you can create a Data Warehouse

By cleaning data With historical data Combining data from multiple sources Denormalizing data Using specific design geared towards Data

Warehouse design Some or many consider DW design is less complex

than relational database design Though it also has some complex areas to address... (by

those some or many)

Page 15: Data ware house design

SO WHAT DOES A DATA WAREHOUSE CONTAIN?

Usually two schemas are used for a DW Star Schema-> looks like a star Snow Flake Schema

Another one called Dimensional Model Includes both Star and Snow Flake in the

same Data Warehouse Both Schemas has tables of two types

Dimension Tables Fact Tables

Page 16: Data ware house design

SO WHAT DOES A DATA WAREHOUSE CONTAIN?

Fact Tables are in the center A Fact table joins/combines all the data required for this

reporting or for the business aspect of this reporting Usually combines the primary keys of different tables that

contain data for this report/business aspect Dimension tables are all the other tables that

contain actual data Dimension tables are the tables that contain data

these can be the actual tables in the OLTP database without any modification (Snow Flake)

Or Dimension tables can be newly created by denormalizing the existing OLTP databases (Star)

Page 17: Data ware house design

SO WHAT DOES A DATA WAREHOUSE CONTAIN?

So, you know now what are dimension tables and what are fact tables Fact tables contain primary keys of all related

tables (here they are foreign keys) Dimension tables contain data

Usually, it’s better that you keep your data warehouse separate from your OLTP database So bring all the tables (dimension) here Or denormalize them and bring them here in the

new database

Page 18: Data ware house design

SIMPLIFIED: WHAT ARE STAR AND SNOWFLAKE SCHEMAS

If you just create Fact tables and take all the related tables from your OLTP/LOB databases You get a Snow Flake Schema Here all Dimension tables are still normalized (as

you just took them from the actual database) This is easy –

so good for short-term, quick, and experimental Data Warehouse

One note, your reporting and analysis services queries (MDX, DMV) will be slow with Snow Flake Schemas

Page 19: Data ware house design

SO WHAT DOES A DATA WAREHOUSE CONTAIN?

Now, when you denormalize the dimension tables You get the start schema The Fact tables remain the same for

example Star Schema is kind of standard and

used a lot Originally was developed in 1980’s

Page 20: Data ware house design

EXAMPLES: WHY REPORTING IN OLTP DATABASE IS NOT A GREAT IDEA

Sales amount for internet sales by different countries and historical years

Page 21: Data ware house design

WHY REPORTING IN OLTP DATABASE IS NOT A GREAT IDEA

issues that I did not mention before If your OLTP database was well designed

(?) It may be hard to find the tables related to the

reporting The table names and the column names can be

tricky – do not follow any conventions – do not have meaning

So it can be hard to find data for the reporting

Page 22: Data ware house design

WHY REPORTING IN OLTP DATABASE IS NOT A GREAT IDEA

Note: Reality: The OLTP may not even be well designed (that makes

reporting hard sometimes) even the relationships as well as normalization

– here we assumed that OLTP is perfect In a long back project

I had to re-write/verify/check/change/optimize/had to deal with (whatever you say) 100s (not really 100s, can be close to 100) of queries for a reporting system

Had to change the interface from one button for one report (easy to get lost)

Into a drop down list of reports The relations among data were arbitrary – actually had only in the mind of

the designer – did not follow any standards – No ER – no standard concepts---

So it was a hard job.. Anyway..

Page 23: Data ware house design

WHY REPORTING IN OLTP DATABASE IS NOT A GREAT IDEA

In such cases Tools such as SQL Profiler might help you could create a test environment,

try to insert some data through an LOB application have SQL Profiler identify where the data was inserted

Another, issue with this particular example No lookup for dates and years

You need to extract The tables may not contain even historical data

No date field So no historical data

Page 24: Data ware house design

WHY REPORTING IN OLTP DATABASE IS NOT A GREAT IDEA

If sales data reside in multiple databases even by multiple departments How do you merge Identify and match Customer data can be in different database with no

common identification Data quality can be low

Data missing Partial data Inconsistent data in multiple databases Data can be represented differenlt in different database

M or F for gender 1 or 0 for gender

Page 25: Data ware house design

STAR SCHEMA/FACT/DIMENSION/CUBE

Page 26: Data ware house design

TOTAL DW: MULTIPLE STAR SCHEMAS You saw one Star Schema for Internet Sales You can see another for Offline Sales Another for Accounting Your DW has many such Star Schemas And these start schemas need to be connected/related They will be connected when you use the same dimensions

for them i.e. If two star schemas have the same dimension they can share

that dimension Called: shared or conformed dimensions

For SSAS, you can use shared dimensions only There is a concept of private dimension

Not a great idea in practical and real life applications You cannot connect/compare/verify the data over the shared dimension

Page 27: Data ware house design

SHARED/CONFORMED DIMENSIONS

Page 28: Data ware house design

DENORMALIZED DIMDATE TABLE

Page 29: Data ware house design

SNOW FLAKES WILL BE MORE AND MORE NORMALIZED

Everything can be normalized Or the first level can be normalized

others are not

Page 30: Data ware house design

NORMALIZED PRODUCT DIMENSIONIn the Star Schema, you could use these normalized product table to get snow flake schema (partially.) Could use all normalized dimensions to get full snow flake

Page 31: Data ware house design

SNOW FLAKE In Snow flake, you may see partial than

full snow flakes in reality Though, in reality, better to go for star

schema Queries will be faster

Page 32: Data ware house design

PARTIAL SNOW FLAKE

Page 33: Data ware house design

GRANULARITY The number of Dimension Tables

connected to a fact table Dimension of a star schema Cube = 3 dimension SSAS operates/analyzes on Cube

Page 34: Data ware house design

AUDITING AND LINEAGE

I will be very short on this In data warehouse, you may want some

auditing tables For every update, you should audit

who made the update, when it was made, and how many rows were transferred

to each dimension and fact table

in your DW

Page 35: Data ware house design

AUDITING AND LINEAGE You will need additional fields/columns

in your dimension and fact tables to track When, and who, and from where the row

data was/were updated Your ETL process needs to be updated If you used SSIS for the ETL

Modify SSIS packages so that you can record these information

Page 36: Data ware house design

THANK YOU Any Concerns?

http://ask.justetc.net Or comment below...

Page 37: Data ware house design

DESIGNING DIMENSIONS Keys . Used to identify entities Name columns . Used for human

names of entities Attributes . Used for pivoting in

analyses Member properties . Used for labels

in a report Lineage columns . Used for auditing,

and never exposed to end users

Page 38: Data ware house design

For analysis Pivot Table Pivot Graph

For Dimensions The fields used as for pivoting are called Attributes Not all columns are attributes Attributes: based on what analysis are done In previous, slide you saw the different types of

columns

Page 39: Data ware house design

Attributes  For pivoting, discrete attributes with a

small number of distinct values is most appropriate

Should not be continuous Keys are not good candidates for pivoting

and analysis To make continous column for pivoting

Concert/utilize it as a small set of discrete values

Page 40: Data ware house design

 SSAS can discretize continuous attributes. Not always great – need business perspecyive as well Age and Income are not good candidates for auto

discretize Naming columns to identify the entity

Not good for pivoting or keys Address such as Columns used in reports as labels only, not for

pivoting, are called member properties. Can include translations

Page 41: Data ware house design

Lineage and auditing columns Used for auditing data Never exposed to the users

Page 42: Data ware house design
Page 43: Data ware house design

Possible Attributes BirthDate (after calculating age and discretizing the age) MaritalStatus Gender YearlyIncome (after discretizing) TotalChildren NumberChildrenAtHome EnglishEducation (other education columns are for translations) EnglishOccupation (other occupation columns are for

translations) HouseOwnerFlag NumberCarsOwned CommuteDistance

Page 44: Data ware house design
Page 45: Data ware house design

FullDateAlternateKey (denotes a date in date format)

EnglishMonthName CalendarQuarter CalendarSemester CalendarYear Drill Down attributes

 CalendarYear →CalendarSemester → CalendarQuarter → EnglishMonthName → FullDateAlternateKey.

Page 46: Data ware house design

why dimension columns used in reports for labels are called member properties.

In a Snowflake schema, lookup tables show you levels of hierarchies. In a Star schema, you need to extract natural hierarchies from the names and content of columns. Nevertheless, because drilling down through natural hierarchies is so useful and welcomed by end users, you should use them as much as possible.

Page 47: Data ware house design

SLOWLY CHANGING DIMENSIONS Type 1

History lost Type 2

Keeps all history Type 3

Keeps partial history You can use a combination

For some columns type1 for others type 2

Page 48: Data ware house design
Page 49: Data ware house design
Page 50: Data ware house design
Page 51: Data ware house design
Page 52: Data ware house design

DESIGNING FACT TABLES

Fact tables include measures, foreign keys, and possibly an additional primary key and lineage columns.

Measures can be additive, non-additive, or semi-additive.

For many-to-many relationships, you can introduce an additional intermediate dimension.

Page 53: Data ware house design

Fact tables Collection of measurements on a specific

aspects of business Measure columns sales amount, order quantity, and discount

amount.

Page 54: Data ware house design