Dwh Concepts
-
Upload
madasamy-murugaboobathi -
Category
Documents
-
view
106 -
download
7
description
Transcript of Dwh Concepts
Data Warehousing Concepts 06/01/2010
BFS 4/YORKSHIRE BUILDING SOCIETY MANOJ I BHADIYADRA [email protected]
Agenda
• What is Data Warehouse?
• What is Data Model?
• Data Warehouse Architecture
• ETL
• What is Dimension and Fact?
• What is Star Schema?
• What is Snow Flake Schema?
• What is Galaxy Schema?
• Advantages of using Star, Snow flake and Galaxy Schemas
• What is Primary Key and Surrogate Key?
• What are Rollup and Drill-Down operations
• Design Tips
• A Single complete consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use In a business context.
• A collection of data designed to support management decision making process. Data warehouses contain a wide variety of data that present a coherent picture Of business conditions at a single point in time.
• A Data Warehouse is a Subject Oriented Integrated Time-Varying Non-Volatile collection of data that is used primarily in organizational decision making.
What is Data Warehouse?
What is Data Model?
The logical data structure developed during the logical database design process is a data model or entity model . It is also a description of the structural properties that define all entities represented in a database and all the relationships that exist among them.
ORA structured way of viewing a set of data — the design of the tablesand their corresponding relationships in a relational database.
Data Warehouse Architecture
Operational Databases
External Data Sources
EDW
E
T
L
dm1 dm2 dm3 dm4 dm5 dm6
Data Marts
Reports
ETL
• Short for Extract, Transform, Load, three database functions that are combined into one tool to pull data out of one schema and place it into another database/schema.
• Extract -- the process of reading data from a database.
• Transform -- the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database/schema.
• Load -- the process of writing the data into the target database/schema.
What is Dimension and Fact?
• Dimension:
A user typically needs to evaluate or analyze some aspect of the organization’s business. The requirements that have been collected must represent the two key elements of this analysis: what is being analyzed, and the evaluation criteria for what is being analyzed. The evaluation criteria are referred to as measures (a numeric attribute of a fact), and what is being analyzed is referred to as dimensions (a description attribute of a fact).
• Fact:
The fact table contains IDs for referencing dimensions tables, and measures for measuring the changing or performance of all dimension members.
What is Star Schema?
saleorderId
datecustIdprodIdstoreId
qtyamt
customercustIdname
addresscity
productprodIdnameprice
storestoreId
city
Star Schema With Data
customer custId name address city53 joe 10 main sfo81 fred 12 main sfo
111 sally 80 willow la
product prodId name pricep1 bolt 10p2 nut 5
store storeId cityc1 nycc2 sfoc3 la
sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11105 3/8/97 111 p1 c3 5 50
Dimension Hierarchies
store storeId cityId tId mgrs5 sfo t1 joes7 sfo t2 freds9 la t1 nancy
city cityId pop regIdsfo 1M northla 5M south
region regId namenorth cold regionsouth warm region
sType tId size locationt1 small downtownt2 large suburbs
store
sType
city region
snowflake schema
Data Dimension
time day week month quarter year1 1 1 1 20002 1 1 1 20003 1 1 1 20004 1 1 1 20005 1 1 1 20006 1 1 1 20007 1 1 1 20008 2 1 1 2000
all
years
quarters
months
days
weeks
What is Galaxy Schema?
The Galaxy Schema OR "Multiple Fact Table Schema" is composed of multiple fact tables, which are associated partially with the same dimension tables.
In Galaxy Schema You have two or more related fact table surrounded by common dimensions.
Advantages
The benefit of having star schema is that it is simpler than snowflake and galaxy schemas, making it easier for the ETL processes to load the data into Dimensional Data Store (DDS).The benifit of having snowflake schema is less redundancy, so less disk space is required.The benefit of having galaxy schema is the ability to model the business events more accurately by several fact tables.
Galaxy Schema
Customer
Area
Sales Fact
Time
Product
Purchase Fact
Supplier
Cust_ID Cust_Name Cust_State
Area_ID Area_Name
Time_Id Day Week Month Year
Product_Id Name Type_Name Prod_Brand Size Colour_Name
Purchase_Id Supplier_IdProd_Id Purchase_Price Quantity Time_Id
Supplier_Id Supplier_Name Supplier_Category
Sales_Id Prod_Id Cust_Id Sale_Price Quantity Time_Id Area_Id
Basic Structure of Dimension
PRIMARY KEY
• Definition: The primary key of a relational table uniquely identifies each record in the table. It can either be a normal attribute that is guaranteed to be unique (such as Social Security Number in a table with no more than one record per person) or it can be generated by the DBMS.
Primary keys may consist of a single attribute or multiple attributes in combination.
SURROGATE KEY
• A unique {primary key} generated by the {RDBMS} that is not derived from any data in the database and whose only significance is to act as the primary key. A surrogate key is frequently a sequential number.
ROLAP VS MOLAP
ROLAP: Relational On-Line Analytical Processing
MOLAP: Multi-Dimensional On-Line Analytical Processing
Roll Up AND Drill-Down
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
sale prodId date amtp1 1 62p2 1 19p1 2 48
• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
RollUp Summarize Data : By climbing up hierarchy or by dimension reduction
Drill down Reverse of Roll up: from higher level summary to lower level summary or detailed data, or introducing new dimensions
Design Tips
What data is needed?
Where does it come from?
How to clean data?
How to represent in warehouse (schema)?
What to summarize?
What to materialize?
What to index?