Dataware House Introduction by QuontraSolutions

43
INTRODUCTION TO DATA WAREHOUSING BY QUONTRA SOLUTIONS PHONE : (404)-900-9988 EMAIL : [email protected] WEBSITE : WWW.QUONTRASOLUTIONS.COM

description

Quontra Solutions main motto is to Provide Industry Oriented best Online Training on all IT Courses. All our courses are taught by experienced trainers who have extensive field knowledge with the topics they teach. We are offering Job Oriented online Training Program on Informatica. Learn Informatica Course from Real Time Experienced Trainers. Quontra Solutions provide Training to wide range of customers like for the working professional, job seeking candidates, corporate & to the students. Coming to learning part to work in Informatica minimum Intermediate level of SQL knowledge and atleast basic level of UNIX knowledge (as Informatica installed in UNIX environment most cases) adding to it some analytical skills in writing expressions.

Transcript of Dataware House Introduction by QuontraSolutions

Page 1: Dataware House Introduction by QuontraSolutions

INTRODUCTION TO DATA

WAREHOUSINGBY

QUONTRA SOLUTIONS

PHONE : (404)-900-9988

EMAIL :

[email protected]

WEBSITE : WWW.QUONTRASOLUTIONS.COM

Page 2: Dataware House Introduction by QuontraSolutions

DATA WAREHOUSE Maintain historic data Analysis to get better understanding of business Better Decision making Definition: A data warehouse is a

subject-oriented

integrated

time-varying

non-volatile

collection of data that is used primarily in organizational decision making.

-- Bill Inmon, Building the Data Warehouse 1996

Page 3: Dataware House Introduction by QuontraSolutions

SUBJECT ORIENTED• Data warehouse is organized around subjects such as

sales, product, customer.• It focuses on modeling and analysis of data for

decision makers.• Excludes data not useful in decision support process.

Page 4: Dataware House Introduction by QuontraSolutions

INTEGRATED• Data Warehouse is constructed by integrating

multiple heterogeneous sources.• Data Preprocessing are applied to ensure consistency.

RDBMS

LegacySystem

DataWarehouse

Flat File

Data ProcessingData Transformation

Data ProcessingData Transformation

Page 5: Dataware House Introduction by QuontraSolutions

NON-VOLATILE• Mostly, data once recorded will not be updated.• Data warehouse requires two operations in data

accessing- Incremental loading of data- Access of data

load access

Page 6: Dataware House Introduction by QuontraSolutions

TIME VARIANT• Provides information from historical perspective e.g.

past 5-10 years• Every key structure contains either implicitly or

explicitly an element of time

Page 7: Dataware House Introduction by QuontraSolutions

WHY DATA WAREHOUSE? Problem Statement:• ABC Pvt Ltd is a company with branches at USA,

UK,CANADA,INDIA• The Sales Manager wants quarterly sales report

across the branches. • Each branch has a separate operational system

where sales transactions are recorded.

Page 8: Dataware House Introduction by QuontraSolutions

WHY DATA WAREHOUSE?

USA

UK

CANADA

INDIA

SalesManager

Get quarterly sales figure for each branch

and manually calculate sales figure across branches.

What if he need daily sales report across the branches?

Page 9: Dataware House Introduction by QuontraSolutions

WHY DATA WAREHOUSE? Solution:• Extract sales information from each database.• Store the information in a common repository at a

single site.

Page 10: Dataware House Introduction by QuontraSolutions

WHY DATA WAREHOUSE?

USA

UK

CANADA

INDIA

DataWarehouse

SalesManager

Query &Analysis tools

Page 11: Dataware House Introduction by QuontraSolutions

CHARACTERISTICS OF DATA WAREHOUSE Relational / Multidimensional database

Query and Analysis rather than transaction Historical data from transactions Consolidates Multiple data sources Separates query load from transactions Mostly non volatile Large amount of data in order of TBs

Page 12: Dataware House Introduction by QuontraSolutions

WHEN WE SAY LARGE - WE MEAN IT!• Terabytes -- 10^12

bytes:

• Petabytes -- 10^15 bytes:

• Exabytes -- 10^18 bytes:

• Zettabytes -- 10^21 bytes:

• Zottabytes -- 10^24 bytes:

Yahoo! – 300 Terabytes and growing

Geographic Information Systems

National Medical Records

Weather images

Intelligence Agency Videos

Page 13: Dataware House Introduction by QuontraSolutions

OLTP VS DATA WAREHOUSE (OLAP)OLTP Data Warehouse (OLAP)

Indexes Few Many

Data Normalized Generally De-normalized

Joins Many Some

Derived data and aggregates Rare Common

Page 14: Dataware House Introduction by QuontraSolutions

DATA WAREHOUSE ARCHITECTURE

FlatFiles

ETL(Extract

Transformand Load)

Data Warehouse

InventoryData Mart

Data Mining

Analysis

Reporting

GenericData Mart

SalesData Mart

Operational System

Operational System

FlatFiles

Page 15: Dataware House Introduction by QuontraSolutions

ETL ETL stands for Extract, Transform and Load Data is distributed across different sources

– Flat files, Streaming Data, DB Systems, XML, JSON

Data can be in different format– CSV, Key Value Pairs

Different units and representation– Country: IN or India– Date: 20 Nov 2010 or 20101020

Page 16: Dataware House Introduction by QuontraSolutions

ETL FUNCTIONS Extract

– Collect data from different sources– Parse data– Remove unwanted data

Transform– Project– Generate Surrogate keys– Encode data– Join data from different sources– Aggregate

Load

Page 17: Dataware House Introduction by QuontraSolutions

ETL STEPS• The first step in ETL process is mapping the data

between source systems and target database. • The second step is cleansing of source data in staging

area. • The third step is transforming cleansed source data. • Fourth step is loading into the target system.

Data before ETL Processing:

Data after ETL Processing:

Page 18: Dataware House Introduction by QuontraSolutions

ETL GLOSSARYMapping:

Defining relationship between source and target objects.

Cleansing:

The process of resolving inconsistencies in source data.

Transformation:

The process of manipulating data. Any manipulation beyond copying is a transformation. Examples include aggregating, and integrating data from multiple sources.

Staging Area:

A place where data is processed before entering the warehouse.

Page 19: Dataware House Introduction by QuontraSolutions

DIMENSION Categorizes the data. For example - time, location,

etc. A dimension can have one or more attributes. For

example - day, week and month are attributes of time dimension.

Role of dimensions in data warehousing.- Slice and dice- Filter by dimensions

Page 20: Dataware House Introduction by QuontraSolutions

TYPES OF DIMENSIONS• Conformed Dimension - A dimension that is shared across fact

tables. • Junk Dimension - A junk dimension is a convenient grouping

of flags and indicators. For example, payment method, shipping method.

• De-generated Dimension - A dimension key, that has no attributes and hence does not have its own dimension table. For example, transaction number, invoice number. Value of these dimension is mostly unique within a fact table.

• Role Playing Dimensions - Role Playing dimension refers to a dimension that play different roles in fact tables depending on the context. For example, the Date dimension can be used for the ordered date, shipment date, and invoice date.

• Slowly Changing Dimensions - Dimensions that have data that changes slowly, rather than changing on a time-based, regular schedule.

Page 21: Dataware House Introduction by QuontraSolutions

TYPES OF SLOWLY CHANGING DIMENSION

• Type1 - The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all.

• Type 2 - The Type 2 method tracks historical data by creating multiple records for a given value in dimension table with separate surrogate keys.

• Type 3 - The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns we designate for storing historical data.

• Type 4 - The Type 4 method is usually referred to as using "history tables", where one table keeps the current data, and an additional table is used to keep a record of all changes.

Type 1, 2 and 3 are commonly used.

Some books talks about Type 0 and 6 also.

http://en.wikipedia.org/wiki/Slowly_changing_dimension

Page 22: Dataware House Introduction by QuontraSolutions

FACTS Facts are values that can be examined and analyzed. For Example - Page Views, Unique Users, Pieces Sold,

Profit. Fact and measure are synonymous. Types of facts:

– Additive - Measures that can be added across all dimensions.

– Non Additive - Measures that cannot be added across all dimensions.

– Semi Additive - Measures that can be added across few dimensions and not with others.

Page 23: Dataware House Introduction by QuontraSolutions

HOW TO STORE DATA? Facts and Dimensions:

1. Select the business process to model

2. Declare the grain of the business process

3. Choose the dimensions that apply to each fact table row

4. Identify the numeric facts that will populate each fact table row

Page 24: Dataware House Introduction by QuontraSolutions

DIMENSION TABLE Contains attributes of dimensions e.g. month is an

attribute of Time dimension. Can also have foreign keys to another dimension

table Usually identified by a unique integer primary key

called surrogate key

Page 25: Dataware House Introduction by QuontraSolutions

FACT TABLE Contains Facts Foreign keys to dimension tables Primary Key: usually composite key of all FKs

Page 26: Dataware House Introduction by QuontraSolutions

TYPES OF SCHEMA USED IN DATA WAREHOUSE

Star Schema Snowflake Schema Fact Constellation Schema

Page 27: Dataware House Introduction by QuontraSolutions

STAR SCHEMA Multi-dimensional Data Dimension and Fact Tables A fact table with pointers to Dimension tables

Page 28: Dataware House Introduction by QuontraSolutions

STAR SCHEMA

Page 29: Dataware House Introduction by QuontraSolutions

SNOWFLAKE SCHEMA An extension of star schema in which the dimension

tables are partly or fully normalized. Dimension table hierarchies broken down into

simpler tables.

Page 30: Dataware House Introduction by QuontraSolutions

SNOWFLAKE SCHEMA

Page 31: Dataware House Introduction by QuontraSolutions

FACT CONSTELLATION SCHEMA• A fact constellation schema allows dimension tables

to be shared between fact tables. • This Schema is used mainly for the aggregate fact

tables, OR where we want to split a fact table for better comprehension.

For example, a separate fact table for daily, weekly and monthly reporting requirement.

Page 32: Dataware House Introduction by QuontraSolutions

FACT CONSTELLATION SCHEMA

In this example, the dimensions tables for time, item, and location are shared between both the sales and shipping fact tables.

Page 33: Dataware House Introduction by QuontraSolutions

OPERATIONS ON DATA WAREHOUSE Drill Down

Roll up Slice & Dice Pivoting

Page 34: Dataware House Introduction by QuontraSolutions

DRILL DOWN

Time

Reg

ion

Product

Category e.g Home Appliances

Sub Category e.g Kitchen Appliances

Product e.g Toaster

Page 35: Dataware House Introduction by QuontraSolutions

ROLL UP

Year

Quarter

Month

Fiscal Year

Fiscal Quarter

Fiscal Month

Fiscal Week

Day

Page 36: Dataware House Introduction by QuontraSolutions

SLICE & DICE

Time

Reg

ion

ProductProduct = Toaster

Time

Reg

ion

Page 37: Dataware House Introduction by QuontraSolutions

PIVOTING

• Also called rotation• Rotate on an axis• Interchange Rows and Columns

Time

Reg

ion

Product

Region

Tim

e

Product

Page 38: Dataware House Introduction by QuontraSolutions

ADVANTAGES OF DATA WAREHOUSE• One consistent data store for reporting, forecasting,

and analysis• Easier and timely access to data• Scalability• Trend analysis and detection• Drill down analysis

Page 39: Dataware House Introduction by QuontraSolutions

DISADVANTAGES OF DATA WAREHOUSE• Preparation may be time consuming.

• High associated cost

Page 40: Dataware House Introduction by QuontraSolutions

CASE STUDY: WHY DATA WAREHOUSE• G2G Courier Pvt. Ltd. is an established brand in

courier industry which has its own network in main cities and also have sub contracted in rural areas across the country to various partners.

• The President of the company wants to look deep into the financial health of the company and different performance aspects.

Page 41: Dataware House Introduction by QuontraSolutions

CHALLENGES• Apart from G2G’s own transaction system, each

partner has their own system which make the data very heterogeneous.

• Granularity of data in various systems is also different. For eg: minute accuracy and day accuracy.

• To do analysis on metrics like Revenue and Timely delivery across various geographical locations and partner, we need to have a unified system.

Page 42: Dataware House Introduction by QuontraSolutions

DATA WAREHOUSE MODEL

Sales Fact

Region

Product ProductCategory

Time

Page 43: Dataware House Introduction by QuontraSolutions

THANK YOU