(Dwh Fundamentals)

63
06/10/22 TCS Confidential 1

description

(Dwh Fundamentals)

Transcript of (Dwh Fundamentals)

Page 1: (Dwh Fundamentals)

04/21/23 TCS Confidential 1

Page 2: (Dwh Fundamentals)

Course Roadmap• Why we use Data warehousing

• Difference between Operational System and Data Warehouse

• Introduction to Dataware housing

• Emergence of Decision Support Systems

• Data Warehousing Approaches

• Data Warehouse Technical Architecture

• Data Modelling concepts

• Operational Data Store

• Schema Design of Data warehouse

• Data Acquisation

Page 3: (Dwh Fundamentals)

Why We Need Data Warehousing ?• Better business intelligence for end-users

• Reduction in time to locate, access, and analyze information

• Consolidation of disparate information sources

• To Store Large Volumes of Historical Detail Data from Mission

Critical Applications

• Strategic advantage over competitors

• Faster time-to-market for products and services

• Replacement of older, less-responsive decision support systems

• Reduction in demand on IS to generate reports

Page 4: (Dwh Fundamentals)

OPERATIONAL DATABASE:

Online Transaction Processing

Designed for running the business and not suitable for analyzing the business in the prospect Of business executives because data volatile nature (Keep on changing)

It does not maintain historical data.

It contains only current data.

If u insert any new values it will updateEg: Acnthno Acnthsal 1072 13,000 20,000

Page 5: (Dwh Fundamentals)

OLTP Systems Vs Data Warehouse

users are different

data content is different,

data structures are different

hardware is differentUnderstanding The Differences Is The KeyUnderstanding The Differences Is The Key

Page 6: (Dwh Fundamentals)

OLTP Vs Data Warehouse

Operational System Data Warehouse

Transaction Processing Query Processing

Predictable CPU Usage Random CPU Usage

Time Sensitive History Oriented

Operator View Managerial View

Normalized Efficient

Design for TP

Denormalized Design for

Query Processing

Operational System Data Warehouse

Transaction Processing Query Processing

Predictable CPU Usage Random CPU Usage

Time Sensitive History Oriented

Operator View Managerial View

Normalized Efficient

Design for TP

Denormalized Design for

Query Processing

Page 7: (Dwh Fundamentals)

OLTP Vs WarehouseOperational System Data Warehouse

Designed for Atmocity,Consistency, Isolation andDurability

Designed for quite or staticdatabase

Organized by transactions(Order, Input, Inventory)

Organized by subject(Customer, Product)

Relatively smaller database Large database size

Many concurrent users Relatively few concurrentusers

Volatile Data Non Volatile Data

Operational System Data Warehouse

Designed for Atmocity,Consistency, Isolation andDurability

Designed for quite or staticdatabase

Organized by transactions(Order, Input, Inventory)

Organized by subject(Customer, Product)

Relatively smaller database Large database size

Many concurrent users Relatively few concurrentusers

Volatile Data Non Volatile Data

Page 8: (Dwh Fundamentals)

Operational System Data Warehouse

Stores all data Stores relevant data

Performance Sensitive Less Sensitive to performance

Not Flexible Flexible

Efficiency Effectiveness

Operational System Data Warehouse

Stores all data Stores relevant data

Performance Sensitive Less Sensitive to performance

Not Flexible Flexible

Efficiency Effectiveness

Page 9: (Dwh Fundamentals)

What is a Data Warehouse ?

• Data Warehouse Data Warehouse is a

• Subject-Oriented

• Integrated

• Time-Variant

• Non-volatile

WH Inmon - Regarded As Father Of Data WarehousingWH Inmon - Regarded As Father Of Data Warehousing

Page 10: (Dwh Fundamentals)

10

Subject Oriented Analysis

Data Warehouse StorageTransactional Storage

SalesSales

CustomersCustomers

ProductsProducts

EntrySales RepQuantity SoldPart NumberDate Customer NameProduct DescriptionUnit PriceMail Address

Process Oriented Subject Oriented

Page 11: (Dwh Fundamentals)

11

Integration of Data

Data Warehouse StorageTransactional Storage

Appl. A - M, FAppl. B - 1, 0Appl. C - X, Y

Appl. A - pipeline cm.Appl. B - pipeline inchesAppl. C - pipeline mcf

Appl. A - balance dec(13,2) Appl. B - balance PIC 9(9)V99Appl. C - balance float

Appl. A - bal-on-handAppl. B - current_balanceAppl. C - balance

Appl. A - date (Julian)Appl. B - date (yymmdd)Appl. C - date (absolute)

M, F

pipeline cm

balance dec(13, 2)

balance

date (Julian)In

tegr

atio

n

Encoding

Unit of Attributes

Physical Attributes

Naming Conventions

Data Consistency

Page 12: (Dwh Fundamentals)

12

Load

Access

Mass Load / Access of DataRecord-by-Record Data Manipulation

Insert

Access

Insert

Change

Delete

Change

Volatile Non-Volatile

Volatility of Data

Data Warehouse StorageTransactional Storage

Page 13: (Dwh Fundamentals)

13

Time Variant Data Analysis

Data Warehouse StorageTransactional Storage

Current Data Historical Data

0

5

10

15

20

Sales ( in lakhs )

January February March

Year97

Sales ( Region , Year - Year 97 - 1st Qtr)

EastWestNorth

Page 14: (Dwh Fundamentals)

14

Decision Support Systems (DSS)

What is DSS?

Need for DSS

Comparison of OLTP & DSS

Transition from Data Processing to Information

Processing

Page 15: (Dwh Fundamentals)

15

Enable users to get a “Business View” of the data

Facilitate Data based Decision Making that would drive and improve the Business

Discover “Hidden Trends”

What is DSS?

Decision Support SystemsDecision Support Systems (DSS) are interactive computer-based systems intended to help decision makers utilize data and models to identify and solve problems and make decisions. Data Warehouse is the foundation of DSS process. It is a Strategy and a Process for Staging Corporate Data.

Decision Support SystemsDecision Support Systems (DSS) are interactive computer-based systems intended to help decision makers utilize data and models to identify and solve problems and make decisions. Data Warehouse is the foundation of DSS process. It is a Strategy and a Process for Staging Corporate Data.

Page 16: (Dwh Fundamentals)

Why DSS?: How to answer these Business Queries?

What is the sales distribution region wise?

What is Defaulter’s Profile?

What are the slow movers in my product line?

How did my revenue improve in the past 5 years?

Which of my Sales Agentsare doing better?

Who are my profitable customers?

Currency Risk, Interest Rate Risk, Liquidity Risk

Strategic Planning / Budgeting

Which channel costs me more and pays less?

Page 17: (Dwh Fundamentals)

17

OLTP v/s DSS Environment

OLTP EnvironmentOLTP Environment• get data IN

• large volumes of simple transaction queries

• continuous data changes

• low processing time

• mode of processing

• transaction details

• data inconsistency

• mostly current data

DSS EnvironmentDSS Environment

• get information OUT

• small number of diverse queries

• periodic updates only

• high processing time

• mode of discovery

• subject oriented - summaries

• data consistency

• historical data is relevant

Page 18: (Dwh Fundamentals)

18

OLTP v/s DSS Environment

OLTP EnvironmentOLTP Environment• high concurrent usage

• highly normalized data structure

• static applications

• automates routines

DSS EnvironmentDSS Environment

• low concurrent usage

• fewer tables, but more columns per table

• dynamic applications

• facilitates creativity

Page 19: (Dwh Fundamentals)

DW Implementation Approaches

• Top Down

• Bottom-up

• Combination of both

• Choices depend on:– current infrastructure– resources– architecture– ROI– Implementation speed

Page 20: (Dwh Fundamentals)

Top Down Implementation

Page 21: (Dwh Fundamentals)

Bottom Up Implementation

Page 22: (Dwh Fundamentals)

DW Implementation Approaches

Top Down• More planning and design

initially• Involve people from

different work-groups, departments

• Data marts may be built later from Global DW

• Overall data model to be decided up-front

Bottom Up• Can plan initially without

waiting for global infrastructure

• built incrementally

• can be built before or in parallel with Global DW

• Less complexity in design

Page 23: (Dwh Fundamentals)

DW Implementation Approaches

Top Down• Consistent data definition

and enforcement of business rules across enterprise

• High cost, lengthy process, time consuming

• Works well when there is centralized IS department responsible for all H/W and resources

Bottom Up• Data redundancy and

inconsistency between data marts may occur

• Integration requires great planning

• Less cost of H/W and other resources

• Faster pay-back

Page 24: (Dwh Fundamentals)

24

DW Architectures

Page 25: (Dwh Fundamentals)

25

Data warehousing Architecture

Source 1

Source 2

Source 3

Source n

Sources

Cle

an

sin

g,

Tra

nsfo

rmati

on

& L

oad

ing

Staging Layer

Data Marts

Cubes-Conformed Dimensions

Data Warehouse

Summaries /

Aggregations

ODS

Detail Data

Transformation

Summarization Aggregation

Reporting Layer

Canned Reports

Ad-hoc analysis

Metadata

Extract-Push/Pull

Page 26: (Dwh Fundamentals)

Benefits of DWH

To formulate effective business, marketing

and sales strategies.

To precisely target promotional activity.

To discover and penetrate new markets.

To successfully compete in the marketplace

from a position of informed strength.

To build predictive rather than retrospective models.

Page 27: (Dwh Fundamentals)

Data Modeling

Page 28: (Dwh Fundamentals)

Data Modeling

WHAT IS A DATA MODEL? A data model is an abstraction of some aspect of

the real world (system). WHY A DATA MODEL?

• Helps to visualize the business

• A model is a means of communication.

• Models help elicit and document requirements.

• Models reduce the cost of change.

• Model is the essence of DW architecture based on which DW will be implemented

Page 29: (Dwh Fundamentals)

STEPS in DATA MODELINGProblem & scope definition

Requirement Gathering

Analysis

Logical Database Design

Deciding Database

Physical Database design

Schema Generation

Page 30: (Dwh Fundamentals)

Levels of modeling• Conceptual modeling

– Describe data requirements from a business point of view without technical details

• Logical modeling– Refine conceptual models– Data structure oriented, platform independent

• Physical modeling– Detailed specification of what is physically

implemented using specific technology

Page 31: (Dwh Fundamentals)

Conceptual Model

• A conceptual model shows data through business eyes.

• All entities which have business meaning.

• Important relationships

• Few significant attributes in the entities.

• Few identifiers or candidate keys.

Page 32: (Dwh Fundamentals)

Logical Model

• Replaces many-to-many relationships with associative entities.

• Defines a full population of entity attributes.

• May use non-physical entities for domains and sub-types.

• Establishes entity identifiers.

• Has no specifics for any RDBMS or configuration.

Page 33: (Dwh Fundamentals)

Physical Model

• A Physical data model may include– Referential Integrity– Indexes– Views– Alternate keys and other constraints– Tablespaces and physical storage objects.

Page 34: (Dwh Fundamentals)

Modeling Techniques

• Entity-Relationship Modeling

– Traditional modeling technique

– Technique of choice for OLTP

– Suited for corporate data warehouse

• Dimensional Modeling

– Analyzing business measures in the specific business context

– Helps visualize very abstract business questions

– End users can easily understand and navigate the data structure

Page 35: (Dwh Fundamentals)

• Relationship

– Relationship between entities - structural interaction and

association

– described by a verb

– Cardinality

• 1-1

• 1-M

• M-M

– Example : Books belong to Printed Media

Entity-Relationship Modeling - Basic Concepts

Page 36: (Dwh Fundamentals)

Entity-Relationship Modeling - Basic Concepts

• Attributes– Characteristics and properties of entities

– Example :• Book Id, Description, book category are attributes of entity

“Book”

– Attribute name should be unique and self-explanatory

– Primary Key, Foreign Key, Constraints are defined on Attributes

Page 37: (Dwh Fundamentals)

37

Examples: ER Model

Page 38: (Dwh Fundamentals)

Limitations of E-R Modeling

• Poor Performance

• Tend to be very complex and difficult to navigate.

Page 39: (Dwh Fundamentals)

39

Dimensional Modeling

Page 40: (Dwh Fundamentals)

Dimensional Modeling

• Dimensional modeling uses three basic concepts : measures, facts, dimensions.

• Is powerful in representing the requirements of the business user in the context of database tables.

• Focuses on numeric data, such as values counts, weights, balances and occurences.

Page 41: (Dwh Fundamentals)

• Must identify– Business process to be supported– Grain (level of detail)– Dimensions– Facts

Dimensional modeling

Page 42: (Dwh Fundamentals)

What is a Facts • A fact is a collection of related data items,

consisting of measures and context data.

• Each fact typically represents a business item, a business transaction, or an event that can be used in analyzing the business or business process.

• Facts are measured, “continuously valued”, rapidly changing information. Can be calculated and/or derived.

Page 43: (Dwh Fundamentals)

Types of Facts• Additive

– Able to add the facts along all the dimensions

– Discrete numerical measures eg. Retail sales in $

• Semi Additive

– Snapshot, taken at a point in time

– Measures of Intensity

– Not additive along time dimension eg. Account balance, Inventory balance

– Added and divided by number of time period to get a time-average

• Non Additive

– Numeric measures that cannot be added across any dimensions

– Intensity measure averaged across all dimensions eg. Room temperature

– Textual facts - AVOID THEM

Page 44: (Dwh Fundamentals)

Dimensions

• A dimension is a collection of members or units of the same type of views.

• Dimensions determine the contextual background for the facts.

• Dimensions represent the way business people talk about the data resulting from a business process, e.g., who, what, when, where, why, how

Page 45: (Dwh Fundamentals)

45

Dimensional Hierarchy

World

America AsiaEurope

USA

FL

Canada Argentina

GA VA CA WA

TampaMiami Orlando Naples

Continent Level

State Level

City Level

World Level

Country Level

Pare

nt R

elat

ion

Dimension Member / Business

Entity

Geography Dimension

Attributes: Population, Tourist’s Place

Page 46: (Dwh Fundamentals)

Dimensions Types

• Conformed Dimension

• junk Dimension

• Dirty Dimension

• Monster Dimension

• Slowly Changing Dimension

• Degenerated Dimension

46

Page 47: (Dwh Fundamentals)

47

Data marts

A data mart is a

• Powerful and natural extension of the data warehouse• Extends information to the departmental environment

from an enterprise environment• Interprets and structures data to suit departments’

specific needs

Data marts (DM)

Several names for DMs:

• departmental DSS DBs

• OLAP Data bases

• multi-dimensional DBs (MDDB)

• lightly summarized tables

Page 48: (Dwh Fundamentals)

48

Data marts

• Embedded data marts are marts that are stored within

the central DW. They can be stored relationally as files or

cubes.

• Dependent data marts are marts that are fed directly by

the DW, sometimes supplemented with other feeds, such as

external data.

• Independent data marts are marts that are fed directly

by external sources and do not use the DW.

DM - Types

Page 49: (Dwh Fundamentals)

49

ODS

An ODS

• pulls together, validates, cleanses and integrates data• foundation for providing integrated view of enterprise data• tactical decision support, day-to-day operations and

management reporting

Operational Data Store (ODS)

Characteristics

Integrated

Subject-oriented

Volatile (including update)

Current valued

Page 50: (Dwh Fundamentals)

50

ODS

Class I – Immediate Load.

Class II – Delayed Load

Class III – Overnight Load.

Class IV – Data warehouse Load.

ODS - Types

Page 51: (Dwh Fundamentals)

OLTP Vs ODS Vs DWH

Characteristic OLTP ODS Data Warehouse

Data redundancy Non-redundantwithin system;Unmanagedredundancy amongsystems

Somewhatredundant withoperationaldatabases

Managedredundancy

Data stability Dynamic Somewhat dynamic Static

Data update Field by field Field by field Controlled batch

Data usage Highly structured,repetitive

Somewhatstructured, someanalytical

Highlyunstructured,heuristic oranalytical

Database size Moderate Moderate Large to very large

Databasestructure stability

Stable Somewhat stable Dynamic

Page 52: (Dwh Fundamentals)

Star Schema Design

– Single fact table surrounded by denormalized dimension tables

– The fact table primary key is the composite of the foreign keys (primary keys of dimension tables)

– Fact table contains transaction type information.– Many star schemas in a data mart– Easily understood by end users, more disk storage

required

Page 53: (Dwh Fundamentals)

Example of Star Schema

Page 54: (Dwh Fundamentals)

Snowflake Schema – Single fact table surrounded by normalized dimension

tables– Normalizes dimension table to save data storage space.– When dimensions become very very large– Less intuitive, slower performance due to joins

• May want to use both approaches, especially if supporting multiple end-user tools.

Page 55: (Dwh Fundamentals)

Example of Snow flake schema

Page 56: (Dwh Fundamentals)

Snowflake - Disadvantages

• Normalization of dimension makes it difficult for user to understand

• Decreases the query performance because it involves more joins

• Dimension tables are normally smaller than fact tables - space may not be a major issue to warrant snowflaking

Page 57: (Dwh Fundamentals)

57

On-Line Analytical Processing (OLAP)

OLAP Cubes

OLAP is a category of applications/technology for

collecting

managing

processing

presenting

multidimensional data for analysis and management purposes

Page 58: (Dwh Fundamentals)

58

OLAP Cubes

• Subject oriented approach to Decision Support

• Calculations applied across dimensions, through hierarchies and/or across members

• Trend analysis over sequential time periods, What If scenarios.

• Slicing/Dicing subsets for on-screen viewing

• Drill-down/up along the hierarchy

• Reach-through to underlying detail data

• Rotation to new dimensional comparisons in the viewing area

OLAP Features

Page 59: (Dwh Fundamentals)

59

Multi-dimensional OLAP (MOLAP)

Relational OLAP (ROLAP)

Hybrid OLAP (HOLAP)

OLAP Categories

OLAP Cubes

Page 60: (Dwh Fundamentals)

60

MOLAP

• Use pre-calculated data set – CUBE

• Cube contains all possible answers to given range of questions

Features:

• Very fast response

• Ability to quickly write data into the cube

Downsides:

• Limited Scalability

• Inability to contain detailed data

• Load time

OLAP Cubes

Page 61: (Dwh Fundamentals)

61

OLAP Cubes

ROLAP

• Do not use pre-calculated CUBE

• Intercept query & pose it to the Relational DB

Features:

• Ask any question (not limited to the contents of the cube)

• Ability to drill downDownsides:

• Slow Response

• Some limitations on scalability

Page 62: (Dwh Fundamentals)

62

OLAP Cubes

HOLAP

• Combines MOLAP & ROLAP

• Utilizes both pre-calculated cubes & relational data sources

Features:

• For summary type info – cube, (Faster response)

• Ability to drill down – relational data sources (drill through detail to underlying data)

• Source of data transparent to end-user

Page 63: (Dwh Fundamentals)

Data Acquisation

• Data Extraction

• Data Transformation

• Data Loading

63