Optimizing the design of your data warehouse 09222010

31
Optimizing the Design of your Data Warehouse Michael Wacey CSC [email protected]

description

 

Transcript of Optimizing the design of your data warehouse 09222010

Page 1: Optimizing the design of your data warehouse 09222010

Optimizing the Design of your Data WarehouseMichael [email protected]

Page 2: Optimizing the design of your data warehouse 09222010

PAGE 2

Introduction

• Who am I?

– Michael Wacey

– Partner with CSC since 1986

– Architected many large scale data warehouses

• What are we going to discuss today?

– Motivation

– Tools

– Approach

Page 3: Optimizing the design of your data warehouse 09222010

PAGE 3

Motivation

• Data Here, Data There, Data Everywhere

• Solutions

– Architecture – the SAP approach – very hard to sustain and SAP can not solve all problems

– Data Integration – requires architecture on the boundaries and infrastructure, lots of infrastructure

– Data Warehouse – Periodically collect the data and bring it all together for one or more purposes – the best bet for the foreseeable future

• Solutions are always trying to answer - How do we get this data to fit together?

Page 4: Optimizing the design of your data warehouse 09222010

PAGE 4

Motivation

• Making data fit together is difficult

– Local countries report numbers in their local (possibly multiple) currencies and there is no agreed to set of conversion rates

– The Trust department would rather not share that data with finance

– The current policy administration system has serious data quality issues, but there is a new system being built and scheduled to go online in June 2011, but that date may be in jeopardy

• We need a way to collect and analyze all this knowledge about the data

Page 5: Optimizing the design of your data warehouse 09222010

PAGE 5

Motivation

• A high level view:

• May help with scoping

• Each line could represent many files or feeds

• Each box could represent many applications

Accounting

Sales

Marketing

Data Warehouse

Customer Profitability

Sales Forecasts

Page 6: Optimizing the design of your data warehouse 09222010

PAGE 6

Motivation

• A detailed view:

• Too much detail to plan and analyze and understand

• As usual, we have a forest and trees problem

BEGIN

SELECT ml.sequence, al.sequence, m.msgkey INTO mseq, aseq, mkey

FROM mqseries.levelcodes ml, mqseries.messages m, mqseries.appctl a, mqseries.levelcodes al

WHERE m.msglevel = ml.levelcodekey

AND m.msgcode = inmsgcode

AND a.msglevel = al.levelcodekey

AND a.appctlkey = 1;

IF sql%ROWCOUNT = 1 THEN

IF aseq <= mseq THEN

SELECT statuscodekey INTO sck FROM mqseries.statuscodes WHERE

statuscode = 'n';

insert into mqseries.msglog (msglogkey, msgkey, msgdata, msgstatus,

msgsqlcode, msgsqlerrm)

values(mqseries.msgseq.nextval, mkey, inmsgdata, sck, inmsgsqlcode,

SUBSTR(inmsgsqlerrm,1,4000));

IF incommit = true THEN

commit;

END IF;

END IF;

ELSE

Page 7: Optimizing the design of your data warehouse 09222010

PAGE 7

Motivation

• What to do?

– PowerPoint?

– Visio?

– ERwin?

• They all help, but none gives us that right picture

• We need a way to see the problem and the solution at the right level of detail

Page 8: Optimizing the design of your data warehouse 09222010

PAGE 8

Motivation

• What is a data warehouse?

• It includes:

– Sources of data

– Processing of data

– Storage of data – probably multiple times in different structures

– Analytics

• Except for Analytics, these are either static views of data or dynamic processing of data

• ERwin DM is great for the static views of data, we just need to capture the dynamic processing

Page 9: Optimizing the design of your data warehouse 09222010

PAGE 9

Motivation

• I have used many techniques to capture the dynamic processing

• Spreadsheets to capture data mapping (who hasn’t)

• Process flow diagrams in PowerPoint and Visio

• UML Diagrams in the IBM and Sparx tools

• They all worked to an extent but were hard to maintain and did not provide a leveling mechanism

Page 10: Optimizing the design of your data warehouse 09222010

PAGE 10

Motivation

• Many years ago, I had used Data Flow Diagrams to describe systems under development

• They provided insight into the flow of data and leveling of those processes

• So, I tried that – first in Visio and later in ERwin PM

• The rest of this talk is an approach to using ERwin DM and ERwin PM together to model a Data Warehouse

• I have used this approach for the past five years and find it is very successful

• It provides information to both the user community and developers

Page 11: Optimizing the design of your data warehouse 09222010

PAGE 11

The Tools

• ERwin Data Modeler

– Used to model databases

– Supports both Logical and Physical models

– If needed, I create conceptual models in PowerPoint or Visio

– Each model has to represent one type of database

– But, data warehouses use many – Flat Files, Oracle, SQL Server, Cubes, etc

– I use UDP to represent the actual type of an Entity/Table

– For example, a table that represents a flat file would have that setting in a UDP

Page 12: Optimizing the design of your data warehouse 09222010

PAGE 12

The Tools

• ERwin Process Modeler (ERwin PM)

– Previously called BPwin

– Supports several diagram types

– I have only found the Data Flow diagrams useful for the design of a data warehouse

– The other diagrams could be used in analysis to understand how the data warehouse will be used

Page 13: Optimizing the design of your data warehouse 09222010

PAGE 13

The Tools

• ERwin DM and ERwin PM

• There is a connection between the tools

• I have not used it extensively

Page 14: Optimizing the design of your data warehouse 09222010

PAGE 14

The Tools

• Other Tools

– These are minor but needed

– PDF Viewer

– Microsoft Excel

– Microsoft Word

Page 15: Optimizing the design of your data warehouse 09222010

PAGE 15

The Approach

• So, we have two tools to design a data warehouse

• ERwin DM will be used to design and document static data stores

• ERwin PM will be used to design the processing

• Lets take a look at an example and then discuss how it works

Page 16: Optimizing the design of your data warehouse 09222010

PAGE 16

The Approach

• Start in ERwin PM

• Create a new model that is a data flow model

• First we will create a context model

• This will provide a view of the sources and uses of data

• On the left side, the sources of data are listed – using the external entity symbol

– Sources can be Systems, Databases, People, etc.

• On the right hand side, the uses of data are listed – using the external entity symbol

– Uses can be reports, cubes, analytics, data feeds, etc.

Page 17: Optimizing the design of your data warehouse 09222010

PAGE 17

The Approach

NODE: TITLE: NUMBER:Customer ProfitabilityA-0

Allocation

Factors

Consum er Loan

Data

Exception

Report

Data

Comm ercial

Customer Data

Comm ercial Loan

Data

Retail Customer

Data

General Ledger

Data

Demand

Deposit Data

Mortgage Data

Balancing Report Data

Treasury Data

Trust Data

Organization Data

A0$0

Customer Profitabil ity

E1

Allocation

Factors

E11

Exception

Report

E3

Consum er Loans

E14

Retail

Customer

Analytics

E13

Comm ercial

Customer

Analytics

E9

General Ledger

E5

Comm erical Loans

E2

Demand Deposit

Accounts

E4

Mortgages

E12

Balancing

Report

E6

Treasury

E7

Trust Accounts

E8

Organization

Page 18: Optimizing the design of your data warehouse 09222010

PAGE 18

The Approach

• The Context Diagram is a good start

• It sets the scope

• But does not provide any details about what is going to be done

• This comes in the next diagram – The details of the central process

Page 19: Optimizing the design of your data warehouse 09222010

PAGE 19

The Approach

NODE: TITLE: NUMBER:Customer ProfitabilityA0

Treasury Data

Trust Data

Dim ens ion

Data for

Calculation

Exception Report

Data

Balancing

Report Data

Validated

Dim ens ion

Data

Mortgage Data

Demand

Deposit Data

Organization Data

Comm ercial

BI Data

Consum er

Loan Data

Retial BI Data

Input Balance

Values

Calculation Balance

Values

Validated

Fact Data

Retail Balancing Data

Fact Data

for

Calculation

Source Exceptions

Calculation

Exceptions

Comm ercial

Loan Data

Comm ercial

Balancing Data

General Ledger

Data

Allocation

Factors

Comm ercial

Customer Data

Retail Customer

Data

Customer

Profitabil ity

Data

A1$0

Sourcing

A2$0

Customer Profitabil ity Calculation

A5$0

Retail BI

A3$0

Exception Output

A6$0

Balance Input and Output

A4$0

Comm ercial BI

D4Balancing

Values

D3 Exceptions

D1

Customer

Profitabil ity

Staging

D2

Customer

Profitabil ity

Data

Warehouse

Page 20: Optimizing the design of your data warehouse 09222010

PAGE 20

The Approach

• This level one diagram shows all the key components of the solution.

• There is no magic formula of should be included here

• There needs to at least be some sort of sourcing, processing, and display/output activities

• In this case, there one source processing, one calculation, and four output activities

• Each can be broken down into more details

• Lets look at the Commercial BI Activity

Page 21: Optimizing the design of your data warehouse 09222010

PAGE 21

The Approach

NODE: TITLE: NUMBER:Commercial BIA4

Comm ercial

Balancing Data

Data for

Cube

Out

Data for

Reporting

Comm ercial

Customer Data

Comm ercial

BI Data Data for Cube

In

A4.1$0

Load Commercial Cube

A4.3$0

Cube Provider

A4.6$0

Comm erical Profitability Reporting

D16

Comm ercial

Profitabil ity

Cube

Page 22: Optimizing the design of your data warehouse 09222010

PAGE 22

The Approach

• This decomposition can continue until you are comfortable

• I try to get to the point where one developer can implement it in one module

• At this point, we will have a series of diagrams that show the flow of data through the system

• The diagrams contain:

– Activities

– Data Stores (note that a single data store can be used on multiple diagrams)

– Data Flows

– External Entities

Page 23: Optimizing the design of your data warehouse 09222010

PAGE 23

The Approach

• Each of the diagram elements, except for the Data Flows, can be further modeled in ERwin DM

• This gives the developer a further level of detail of what is intended

• It also provides the physical names that will be used

• To maintain the mapping between the models, I use a naming convention for ERwin DM Subject Areas

• The convention is:

– A01.01.01 – {Activity Name}

– D01 – {Data Store Name}

– E01 – {External Name}

Page 24: Optimizing the design of your data warehouse 09222010

PAGE 24

The Approach

• Some examples for External Entities and Data Stores from the model above:

– D01 – Customer Profitability Staging

– E05 – Commercial Loans

• Each of these subject areas should have the portion of the data model relevant to it

• Note that these are just typical ER models

• They can represent more than just table – for example, an external entity could be a flat file

• Below is an example – the E05 – Commercial Loans external entity

Page 25: Optimizing the design of your data warehouse 09222010

PAGE 25

The Approach

Page 26: Optimizing the design of your data warehouse 09222010

PAGE 26

The Approach

• Next we need to look at the activities

• Because activities have a hierarchical numbering system, we need one for the subject areas

• We simply start with A and separate each level with a period

• Combine Retail Loans from the model above is in Activity 7 inside of Activity 2. It is called A2.7 Combine Retail Loans in the model.

• The associated subject area will be:

– A02.07 – Combine Retail Loans

• The data model will show the input and out put entities and how they are processed

Page 27: Optimizing the design of your data warehouse 09222010

PAGE 27

The Approach

Page 28: Optimizing the design of your data warehouse 09222010

PAGE 28

The Approach

• With the Diagrams from ERwin DM, ERwin PM, and the narrative in ERwin PM, the developer has all the information they need to implement a portion of the solution

• The diagrams and narratives are also accessible to technical users

• Twice, I have had the user community write papers to explain the details of specific areas of the ERwin PM model

Page 29: Optimizing the design of your data warehouse 09222010

PAGE 29

The Approach

• Notes

– Using ERwin DM we can quickly build detailed reports with diagrams and descriptions

– The developers use these reports to track what they have to do

– The Project Managers use these reports as an inventory for project planning

– The ERwin PM reports are like a roadmap that ties everything together

– It takes some effort to keep everything synchronized but it is well worth it

Page 30: Optimizing the design of your data warehouse 09222010

PAGE 30

The Approach

• In Summary

– A data warehouse is very much a store of data and a flow of data

– ERwin DM and ERwin PM can model both of these areas

– Use ERwin PM to decompose the solution

• There is no right or best decomposition

• Try it until it works

– Use ERwin DM to model the internals of External Entities, Data Stores, and Activities

• Tie the two models together through an appropriate naming convention

• Do not worry if the entities model more than tables

– The goal is to communicate with users and developers

Page 31: Optimizing the design of your data warehouse 09222010

PAGE 31

Questions?