Datawarehouse Tutorials

48
Case study Insurance domain

description

Datawarehouse Tutorials

Transcript of Datawarehouse Tutorials

Page 1: Datawarehouse Tutorials

Case study

Insurance domain

Page 2: Datawarehouse Tutorials

Agenda

• Introduction to Insurance domain

• What is Health insurance

• Identify Dimensions

• ETL

• Reporting

• Identify Facts

• Historical data maintanance

Page 3: Datawarehouse Tutorials

3

Introduction to Insurance

• Insurance provides protection against financial losses resulting from a variety of threats. By purchasing insurance policies, individuals and businesses can receive reimbursement for losses due to car accidents, theft of property, and fire and storm damage, medical expenses, and loss of income due to disability or death.

• Types of Insurance: Auto Insurance Home Insurance Health Insurance Life Insurance etc…

Page 4: Datawarehouse Tutorials

4

Health Insurance

• Health insurance (popularly known as Medical Insurance or Mediclaim) protects one against any financial constraints arising on account of a medical emergency. It sometimes includes disability and long term medical needs.

• In Mediclaim, you pay a premium and in return the insurer commits to pay a predetermined sum of money to meet the claims.

• Health insurance is available to individual and groups.

Page 5: Datawarehouse Tutorials

5

Health Insurance

• Health insurance (popularly known as Medical Insurance or Mediclaim) protects one against any financial constraints arising on account of a medical emergency. It sometimes includes disability and long term medical needs.

• In Mediclaim, you pay a premium and in return the insurer commits to pay a predetermined sum of money to meet the claims.

• Health insurance is available to individual and groups.

Page 6: Datawarehouse Tutorials

Which are our lowest/highest margin

customers ?

Which are our lowest/highest margin

customers ?

Who are my customers and what products are they buying?

Who are my customers and what products are they buying?

Which customers are most likely to go to the competition ?

Which customers are most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest

impact on revenue?

What product prom--otions have the biggest

impact on revenue?

What is the most effective distribution

channel?

What is the most effective distribution

channel?

A customer wants to know….

Page 7: Datawarehouse Tutorials

Data, data everywhere yet...

• I can’t find the data I need data is scattered over the network many versions, subtle differences

• I can’t get the data I need need an expert to get the data

• I can’t understand the data I found available data poorly documented

• I can’t use the data I found results are unexpected data needs to be transformed from

one form to other

Page 8: Datawarehouse Tutorials

What the knowledge users want is...

• Data should be integrated across the enterprise

• Summary data has a real value to the organization

• Historical data holds the key to understanding data over time

• What-if capabilities are required

Page 9: Datawarehouse Tutorials

But the Present Systems are…

• Run mission critical applications

• Need to work with stringent performance requirements for routine tasks

• Used to run a business

• Online Transaction Processing (OLTP) systems a.k.a. Operational Systems

Page 10: Datawarehouse Tutorials

Operational Systems…

• Run the business in real time

• Optimized to handle large numbers of simple read/write transactions

• Based on up-to-the-second data

• Optimized for fast response to predefined transactions

• Used by people who deal with customers, products - clerks, salespeople etc.

Page 11: Datawarehouse Tutorials

…What they need is a ‘Data Warehouse’

A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a way they can understand and use in a business context

[Barry Devlin]

Page 12: Datawarehouse Tutorials

12

Evolution

• 60’s: Batch reports• hard to find and analyze information• inflexible and expensive, reprogram every new

request• 70’s: Terminal-based DSS and EIS (executive

information systems)• still inflexible, not integrated with desktop tools

• 80’s: Desktop data access and analysis tools• query tools, spreadsheets, GUIs• easier to use, but only access operational databases

• 90’s: Data warehousing with integrated OLAP engines and tools

Page 13: Datawarehouse Tutorials

Data Warehouse

A data warehouse is a

• subject-oriented

• integrated

• time-varying

• non-volatile

collection of data that is used primarily in organizational decision-making

- Bill Inmon, Building the Data Warehouse 1996

Page 14: Datawarehouse Tutorials

14

Application-Orientation vs. Subject Orientation

Application-Orientation

Operational Database

LoansCredit Card

Trust

Savings

Subject-Orientation

DataWarehouse

Customer

VendorProduct

Activity

Page 15: Datawarehouse Tutorials

OLTP vs. Data Warehouse

OLTP Warehouse (DSS)

• Application Oriented • Subject Oriented

• Used to run business • Used to analyze business

• Current up to date • Summarized and refined

• Detailed data • Snapshot data

• Isolated Data • Integrated Data

• Repetitive access • Ad-hoc access

• Performance Sensitive • Performance relaxed

• Few Records accessed at a time (tens)

• Large volumes accessed at a time (millions)

• Read/Update Access • Mostly Read (Batch Update)

Page 16: Datawarehouse Tutorials

OLTP vs. Data Warehouse

OLTP Warehouse (DSS)

• Clerical User • Knowledge User (Manager)

• No data redundancy • Redundancy present

• DB Size (100MB -100GB) • DB Size (100GB - few terabytes)

• Transaction throughput is the performance

metric

• Query throughput is the performance metric

• 100s – 1000s of users • 10s – 100s of users

Page 17: Datawarehouse Tutorials

To summarize ...

OLTP Systems are used to “run” a business

The Data Warehouse helps to “optimize” the business

Page 18: Datawarehouse Tutorials

Data Warehouse Architecture

activities

Page 19: Datawarehouse Tutorials

19

Data Warehouse Architecture

Data Warehouse Engine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

ERPSystems

Page 20: Datawarehouse Tutorials

Architecture

• Operational database layer The source data for the data warehouse — An organization's Enterprise Resource Planning systems fall into this layer.

• Data access layer The interface between the operational and informational access layer — Tools to extract, transform, load data into the warehouse fall into this layer

• Metadata layer The data directory - This is usually more detailed than an operational system data directory. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool.

• Informational access layer The data accessed for reporting and analyzing data— Business intelligence tools fall into this layer.

Page 21: Datawarehouse Tutorials

Dimensional Modeling

Database organization• must look like business• must be recognizable by business user• approachable by business user

Schema Types• Star Schema• Snowflake schema

Page 22: Datawarehouse Tutorials

Dimension Tables

• Define business in terms already familiar to users

• Wide rows with lots of descriptive text• Small tables (generally, few thousands)• Joined to fact table by a foreign key• Typical dimensions include time periods,

geographic regions (markets, cities), products, customers, salesperson etc.

Dimensions are similar set of members upon which the user wants to base an analysis.

Page 23: Datawarehouse Tutorials

• Definition Descriptions of the business. The “which, who, how, where, or when that describes or explains the fact.”

• Characteristics - Discretely valued descriptions of variables which are more or less constant. - Enables “slicing and dicing” the facts by different variables.

• Examples Time Customer Product

Sales Fact

RevenueQtyCostGross margin

CustomerDimension

TimeDimension

BranchDimension

ProductDimension

Dimensions

Page 24: Datawarehouse Tutorials

Fact Table

• Central table• Mostly raw numeric items• Narrow rows, a few columns at most• Large number of rows (millions to billions)• Accessed via dimensions

Facts are measurements that users will apply arithmetic calculations to.• What is the average of…

• What is the total of…

Page 25: Datawarehouse Tutorials

Dimensional Model Schemas

• Dimensional Data Models majorly fall into two types of models:• Star Schema• Snowflake Schema

• Several factors influence schema choice:• Presentation restrictions• Inconsistency of data• Complex queries and analysis

Page 26: Datawarehouse Tutorials

Star Schema

• Star schema represents a compromise between the fully normalized model and the denormalized model.

• Descriptive ‘dimension’ information is maintained in a set of denormalized dimension tables.

STAR SCHEMA

A database design that stores a central fact table surrounded by multiple dimension tables.

Page 27: Datawarehouse Tutorials

Star Schema

• A single fact table and for each dimension one dimension table

• Does not capture hierarchies directly

T ime

prod

cust

city

fact

date, custno, prodno, city, ...

Page 28: Datawarehouse Tutorials

Star Schema

Page 29: Datawarehouse Tutorials

Snowflake Schema

• Snowflake schemas are most often used when dealing with large hierarchies that are static.

• Snow flaked tables (look-up tables) may increase the speed of queries depending on the presentation tool

SNOWFLAKE SCHEMA

A database design that stores a central fact table surrounded by multiple dimension tables decomposed or normalized into one or more hierarchies.

Page 30: Datawarehouse Tutorials

Snowflake schema

• Represents dimensional hierarchy directly by normalizing tables.

T ime

prod

cust

city

fact

date, custno, prodno, city, ...

region

Page 31: Datawarehouse Tutorials

Snowflake Schema

Page 32: Datawarehouse Tutorials

Slowly-Changing Dimensions

• Most dimensions change over time. • Products change offered coverage or limits

and deductibles. • Employees are promoted, fired, or change

departments.• Customers change names and addresses.

• What are our choices for tracking these changes over time?

Page 33: Datawarehouse Tutorials

Slowly-Changing Dimensions

• There are three types of slowly changing dimensions:• Type 1: Overwrites the old data for a

record with new data. This eliminates the ability to track history over time.

• Type 2: Creates a new record with the new data at the type of the change. Accurately tracks history, but requires generalized key.

• Type 3: Tracks new and original values in separate fields at time of change. Intermediate values are lost.

Page 34: Datawarehouse Tutorials

Type 1 - Overwrite Old Values

• Customer Lynnette Groves is changing her name to ?

• If there is no value in tracking this change, we will overwrite the First Name and Last Name fields with the new values.

• ‘UPDATE’ statement; 1 record is maintained.

Page 35: Datawarehouse Tutorials

Type 2 - Create New Record

• Lynnette Groves is changing her name and we want to track both values

• Add a second record with a new Customer Key and make it the active row

• ‘INSERT’ statement for new, ‘UPDATE’ for active; 2 records are maintained

• New record for each change up to n records

Page 36: Datawarehouse Tutorials

Type 3 - Original and Current

• We decide that no matter how many times she changes her name, we only want to track the original and the current.

• Before any changes, original and current are the same. Any name change updates ‘current’ fields.

• UPDATE’ statement; 1 record is maintained

Page 37: Datawarehouse Tutorials

Degenerate Dimensions

• Certain attributes are tracked that don’t necessarily belong in their own dimension - orphan attributes.

• This may occur when fact tables are designed to reflect the actual working document.

DEGENERATE DIMENSIONS

Dimensions that are so small and have no attributes of their own that they have been added to the fact table.

Page 38: Datawarehouse Tutorials

Degenerate Dimensions

• Examples include ‘order_number’, ‘bill_of_lading_num’, and ‘invoice_number’.

• While these fields seem very transaction oriented, they are helpful in grouping things such as all line items on an invoice.

• Including these fields on the fact table amounts to denormalizing the attribute due to the granularity of the fact table being the document itself or a line item of the document.

Page 39: Datawarehouse Tutorials

Types of Facts

• Understanding which facts can be added across which dimensions is an important data design issue.

• Three Types of Facts:• Additive• Non-Additive• Semi-Additive

Page 40: Datawarehouse Tutorials

Additive Facts

• Since aggregation is a key element in the usefulness of the dimensional model, its best utilized for facts that are additive, numeric values.

• We can add revenue, cost, and quantity sold for all products, all stores, and any time period.

ADDITIVE FACTS

Measurements in a fact table that can be added across all dimensions.

Page 41: Datawarehouse Tutorials

Semi-Additive Facts

• Current Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?)

• It does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information

SEMI-ADDITIVE FACTS

Measurements in a fact table that can be added across some dimensions but not others.

Page 42: Datawarehouse Tutorials

Non-Additive Facts

• A new value will need to be calculated at each level, for each set of data.

• Ratios Averages & Variance facts.

NON-ADDITIVE FACTS

Measurements in a fact table that cannot be added across any dimensions, like ratios.

Page 43: Datawarehouse Tutorials

Dimensional Modeling Process

Step 1: Choose the grain of each fact table.

• Granularity defines the level of detailed data.

• It must be determined prior to going forward in the modeling process.

• Typical grains are individual transactions, time-based aggregation, and/or aggregations along a commonly used dimension.

Page 44: Datawarehouse Tutorials

Dimensional Modeling Process

Step 2: Choose the dimension attributes.

• For example, what should our time dimension look like? Should it have just ‘January for month’, or also ‘Jan’ and ‘1’?

• Should we store the code and the description, just the code, or just the description?

• What values will our users need to filter or report on?

Page 45: Datawarehouse Tutorials

Dimensional Modeling Process

Step 3: Identify dimensional hierarchies.

• A dimension such as time may have days rolling into months and then quarters, as well as days rolling into weeks which may cross months and quarters.

• Sales geography may differ from physical geography.

• Zip codes can cross city boundaries and cities are made up of multiple zip codes.

Page 46: Datawarehouse Tutorials

Dimensional Modeling Process

Step 4: Choose the dimensions that apply to each fact table.

• Typical dimensions include time, product, policyholder, agent, and geography.

• Remember to evaluate granularity when applying dimensions to facts.

Page 47: Datawarehouse Tutorials

Dimensional Modeling Process

Step 5: Choose the measured facts, including pre calculated facts.

• Each aggregated and derived fact will need to be evaluated for inclusion in the model or calculation in the application.

• Trade-offs include storage and indexing and must be weighed against the access requirements.

Page 48: Datawarehouse Tutorials

Dimensional Modeling Process

Step 6: Determine slowly changing dimensions

• These are the dimensions that change over time.

• If tracking these changes is important, the method must be decided.

• Options: overwrite the existing record, store all records with effective dates, or a historical and current value tables.