Data Warehouse Overview -...

112
1 1 Introduction to Data Warehouse (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410-553-6760 www.carriget.com) 2 Introduction to Data Warehousing and Data Mining 1) Data Warehouse Introduction 2) Engineering Conflicts 3) OLTP and DSS 4) Stovepipe vs. Integration 5) Data Warehouse Solution 6) Enterprise Information System 7) Security in a Data Warehouse 8) Moving Data to a Data Warehouse 9) Data Marts 10) Data Mining

Transcript of Data Warehouse Overview -...

Page 1: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

1

1

Introduction toData Warehouse

( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)

2

Introduction to Data Warehousing and Data Mining

Introduction to Data Warehousing and Data Mining

1) Data Warehouse Introduction2) Engineering Conflicts3) OLTP and DSS4) Stovepipe vs. Integration5) Data Warehouse Solution6) Enterprise Information System7) Security in a Data Warehouse8) Moving Data to a Data Warehouse9) Data Marts10) Data Mining

Page 2: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

2

3

IntroductionIntroduction

• Key topics for this course include:– Data Warehouse– Data Mart

– Data Mining

• Background and review of relational database systems

• Main focus on data warehouse and data mining

4

Data Warehouse IntroductionData Warehouse Introduction

• A data warehouse is a single source for key, corporate information needed to enable business decisions

• A database application is a piece of software that provides a user interface for users to add, delete, query and update data

• Typically, a database management system is used to actually do the work of adding, deleting, querying or updating data

DatabaseSystemApplication Data

Page 3: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

3

5

Engineering Conflicts, Query and UpdateEngineering Conflicts, Query and Update

• It is often an engineering problem when data is updated and long-running queries occur at the same time

• In some cases, the users who are doing updates must wait for queries to complete

• One way to avoid this is to make a read-onlycopy of data

Application

Database System

Datafor update

Datafor query

6

Database System

OLTPApplication

DSSDataDSS

Application

OLTPData

OLTP and DSS DefinedOLTP and DSS Defined

• An application that updates is called an on-line transaction processing (OLTP) application

• An application that issues queries to the read-only database is called a decision support system (DSS)

Page 4: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

4

7

Applications in a Typical EnterpriseApplications in a Typical Enterprise

• Most organizations have several disparate OLTP/DSS applications in several databases

InventoryOLTP

Application

FinanceOLTP

Application

Finance DSS

Application

InventoryDSS

Application

SalesOLTP

Application

SalesDSS

Application

F i n a n c eDSSData

F i n a n c eOLTPData

Inven to ryDSSData

Inven to ryOLTPData

S a l e sDSSData

S a l e sOLTPData

DATABASE SYSTEM

8

Stovepipe vs IntegrationStovepipe vs Integration

• When systems stand by themselves they are often referred to as “stovepipes”

• Systems that easily share data are called “well integrated systems”

Finance DSS

Application

InventoryOLTP

Application

FinanceOLTP

Application

InventoryDSS

Application

Page 5: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

5

9

• Problems:– Users who wish to access data must query several different DSS to

find it– Data may have fundamental conflicts between DSS

– a department code table in one DSS may differ in another DSS– a measurement may be stored in meters in one DSS and yards in

another

• Solution:– Use a data warehouse, where data is integrated

from the several different stovepipe systems– Data warehouse is really sharing-lite -- you

don’t have to co-ordinate as much when applications are built and you still reap the benefits of data sharing

Problems with Stovepipe ArchitectureProblems with Stovepipe Architecture

10

Data Warehouse SolutionData Warehouse Solution

• A data warehouse is an attempt to integrate separate DSS so that users can query one place to find the answers to their questions

• A data warehouse has the key, corporate data in the organization

• A data warehouse tracks historical data

Page 6: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

6

11

Data Warehouse - A Success StoryData Warehouse - A Success Story

• Largest data warehouse is Wal-Mart (9 TB)

• Uses for Wal-Mart data warehouse– Identifies where a new store should be built based on customer demand– Identifies how stores are performing across the nation– Contains every “scan” from every purchase

• Benefits Wal-Mart gained from their data warehouse– Provided competitive advantage over K-Mart– Reduced excess inventory in individual stores– Avoided wasted funds in building stores which would fail

12

Selling the Data WarehouseSelling the Data Warehouse

• A data warehouse project will fail without corporate sponsorship– Preferably, the project should be sponsored by the CEO– The CEO must be sold on the value to the business to

improve competitive advantage by deploying a data warehouse

• If an active, corporate sponsor does not exist, data sources will be very difficult to identify

• Only add data to the warehousethat will answer key, corporate questions asked by the corporate sponsor.

Otherwise, you will have a data dump

Page 7: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

7

13

Building a Useful Data WarehouseBuilding a Useful Data Warehouse

• You really need:– strong executive sponsorship

– good knowledge of the data

– sound software engineering

– stability from source systems

– users who want a success

• A 75 percent failure rate is often cited

• It is WORTH the effort!!!

14

Enterprise Information SystemEnterprise Information System

Data Warehouse

EnterpriseInformation

System

• An EIS (Enterprise Information System) allows users to query data in a data warehouse

• Users can access key, corporate data in the data warehouse

Page 8: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

8

15

Users of an Enterprise Information SystemUsers of an Enterprise Information System

• Frequently, multiple EIS are needed to satisfy different types of users– Some users only want a system that has pre-defined reports so they

only need to “click one button” to see data they need. These users want the system to be no harder to use than a “coffee pot”

– Other users want to delve into the data and build their own queries

• Executives want a high-level, summarydata and a simple tool– Must be VERY easy to use, users want to click a few

buttons and get data they want– Results must be graphs– Users should be able to drill-down into key areas.

16

Users of an Enterprise Information SystemUsers of an Enterprise Information System

• Analysts want a flexible, more detailed tool– Often very knowledgeable about the data– Willing to do more work to learn about the data– Sometimes even learn SQL to issue their own

ad-hoc queries

• General users want a tool that provides detailed data, but is very easy to use– Want access to the data warehouse to do

routine tasks such as “Find me Hank’s phone number”, etc.

– Simple application, but not so focused on large reports

Page 9: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

9

17

Data Warehouse / EISData Warehouse / EIS

Data Warehouse

FinanceOLTP

Application

SalesOLTP

Application

F i n a n c eSub jec t

Area

F i n a n c eOLTPData

Inven to rySub j ec t

Area

Inven to ryOLTPData

Sales Sub j ec t

Area

S a l e sOLTPData

InventoryOLTP

Application

EnterpriseInformation

System

S a l e sOLTPData

18

Need for Data WarehousesNeed for Data Warehouses

• Data warehouses provide a single place to store key corporate data– The idea is that users can go one place to find this key data using an

enterprise information system (EIS)

• Data warehouse is also a place to store and access historical data– Users measure performance goals for their company over a period of

time– Company statistics are available– Data not stored in the same place is difficult to locate

and compare, easily lost– Single query can be used to access key data

Page 10: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

10

19

Security in Data WarehouseSecurity in Data Warehouse

• Building a data warehouse does increase security risk because key, corporate information is all in one place

• To mitigate that risk, database system components can be used to protect the data warehouse. These include– Views – Access control– Security Administration– Encryption– Audit

20

Moving Data into the Data Warehouse Moving Data into the Data Warehouse

• Moving data from source OLTP systems to the data warehouse is the hard part of data warehousing

• Updates to the data warehouse are performed periodically– weekly– nightly– monthly

• Occasionally, real-time data is needed in a data warehouse, but this

is not very common

Page 11: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

11

21

Using Middleware to Move DataUsing Middleware to Move Data

SourceOLTP

System

DataWarehouseMigrationSoftware

“Middleware”

DataWarehouse

• Data can be moved to the warehouse via data migration software

• This is often called “middleware” because it sits between the source OLTP and the data warehouse

22

Need for a Data MartNeed for a Data Mart

• A data mart is a subset of the data warehouse that may make it simpler for users to access key corporate data – Sometimes, users only need a piece of data from the data

warehouse

• The data mart is typically fed from the data warehouse

Data Warehouse

F i n a n c eSub jec t

Area

Inven to rySub j ec t

Area

Sales Sub j ec t

Area

New YorkD a t a M a r t

CaliforniaD a t a M a r t

Page 12: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

12

23

Data Mart in ActionData Mart in Action

New YorkD a t a M a r t

CaliforniaD a t a M a r t

Data Warehouse

FinanceOLTP

Application

SalesOLTP

Application

F i n a n c eSub jec t

Area

F i n a n c eOLTPData

Inven to rySub j ec t

Area

Inven to ryOLTPData

Sales Sub j ec t

Area

S a l e sOLTPData

InventoryOLTP

Application

EnterpriseInformation

System

S a l e sOLTPData

24

Data Mining IntroductionData Mining Introduction

• Data Mining is done by running software that examines a database and looks for patterns in the data

• A data warehouse by itself will respond to queries from users– It will not tell users about patterns in data that users may not have

thought about– To find patterns in data, data mining is

used to try and mine key information from a data warehouse

Page 13: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

13

25

Advantages of Data MiningAdvantages of Data Mining

• Data mining allows companies to collect information and make them more productive and beat their competition

• Data mining helps identify – why customers buy certain products

– ideas for very direct marketing– ideas for shelf placement– training of employees vs. employee retention– employee benefits vs. employee retention

26

Implementing Data MiningImplementing Data Mining

• Apply data mining tools to run data mining algorithms against data

• There are two approaches:– Copy data from the Data Warehouse and mine it– Mine the data in the Data Warehouse

• Popular tools use a variety of different data mining algorithms:– association rules

– genetic algorithms– decision trees– neural networks

Page 14: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

14

27

Data Mining using Separate Data Data Mining using Separate Data

• You can move data from the data warehouse to data mining tools– Advantages

– Data mining tools may organize data so they can run faster– Disadvantages

– Could be very expensive to move largeamounts of data

Data Warehouse

Data Mining ToolCopy of data made

by the Data Mining Tool

28

Data Mining Against the Data WarehouseData Mining Against the Data Warehouse

• Data mining tools can access data directly in the Data Warehouse– Advantages

– No copy of data is needed for data mining– Disadvantages

– Data may not be organized in a way that isefficient for the tool

Data Warehouse

Data Mining Tool

Page 15: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

15

29

Data Mining: SummaryData Mining: Summary

• Data mining attempts to find patterns in data that we did not know about

• Often data mining is just a new buzzword for statistics

• Data mining differs from statistics in that large volumes of data are used

• Many different data mining algorithms exist and we will discuss them in the course

• Examples– identify users who are most likely to commit credit card fraud– identify what attributes about a person most results in them buying

product x.

30

SQL Review

( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)

Page 16: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

16

31

Introduction to SQLIntroduction to SQL

1) Introduction to SQL2) Data Definition Language (DDL)3) Data Manipulation Language (DML)4) SELECT Construct 5) SELECT Operators6) Wildcard Searches7) Aggregate Operators8) Calculated Attributes9) Sorting Results

32

Introduction to Structured Query LanguageIntroduction to Structured Query Language

• Structured Query Language (SQL) is the language used to communicate with a relational database– Industry standard– Based on set theory

• SQL composed of two types of constructs:– Data Definition Language (DDL)

– Defines the structure of the database– Data Manipulation Language (DML)

– Provides the constructs to input and retrieve data

Page 17: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

17

33

SQL Overview - DDLSQL Overview - DDL

• Data Definition Language (DDL) is used to describe the structure of the database– Create tables, indexes, etc. – Typical Operations are:

– CREATE TABLE defines what columns are in the table and establishes the table

– CREATE INDEX defines an index for the table. Indexes are used to improve database performance

34

SQL Overview - DMLSQL Overview - DML

• Data Manipulation Language (DML) is used for storing, updating, and retrieving data.

• Typical operations include:– SELECT is used to retrieve data.

– Ex: SELECT * FROM PRODUCTS– INSERT is used to add new rows to the database.

– INSERT INTO PRODUCTS VALUES ('food', 'hardware', 'housewares')

– UPDATE is used to change rows that already exist in the database.– UPDATE PRODUCTS SET PRICE = PRICE + 4

– DELETE is used to eliminate rows of data from the database.– DELETE FROM PRODUCTS

Page 18: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

18

35

SELECT OverviewSELECT Overview

• SELECT is used to retrieve records from the database.

• Single table SELECT constructs:– WHERE– IN– BETWEEN– LIKE– Aggregate Operators– DISTINCT– ORDER BY

36

SELECT ExamplesSELECT Examples

• Query Purpose: Retrieve names and prices of all products

SELECT ProductName, PriceFROM TinyProducts

• Query Purpose: Retrieve all information for all employees from the TinyProducts table

SELECT *FROM TinyProducts

Page 19: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

19

37

SELECT with WHERESELECT with WHERE

• The WHERE clause is used to filter which information is returned from a SELECT

• Query Purpose: Retrieve all information only for product type of “food”

SELECT * FROM TinyProductsWHERE ProductType = ‘Food’

38

Use of Boolean OperatorsUse of Boolean Operators

• Conditions can be separated by Boolean operators: – AND, OR, NOT

• Query Purpose: List all information about food products that are either cereal or fruit

SELECT *FROM TinyProductsWHERE (ProductName = 'Cereal')OR (ProductName = 'Fruit')

Page 20: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

20

39

Boolean Operator ExampleBoolean Operator Example

• Query Purpose: List the names of all products that the type is fruit and the price is less than $2.00

SELECT ProductType, ProductNameFROM TinyProducts

WHERE Price < 2AND ProductName = 'Fruit'

40

IN OperatorIN Operator

• The IN operator allows a search for records that match one value in a set of unordered values

• Example questions to use IN:– 'Find all products whose type is Food, Hardware, or Housewares'– 'Find all food whose type is Meat, Fish, Vegetables, or Fruit'

Page 21: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

21

41

IN ExampleIN Example

• Query Purpose: List the name of Housewares that are Cookware, Linens, or Dishes

SELECT ProductName, ProductTypeFROM TinyProductsWHERE ProductName in ('Cookware', 'Linens', 'Dishes')

instead of:

SELECT ProductName, ProductTypeFROM TinyProducts

WHERE (ProductName = ’Cookware') OR (ProductName = 'Linens') OR (ProductName = 'Dishes')

42

BETWEEN OperatorBETWEEN Operator

• The BETWEEN operator allows a search for a range of values

• Example Queries:– 'Find all fruit between Bananas and Grapes'– 'Find all cereals whose price is between $1.50 and $4.00 a box

1.50 4.00

Page 22: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

22

43

BETWEEN ExampleBETWEEN Example

• Query Purpose: Find all products whose price is between $2.00 and $8.00

SELECT ProductName, PriceFROM TinyProductsWHERE Price BETWEEN 2.00 AND 8.00

instead of:

SELECT ProductName, HardwareFROM TinyProductsWHERE (Price >= 2.00) OR (Price <= 8.00)

44

Wildcard Searches of StringsWildcard Searches of Strings

• The LIKE operator is used to search parts of a string

• The following wildcard characters are used:% to match any zero or more characters_ to match exactly one character

Page 23: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

23

45

Wildcard Search ExamplesWildcard Search Examples

• Query Purpose: List all products whose name starts with an ’C'

SELECT * FROM TinyProductsWHERE ProductName LIKE 'C%'

• Query Purpose: List all products that have a SKU number with the last 2 characters of ’23' when you don't know the first character

SELECT * FROM TinyProductsWHERE SKUNumber LIKE '_23'

46

Aggregate OperatorsAggregate Operators

• MIN, MAX, and AVERAGE are used when computing statistics on a range of data

• Query Examples:– 'What is the highest batting average on the team?'– 'What is the average number of hits for all the little league teams in

the National League?'– 'What are the names of the players that had the lowest average on

the little league team?'

Page 24: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

24

47

Aggregate Operators ExampleAggregate Operators Example

• Query Purpose: Find the minimum, maximum, and average batting average of all players in the National League of Little League

SELECT MIN(Average), MAX(Average),AVG(Average)

FROM PLAYERSWHERE League = 'National'

48

SUM and COUNT OperatorsSUM and COUNT Operators

• Use the SUM operator to total the results of a query

• COUNT will count the total number of occurrences of an item in a search

11 + + 22 + + 33 + + 44

Page 25: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

25

49

SUM And COUNT ExamplesSUM And COUNT Examples

• Query Purpose: Find the total number of homeruns hit by all players in the American League?

SELECT SUM(HomeRuns)FROM PLAYERSWHERE League='American'

• Query Purpose: List the names of players that have hit 3 home runs in the National League?

SELECT COUNT(*) FROM PLAYERSWHERE HomeRuns = '3' AND League = 'National'

50

Calculated AttributesCalculated Attributes

• A new attribute can be obtained by using arithmetic operators (+,-, *, /) on other numeric attributes

• All operators follow standard precedence:– Multiplication and division are computed first left to right– Addition and subtraction are computed last left to right– Use parenthesis to override the standard precedence

( ( + + ,, -- ,, * * ,, / / ))

Page 26: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

26

51

Calculated Attributes ExampleCalculated Attributes Example

Query Purpose: List all players with their hits, at bats, and their batting average

SELECT Name, Hits, AtBats, (Hits / AtBats)

FROM PLAYERS

52

DISTINCT OperatorDISTINCT Operator

• DISTINCT is used to exclude duplicate occurrences in the result of a query

• Query Purpose: List all distinct batting averages

SELECT DISTINCT(Average)FROM PLAYERS

Page 27: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

27

53

Sorting Query ResultsSorting Query Results

• The ORDER BY clause is used at the end of the SELECT statement to sort the results of a query

• Use DESC on the end of the ORDER BY clause to sort the data in descending order. Otherwise, the result will be in ascending order

54

Sorting ExampleSorting Example

• Query Purpose: List all players in ascending order of their batting average

SELECT Name, AverageFROM PLAYERSORDER BY Average

• For descending order add the keyword DESC

SELECT Name, AverageFROM PLAYERSORDER BY Name DESC

Page 28: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

28

55

Sorting Calculated AttributesSorting Calculated Attributes

• To refer to a computed attribute in the ORDER BY, use its position in the list of columns following SELECT

• Query Purpose: List all players in descending order of their batting average (here we assume batting average is computed at the time of the query)

SELECT Name, Hits, AtBats, Hits / AtBats

FROM PLAYERSORDER BY 3 DESC

56

More SQLMore SQL

1) GROUP BY Construct

2) HAVING Filter

3) Multiple Tables

4) Joins5) Equijoins6) Cartesian Product7) Nulls8) OUTER JOIN

Page 29: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

29

57

GROUP BY ClauseGROUP BY Clause

• GROUP BY will partition a table into multiple groups of related rows.

• As an example, consider the EMPLOYEE table where Department partitions the EMPLOYEE set into subsets:

Engineering

Marketing Customer

Finance

58

GROUP BY ExampleGROUP BY Example

• Query Purpose: For each department, list the average salary using the EMPLOYEE table

SELECT Department, AVG(Salary)FROM EMPLOYEEGROUP BY Department

Page 30: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

30

59

• To filter data further, we can use the WHEREclause with GROUP BY clause

Query Purpose: For each department, list the highest salary of their administrative assistants.

SELECT Department, MAX(Salary)FROM EMPLOYEEWHERE Title='administrative assistant'GROUP BY Department

GROUP BY With WHEREGROUP BY With WHEREGROUP BY With WHEREGROUP BY With WHERE

60

HAVING ConstructHAVING Construct

• HAVING is used to restrict the output of aggregate functions, such as SUM, MIN, MAX and AVG, to only those groups of rows that meet some condition.

Query Purpose: List the average salary for all departments that have more than three employees.

SELECT Department, AVG(Salary)FROM EMPLOYEE

GROUP BY DepartmentHAVING COUNT(*) > 3

Page 31: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

31

61

EmpID Name Salary1 Fred 200

2 Ethel 300

3 Mike 400

4 David 100

EMPLOYEE

Multi-Table SQLMulti-Table SQL

• It is often necessary to combine data into multiple tables.

ATTENDSEmpID Name

1 Harvard2 GMU2 Yale3 MIT3 Stanford3 GMU

62

JoinsJoins

• Joins are the means by which multiple tables can be combined.

• A join allows us to combine data from different tables. A join operation is done through the SELECT construct.

• Types of Joins: Equijoin, Outer Join, Inner Join

Page 32: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

32

63

EquijoinEquijoin

• Joins only those rows where a foreign key matches the primary key

• Allows information from multiple tables to be linked together in a single query

• Can be used to link as many tables as needed in a single query

64

• Query Purpose: List the names of all colleges attended by Ethel

SELECT b.NameFROM EMPLOYEE a, ATTENDS bWHERE a.EmpID = b.EmpIDAND a.Name = 'Ethel'

Equijoin Query ExampleEquijoin Query Example

Page 33: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

33

65

Equijoin ExampleEquijoin Example

EmpID College GPA

1 Harvard 2.452 GMU 3.792 Nova 3.653 Yale 2.853 Nova 2.653 GMU 4.0

EmpID Name Salary

1 Fred 2002 Ethel 3003 Mike 400

EMPLOYEE

ATTENDS

66

Warning about Joining TablesWarning about Joining Tables

• A join is really just a subset of a cartesianproduct. When no fields are 'joined' in the WHEREclause, a cartesian product is produced

– Restated in English: When the linking condition is omitted from the WHERE clause, you get a lot of excess garbage that you probably do not want.

Sample Query:

SELECT b.NameFROM EMPLOYEE a, ATTENDS bWHERE a.Name = 'Ethel'

Page 34: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

34

67

Cartesian ProductCartesian Product

• Each row in one table with every other row in other table

a.EmpID a.Name a.Salary b.EmpID b.GPA

2 Ethel 300 1 3.42 Ethel 300 2 2.82 Ethel 300 3 3.72 Ethel 300 4 3.5

....

68

NullsNulls

• An attribute may be defined as null.

• This indicates that the value is unknown and avoids the need for user-defined special indicators.

• To prevent a column from having nulls, specify NOT NULL on the column in the CREATE TABLEstatement when setting up the database.

Page 35: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

35

69

Nulls ExamplesNulls Examples

Statement Purpose: Add an employee whose salary is unknown

INSERT INTO EMPLOYEE (3,'Hank', NULL)

Query Purpose: Find all employees whose salary is unknown (or null)

SELECT *FROM EMPLOYEEWHERE Salary IS NULL

70

• An OUTER JOIN is used when the query should return a result row even for rows that do not have corresponding data in one of the tables.

• A LEFT OUTER JOIN returns all rows from the 'left' table.

• Nulls are returned when a row in the 'left' table has no corresponding rows in the right table.

OUTER JOINOUTER JOIN

Page 36: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

36

71

LEFT OUTER JOIN ExampleLEFT OUTER JOIN Example

• Query Purpose: List the college GPAs for each employee. Include employees who have not attended any colleges

SELECT a.Name, b.GPAFROM EMPLOYEE aLEFT OUTER JOIN ATTENDS bon a.EmpID = b.EmpID

72

LEFT OUTER JOIN ExampleLEFT OUTER JOIN Example

• Result of the outer join – All employees are listed. – For an equijoin, only those who attended a college would be listed– Here, employee number 4 did not attend college, but is still

retrieved by the outer join.

Name GPA

---------- -----Fred 2.45Ethel 3.79Ethel 3.65Mike 2.85Mike 2.65Mike 4.00David NULL

Page 37: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

37

73

Advanced SQL

( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)

74

Advanced SQLAdvanced SQL

1) Finding the nth element in a list2) Finding the median3) Correlated subquery4) Data Definition Language Constructs

Page 38: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

38

75

Find the Nth ElementFind the Nth Element

• It is very common to try to find the nth element in a list.– Examples:

– Who makes the second highest salary in marketing department? – What is the fifth best product in sales?

– This can be done with a program that uses SQL to access the database: SQL is sent to the database and the program keeps retrieving the result set until the threshold is crossed.

• We show another way ofdoing this using standard SQL.

76

Find the Nth Element: Example TableFind the Nth Element: Example Table

• Consider a table, called TEST, with just one column, x, with the following values:

X

4

5

8

Page 39: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

39

77

• First join TEST with itself, this yields each element matched with every other element:

Find the Nth Element: Step 1Find the Nth Element: Step 1

444555888

458458458

78

Find the Nth Element: Step 2Find the Nth Element: Step 2

• Next keep only those rows where the first column is greater than or equal the second column.

Notice the pattern that just developed, each number on the list now has a certain number of values that match on the right. This number matches the position of this value in the list. For example, 4 has only one match as it is the first number in the list, 5 has two matches, 8 has three matches.

455888

445458

444555888

458458458

Page 40: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

40

79

Find the Nth Element: Step 3Find the Nth Element: Step 3

• Now group by the column on the left and identify the size of each group.

• The same ideas can be applied to any SELECTstatement output.

455888

445458

458

123

80

Finding the Nth Element: ExampleFinding the Nth Element: Example

• Query Purpose: Find the information about the product with the second highest price.

SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber

FROM TinyProducts a, TinyProducts bWHERE a.Price >= b.PriceGROUP BY a.ProductName,a.ProductType,

a.Price, a.SKUNumberHAVING COUNT(*) =

(SELECT COUNT(*)-1 FROM TinyProducts)

Page 41: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

41

81

Finding the Top N Elements: ExampleFinding the Top N Elements: Example

• To ask for the top n values instead of the nthvalue, specify a range (>=) instead of just an equality (=) in the HAVING.

• Query Purpose: Find information about the products with the two highest prices.

SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber FROM TinyProducts a, TinyProducts b

WHERE a.Price >= b.PriceGROUP BY a.ProductName,a.ProductType,

a.Price, a.SKUNumberHAVING COUNT(*) >=

(SELECT COUNT(*)-1 FROM TinyProducts)ORDER BY a.Price

82

Finding the MedianFinding the Median

• The median is defined as the element in the middle of the list.

• Query Purpose: Find the median price in TinyProducts.

SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber FROM TinyProducts a, TinyProducts bWHERE a.Price >= b.PriceGROUP BY a.ProductName,a.ProductType, a.Price, a.SKUNumber

HAVING COUNT(*) = (SELECT (COUNT(*)/2)+1 FROM TinyProducts)

Page 42: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

42

83

Using SubqueriesUsing Subqueries

• A subquery may be used in the middle of a query.

• Query Purpose: Find the information about the highest priced product, using a simple subquery.

SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber FROM TinyProducts aWHERE Price = (SELECT MAX(PRICE) FROM TinyProducts)

84

Correlated SubqueryCorrelated Subquery

• If the subquery references a data element from outside of the subquery, it is called a correlatedsubquery.– For each row in the outer part of the query, the correlated subquery

is executed.

The following query will indicate who makes more money than ‘Ethel’

SELECT a.Name, a.Salary FROM Employee a WHERE EXISTS

(SELECT b.Salary FROM Employee b

WHERE a.Salary > b.Salary AND b.Name = 'Ethel')

Page 43: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

43

85

• INSERT– Add rows to a single table

• UPDATE– Modify rows in a single table

• DELETE– Remove rows from a single table

Other Data ManipulationOther Data Manipulation

86

INSERT ExamplesINSERT Examples

• Statement Purpose: Add a record for employee #1, ’Fred' with a salary of 200 to the EMPLOYEEtable

INSERT INTO Employee VALUES(1, ’Fred', 200)

• Statement Purpose: Copy all rows in the EMPLOYEE table and place them in NEW_EMPLOYEE

INSERT INTO New_EmployeeSELECT * FROM Employee

Page 44: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

44

87

UPDATE ExampleUPDATE Example

• Statement Purpose: Modify Fred’s salary to 150

UPDATE EmployeeSET Salary = 150.00WHERE Name = 'Fred'

• Statement Purpose: Give all employees a ten percent raise

UPDATE EmployeeSET Salary = Salary * 1.10

88

DELETE ExamplesDELETE Examples

• Statement Purpose: Remove all employees who have a salary higher than 100.

DELETE FROM EmployeeWHERE Salary > 100

• To remove all employees:

DELETE FROM Employee

Page 45: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

45

89

CREATE TABLE ExampleCREATE TABLE Example

• Statement Purpose: Create a table to store employee information

CREATE TABLE EMPLOYEE(EmpId SMALLINT,Name CHAR(10),Salary DECIMAL(5,2))

To drop the EMPLOYEE table

DROP TABLE EMPLOYEE

90

Data WarehouseSecurity

( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)

Page 46: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

46

91

Data Warehouse SecurityData Warehouse Security

1) Key Security Services2) Views3) Access Control 4) Roles5) Encryption6) Audit Trails7) Security Holes8) Intrusion Detection9) Misuse Detection

92

IntroductionIntroduction

• A key feature provided by database systems is good security services. – In a database system with good security, applications do not have

to worry about problems that arise with security violations.

• A data warehouse also requires good security services because it holds key, corporate data.

EIS

Database System

SecurityServices

Page 47: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

47

93

Key Security ServicesKey Security Services

• Access Control– Controls who accesses what data

• Administration of Access Control– Used to give access to users as well as track who has various

accesses and what kind of accesses are given to a user or group of users

– Audit tracks the usage of the data warehouse

94

Security in a Data WarehouseSecurity in a Data Warehouse

• A data warehouse consolidates organizations key data in one place.– A data warehouse increases the security risk that unauthorized

users will try to obtain this data

• Security aspects of EIS applications must be designed and implemented very thoroughly.

• Access control and audits are two of the critical components of security.

Page 48: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

48

95

Data Warehouse Security ComponentsData Warehouse Security Components

• Database system components that can be used to protect a data warehouse include:– Views

– Allow users to only see certain rows or columns of data– Access control

– Indicate which users have access to what data

– Administration– This component is used to actually give access to groups of users

and to define the accesses given to either an individual or a group. – Encryption– Protect data from access outside of the DBMS

– Audit– Track what users are doing

96

Views in Data WarehouseViews in Data Warehouse

• A view is a logical view into one or more tables. Users may be given access to the view without access to the base table.

• Views provide some security assistance because they can hide data from users.

Name Address Salary

H a n kE s t h e rTomSueDavePeteKathy

1 S o u t h S t r e e t2 N o r t h S t r e e t3 4 M a i n S t r e e t4 5 E a s y S t r e e t5 6 5 t h A v e n u e7 Broadway8 9 W e s t e r n A v e n u e

$50 ,000$80 ,000$90 ,000$28 ,500$35 ,000$60 ,000$85 ,000

EMPLOYEE

Page 49: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

49

97

View ExampleView Example

VIEW (SAFE_EMPLOYEE)“Salary” i s e f fect ive ly h idden

• A view called SAFE_EMPLOYEE may be createdas:CREATE VIEW SAFE_EMPLOYEE AS

(SELECT name, address FROM EMPLOYEE)

Now users of the view SAFE_EMPLOYEE will noteven know that salary exists.

Name Address Salary

H a n kE s t h e rTomSueDavePeteKathy

1 S o u t h S t r e e t2 N o r t h S t r e e t3 4 M a i n S t r e e t4 5 E a s y S t r e e t5 6 5 t h A v e n u e7 Broadway8 9 W e s t e r n A v e n u e

SAFE_EMPLOYEE

98

Updating ViewsUpdating Views

• Restrictions exist on updating views. For the EMPLOYEE table, it is possible to insert into the SAFE_EMPLOYEE view.

– Example:

INSERT INTO SAFE_EMPLOYEE VALUES (‘Hank’, 300)

This will insert a NULL into the SALARY column of the base table EMPLOYEE.

• Other restrictions to view updates exist:– Cannot update a view that is defined with an aggregate – Cannot update a view that is defined with a GROUP BY

Page 50: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

50

99

Data Warehouse Access ControlData Warehouse Access Control

• Access control is implemented in a data warehouse with the SQL Grant and Revoke commands.

• Syntax

– GRANT <ALL|UPDATE|DELETE|INSERT|SELECT> ON<object-name> TO <user name>

– Example: GRANT SELECT ON EMPLOYEE TO MARY

• Access control is done by DBAs and creators of tables.

• To remove access the REVOKE command is used.– Example: REVOKE SELECT ON EMPLOYEE FROM MARY

100

Database RolesDatabase Roles

• Roles provide security administration by allowing users to be grouped into roles. Accesses may then be given to a group of users. – As an example, some roles for a company might be:

– Administrative assistant– Loan officer– Salesperson

• Accesses may be assigned based on roles. – This dramatically simplifies administration.– If new tables are created, it is not necessary to add thousands of

new accesses. – Examples:

CREATE ROLE loan_officer AS (Hank, John, Mike)GRANT SELECT ON LOAN TO LOAN_OFFICER

Page 51: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

51

101

Example of Application-based RolesExample of Application-based Roles

• Consider:

• If the database system controls accesses than it does not matter what the application does, accesses are controlled consistently (same for SALES as MARKETING)

• However, more fine-grained access control can be granted in the application.

DatabaseSystemApplications DataUsers

102

Application RolesApplication Roles

• The application can restrict:– Data entry screens– Reports

• Care must be taken to restrict users in a consistent fashion so that a user cannot jump to a different application and avoid security set up by another application.

Page 52: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

52

103

Role Based Security in a Data WarehouseRole Based Security in a Data Warehouse

• Both application and database level security are useful in a data warehouse.

• Database level security is needed so that users are only allowed to see data they

need to see.

• Application level security canbe used to control access to certain menus so that users do not even know what reports exist.

104

EncryptionEncryption

• Encryption is the process of coding data so that it can only be read by users who have the key that allows them to decrypt the data. – Example:A message “sell 500 shares” would appear as “xyzzy”

without the key. Once the key is paired with the encrypted string “xyzzy”, it can then be decrypted.

– The size of the key is a factor in how difficult it is to attack the encryption scheme.

• Three places where encryption might be used in a data warehouse:– Network– Data

– Tape backups

Page 53: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

53

105

Network EncryptionNetwork Encryption

• In a data warehouse application, data and queries are transmitted through a network. – Attackers might be able to steal network traffic just by breaking

into the network medium.

• One way to reduce the risk of this threat is to encrypt traffic on the network.

User

DatabaseSystem

Application

DataWarehouse

N e t w o r k

Tape Backup

106

• Network encryption is critical because the network connects all of the key components in a data warehouse.

• Encrypting network traffic mitigates the risk that an attacker could succeed with the “man in the middle” attack.

• Without this, it may be possible for the“man in the middle” to masquerade asanother user and circumvent existingapplication and database security.

Network EncryptionNetwork Encryption

Page 54: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

54

107

Data EncryptionData Encryption

• Data encryption refers to encrypting the actual data in the data warehouse.

• If the attackers were to retrieve data fromthe warehouse, they would have to decrypt it in order to read it.

EIS DatabaseSystem

DataWarehouse

108

Backup EncryptionBackup Encryption

• Periodically, databases are copied to some kind of long-term storage (usually tapes).

• If the database is encrypted, but the tapes are not encrypted, the risk exists of someone walking off with the tapes.

Tape Backup

EIS DatabaseSystem

DataWarehouse

Page 55: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

55

109

Audit TrailsAudit Trails

• Audit trails are a means of tracking queries, updates, deletes, and additions of new data to the data warehouse.– Audit trails are turned on when the DBMS is started and all

activity that uses the data warehouse is tracked in the audit trail.

• If a user is suspected of an evil deed, the audit trail can be examined to identify what data has been accessed by users.

110

Details of DW Audit TrailsDetails of DW Audit Trails

• An audit trail of a database system typically includes the following information:– User ID, Date, Time, Object that has been accessed (table or view),

Action that accessed the object (INSERT, UPDATE, DELETE, SELECT)

– For UPDATE, the old value and new value is tracked.

• For data warehouses, the SELECT is often used to track the queries that have

been run against the warehouse.

Page 56: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

56

111

Other Uses for DW Audit TrailsOther Uses for DW Audit Trails

• Audit trails can be used to identify the most popular data in the warehouse.

– This information can be used to optimize queries

• An additional use for audit trails is performance tuning of the data warehouse.

– Administrators know where to focus their efforts– Reduces administrative overhead

112

Dealing with Known Security HolesDealing with Known Security Holes

• Commercial database systems and operating systems are often filled with holes that allow users to obtain unauthorized access.– To reduce the risk of these known holes, vendors often provide

“fixes” to their products as soon as these holes become public.

• It is important to constantly keep up with known security holes and apply the latest fixes as soon as they are released.

• One of the key risks surrounding a data warehouse is that privileged users have the “keys to the kingdom”.

Page 57: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

57

113

The Risk of “Privileged Users”The Risk of “Privileged Users”

• "Privileged users" include:– Data warehouse administrators– Operating system programmers– Operators in the computer center

– These users can:– Modify, delete and query any data in the warehouse– Modify the audit trail to mask their actions– Give other users unauthorized access

• Numbers of "privileged users" could be anywhere from 20 to 30 in some

organizations.

114

Reducing the Risk of Privileged UsersReducing the Risk of Privileged Users

• One way to reduce the risk of privileged users is to separate security administration from database administration. – This would separate the task of giving accesses and managing the

audit trail from the task of making sure the data in the warehouse was correct and properly optimized.

Secur i ty Serv icesAccess Cont ro l

Audit

Database Serv icesDatabase Tun ing

Q u e r y O p t i m i z a t i o nBackups

SecurityServices

Access Cont ro lAudit

DatabaseServices

Database Tun ingQ u e r y O p t i m i z a t i o n

Backups

Page 58: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

58

115

Information Security Attacks Information Security Attacks

• Two types of Information security attacks on data warehouses are:– Intrusion

– An intrusion occurs when an unauthorized user gains access to the data warehouse.

– Misuse– Misuse, often referred to as the insider

problem occurs when a user who has access to the warehouse uses that access for an

unauthorized purpose

• Audit Trails can be used to identify either type of attack, but

identification of misuse is typically MUCH harder to do than intrusion.

116

Intrusion Detection Intrusion Detection

• An intrusion is defined as an unauthorized access to a system. The assumption is the user is external to the environment (e.g.; a hacker).

• To reduce the risk of intrusion, intrusion detection tools are used. – These tools monitor access to the data warehouse and sound an

alarm if unauthorized accesses are detected.

DATAWAREHOUSEUSER

INTRUSION DETECTION SYSTEM

Page 59: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

59

117

Misuse DetectionMisuse Detection

• Unwanted access by a user that has the ability to access data is referred to as misuse.

– This is also known as the insider problem.– Some estimates have shown that 80 % of computer crime is a

result of misuse.

• For data warehouses the threat of misuse is high especially by privileged users.

118

SummarySummary

• DBMS Security is useful for data warehouses to hide data from users with views and to restrict access to data with GRANT and REVOKE.

• Application Level Security assists EIS that access data warehouses by hiding certain reports from users.

• Encryption can be used to further protect against the risk of someone walking off with the data warehouse.

• Audit Trails are useful for:– Catching attackers– Identifying usage trends of the data warehouse

Page 60: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

60

119

Moving Datato the

Data Warehouse

( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)

120

Moving Data to the Data WarehouseMoving Data to the Data Warehouse

1) Moving Data into the Data Warehouse2) Updating the Data Warehouse3) Full Refresh 4) Copy Only the Changes5) BCP6) Simple Transformations7) Complex Transformations8) Commercial ETL Tools

Page 61: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

61

121

Moving Data into the Data WarehouseMoving Data into the Data Warehouse

• Data must be moved to the data warehouse from source systems.

• Some key issues:– Determine the frequency of data updates -- how often should data

be moved from source systems to the data warehouse. – Various means of updating data in the warehouse exist:

– SQL Commands– Database system load programs (e.g.; SQL Server’s BCP) – Commercial tools

122

Updating the Data WarehouseUpdating the Data Warehouse

• OLTP (On-Line Transaction Processing) Systems have to send their updates to the data warehouse.

InventoryOLTP

Application

FinanceOLTP

Application

SalesOLTP

Application

Data Warehouse

F i n a n c eSub jec t

Area

Inven to rySubjec t

Area

S a l e sSub j ec t

Area

Page 62: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

62

123

Frequency of Updates to the Data Warehouse

Frequency of Updates to the Data Warehouse

• Updates may occur daily, weekly, monthly, or in real-time.

Daily Update

Weekly U

pdate

Monthly U

pdateInventory

OLTPApplication

FinanceOLTP

Application

SalesOLTP

Application

Data Warehouse

F i n a n c eSub jec t

Area

Inven to rySubjec t

Area

S a l e sSub j ec t

Area

124

Determining the Frequency of UpdatesDetermining the Frequency of Updates

• Requirements should drive update frequency

• Range of updates runs from real-time, to quarterly.– Real time update

– Expensive– Requires update of warehouse while users are

querying – Daily update

– Somewhat cheaper than real time, but significantmaintenance required if the warehouse has lots of tables.

– Monthly or weekly update– Much more manageable

Page 63: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

63

125

Updating the WarehouseUpdating the Warehouse

• Full Refresh vs. Only the Changes

InventoryOLTP

ApplicationFinanceOLTP

Application

SalesOLTP

Application

Full Refresh

Changes since last update

Full refresh

of some tables

chan

ges for other tables

Data Warehouse

F i n a n c eSub jec t

Area

Inven to rySubjec t

Area

S a l e sSub j ec t

Area

126

Full RefreshFull Refresh

• Copy the entire source table in the OLTP system to the destination table in the Data Warehouse.

SourceTable

Source OLTP

TargetTable

Target Data Warehouse

Page 64: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

64

127

Copy Only the ChangesCopy Only the Changes

• Copy only the changes to the source table in the OLTP system to the destination table in the data warehouse.

SourceTable

Source OLTP

TargetTable

Target Data Warehouse

Modif ied da ta s i n c e l a s t u p d a t e t o t h e w a r e h o u s e

D a t a f r o m t w o u p d a t e s a g o . H i s to r i ca l da t a no longe r i n source OLTP.

128

Full Refresh vs. Only the ChangesFull Refresh vs. Only the Changes

• Full Refresh– Pros

– Much easier to implement– Less chance of messing up your database (good data integrity)

– Cons– Can take a lot longer to actually do -- may “run out of night”– Can lose out on warehouse ability to track historical data.

• Only the Changes (DELTA)– Pros

– Tracks historical data– Cons

– Can be very hard to implement– Can require changes in source applications (more on this later)

Page 65: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

65

129

• One way to move data from one table to another is via the INSERT-SELECT.– Syntax: INSERT INTO <target_table>

<any sql SELECT statement>

• Example:INSERT INTO DW_EMPLOYEE

SELECT *FROM EMPLOYEE

Full Refresh Using INSERT-SELECTFull Refresh Using INSERT-SELECT

TARGET

130

Updating Changes Using INSERT-SELECTUpdating Changes Using INSERT-SELECT

• Changes may be moved by adding a WHEREclause to the INSERT-SELECT.

• Example:– INSERT INTO DW_EMPLOYEE

SELECT *FROM EMPLOYEE

WHERE DATE-UPDATED = DATEPART(m, CURRENT_TIMESTAMP)

Page 66: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

66

131

Updating Using BCPUpdating Using BCP

• BCP is the bulk copy program that comes with MS SQL Server.– Bulk copy (BCP) moves data to or from a flat file to a SQL table.

• Syntax: bcp <table> [in | out] <data file>

SourceTable

Source OLTP

TargetTable

Target DataWarehouse

TemporaryF l a tFile

Unload Load

132

BCP ExampleBCP Example

• To bulk copy data from the publishers table in the pubs database to the publishers.txt data file in ASCII text format, execute from the command prompt:bcp pubs..publishers out publishers.txt -c

-Sservername -Usa -Ppassword

• To bulk copy data from the publishers.txt file into the pub2 table in the pubs database, execute from the command prompt:bcp pubs..pub2 in publishers.txt -c

-Sservername -Usa -Ppassword

Page 67: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

67

133

Simple TransformationSimple Transformation

• In addition to moving data from OLTP to the warehouse, it is often necessary to transform data. – Example: System A stores TOTAL_CLOTH in meters and system

B stores TOTAL_CLOTH in yards. Before the data is moved from system A, we need to transform the data.

Data Warehouse

Pat tern = 31 , Tota l C lo th = 50 yardsPat tern = 32 , Tota l C lo th = 70 yards

S t o r e 3 2( P a t t e r n = 3 2 ,

To ta l Clo th = 20 meters)

S t o r e 3 1( P a t t e r n = 3 1 ,

TOTAL_CLOTH = 50 yards )TRANSFORMATION

134

Data Warehouse

Complex TransformationComplex Transformation

• More complex transformations occur when a value in a source table must be moved to several locations in a data warehouse.

BLUE3 4 8 4(Color = Blue, 34 Inches, LS)

TABLE 1

COLOR

BLUE

34 in

CONVERT TOCENTIMETERS

TABLE 2

8 6 . 3 6 c m

84

CONVERT TO

CODE 84

(long sleev

es) and

put in two tab

les

TABLE 3

LongS l e e v e s

TABLE 4

LongS l e e v e s

Page 68: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

68

135

• Key tools in the marketplace– Informatica– Ardent– DecisionBase (Platinum)– Microsoft Data Transformation Services

• All provide libraries of common transformations.

• All provide the ability to code complex transformations.

Commercial ETL ToolsCommercial ETL Tools

136

Data Transformation ServicesData Transformation Services

Page 69: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

69

137

Choose a SourceChoose a Source

138

Choose a DestinationChoose a Destination

Page 70: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

70

139

Choose to use a Query for TransferChoose to use a Query for Transfer

140

Enter SQL Query Enter SQL Query

Page 71: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

71

141

Choose Destination TableNameChoose Destination TableName

142

Verify TransformationVerify Transformation

Page 72: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

72

143

Decide When to Run TransformationDecide When to Run Transformation

144

Final VerificationFinal Verification

Page 73: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

73

145

Run TransformationRun Transformation

146

Check ResultsCheck Results

orderid orderdate productid productname quantity unitprice discount

10248 1996-07-04 00:00:00.000 11 Queso Cabrales 12 14.0000 0.010248 1996-07-04 00:00:00.000 42 Singaporean Hokkien Fried 10 9.8000 0.010248 1996-07-04 00:00:00.000 72 Mozzarella di Giovanni 5 34.8000 0.010249 1996-07-05 00:00:00.000 14 Tofu 9 18.6000 0.010249 1996-07-05 00:00:00.000 51 Manjimup Dried Apples 40 42.4000 0.0

select *from orderfact

Page 74: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

74

147

SummarySummary

• ETL is one of the hard parts of building a data warehouse.

• Either full refreshes of data or just the changes may be done.

• Doing full refresh is easy, but historical data is lost and it may take a lot of time.

• Tracking changes is a tough business.

• ETL commercial tools are beginning to mature and can lessen the pain of this task.

148

More Ways of Moving Data to

the Data Warehouse

( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)

Page 75: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

75

149

More Ways of Moving Datato the Data Warehouse

More Ways of Moving Datato the Data Warehouse

1) Determining What Data Has Changed2) Recovery Logs3) Triggers4) Insert Triggers5) Delete Triggers6) Update Triggers7) Manual Detection

150

• There is a need to move data into the data warehouse from OLTP and DSS applications

• The problem is detecting what data needs to be moved into the data warehouse

• Three methods:

– Recovery Logs

– Triggers

– Manual Techniques

More Ways of Moving Datato the Data Warehouse

More Ways of Moving Datato the Data Warehouse

Page 76: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

76

151

Determining What Data Has ChangedDetermining What Data Has Changed

• Problem: How to get updates made to the source to the same information in the data warehouse?

TABLE A

SOURCE

OLTP

DATA WAREHOUSE

TABLE B??UPDATES

H o w t o g e t u p d a t e s f r o mSource Table A to

Data Warehouse Table B

152

??

Determining What Data Has Changed (cont.)Determining What Data Has Changed (cont.)

• Problem: How to get updates made to multiple sources to the same information in the data warehouse?

TABLE A

SOURCE

OLTP

UPD

ATES

“ROW X”

NAME DEPT. SALARY

Fred Mktg 35000

Hank Sales 60000Sue IT 71000Joe Sales 50000

Employee

RO

W X

UPDATES

Insert intoEmployee Values(‘Joe’,’Sales’,’50000)

DATA WAREHOUSE

TABLE A

“ROW X”

TABLE B

“ROW X”

DEPT COUNTMktg 1Sales 1IT 1HR 0

??EmployeeCount

SalaryInfo

2DEPT AVG SAL TOT SAL

Mktg 35000 35000IT 71000 71000HR 0 0Sales 60000 60000

55000 110000

Page 77: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

77

153

What is the Recovery Log?What is the Recovery Log?

• Recovery log is used for transaction processing– Used to handle errors– Does contain before and after image.

• Recovery log can be used toidentify the data to be updated

in the data warehouse.

– Change Data Capture Utility– This scans the database log and identifies all changes that the user

is interested in and either writes them to a file or stores them in another table.

154

Change Data Capture Utility in ActionChange Data Capture Utility in Action

DBMS

DATA

LOG

A l l c h a n g e sto DBMS

SOURCE

OLTP

CHANGEDATA

CAPTUREUTILITY

DATA WAREHOUSE

READSREADS

WRITESWRITES

RECOVERY LOG

Page 78: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

78

155

Example of Using Recovery LogExample of Using Recovery Log

UPDATE EMPLOYEEWhere SSN=10

SET Salary=Salary*2.0

LOGTABLE=EMPLOYEE

SSN=10

OldSalary=100,

NewSalary=200

CHANGEDATA

CAPTURERECONSTRUCTS

UPDATE

DATA WAREHOUSE

• Consider an update to the Employee table– The information is recorded in the log– The change data capture reconstructs update– Can then be sent to the data warehouse

156

Using the Recovery LogUsing the Recovery Log

• Recovery logs are usually in proprietary format. Use commercial tools to read the log and identify the changes.

• Commercial tools such as CA’s log analyzer can place the results of their work in a table.

Page 79: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

79

157

Summary of Change Data CaptureSummary of Change Data Capture

• Pro

– Log exists anyway, might as well use it to find what has changed

• Con– Some difficult scenarios may occur where it is hard to see what the

new update should be in the Data Warehouse.

– Proprietary format, may not be supported in many DBMS and will always lag behind DBMS development.

– Many tables will be in the source that have nothing to do with the data warehouse, but change data capture will process their changes as well.

158

TriggersTriggers

• Triggers allow DBA’s to specify that when an “event” such as an INSERT, UPDATE, or DELETEoccurs on a table, another event is triggered.– Triggers are used to identify changes that are needed by the

warehouse. – A trigger can be added to a source table and whenever the source

table is updated, an update can be placed either directly in thewarehouse or in a staging table that tracks all updates.

• Triggers can be used to detect the changes and perform data warehouse updates.

– A different trigger might be run on key updates so that the datawarehouse nightly process would know what data has changed.

Page 80: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

80

159

Example of a TriggerExample of a Trigger

TABLE A

INSERT intoTABLE A

VALUES (X, Y)

STEP 1

STEP 2

Values (X, Y) are inser ted

STEP 4

When va lues a r einser ted , se t s o f f

the TRIGGER

Nigh t ly P rocess i n se r t svalues (X, Y) into

t h e D a t a W a r e h o u s e

DATA WAREHOUSE

TABLE A

Values (X, Y)

X, YSTAGING

STEP 3

TRIGGER inser tsvalues (X, Y) intoa “STAGING” area

NightlyProcess

160

Real-Life Trigger ExampleReal-Life Trigger Example

• OLTP/DSS Data - Employee table:

–Employee (ssn, name, salary)

• DW Data - Summary table:

–EmployeeStatistics (total number employees, total salary paid, average salary).

• When a row is inserted in the employee table, we need to do an insert into the EmployeeStatisticstable. – Shown on the next page

Page 81: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

81

161

Insert Trigger ExampleInsert Trigger Example

CREATE TRIGGER EmployeeInsertTriggerON EmployeeFOR INSERT AS

BEGINUPDATE EmployeeStatistics

SET NoEmployee = NoEmployee + (SELECT COUNT(*) FROM INSERTED)

UPDATE EmployeeStatisticsSET TotSalary = TotSalary +

(SELECT SUM(Salary) FROM INSERTED)UPDATE EmployeeStatistics

SET AvgSalary = TotSalary / NoEmployeeEND

162

Insert Trigger in ActionInsert Trigger in Action

INSERT INTO EMPLOYEEVALUES (1, 'John', 300) (1 ROW(S) AFFECTED)

INSERT INTO EMPLOYEEVALUES (2,'Mike', 400) (1 ROW(S) AFFECTED)

SELECT * FROM EMPLOYEEEmployeeEmpId Name Salary ------ --------------------------1 John 300.002 Mike 400.00

SELECT * FROMEMPLOYEESTATISTICS

EmployeeStatisticsNoEmployee TotSalary AvgSalary---------- ---------- ---------2 700.00 350.00

COMMANDS RESULTS

Page 82: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

82

163

Delete Trigger ExampleDelete Trigger Example

CREATE TRIGGER EmployeeDeleteTriggerON EmployeeFOR DELETE AS

BEGINDECLARE @numberEmployee int

UPDATE EmployeeStatistics SET NoEmployee = NoEmployee - (SELECT COUNT(*) FROM DELETED)

UPDATE EmployeeStatisticsSET TotSalary = TotSalary - (SELECT SUM(Salary) FROM DELETED)

SELECT @numberEmployee = NoEmployee FROM EmployeeStatistics

IF @numberEmployee > 0 BEGIN

UPDATE EmployeeStatisticsSET AvgSalary = TotSalary / NoEmployee

EndELSE

UPDATE EmployeeStatistics SET AvgSalary = 0.0END

164

Update Trigger ExampleUpdate Trigger Example

CREATE TRIGGER EmployeeUpdateTriggerON Employee

FOR UPDATE ASBEGIN

IF UPDATE (Salary)UPDATE EmployeeStatisticsSET TotSalary = TotSalary -

(SELECT SUM(Salary) FROM DELETED) +(SELECT SUM(Salary) FROM INSERTED)

UPDATE EmployeeStatisticsSET AvgSalary = TotSalary / NoEmployee

END

Page 83: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

83

165

Summary of Using TriggersSummary of Using Triggers

• Pro– Only needed for tables whose data is going to go to the DW

• Con– Additional work needed to create detailed triggers

– Non-trivial to generate a trigger to implement appropriate action

– May not be acceptable for commercial software on source system

166

Other Ways to Determine What Has ChangedOther Ways to Determine What Has Changed

• There are other manual ways of detecting the change and doing DW updates– Look at each row of OLTP and the data in the warehouse– Compare the differences between the two files, if the data is not in

the warehouse, add it!

Hank

J o h n

Mike

S a m

OLTP DATA WAREHOUSE

Hank

J o h n

MikeCOMPARE

ADD THE DIFFERENCESADD THE DIFFERENCES

Page 84: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

84

167

Manually Identifying What Has ChangedManually Identifying What Has Changed

• Pro– Flexible

• Con– Very expensive

– Could take a long time

168

SummarySummary

• Recovery Logs

• Triggers

• Manual Detection

Page 85: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

85

169

Data WarehouseDesign

( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)

170

Data Warehouse DesignData Warehouse Design

1) Overview2) Describing a Design - ER Diagrams3) Design Normalization4) Star Schema Design

Page 86: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

86

171

OverviewOverview

• How to describe a design

– Entity Relationship (ER) Diagram

• Types of Designs

– Normalized– Star Schema– Snowflake

172

Describing a DesignDescribing a Design

• Different techniques exist, the most prevalent is the ER (Entity-Relationship) Diagram

• Entities– Things that occur in the real world, usually nouns e.g.; employee,

part, product, etc.

• Relationships– How entities interact, example: one employee may attend many

colleges -- usually verbs– Types of relationships

– 1-1– 1-Many– Many-1– Many-Many

Page 87: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

87

173

Examples of Relationships Examples of Relationships

1-MANY

MANY-1

1-1

MANY-MANY

174

Normalized DesignNormalized Design

• Methodology– All 1-1 relationships are placed in a single table.– Many-many relationships require two tables that store the single-

valued relationships and one linking table that indicates how the entities are related. The relationship is represented in the linking table by referencing keys in the two tables that represent each entity in the relationship.

• Checking the design– In a Normalized Design, there are many different normalized

forms. Each normal form (NF) builds on the previous one so that a table in 2NF is, by definition, in 1NF.

– 1NF– 2NF– 3NF

Page 88: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

88

175

Dealing With Many-Many RelationshipsDealing With Many-Many Relationships

• For Many-Many– Two 1-1 Tables (SUPPLIER, PARTS)– One linking table (SP)– Ex: Suppliers, Parts are the 1-1, SP is the linking table that says

who sells what parts.

S# SNAME

1 SEARS2 OFFICE DEPOT

SUPPLIER PARTS

P# PNAME1 HAMMERS

2 NAILS

SP

S# P#

1 11 22 1

2 2

176

Normalized Design: ExampleNormalized Design: Example

• A store sells a product which is supplied by a given vendor. The product is purchased by a customer at a certain time.– Entities: Customer, Product, Store– Relationships: Customer buys Product

– Product is located in Store– Product is supplied By a Vendor

CUSTOMER PRODUCT

BUYS

STORE

IS-LOCATED-IN

VENDOR

Page 89: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

89

177

Checking a Normalized DesignChecking a Normalized Design

• Normalization – Used to reduce data insertion, delete, and update anomalies caused

by bad designs. – Enables users to quickly check a design and make sure there are no

glaring holes in the design. – 1NF

– All “cells” are atomic -- i.e. each entry in a column contains only one value

– 2NF– All non-key values are functionally dependent upon the entire

primary key -- i.e. if the primary key changes, all other columns change.

– 3NF– No transitive dependencies -- i.e. all keys are completely

dependent on the primary key. If the primary key changes, all non-key columns are affected.

178

Overview of Normalized DesignOverview of Normalized Design

• Pro– Relatively easy to change

• Con– Queries can involve numerous joins– The massive number of tables and links between tables makes it

hard for customers to build their own queries

Page 90: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

90

179

Star SchemaStar Schema

• Methodology– Single fact table in the middle describing a key event (e.g. sale)

surrounded by dimension tables (i.e. location, time, employee)

FACT

D5

D1

D3

D2

D4

D = DIMENSIONS

180

Star Schema: MethodologyStar Schema: Methodology

• Identify a key fact that occurs.– Usually some event creates a real fact. Selling a product in a store

on Wednesday, patient visiting a hospital, etc.

• Identify all the dimensions of the data being used. Think of a dimension as a way to slice the data.– Ex: by time, by product, by customer, etc.

• Drill down operations are very well supported

Page 91: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

91

181

Star Schema: ExampleStar Schema: Example

• A store sells a product which is supplied by a given vendor. The product is purchased by a customer at a certain time.

• Fact– CustomerPurchase

• Dimensions are– Customer– Product– Time

– Vendor

182

Star Schema: Example (cont.)Star Schema: Example (cont.)

Sale

C u s t o m e r

Store

Time

Product

SALE ID

1

CUST. ID

3

STORE ID

7

PROD. ID

4

PRICE

$ 3 . 0 0

TIME

4 / 2 4 / 9 9

CUST. ID

3

NAME

FRED

PHONE

1 2 3 4

Buys App le s

Y

Has Big Car

Y

DAY

24

MONTH

4

QTR

2Q

YEAR

99

Price

SALE

CUSTOMER

TIME

Page 92: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

92

183

Star Schema: OverviewStar Schema: Overview

• Pro– Easy for users to navigate and understand

• Con– Performance

– Can end up with one monster fact table, millions of rows– Flexibility

– Not as easy for customers to change the design

184

MakeChips

Pa r t sManu-

facturing

PRODUCT

Price

Labor

Cost

Snowflake SchemaSnowflake Schema

• Several stars can be connected to form a snowflake

Sale

Price

R e v e n u e

Product

Marketing

Vendor

SALES

Direct MailPr ice

Ad

Location

MARKETING

Distrib-u t i o n

Sales

Page 93: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

93

185

SummarySummary

• Two basic types of design– Star Schema– Normalized

• Many Data Warehouse vendors sell products built specifically for the star schema

• Some data warehouses insist that normalization is the way to build the data warehouse.

186

Building aData Warehouse

( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)

Page 94: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

94

187

Building a Data WarehouseBuilding a Data Warehouse

1) Top Down Approaches2) Enterprise Data Model Approach3) "Let Data Users Decide" 4) "Let Data Warehouse Builders Decide" 5) "Let Senior Management Decide"6) Bottom Up Approach

188

Building the Data WarehouseBuilding the Data Warehouse

• How to decide what data goes into the data warehouse?

• Methods:

– Top Down– Using Enterprise Data Models– "Let data users decide" approach– "Let data warehouse builders decide" approach– "Let senior management decide" approach

– Bottom Up– Combine data marts into a data warehouse

Page 95: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

95

189

Using Enterprise Data ModelsUsing Enterprise Data Models

• Use the Enterprise Data Model to decide what data goes into the data warehouse.– Model key processes. This approach says let the business decide.– Identify key data used by these processes in an enterprise data

model -- might be a giant Entity-Relationship diagram.

• Put data in the warehouse based on the enterprise data model.

190

CHIPRECIPES

An Enterprise Data Model ExampleAn Enterprise Data Model Example

MAKECHIPS

PUT INBAGS

SELL CHIPS

COUNT$$

BUY MOREPOTATOES

INGREDIANTS

CHIPSUPPLIERS

Page 96: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

96

191

"Enterprise Data Model" Approach"Enterprise Data Model" Approach

• Pro– All inclusive -- no chance of leaving key data out.

• Con

– Very difficult to build an EDM.

– If the business model changes, you may have to rebuild the Enterprise Data Model and the data warehouse.

• Ways of Avoiding the Con

– In some cases you can buy an EDM -- if the business is common enough the packaged EDM might be very close and then you just have to modify it to fit your business.

192

USE

RS

SO

UR

CE

"Let Data Users Decide""Let Data Users Decide"

• Let the users of the data warehouse choose what data will go into the warehouse. – The data users deciding the data warehouse data and design will

pay for it as well.– Also, you can charge users who

query the data as well.

DATA WAREHOUSE

Page 97: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

97

193

"Let Data Users Decide": An Example"Let Data Users Decide": An Example

DATA WAREHOUSE

MARKETING HUMANRESOURCES FINANCE

DATA

d e m o g r a p h i c s

Adver t is ing

?

trends

e d u c a t i o n

Ethnicgroup

Age

?

DATA

budget

spendingR e v e n u e

?

DATA

194

"Let Data Users Decide" Approach"Let Data Users Decide" Approach

• Pro– Reduces budget problems– Users know best!

• Con– Requires marketing– Could end up with data in the warehouse that is meaningless to the

people who run the place. – Users may not place important data in the warehouse because their

budget is small.– Users who need the data may not use the DW because of budget

concerns.

• Ways of Mitigating the Con– Do not just take money -- try to determine if data is really

corporate.

Page 98: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

98

195

Pay As You Go Warehouse AnalogyPay As You Go Warehouse Analogy

I-495

196

"Let Data Warehouse Builders Decide""Let Data Warehouse Builders Decide"

LETS PUTINFORMATION ON

HOW TO BUILDVIRUSES IN THE

DATA WAREHOUSE

DATA WAREHOUSE

• The technical staff who is building the warehouse decides what data gets put in the warehouse.

Page 99: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

99

197

"Let Data Warehouse Builders Decide" Approach

"Let Data Warehouse Builders Decide" Approach

• Pro– Very easy to design – Does not take much time– Do not have to deal with users

• Con

– Could easily result in data DUMP not data warehouse

• Ways to mitigate the con– Talk to lots of users to help you guess what should go in the DW

198

“Let Senior Management Decide”“Let Senior Management Decide”

• The senior management decides what data goes into the warehouse.

• Asking the senior management is the safest way to build a data warehouse.

• Identify the key questions on senior management’s mind and get the data to answer these questions.

Page 100: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

100

199

“Let Senior Management Decide” Approach“Let Senior Management Decide” Approach

• Pro– Ensures executive support for the project

• Con– Senior management does not have much time for this -- you will

have to only get a few questions at a time

– This dramatically increases visibility - if you do not move quickly senior management will become very angry with the DW.

• Ways to mitigate the con– Do your homework before talking to the senior management -- talk

to the aides of senior management to find out what is on their mind.

– Allocate resources so you can plan to move very quickly once you hear from the senior management.

200

Bottom-Up ApproachBottom-Up Approach

• Move data from existing OLTP Applications to data marts.

• Combine data marts into a data warehouse.

DATAMART

25YARDS

DATAMART

50METERS

DATAMART200CM

DATAWAREHOUSE

OLTPAPP

OLTPAPP

OLTPAPP

Page 101: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

101

201

• Pro– Data marts are much easier to build than full-fledged DW.

• Con– Could end up with a bunch of stove pipe data marts.

• Ways to mitigate the con

– Develop standards for data when building the data marts so that you can glue data from different data marts together.

Bottom-Up ApproachBottom-Up Approach

202

Recommendations for an ApproachRecommendations for an Approach

"Let senior management decide"

Page 102: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

102

203

User Interface to the

Data Warehouse

( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)

204

User Interface to the Data WarehouseUser Interface to the Data Warehouse

1) Introduction2) Types of Users3) Functions Users Want to Do 4) Approaches to Building a User Interface5) Hand Built6) Class Libraries7) OLAP Tools8) Types of User Interfaces

Page 103: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

103

205

IntroductionIntroduction

Inven to ryOLTP

Appl ica t ion

F i n a n c eOLTP

Appl ica t ion

S a l e sOLTP

Appl ica t ion

F i n a n c eOLTPData

Inven to ryOLTPData

S a l e sOLTPData

DATA WAREHOUSE

USER INTERFACE

• A User Interface (UI) is a front end application designed for the user that presents information in a simplified manner.– Data in a data warehouse does nothing if users cannot access it

– Users do not want to learn SQL to drive DW applications

206

Building User InterfacesBuilding User Interfaces

• DW applications have different types of users with different functionality requirements.– It is critical to identify the key users.– Once you do this, you need to identify their functional

requirements.

• There are three main approaches to building UI’s– Build your own entirely– Use commercial Class Libraries– Using OLAP Tools

Page 104: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

104

207

Types of UsersTypes of Users

Everyone

Analysts

Marketing

Executive

Everyone

Analysts

Sales

Executive

Everyone

Analysts

Finance

Executive

CEO

208

• Executives– People who run the place– Need answers quickly– May not be very technical– Expect UI to get them what they

want quickly and efficiently without any need for special training

• Analysts– Have time to really analyze data and think about it

– May have strong statistical and IT background(i.e. Power user of Excel)

– Expect UI to have many complex features, and provide the ability to generate new queries and perform statistical

analysis of the data.

Types of Users (cont.)Types of Users (cont.)

Page 105: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

105

209

Types of Users (cont.)Types of Users (cont.)

• Regular User– All other users – Just need some simple answers to simple questions like “What is

Hank’s phone number)– Expect UI to be simplistic, easy to understand, and provide access

to basic information.

210

Subject Matter Experts ExpectSubject Matter Experts Expect

• Query data in the data warehouse• Trend analysis

– “show me how much money we have spent on computers in the last four years”

• Benchmark to competitors– “what are all our competitors charging for product X”

Trend

Sales

1995 1999

Page 106: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

106

211

WAL-MART

Subject Matter Experts Expect (cont.)Subject Matter Experts Expect (cont.)

• Drill Down– “on that chart you just showed me, I noticed that revenue was

down in Region #4. Please drill down and show me the breakdown of each area in Region #4.”

DRILL DOWN

05

101520

1 2 3 4

REGIONS

RE

VE

NU

E

Y Values

X Values

DCVAMD

DRILL DOWN

Reg ion 4

Reven

ue

212

Approaches to Building User InterfacesApproaches to Building User Interfaces

• Hand-Built– Write all of your own code

• Use Class Libraries– Use an object oriented approach and buy the CLASS libraries that

do all the hard work

• OLAP– Use an On-Line Analytical Processing package to build user

interfaces for you.

Page 107: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

107

213

Architecture of User Interfaces (cont.)Architecture of User Interfaces (cont.)

• Hand Built

• Class Libraries

USER INTERFACEDATA

WAREHOUSE

C o m m e r c i a lOff The Shelf

(COTS)

i.e. JAVAi.e. JAVA

USERINTERFACE

DBMSDBMS

GRAPHICSCLASSLIBRARY OLAPCLASS

LIBRARY

USER

INTERFACE

CLASS

LIBRARYHandHandBuiltBuilt

214

Architecture of User Interfaces (cont.)Architecture of User Interfaces (cont.)

• OLAPST

ORE

REVENUE

YEAR

REGION

USER INTERFACE

DBMS

C o m m e r c i a lOff The Shelf

(COTS)

Resu l t Cube

Page 108: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

108

215

Hand-Building User InterfacesHand-Building User Interfaces

• Write all the code yourself– Requires many design documents, coding and testing for all of the

code components.

• Pros

– Very flexible

• Cons

– Could take a long time to develop

– Requires substantial resources– May need lots of testing and debugging

216

Using Class Libraries to Build User InterfacesUsing Class Libraries to Build User Interfaces

• Write initial user dialog yourself and call class libraries for the hard part (graphics and data access functionality).

• Pro– Many class libraries available -- avoid doing a lot of coding

yourself

• Con– Not as flexible -- if the class library does not do what you want it

to do you have to – Find a new class library– Live without the functionality

– Can take a while to find the class library you need and learn how to interface to it

Page 109: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

109

217

Using OLAP Tools to Build User InterfacesUsing OLAP Tools to Build User Interfaces

• Many different OLAP tools– Need to survey an OLAP tool– Buy an OLAP tool– Install it– If it does not match all requirements some code may be needed to

communicate with the OLAP tool.

• Three types multi-dimensional OLAP

– Relational OLAP (ROLAP)– Multi-dimensional (MOLAP)– Hybrid (HOLOP)– Distributed (DOLAP)

218

Summary of Tools for UI Development of DWSummary of Tools for UI Development of DW

• Tools that may be used include:– Development of in-house software

– Do it all yourself– Use Class Libraries

– OLAP– ROLAP– MOLAP– HOLAP– DOLAP

• Different tools or techniques may be useful depending upon what kind of user interface is being developed. – Executive Information Systems– Analytical Systems– Enterprise Information Systems

Page 110: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

110

219

Types of User InterfacesTypes of User Interfaces

• Executive Information System – Developed for the person who runs the place

• Analytical System– Developed for business analysts

• Enterprise Information System– Developed for users throughout the organization

Everyone

Analysts

Marketing

Executive

Everyone

Analysts

Sales

Executive

Everyone

Analysts

Finance

Executive

CEOEXECUTIVEINFORMATION SYSTEM

ANALYTICAL SYSTEM

ENTERPRISEINFORMATION SYSTEM

220

Executive Information SystemExecutive Information System

• The Executive IS is developed specifically for people who run the organization.

• Development process:– No clean life cycle

– Prototype constantly. Usually have to guess atwhat executives will want to see

– Show executives let them come up with ideasfor revisions

– Drill down functionality required

• Tools– Frequently hand-built, but purchasing a class library can help

lower the development cost.– May just want to use tools that allow development of a

subscription service in which users may “Subscribe” to a fewcanned reports.

Page 111: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

111

221

Analytical SystemAnalytical System

• Analytical systems are user interfaces developed for business analysts in an organization.

• Development process:– Allow users to drag-and-drop data around to further the analysis of

this data.– More complex interface is acceptable

– Users may be required to know some SQL knowledge

• Tools:

– OLAP Tools are frequently used to build the interface

222

Enterprise Information SystemEnterprise Information System

• Enterprise IS is written for the general user to retrieve simple, key information.

• Development process:– Frequently developed in-house– So many users around that you really cannot pick a few and ask

what they need.– Simpler than Executive IS as it does not require drill down

functionality.

• Tools

– Place some simple, key informationon a few screens and controlaccess and then deploy.

Page 112: Data Warehouse Overview - dbmanagement.infodbmanagement.info/Books/MIX/DataWarehouseOverview_DW.pdf · Introduction to Data Warehousing and Data Mining ... SQL Overview SQL Overview

112

223

Summary of Types of User InterfacesSummary of Types of User Interfaces

• Executive Information System– For the senior executives– Use in-house development or in -house development augmented by

class libraries

• Analytical System– OLAP may make sense here as the interface is more complicated,

but OLAP has drawbacks due to:– Data sparseness– No well accepted query language

• Enterprise Information System– Much simpler than executive system– Good candidate for in-house development