Data Warehouses for Decision Support

1

Data Warehouses

for Decision Support

Vera Goebel

Department of Informatics, University of Oslo

Fall 2013

2

What and Why of Data Warehousing

• What: A very large database containing materialized views of multiple, independent source databases. The views generally contain aggregation data (aka datacubes).

Data

System

Database

Datastore

Database

System

Database Database

System Data

Warehouse

System

Datacubes

DSS app

workstations

.

.

. Queries Data

Extraction

and Load

• Why: The data warehouse (DW) supports read-only queries for

new applications, e.g., DSS, OLAP & data mining.

3

Data Warehouse (DW) Life Cycle

• The Life Cycle:

• General Problems:

– Heavy user demand

– Problems with source data

• ownership, format, heterogeneity

– Underestimating complexity

& resources for all phases

• Boeing Computing Services – DW for DSS in airplane repair • DW size: 2-3 terabytes

• Online query services: 247 service

• Data life cycle: retain data for 70+ years (until the airplane is retired)

• Data update: No “nighttime”; concurrent refresh is required

• Access paths: Support new and old methods for 70+ years

Global Schema Definition

Data Extraction and Load

Query Processing

Data Update

4

Global Schema Design – Base Tables • Fact Table

– Stores basic facts from the source databases (often denormalized)

– Data about past events (e.g., sales, deliverings, factory outputs, ...)

– Have a time (or time period) associated with them

– Data is very unlikely to change at a data source; no updates

– Very large tables (up to 1 TB)

ProductID

SupplierID

PurchaseDate

DeliveryDate

CustYrs

Fact Table

ProductID

ProdName

ProdDesc

ProdStyle

ManufSite

SupplierID

SuppName

SuppAddr

SuppPhone

Date1stOrder

TimeID

Quarter

Year

AuditName

Dimension Tables

D E & L

G S D

Q P

D U

TimeID

Quarter

Year

AuditID

AuditComp

Addr

AcctName

Phone

ContractYr

• Dimension Table – Attributes of one dimension of a fact table (typically denormalized)

– A chain of dimension tables to describe attributes on other dimension tables, (normalized or denormalized)

– Data can change at a data source; updates executed occasionally

5

Schema Design Patterns

• Star Schema

• Snowflake Schema

F D1

D2

D3 F

D2.1

D3.1 D1.1

D2.2

D3.2

D1.2

F1 F2

D1

F D1

D2

D3.1 D3.2

D1, D2, D3 are normalized

D E & L

G S D

Q P

D U

• Starflake Schema • Constellation Schema

D1, D2, D3 are denormalized

D3 may be normalized or denormalized D1 stores attributes about a

relationship between F1 and F2

6

Summary Tables

• aka datacubes or multidimensional tables

• Store precomputed query results for likely queries

– Reduce on-the-fly join operations

– Reduce on-the-fly aggregation functions, e.g., sum, avg

• Stores denormalized data

D E & L

G S D

Q P

D U

Summary

Table

Fact

Table

Dim

Table#1

Dim

Table#2

Fact

Table

• Aggregate data from one or more fact tables

and/or one or more dimension tables

• Discard and compute new summaries

as the set of likely queries changes

7

Summary Tables = Datacubes

Product

Fiscal

Quarter

Supplier

S1 S2

All-S

Q1

Q2

Q3

Q4

All-Q

P11 P14 P19 P27 P33 All-P

Total Expenses paid to

all suppliers of parts for

Product P19 in the 1st quarter


Supplier S1 for parts for

Product P33 in 2nd quarter

• Typical, pre-computed Measures are: – Sum, percentage, average, std deviation, count, min-value, max-value, percentile


Supplier S2 for parts for

all products in all quarters

Total Expenses for Parts by Product, Supplier and Quarter

2.7M £

4.6M £

1.2M £

1.0M £

0.4M £

0.6M £

0.2M £

1.0M £

2.2M £

GROUP BY product, quarter

GROUP BY supplier, product, quarter

GROUP BY supplier

D E & L

G S D

Q P

D U

Average Price paid to all suppliers of

parts for Product P11 in the 1st quarter

GROUP BY product, quarter

8

Too Many Summary Tables

• Factors to be considered: – What queries must DW support?

– What source data are available?

– What is the time-space trade-off to store versus re-compute joins and measures?

– Cost to acquire and update the data?

D E & L

G S D

Q P

D U

• An NP-complete optimization problem – Use heuristics and approximation algorithms

• Benefit Per Unit Space (BPUS)

• Pick By Size (PBS)

• Pick By Size–Use (PBS-U)

• The Schema Design Problem:

– Given a finite amount of disk storage, what views (summaries) will you pre-compute in the data warehouse?

A

C B

E D

H G

ALL/None

Derivation Lattice of materialized views

F

Use a derivation lattice to analyze

possible materialized views }

9

A Lattice of Summary Tables • Derivation Lattice

Nodes: The set of attributes that would appear in

the ”group by” clause to construct this view

Edges: Connect view V2 to view V1 if V1

can be used to answer queries over V2

MetaData: estimated # of records in each view

D E & L

G S D

Q P

D U

PSC

SC PC PS

P C S

ALL/None

Derivation Lattice for parts, supplier, & customers

• Determine cost and benefit of each view

• Select a subset of the possible views

• Typical simplifying assumptions: – Query cost ≈ # of records scanned to answer the query

– I/O costs are much larger than CPU cost to compute measures

– Ignore cost reductions due to using indexes to access records

– All queries are equally likely to occur

6M

0.1M

10

Benefit Per Unit Space (BPUS)

• S is the set of views we will materialize

• bf(u,v,S) = min(#w-#v | wS and w covers u)

• Benefit(v, S) = SUM (bf(u,v,S) | u=v or v covers u)

D E & L

G S D

Q P

D U

Derivation Lattice of materialized views

A

C B

E D

H G

ALL/None

F

View #MRecs View #MRecs A 100 E 30 B 50 F 40 C 75 G 1 D 20 H 10

B 50 * 5 = 250

C 25 * 5 = 125

D 80 * 2 = 160

E 70 * 3 = 210

F 60 * 2 = 120

G 99 * 1 = 99

H 90 * 1 = 90

Benefit- round#1

B

C 25 * 2 = 50

D 30 * 2 = 60

E 20 * 3 = 60

F 60 + 10 = 70

G 49 * 1 = 49

H 40 * 1 = 40

Benefit- round#2

B

C 25 * 1 = 25

D 30 * 2 = 60

E 20 * 2 + 10= 50

F

G 49 * 1 = 49

H 30 * 1 = 30

Benefit- round#3

S = S U {F}

S = S U {D}

S = { A }

S = S U {B}

• Savings: read 420M records, not 800M

11

Pick By Size (PBS)

While (space > 0) Do v = smallest views If (space - |v| > 0) Then

space = space - |v| S = S U {v} views = views – {v}

Else space = 0

D E & L

G S D

Q P

D U Parts+Suppls+Custs (6M) Parts (0.2M)

Parts+Custs (6M) Suppls (0.01M)

Parts+Suppls (0.8M) Custs (0.1M)

Suppls+Custs (6M)

Table sizes (in millions of records) PSC

SC PC PS

P C S

ALL/None

Derivation Lattice

• Storage Savings: Reduced from 19.2M records to 7.2M records

• Query Savings: Read 19.11M records, not 42M

S = {PSC}

12

While (space > 0) Do v = smallest { |v| / prob(v), where v views } If (space - |v| > 0) Then

space = space - |v| S = S U {v} views = views – {v}

Else space = 0

Pick By Size-Use (PBS-U)

• Extends the Pick By Size algorithm to consider the frequency of queries on each possible view

D E & L

G S D

Q P

D U

Parts+Suppls+Custs (6M, 0.05) Parts (0.2M, 0.1)

Parts+Custs (6M, 0.3) Suppls (0.01M, 0.1)

Parts+Suppls (0.8M, 0.3) Custs (0.1M, 0.1)

Suppls+Custs (6M, 0.05)

Table sizes (#Mrecs) & query frequency (probabilities)

PSC

SC PC PS

P C S

ALL/None

Derivation Lattice

• This query frequency did not change the selected views → same savings

13

• Algorithmic Performance

– Benefit Per Unit Space (BPUS)

– Pick By Size (PBS)

Comparing Schema Design Algorithms

• All Proposed Algorithms

– Produce only a near optimal solution

• Best known is within (0.63 – f ) of optimal, where f is the fraction of space consumed by the largest table

– Make (unrealistic) assumptions

• e.g., ignore indexed data access

– Rely heavily on having good metadata

• e..g., table size and query frequency

D E & L

G S D

Q P

D U

O(n log n) runtime

O(n3) runtime

• Limited applicability for PBS?

– Finds the near optimal solution only for SR-Hypercube lattices

A lattice forms an SR-Hypercube when for each v in the lattice, except v = DBT

|v| ≥ ((# of direct children of v) * (# of records in the child of v))

14

Data Extraction and Load Step1: Extract and clean data from all sources

– Select source, remove data inconsistencies, add default values

D E & L

G S D

Q P

D U

Step2: Materialize the views and measures

– Reformat data, recalculate data, merge data from multiple sources, add time elements to the data, compute measures

Step3: Store data in the DW

– Create metadata and access path data, such as indexes

• Major Issue: Failure during extraction and load

• Approaches:

– UNDO/REDO logging

• Too expensive in time and space

– Incremental Checkpointing

• When to checkpoint? Modularize and divide the long-running tasks

• Must use UNDO/REDO logs also; Need high/performance logging

15

Materializing Summary Tables

• Scenario: CompsiQ has factories in 7 cities. Each factory manufactures several of CompsiQ’s 30 hardware products. Each factory has 3 types of manufacturing lines: robotic, hand-assembly, and mixed-line.

• Target summary query:

What is last year’s yield from Factory-A by product type?

• Schema for source data from Factory-A:

YieldInfo

ProductCode RoboticYield Hand-AssemYield MixedLineYield Week Year

ProductInfo

ProductCode ProductName ProductType FCS-Date EstProductLife

D E & L

G S D

Q P

D U

16

Materialization using SchemaSQL

select p.ProductType, sum(y.lt)

from Factory-A::YieldInfo→ lt,

Factory-A::YieldInfo y,

Factory-A::ProductInfo p

where lt < > ”ProductCode

and lt < > ”Week”

and lt < > ”Year”

and y.ProductCode = p.ProductCode

and y.Year = 01

group by p.ProductType

D E & L

G S D

Q P

D U

YieldInfo ProductCode RoboticYield Hand-AssemYield MixedLineYield Week Year

ProductInfo ProductCode ProductName ProductType FCS-Date EstProductLife

What is last year’s yield from Factory-A by product type?

At execution time,

lt ranges over the

attribute names in

relation YieldInfo

17

Aggregation Over Irregular Blocks

YieldInfo ProductCode RoboticYield Hand-AssemYield MixedLineYield Week Year

ProductInfo ProductCode ProductName ProductType FCS-Date EstProductLife

P11 ATMCard Net 3-8-99 36

P12 SMILCard Video 1-02-98 18

P13 ATMHub Net 1-11-99 36

P14 MPEGCard Video 24-3-00 24

P15 MP3 Audio 17-1-01 36

ProductInfo YieldInfo

P11 17 12 5 45 01

P12 9 11 12 45 01

P13 5 10 3 45 01

P14 22 8 7 45 01

...

P11 20 15 0 46 01

P12 8 9 10 46 01

P13 31 0 0 46 01

P14 15 15 20 46 01

...

18

User Queries • Retrieve pre-computed data or formulate

new measures not materialized in the DW.

D E & L

G S D

Q P

D U

Fiscal

Quarter

Supplier

Product

S1 S2

All-S

Q1

Q2

Q3

Q4

All-Q

P11 P14 P19 P27 P33

• New user operations on logical datacubes:

– Roll-up, Drill-down, Pivot/Rotate

– Slicing and Dicing with a “data blade”

– Sorting

– Selection

– Derived Attributes

19

Query Processing

• Traditional query transformations

• Index intersection and union

• Advanced join algorithms

D E & L

G S D

Q P

D U

• Piggy-backed scans

– Multiple queries with different selection criteria

• SQL extensions => new operators

– Red Brick Systems has proposed 8 extensions, including:

• MovingSum and MovingAvg

• Rank … When

• RatioToReport

• Tertiles

• Create Macro

20

Data Update

• Data sources change over time

• Must “refresh” the DW

– Adds new historical data to the fact tables

– Updates descriptive attributes in the dimension tables

– Forces recalculation of measures in summary tables

D E & L

G S D

Q P

D U

• Issues:

1. Monitoring/tracking changes at the data sources

2. Recalculation of aggregated measures

3. Refresh typically forces a shutdown for DW query processing

21

Monitoring Data Sources Approaches:

D E & L

G S D

Q P

D U 1. Value-deltas - Capture before and after values of all

tuples changed by normal DB operations and store them

in differential relations.

• Issues: must take the DW offline to install the modified values

2. Operation-deltas – Capture SQL updates from the

transaction log of each data source and build a new log

of all transactions that effect data in the DW.

• Advantages: DW can remain online for query processing while

executing data updates (using traditional concurrency control)

3. Hybrid – use value-deltas and operation-deltas for

different data sources or a subset of the relations from a

data source.

22

Creating a Differential Relation

Approaches at the Data Source:

D E & L

G S D

Q P

D U 1. Execute the update query 3 times

• (1) Select and record the before values; (2) Execute the update; (3) Select and record the after values

• Issues: High cost in time & space; reduces autonomy of the data sources

2. Define and insert DB triggers

• Triggers fire on “insert”, “delete”, and “update” operations; Log the before and after values

• Issues: Not all data sources support triggers; reduces autonomy of the data sources

23

Creating Operation-Deltas

• The process:

– Scan the transaction log at each data source

– Select pertinent transactions and delta-log them

• Advantage:

– Op-delta is much smaller than the value-delta

• Issues:

– Must transform the update operation on the data source schema into an update operation on the DW schema – not always possible. Hence can not be used in all cases.

D E & L

G S D

Q P

D U

24

Recalculating Aggregated Measures

• Delta Tables

– Assume we have differential relations for the base facts in the data sources (i.e., value deltas)

– Two processing phases (Propagation & Refresh):

D E & L

G S D

Q P

D U

Differential

Relations

Global

DW

Schema

Propagation

Process

Delta

Tables

1) Propagation – pre-compute all new tuples and all

replacement tuples and store them in a delta table

25

Recalculating Aggregated Measures

2) Refresh – Scan the DW tuples, replace existing

tuples with the pre-computed tuple values, insert

new tuples from the delta tables

D E & L

G S D

Q P

D U

DW

Tables

Updated

DW Tables

Refresh

Process

Delta

Tables

Issue:

Can not pre-compute Delta Table for non-commutative measures

Ex: average (without #records), percentiles

Must compute these during the refresh phase.

26

Data Marting

• What: Stores a second copy of a subset of a DW

• Why build a data mart?

– A user group with special needs (dept.)

– Better performance accessing fewer records

– To support a “different” user access tool

– To enforce access control over different subsets

– To segment data over different hardware platforms

Data Mart System

Datacubes

Data Mart System

Datacubes

Data

Extraction

and Load

DSS app

workstations

.

.

.

Queries

Queries

Data

Warehouse

System

datacubes

27

Costs and Benefits of Data Marting • System costs:

– More hardware (servers and networks)

– Define a subset of the global data model

– More software to:

• Extract data from the warehouse

• Load data into the mart

• Update the mart (after the warehouse is updated)

• User benefits:

– Define new measures not stored in the DW

– Better performance (mart users and DW users)

– Support a more appropriate user interface

• Ex: a browser with forms versus SQL queries

– Company achieves more reliable access control

28

Commercial DW Products • Short list of companies with DW products:

– Informix/Red Brick Systems

– Oracle

– Prism Solutions

– Software AG

• Typical Products and Tools

– Specially tuned DB Server

– DW Developer Tools: data extraction, incremental update, index builder

– User Tools: ad hoc query and spreadsheet tools for DSS and post-processing (creating graphs, pie-charts, etc.)

– Application Developer Tools (toolkits for OLAP and DSS): spreadsheet components, statistics packages, trend analysis and forecasting components

29

Ongoing Research Problems

• How to incorporate domain and business rules

into DW creation and maintenance

• Replacing manual tasks with intelligent agents

– Data acquisition, data cleaning, schema design,

DW access paths analysis and index construction

• Separate (but related) research areas:

– Tools for data mining and OLAP

– Providing active database services in the DW

Data Warehouses for Decision Support

Documents

Transcript of Data Warehouses for Decision Support