1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are...
-
date post
19-Dec-2015 -
Category
Documents
-
view
216 -
download
2
Transcript of 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are...
![Page 1: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/1.jpg)
1
OLAP and Decision Support
Chapter 25, Part A
![Page 2: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/2.jpg)
2
Introduction
Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business strategies.
Emphasis is on complex, interactive, exploratory analysis of very large datasets created by integrating data from across all parts of an enterprise; data is fairly static.
Solving modern business problems such as market analysis requires query-centric database schemas that are multidimensional.
![Page 3: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/3.jpg)
3
Three Complementary Trends
Data Warehousing: Consolidate data from many sources in one large repository. Loading, periodic synchronization of replicas. Semantic integration.
OLAP: Complex SQL queries and views. Queries based on spreadsheet-style operations and
“multidimensional” view of data. Interactive and “online” queries.
Data Mining: Exploratory search for interesting trends and anomalies. (Another lecture!)
![Page 4: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/4.jpg)
4
OLAP versus OLTP OLTP( Online Transaction Proccess ) is used to
run the business operations of a company such as inventory management and order processing. On the other hand, OLAP works with data that is geared towards decision making, esp. long-term strategic decisions. OLTP OLAPContains a snapshot of the current data( 6 – 24 months )
Requires a history of transactions spread over many years(5 - 20 years)
Updated continuously Static Data
Can have errors or missing data Validated and complete data
Processes millions of transactions daily
Updated priodically through batch processing – usually once per day
Uses Entity – Relationship Diagram
Uses multi – dimensional model
![Page 5: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/5.jpg)
5
Multidimensional Data Model Fact
Table -> Collection of numeric measures,
which depend on a set of dimensions. E.g., measure Sales, dimensions
Product (key: pid), Location (locid), and Time (timeid).
8 10 10
30 20 50
25 8 15
1 2 3 timeid
p
id11
12
13
11 1 1 25
11 2 1 8
11 3 1 15
12 1 1 30
12 2 1 20
12 3 1 50
13 1 1 8
13 2 1 10
13 3 1 10
11 1 2 35
pid
tim
eid
locid
sale
s
locid
Slice locid=1is shown:
![Page 6: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/6.jpg)
6
Why Multidimensional Data Model
Traditional relational databases, as well as spreadsheets , are based on two dimensional model such as # of customers by region.
However, a telecommunication company analyst may not be interested just determining the number of the phone customers in a single state such as PA.
![Page 7: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/7.jpg)
7
Why Multidimensional Data Model For ex, the analyst might like to determine
the number of customers who subscribed to both home and wireless service in the past year.
The scenario may become more complicated as more dimensions are added.
The analyst would have to access data in different tables and perform complex table joins, a task would be beyond the capabilities of ordinary user.
![Page 8: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/8.jpg)
8
Why Multidimensional Data Model
Reg
ion
Prod
uct
Time
Three dimensional model
![Page 9: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/9.jpg)
9
What is OLAP?
OLAP is an analytical technique that combines data access tools with an analytical database engine. In contrast to the simple rows and columns structure of relational databases, OLAP uses a multi-dimensional view of data. OLAP uses calculations and transformations to perform its analytical tasks. There are two types of OLAP systems architectures ;
![Page 10: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/10.jpg)
10
MOLAP vs ROLAP
In Multidimensional OLAP ( MOLAP ), data is stored in a special OLAP database server, after being extracted from various sources, in pre-aggregated cubic format. In contrast to this approach, Relational OLAP ( ROLAP ) does not use an intermediate server because it can work directly against the relational database.
![Page 11: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/11.jpg)
11
MOLAP vs ROLAP
![Page 12: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/12.jpg)
12
MOLAP vs ROLAP
MOLAP performs well with 10 dimensions while ROLAP can scale considerably higher. ROLAP ‘s advantage is that it can work directly relational database and it is not limited by 10 dimensions but it places a heavy load on the server which makes it much slower and ROLAP is expensive to maintain.
![Page 13: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/13.jpg)
13
Dimension and Fact tables
price
category
pname
pid country
statecitylocid
sales
locidtimeid
pid
holiday_flag
weekdate
timeid month
quarter
year
(Fact table)SALES
TIMES
PRODUCTS LOCATIONS
(Dimension)
(Dimension table)
(Dimension table)• The main relation, which relates dimensions to
a measure, is called the fact table. Each dimension can have additional attributes and an associated dimension table.
E.g., Products(pid, pname, category, price) Fact tables are much larger than dimensional tables.
![Page 14: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/14.jpg)
14
Dimension Hierarchies
For each dimension, the set of values can be organized in a hierarchy:
PRODUCT TIME LOCATION
category week month state
pname date city
year
quarter country
![Page 15: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/15.jpg)
15
OLAP Queries
Influenced by SQL and by spreadsheets. A common operation is to aggregate a
measure over one or more dimensions. Find total sales. Find total sales for each city, or for each state. Find top five products ranked by total sales.
Roll-up: Aggregating at different levels of a dimension hierarchy. E.g., Given total sales by city, we can roll-up to
get sales by state.
![Page 16: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/16.jpg)
16
OLAP Queries Drill-down: The inverse of roll-up.
E.g., Given total sales by state, can drill-down to get total sales by city.
E.g., Can also drill-down on different dimension to get total sales by product for each state.
Pivoting: Aggregation on selected dimensions. E.g., Pivoting on Location and Time
yields this cross-tabulation:63 81 144
38 107 145
75 35 110
WI CA Total
1995
1996
1997
176 223 339Total
Slicing and Dicing: Equality and range selections on one or more dimensions.
![Page 17: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/17.jpg)
17
OLAP Queries
DiceSlice
Slicing and dicing basically are used for viewing different range of different dimension selection.
![Page 18: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/18.jpg)
18
Comparison with SQL Queries The cross-tabulation obtained by pivoting can also
be computed using a collection of SQL queries:
SELECT SUM(S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.timeid=L.timeidGROUP BY T.year, L.state
This query generates the entries in the body of the chart( outlined by the dark lines )
63 81 144
38 107 145
75 35 110
WI CA Total
1995
1996
1997
176 223 339Total
![Page 19: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/19.jpg)
19
Comparison with SQL Queries
SELECT SUM(S.sales)FROM Sales S, Location LWHERE S.timeid=L.timeidGROUP BY L.state
SELECT SUM(S.sales)FROM Sales S, Times TWHERE S.timeid=T.timeidGROUP BY T.year
SELECT SUM(S.sales)FROM Sales S, Location LWHERE S.locid=L.locid
The summary column on the right is generated by this query
The summary row at the bottom is generated by this query
The cumulative sum in the bottom corner of the chart.
![Page 20: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/20.jpg)
20
The CUBE Operator
Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions.
GROUP BY CUBESELECT T.year, L.state, SUM (S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid = T.timeid AND S.locid=L.locidGROUP BY CUBE( T.year, L.state )
![Page 21: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/21.jpg)
21
The CUBE Operator
T.Year L.state SUM(S.sales)
1995 WI 63
1995 CA 81
1995 Null 144
1996 WI 38
1996 CA 107
1996 Null 145
1997 WI 75
1997 CA 35
1997 Null 110
Null WI 176
Null CA 223
Null Null 399
The Result of GROUP BY CUBE on Sales
![Page 22: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/22.jpg)
22
The CUBE Operator
CUBE pid, locid, timeid BY SUM Sales
This query rolls up the table Sales on all eight subsets of the set {pid, locid, timeid}. It is equivalent to eight queries of the form:
SELECT SUM(S.sales)FROM Sales SGROUP BY grouping list grouping list is some set of the set {pid, locid,
timeid}
![Page 23: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/23.jpg)
23
Implementation Issues New indexing techniques: Bitmap indexes,
Join indexes, array representations, compression, precomputation of aggregations, etc.
E.g., Bitmap index:
10100110
112 Joe M 3115 Ram M 5
119 Sue F 5
112 Woo M 4
00100000010000100010
sex custid name sex rating ratingBit-vector:1 bit for eachpossible value.Many queries canbe answered usingbit-vector ops!
MF
![Page 24: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/24.jpg)
24
Join Indexes Consider the join of Sales, Products, Times, and
Locations, possibly with additional selection conditions (e.g., country=“USA”). A join index can be constructed to speed up such joins.
The index contains [s,p,t,l] if there are tuples (with sid) s in Sales, p in Products, t in Times and l in Locations that satisfy the join (and selection) conditions.
Problem: Number of join indexes can grow rapidly. A variation addresses this problem: For each column
with an additional selection (e.g., country), build an index with [c,s] in it if a dimension table tuple with value c in the selection column joins with a Sales tuple with sid s; if indexes are bitmaps, called bitmapped join index.
![Page 25: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/25.jpg)
25
Bitmapped Join Index
Consider a query with conditions price=10 and country=“USA”. Suppose tuple (with sid) s in Sales joins with a tuple p with price=10 and a tuple l with country =“USA”. There are two join indexes; one containing [10,s] and the other [USA,s].
Intersecting these indexes tells us which tuples in Sales are in the join and satisfy the given selection.
price
category
pname
pid country
statecitylocid
sales
locidtimeid
pid
holiday_flag
week
date
timeid
month
quarter
year
(Fact table)SALES
TIMES
PRODUCTS LOCATIONS
![Page 26: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/26.jpg)
26
Querying Sequences in SQL:1999 Trend analysis is difficult to do in SQL-92:
Find the % change in monthly sales Find the top 5 product by total sales Find the trailing n-day moving average of sales The first two queries can be expressed with
difficulty, but the third cannot even be expressed in SQL-92 if n is a parameter of the query.
The WINDOW clause in SQL:1999 allows us to write such queries over a table viewed as a sequence (implicitly, based on user-specified sort keys)
![Page 27: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/27.jpg)
27
SQL Versions
Year Names Comments
1986
SQL-86 First published by ANSI. Ratified by ISO.
1989
SQL-89 Minor version.
1992
SQL-92 Major Version( ISO 9075 ).
1999
SQL:1999 Added regular expression matching, recursive queries, triggers, non-scholar types.
2003
SQL:2003 Introduced XML features, window functions, standardized sequences.
2006
SQL:2006 ISO/IEC 9075-14:2006 defines ways in which SQL can be used in conjunction with XML.
![Page 28: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/28.jpg)
28
The WINDOW Clause
This example shows moving average of sales over 3 months
Let the result of the FROM and WHERE clauses be “Temp”.
(Conceptually) Temp is partitioned according to the PARTITION BY clause. Similar to GROUP BY, but the answer has one row
for each row in a partition, not one row per partition!
SELECT L.state, T.month, AVG(S.sales) OVER W AS movavgFROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.locid=L.locidWINDOW W AS (PARTITION BY L.state
ORDER BY T.monthRANGE BETWEEN INTERVAL `1’ MONTH PRECEDINGAND INTERVAL `1’ MONTH FOLLOWING)
![Page 29: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/29.jpg)
29
The Window Clause
Each partition is sorted according to the ORDER BY clause.
For each row in a partition, the WINDOW clause creates a “window” of nearby (preceding or succeeding) tuples. Can be value-based, as in example, using
RANGE Can be based on number of rows to include in
the window, using ROWS clause The aggregate function is evaluated for each row
in the partition using the corresponding window.
![Page 30: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/30.jpg)
30
The Window Clause
In the example, the window for a row includes the row itself plus all rows whose “month” value is within a month value is within a month before or after; therefore, a row whose month value is June 2002 has a window containing all rows with month equal to May, June, and July 2002.
![Page 31: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/31.jpg)
31
Top N Queries
If you want to find the 10 (or so) cheapest cars, it would be nice if the DB could avoid computing the costs of all cars before sorting to determine the 10 cheapest. Idea: Guess at a cost c such that the 10 cheapest all
cost less than c, and that not too many more cost less. Then add the selection cost<c and evaluate the query.
• If the guess is right, great, we avoid computation for cars that cost more than c.
• If the guess is wrong, need to reset the selection and recompute the original query.
![Page 32: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/32.jpg)
32
Top N Queries
SELECT P.pid, P.pname, S.salesFROM Sales S, Products PWHERE S.pid=P.pid AND S.locid=1 AND S.timeid=3ORDER BY S.sales DESCOPTIMIZE FOR 10 ROWS
OPTIMIZE FOR construct is not in SQL:1999! Cut-off value c is chosen by optimizer.
SELECT P.pid, P.pname, S.salesFROM Sales S, Products PWHERE S.pid=P.pid AND S.locid=1 AND S.timeid=3
AND S.sales > cORDER BY S.sales DESC
![Page 33: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/33.jpg)
33
Online Aggregation Consider an aggregate query, e.g., finding
the average sales by state. Can we provide the user with some information before the exact average is computed for all states? Can show the current “running average” for
each state as the computation proceeds. Even better, if we use statistical techniques and
sample tuples to aggregate instead of simply scanning the aggregated table, we can provide bounds such as “the average sales for Wisconsin is 2000$102 with 95% probability.• Should also use nonblocking algorithms!
![Page 34: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/34.jpg)
34
Online Aggregation
Status
Prioritize State AVG(Sales) Confidence
%90 Selected Alabama 5,232.5 %97
%20 UnSelected
Alaska 2,832.5 %93
%92 Selected Arizona 6,432.5 %98
… … … … …
%35 UnSelected
Wyoming
4,243.5 %92
An algorithm is said to block if it does not produce output tuples until it has consumed all its input tuples. For example, the sort-merge join algorithm blocks because sorting requires all input tables before determining the first output tuple. Hash join is preferable instead of sort-merge join for online aggregation.
![Page 35: 1 OLAP and Decision Support Chapter 25, Part A. 2 Introduction Increasingly, organizations are analyzing current and historical data to identify useful.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d405503460f94a19ff5/html5/thumbnails/35.jpg)
35
Summary Decision support is an emerging, rapidly
growing subarea of databases. Involves the creation of large, consolidated
data repositories called data warehouses. Warehouses exploited using sophisticated
analysis techniques: complex SQL queries and OLAP “multidimensional” queries (influenced by both SQL and spreadsheets).
New techniques for database design, indexing, view maintenance, and interactive querying need to be supported.