OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation...

30
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions

Transcript of OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation...

Page 1: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

OLAP Recap

3 characteristics of OLAP cubes:Large data sets ~ Gb, TbExpected Query : Aggregation

Infrequent updatesStar Schema : Hierarchical Dimensions

Page 2: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Attributes and Measures

Attributes are columns with values from a fixed domain (foreign keys).

Measures are numerical columns.

Page 3: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Imprecision and Uncertainity

Imprecision in a tuple refers to an attribute instantiated by a set of values from the domain, each with associated probability, instead of a single value

Uncertainity refers to a measure represented by a pdf over the domain instead of a single value.

Page 4: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Aggregation on Uncertain Data

Several ways of combining PDFs

LinOp: linear combination of PDFS P(X)= weighted sum of pi(x)

Page 5: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Hierarchical Domains : Star Schema

Location

Maharashtra Madhya Pradesh

Mumbai Pune Bhopal Indore

Page 6: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Restriction on Imprecision

We restrict the sets of values in an imprecise fact to either:1. A singleton set consisting of a leaf level member of the hierarchy, or,2. The set of all the leaf level members under some non-leaf level member of the hierarchy.

Page 7: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Cells and Regions

A region is a vector of attribute values from an imprecise domains of each dimension of the cube.A cell is a region in which all values are leaf level members.Let reg(R) represent the set of cells in a region R.

Page 8: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Queries on precise data

A query Q = (R, M, A) refers to a region R, a measure M, and an aggregate function A. Eg : (<Ambassador, Location>, Repairs, Sum)The result of the query in a precise database is obtained by applying A on the measure M of all cells in R.For the example above, the result is (P1 + P2)

Page 9: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Queries on imprecise data

Consider the query region <Pune, Model> in the figure. It overlaps two imprecise facts P4 and P5.Three (naive) options for including fact in query: Contains: consider only if contained in query Overlaps: consider if overlapping query None: ignore all imprecise facts

Page 10: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Desideratum I: Consistency

Consistency specifies the relationship between answers to related queries on a fixed data set

Generic idea: if query region is partitioned, and aggregate applied on each partition, then aggregate q on whole region must be consistent in some ways with aggregates qi on partitions

General idea: alpha consistency for property alpha

Specific forms of consistency discussed in detail in paper

sum consistency (for count/sum)

Boundedness consistency (for average)

SierraF150

Truck

MA

NY

East

p1

p3

p5

p4

p2

Page 11: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Desideratum II: Faithfulness

Faithfulness specifies the relationship between answers to a fixed query on related data sets

SierraF150

MA

NY

p3

p1

p4

p2

p5

SierraF150

MA

NY

p3

p1

p4

p2

p5

SierraF150

MA

NY

p3

p1

p4

p2

p5

Data Set 1 Data Set 2 Data Set 3

Page 12: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Contains option : Consistency

Intuitively, consistency means that the answer to a query should be consistent with the aggregates from individual partitions of the query.Using the Contains option could give rise to inconsistent results.For example, consider the sum aggregate of the query above and that of its individual cells. With the Contains option, will the individual results add up to be the same as the collective?

Page 13: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

None option

Essentially, the none option ignores the imprecise facts, even if a fact is completely inside the region. Lays waste to the whole notion of having imprecise facts.

Page 14: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Overlaps option : Possible Worlds

Page 15: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Query semantics on Possible worlds

With each possible world, assign a weight wi such that the

sum of all weights is 1. Intuitively, the weight of a particular world is like probability that it is the correct underlying data.

Given a query Q, we can calculate the result for each vi for each world. Thus, we can return a pdf over the answer Z as

P[Z = v] = ∑ i : v_i = v

wi

A neat short answer could be the expected value of ZE[Z] =∑

i w

i * v

i

Problem with this is : number of possible worlds is exponential in number of imprecise facts!

Page 16: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Solution : Extended data model

With each cell c in a region r, we add a probability pr, c

, called

the allocation of r to c.The probability of a possible world becomes the multiple of allocations of ranges to cells that have been populated in the world.This leads to a (reasonable) restriction on the kind of probability distributions on possible worlds.

Page 17: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Advantages of EDM

No extra infrastructure required for representing imprecisionEfficient algorithms for aggregate queries :SUM and COUNT : linear time algo.AVERAGE : slightly complicated algorithm running in O(m + n3) for m precise facts and n imprecise facts.

Page 18: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Allocation Policies

For every region r in the database, we want to assign an allocation p

c, r to each cell c in Reg(r), such that

∑c Reg(r)

pc, r

= 1

Three ways of doing so:

1. Uniform : Assign each cell c in a region r an equal probability.

pc, r

= 1 / |Reg(r)|

Page 19: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Allocation Policies

For every region r in the database, we want to assign an allocation p

c, r to each cell c in Reg(r), such that

∑c Reg(r)

pc, r

= 1

However, we can do better. Some cells may be naturally inclined to have more probability than others. Eg : Mumbai will clearly have more repairs than Bhopal. We can do this automatically by giving more probability to cells with higher number of precise facts.

2. Count based :

where Nc is the number of precise facts in cell c

Page 20: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Allocation Policies

For every region r in the database, we want to assign an allocation p

c, r to each cell c in Reg(r), such that

∑c Reg(r)

pc, r

= 1

Again, we can arguably get a better result by looking at not just the count, but rather than the actual value of the measure in question.

3. Measure based : next slide.

Page 21: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Measure Based Allocation

Assumes the following model : The given database D with imprecise facts has been generated by randomly injecting imprecision in a precise database D'.D' assigns value o to a cell c according to some unknown pdf P(o, c).

If we could determine this pdf, the allocation is simplyp

c, r = P(c) / ∑

c' in Reg(r) P(c')

Page 22: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Maximum Likelihood Principle

A reasonable estimate for this function P can be that which maximises the probability of generating the given imprecise data set D.

Example :Suppose the pdf depends only on the cells and is independent of the measure values. Thus, the pdf is a mapping : C ℝ where C is the set of cells.This pdf can be found by maximising the likelihood function :

ℒ() = r D

∑c Reg(r)

(c)

Page 23: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

EM Algorithm

The Expectation Maximization algorithm provides a standard way of maximizing the likelihood, when we have some unknown variables in the observation set.

Expectation step (compute data): Calculate the expected value of the unknown variables, given the current estimate of variables.Maximization step (compute generator): Calculate the distribution that maximizes the probability of the current estimated data set.

Page 24: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Initialization Step: Data: [4, 10, ?, ?] Initial mean value: 0New Data: [4, 10, 0, 0]

Step 1: New Mean: 3.5New Data:[4, 10, 3.5, 3.5]

Step 2: New Mean: 5.25New Data: [4, 10, 5.25, 5.25]

Step 3: New Mean: 6.125New Data: [4, 10, 6.125, 6.125]

Result: New Mean: 6.890625

EM Algorithm : Example

Step 4: New Mean: 6.5625New Data: [4, 10, 6.5625, 6.5625]

Step 5: New Mean: 6.7825New Data: [4, 10, 6.7825, 6.7825]

Page 25: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

EM Algorithm : Application

Page 26: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Experiments : Allocation run time

Page 27: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Experiments : Query run time

Page 28: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Experiments : Accuracy

Page 29: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Summary

Model for ambiguity : Imprecision, UncertainityQuerying on uncertain data :

None v/s Contains v/s Overlaps option Consistency, Faithfulness

Possible Worlds interpretation : size blowupExtended databases : allocationAggregation algorithms on Extended databasesAllocation policies :

Uniform Count Measure : EM algorithm

Experiments : Allocation time, query time, accuracy

Page 30: OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

References :

OLAP over uncertain and imprecise data (Doug Burdick et al.) - The VLDB Journal (2007) 16:123–144

OLAP over uncertain and imprecise data(Doug Burdick et al.) - - The VLDB Journal (2005)

http://en.wikipedia.org/wiki/Expectation-maximization_algorithm