下載

Chapter 1

Introduction

Data warehousing and data mining are both popular technologies in recent years.

Data warehousing is an information infrastructure to store and integrate different data

sources into a consistent repository, and through OLAP (On-Line Analytical

Processing) tools business managers can analyze these data in various perspectives to

discover valuable information for strategic decision. Data mining, on the other hand,

is the exploration and analysis of data, automatically or semi-automatically, to

discover meaningful patterns and rules. From the business viewpoint, the integration

of these two technologies can allow a corporation to understand its customers

behaviors, and to use this information to gain market competition. Among various

pattern interested by data mining research community, association rule has attracted

great attention recently. An association rules is a rule of the form A B (sup = s %,

conf = c %), which reveals the concurrence between two itemsets A and B. An

example is PC => Laser Printer (sup = 30%, conf = 80%), which means there are 30%

customers will buy PC and Laser Printer together, and 80% of those customers who

buy PC also get Laser Printer.

Mining association rules from large database is a data and computation intensive

task. To reduce the complexity of association mining, researchers have proposed the

concept of integrating data warehousing system and association mining algorithms.

1

For example, the DBMiner system [22] developed by J. Han and its research team

adopts an OLAP-based association mining approach. Similar paradigm was presented

in [22].

The primary problem of OLAP-based approach is that the OLAP data cube is not

feasible for on line association mining. Excessive efforts are still required to complete

the task. As such, Lin et al. [15] proposed the concept of OLAM (On-Line

Association Mining) cube, an extension of Ice-berg cube [3] used to store frequent

multidimensional itemsets. They also proposed a framework of on-line

multidimensional association rule mining system, called OMARS, to provide users an

environment to execute OLAP-like query to mine association rules from data

warehouses efficiently.

This thesis is a companion toward the implementation of OMARS. Particularly,

the problem of selecting appropriate OLAM cubes to materialize and store in

OMARS is concerned. And, in accordance with the proposed mining algorithms in

OMARS, a suitable model to evaluate the cost of selecting data cubes to materialize is

also developed.

1.1 Contributions

The main contributions of this thesis are as follows:

1. We exploit the devising dependency between OLAM cubes with regard to

association query, thereby devising the structure of OLAM lattice.

2. We deploy the model for evaluating the cost of answering association

queries using materialized OLAM cubes, which is a preliminary step for

OLAM cubes selection.

3. We modify and implement some state-of-the-art heuristic algorithms, and

2

draw comparisons between these algorithms to evaluate their effectiveness.

1.2 Thesis Organization

This thesis is organized as follows. We describe past researches and related work

about the data warehousing and data mining technologies in Chapter 2. In Chapter 3,

we describe the OMARS framework briefly. Chapter 4 formulates our OLAM cube

selection problem. The algorithm analysis and cost model is described in Chapter 5.

Chapter 6 explains our algorithms, and Chapter 7 shows the experimental results

conducted in this research. Finally, we conclude our work and point out some future

research directions in Chapter 8.

3

Chapter 2

Background and Related Work

2.1 Data Warehouse and OLAP

2.1.1 Data Warehouse

As coined by W. H. Inmon, the term “Data warehouse” refers to a “subject-

oriented, integrated, time-variant and nonvolatile collection of data in support of

management’s decision-making process” [11]. In this regard, a data warehouse is a

database dedicated to support decision making. According to the demand of analysts,

the data comes from different databases are extracted and transformed into the data

warehouse. If users want to execute queries, the system only needs to search the data

warehouse instead of the source databases. For this reason, it can save much more

query processing time for users.

A data warehouse system is composed of three primary parts:

1. The source databases in the backend: In the backend, the data are collected

from various sources, internal or external, legacy or operational, and any

change to these sources is continually monitored by several modules called

4

monitors/ wrappers.

2. The data warehouse and data marts in the core: The reconciled data are

stored in the data warehouse and data mart, which are central repository for

the whole system.

3. The analysis tools in the front end: The analysis tools supported in the front

end are usually OLAP, query/tabulation tools, and data mining software.

The typical structure of a data warehouse is illustrated in Figure 2.1.

Figure 2.1. A typical architecture of data warehouse [11].

2.1.2 On-Line Analytical Processing (OLAP)

Although the data stored in a data warehouse have been cleaned, filtered, and

integrated, it still requires much time to transform the data into useful strategic

information owing to the massive amount of data stored in data warehouse. The

concept of On-Line Analytical Processing (OLAP) [4] refers to the process of creating

and managing multidimensional data for analysis and visualization. To provide fast

and multidimensional analysis of data in a data warehouse, the OLAP tool

precomputes aggregation over data and organizes the result as a data cube composed

5

ExtractCleanTransformLoad Refresh

Data sources

Operational databases

External sources

Monitoring & Administration

MetadataRepository

Data Warehouse

Serve

Analysis

Query/Reporting

Data Mining

ToolsOLAP

Servers

Data martdbs

Monitors/ wrappersdbs

of several dimensions, each representing one of the user analysis perspectives.

The typical operations provided by OLAP include roll-up, drill-down, slice and

dice and pivot [8]. Roll-up operation performs aggregation on a data cube, either by

climbing up a concept hierarchy for a dimension or by dimension reduction. Drill-

down is the reverse of roll-up. It navigates from less detailed data to more detailed

data. The slice operation performs a selection on one dimension of the given cube,

resulting in a subcube, while the dice operation defines a subcube by performing a

selection on two or more dimensions. The pivot operation, which is also called rotate,

is a visualization operation that rotates the data axes in view in order to provide an

alternative presentation of the data. These OLAP operations are illustrated in Figure

2.2.

2.2 Data Warehouse Data Model

Because the data warehouse systems require a concise, subject-oriented schema

that facilitates on-line data analysis, the entity-relationship data model that is

generally used in relational database systems is not suitable for data warehouse

system. For this purpose, the most popular data model for a data warehouse is a

multidimensional data model. Two common relational models that facilitate

multidimensional analysis are star schema, and snowflake.

6

Slice

CustomerAll

Supplier

S1 S2S3

S4

Pro

du

ct

P1

P4P3P2

P5P6

C1 C2Customer

Pro

du

ct

P1P2

SupplierS1 S2

Roll-Up

Drill-Down

Pro

du

ct

P1

P4P3P2

P5P6

C1 C2 C3 C4Customer

Dice

Pivot

C2

Pro

du

ct

P1

P4P3P2

P5P6

SupplierS1 S2 S3 S4

C1 C3 C4Customer

Product

Cu

stom

er

C3

C2

C1

C4

P3 P1P2P4P5P6

Figure 2.2. The typical operations of OLAP

2.2.1 Star Schema

Star schema, proposed by Kimball [12], is the most popular dimensional model

used in data warehouse community. A star schema consists of a fact table and several

dimension tables. The fact table stores a list of foreign keys which correspond to

dimension tables, and numeric measure of user interests. Each dimension table

contains a set of attributes. Moreover, the attributes within a dimension table may

form either a hierarchy (total order) or a lattice (partial order). An example of star

schema is depicted in Figure 2.3, whose schema hierarchy is illustrated in Figure 2.4.

7

Figure 2.3. An example of star schema for sales

Figure 2.4. An example of schema hierarchy for sales star

8

2.2.2 Snowflake Data Model

The snowflake schema is a variant of the star schema model, where some

dimension tables are normalized, thereby further splitting the data into additional

individual and hierarchical tables. An example of snowflake data model is depicted in

Figure 2.5.

Figure 2.5. An example of snowflake schema for sales

The major difference between snowflake schema and star schema is that the

dimension tables of snowflake model may be kept in normalized form to reduce

redundancies. Through this characteristic one can easily maintain and save storage

space than that by star schema data model. On the other hand, the star schema can

integrate schema hierarchies into a dimension table, thereby incurring no join

9

operation during hierarchical traverse of the dimensions. Hence, the star schema data

model is more popular than snowflake schema data model.

2.3 Association Rule Mining

2.3.1 Association Rules

Association rule mining is one of the prominent activities conducted in data

mining community. The concept of association rule mining is to search interesting

relationships among items in a given data set. For example, the information that

customers who purchase diapers also tend to buy beers at the same time is represented

in association rule below:

Diaper => Beer [sup = 2%, conf = 60%]

Rule support and confidence are two measures of rule interestingness. A support of

2% means that 2% of customers purchase diaper and beer together. A confidence of

60% means that 60% of the customers who purchase a diaper also buy beer. Typically,

an association rule is considered interesting if it satisfies a minimum support threshold

and a minimum confidence threshold that are set by users or domain experts.

The process of association rule mining can be divided into two steps:

1. Frequent itemsets generation: In this step, all itemsets with support greater

than the minimum support threshold are first discovered.

2. Rule construction: After generating all frequent itemsets, the confidence of

these frequent itemsets much greater than minimum confidence threshold.

Then, we can discover association rules.

10

The most popular and influential association mining algorithm is Apriori [2],

which the apriori knowledge of frequent k-itemsets to generate candidate (k+1)-

itemsets. When the maximum length of frequent itemsets is l, Apriori needs l passes

of database scans. Since the Apriori algorithm costs much time to generate the

candidate itemsets and to count the support of each itemset, many variant algorithms

have been proposed to improve the efficiency of mining process.

2.3.2 Multi-dimensional Association Rules

The concept of multi-dimensional association rules is first proposed by H. Zhu

[22], which is used to describe associations between data values from data warehouse,

because where the data schema is composed of multiple dimensions, and each

dimension may contain many attributes. Following the work in [22], we can divide

the multi-dimensional association rules into three different types as follows:

1. Inter-dimensional association rule: This is the association among a set of

dimensions. For example, suppose an OLAP cube is composed of three

dimensions: Product, Supplier, Customer, and whose data is listed in Table

2.1. An inter dimensional association rule is:

Supplier (“Hong Kong”), Product (“Sport Wear”) Customer (“John”)

2. Intra-dimensional association rule: This is the association among items

coming from one dimension. From Table 2.1, a possible intra-dimensional

association rule is:

Product (“Sport Wear”) Product (“Tents”)

11

3. Hybrid association rule: This is the association among a set of dimensions,

but some items in the rule are from one dimension. It can be regarded as a

combination of inter-dimensional and intra-dimensional associations.

According to Table 2.1, a hybrid-association rule is:

Product (“Sport Wear”), Supplier (“Hong Kong”) Product (“Tents”)

Table 2.1. A relational representation of OLAP cube

2.4 Related Work

2.4.1 Data Cube

The concept of data cube is first proposed by Gray et al [6], which allow the

Supplier Product Customer Count

HongKong

HongKong

HongKong

Mexico

Mexico

Mexico

Mexico

Mexico

Seattle

Seattle

Seattle

Seattle

Tokyo

Tokyo

Tokyo

Tokyo

Sport Wear

Sport Wear

Water Purifier

Alert Devices

Carry Bags

Carry Bags

Tents

Tents

Carry Bags

Sport Wear

Sport Wear

Water Purifier

Carry Bags

Sport Wear

Tents

Alert Devices

John

Mary

John

Peter

Peter

Bill

Sue

Mary

John

Peter

John

Bill

Sue

Bill

Sue

John

30

10

30

20

85

25

25

20

100

20

40

25

10

20

20

20

12

analysts to view the data stored in data warehouse from various aspects and to employ

multidimensional analysis. Each cell in a data cube represents the measured value. For

example, consider a sales data cube with three dimensions, Product, Supplier,

Customer, and one measure value, Sales_total. This cube is depicted in Figure 2.6 and

can be expressed as a SQL query as follows:

Select Product, Supplier, Customer SUM(Sales) AS Total Sales

From Sales_Fact

Group by Product, Supplier, Customer;

Figure 2.6 An example of data cube

2.4.2 Cube Selection Problem

In order to accelerate the query processing, it is important to select the most

suitable cubes to materialize. In general, there are three options to select the cubes to

materialize.

1. Materialize all data cubes: This method costs the lowest query time but

needs the largest storage space, because the whole cubes have to be

materialized.

13

Customer

Supplier

Pro

du

ct

c1 c2 c3 c4

s1s2

s3s4

p1 p

2 p

3 p

4 p

5 p

6

2. Materialize nothing: This method saves the largest storage space but needs

the largest query time, because there is no cube to be materialized.

3. Materialize a fraction of data cubes: This method selects a part of the data

cubes to materialize. But how to select the most suitable cubes to

materialize under a space constraint is difficult. Indeed, it has been proved

to be a NP-hard problem [9].

According to the above discussions, the best way is to materialize all data cubes.

However, the space limit of data warehouse would hinder us to do this. On the other

hand, if we materialize nothing, it will cost too much query time. Therefore, we

should try to select the most suitable cubes to materialize even this problem is an NP-

hard problem. In the literature, there has been a substantial contribution in this

problem, which can be classified into three main categories:

1. Heuristic method: This category is mainly based on the greedy paradigm.

Harinarayan et al. [9] was the first one to consider the problem of

materialized views selection for supporting multidimensional analysis in

OLAP. They proposed a lattice model and provided a greedy algorithm to

solve this problem. Gupta et al. [7] further extend their work to include

indices selection. Ezeife [5] also considered the same problem but proposed

a uniform approach using a more detailed const model. Shukla et al. [17]

proposed a modified greedy algorithm that selects only according to the

cube size. Their algorithm was shown to have the same quality as

Harinarayan’s greedy method but is more efficient.

2. Exhaustive method: The work in [19] supposed that all queries should be

answered solely by the materialized views, with or without rewriting the

users’ queries. They modeled the problem as a state space optimization

problem, and provided exhaustive and heuristic algorithms without concern

14

for the storage constraint. Soutyrina and Fotouhi [18] proposed a dynamic

programming algorithm to solve the problem, which can yield the optimal

set of cubes.

3. Genetic method: There is some work devoted to applying genetic

algorithms to the view selection problem [10, 20, 21]. Following the AND-

OR view graph used in [7], Horng et al. [10] proposed a genetic algorithm

to select the appropriate set of views to minimize the query cost and view

maintenance cost. A similar genetic algorithm with different repairing

scheming is proposed in [13], which use a greedy repair method to correct

the infeasible solutions instead of using a penalty function to punish the

fitness of the infeasible solutions. Researches have shown that the repair

scheme is better in dealing with infeasible solutions than penalty function is

[16]. Rather than optimize the view selection from a given query processing

plan, the work in [20, 21] focus on finding an optimal set of processing

plans for multiple queries. A solution in their genetic algorithm thus

represents a set of processing plans for the given queries.

15

Chapter 3

The OMARS Framework

In this chapter, we will give a brief review of the OMARS framework, because

our research deals with the problem of how to select the most suitable OLAM cubes

to materialize in this system.

The OMARS framework, as illustrated in Figure 3.1, integrates data warehouse,

on-line analytical processing, and the OLAM Cube, whose objective is to provide an

efficient and convenient platform, allowing users to perform OLAP-like association

explorations. Through the OMARS system, users can perform multidimensional

associational mining queries, interactively change the dimensions that comprise the

associations, and refine the constraints such as minimum support and minimum

confidence. Functionality of each component is described in the following sections.

Figure 3.1. The OMARS framework [15].

16

CubeManager

DataWarehouse

OLAPCube

OLAMCube

OLAMMediator

OLAMEngine

Auxiliary Cube

3.1 OLAM Cube and Auxiliary Cube

OLAM cube is a new concept proposed by Lin et al. [15], which is used to store

the frequent itemsets with supports greater than or equal to a presetting minimum

support, denoted as prims. In this regard, the OLAM cube can be regarded as an

extension of iceberg cube. The main difference is that the iceberg cube stores the

information of frequent itemsets derived from inter-dimensional associations, while

OLAM cube is feasible for all of the three different associations. When the minsup of

user’s query is greater or equal than prims, it can accelerate the process of mining

association rules because of the OLAM cube stores the frequent itemsets with

supports greater or equal than prims.

Although the OLAM cube can be used to generate association rules efficiently

when minsup is greater than prims, it fails to solve the situation that minsup is lower

than prims. To alleviate this problem, the OMARS system embraces another type of

data cube, called auxiliary cube. The concept of auxiliary cube is used to store the

infrequent itemsets with length of K, where K denotes the cutting-level employed by

the mining algorithm CBWon used in OMARS.

3.2 Cube Manager

This component is responsible for three different tasks:

1. Cube selection: This refers to how to select the most proper cubes to

materialize, in order to minimize the query cost and/or maintenance cost

under the constraint of limited storage space.

2. Cube computation: This portion is to deal with the work of efficiently

generating the set of materialized cubes produced by the cube selection

17

module.

3. Cube maintenance: This part concerns the problem of how to maintain the

materialized cubes when the data in the data warehouse are updated.

Our research in this thesis indeed deals with the implementation issue of the cube

selection task of Cube Manager. We will discuss this in the next chapter.

3.3 OLAM Mediator and OLAM Engine

OLAM Engine is an interface between the OMARS system and the users. It

accepts user’s queries and invokes the appropriate algorithm to mine

multidimensional association rules.

When OLAM Engine receives a user’s query, it will analyze the query and

forward relevant information to OLAM Mediator, which then looks for the most

relevant cube and returns the result to OLAM Engine. Here the most relevant cube

denotes the materialized OLAM cube that can answer the query and consume the

smallest cost. There are two possibilities of the search result returned by OLAM

Mediator, and each should be handled in different way.

1. OLAM Mediator can find the most relevant cube: In this case, OLAM

Mediator has to further compare the minsup of user’s query to prims, and to

handle this situation according to the following two different cases:

i. minsup prims: The discovered OLAM cube is capable of answering

the query. Return this cube to OLAM Engine.

ii. minsup < prims: The discovered OLAM cube can not answer the query

without the aid of the auxiliary cube. Return the OLAM cube and its

accompanied auxiliary cube to OLAM Engine.

2. OLAM Mediator can not find the cube: In this case, OLAM Mediator has to

18

search the OLAP Cube repository to determine if there is an OLAP cube

whose data can be used to answer the query. If the answer is yes, return the

discovered OLAP cube to OLAM Engine; otherwise, notify OLAM Engine

to execute the mining procedure from the data warehouse afresh.

We will discuss the above cases in more detail and devise to the cost evaluation

of each case in Chapter 5.

19

Chapter 4

Problem Formulation

In this chapter, we first elaborate the correspondence between OLAM query and

OLAM cube, and describe the concept of OLAM lattice. After this, we will define the

problem of OLAM cube selection.

4.1 OLAM Cube and OLAM Query

As described in Chapter 3, OLAM cube is used to store frequent itemsets, aiming

at accelerating the process of mining association rules. To clarify the structure of

OLAM cube and its relationship between multidimensional associations, we first

introduce a four-tuple mining meta-pattern to specify the form of multidimensional

association query. The definition is as follows:

Definition 4.1. Suppose a star schema S containing a fact table and m dimension

tables {D1, D2, …, Dm}. Let T be a jointed table from S composed of a1, a2, …., ak

attributes, such that ai, aj Attr(Dk), there is no hierarchical relation between ai and

aj, 1 i, j r, 1 k m. Here Attr(Dk) denotes the attribute set of dimension table

Dk. A meta-pattern of multidimensional associations from T is defined as follows:

20

MP: < tG, tM, ms, mc >,

where ms denotes the minimum support, mc the minimum confidence, tG the group of

transaction attributes, tM the group of item attributes, for tG, tM {a1, a2, …., ak} and

tG tM = .

The above-mentioned meta-form specification of multidimensional association

queries can present three different multidimensional association rules defined in [22],

intra-association, inter-association, and hybrid association.

For example, consider a jointed table T involving three dimensions from the star

schema in Figure 2.3. The content of T is shown in Table 4.1. If the item attribute set

tM consists of only one attribute, then the meta pattern corresponds to an intra-

association.

Table 4.1. A jointed table T from star schema

Tid City Education Date Month Product_ID Category

1 Taipei Bachelor 7/12 July 1 A

2 Taipei High school 7/12 July 2 A

3 N.Y. Master 7/18 July 1 A

4 Toronto Master 8/2 Aug. 3 B

5 Seattle Master 8/3 Aug. 4 B

6 N.Y. High School 8/2 Aug. 1 A

7 Toronto High School 7/4 July 1 A

8 Seattle Bachelor 7/18 July 5 C

9 Taipei Bachelor 8/2 Aug. 2 A

10 N.Y. Bachelor 9/1 Sep. 3 B

For instance, let tG = {City}, tM = {Category}. We may have the following intra-

association rule:

21

(Category, “A”) (Category, “B”) (sup = 40%, conf = 80%)

Note that to facilitate this mining task, the table T has to be, implicitly or

explicitly, transformed into a transaction table as follows:

City Category

Taipei

N.Y.

Toronto

Seattle

A

A, B

A, B

B, C

On the other hand, if | tM | 2, then the resulting associations will be inter-

association or hybrid association. For example, let tG = , tM = {Education, Month}.

We have an inter-association:

(Education, “Master”) (Month, “July”) (sup = 40%, conf = 80%)

Like intra-association, the table T has to be transformed into the following form:

Tid Education Month

1 Bachelor July

2 High school July

3 Master July

4 Master Aug.

5 Master Aug.

6 High School Aug.

7 High School July

8 Bachelor July

9 Bachelor Aug.

10 Bachelor Sep.

Note that in this case, the transaction attribute is the same as the original table T.

But if tG = {City}, we will have a hybrid-association:

(Education, “Master”), (Month, “July”)

22

(Month, “Aug.”) (sup = 40%, conf = 80%)

For this case, the transformed table will be:

City Education Month

Taipei Bachelor, High School July, Aug.

N.Y. Master, High School, Bachelor July, Aug., Sep.

Toronto Master, High School Aug., July

Seattle Master, Bachelor Aug., July

After explaining the mining patterns, we will clarify the structure of OLAM

Cube.

Definition 4.2. Given a meta-pattern MP with transaction attribute set tG and item

attribute set tM, and a presetting minsup, prims, the corresponding OLAM cube,

MCube(tG, tM), is the set of the frequent itemsets with supports larger than prims.

The following examples illustrate the corresponding OLAM cube for different

kinds of multidimensional association rules.

Example 4.1. An intra-dimensional OLAM Cube: Let tG = {City}, tM = {Category},

and prims = 2. From Table 4.1, the resulting OLAM cube is shown in Table 4.2.

Table 4.2. An example of intra OLAM cube expressed in table

Category Support

A

B

A, B

3

3

2

Example 4.2. An inter-dimensional OLAM cube: Let tG = , tM = {Education,

Month}, and prims = 2. From Table 4.1, the resulting OLAM cube is shown in Table

4.3.

23

Table 4.3. An example inter-dimensional OLAM cube expressed in table

Education Month Support

Bachelor

High school

Master

-

-

Bachelor

High school

Master

-

-

-

July

Aug.

July

July

Aug.

4

3

3

5

4

2

2

2

Example 4.3. A hybrid-dimensional OLAM cube: Let tG = {City}, tM = {Education,

Month}, and prims = 3. From Table 4.1, the resulting OLAM cube is shown in Table

4.4.

Table 4.4. An example hybrid-dimensional OLAM cube expressed in table

Education Month support

Bachelor

High school

Master

-

-

-

Bachelor

Bachelor

High school

High school

Master

Master

Bachelor

High school

Master

-

-

-

July

Aug.

July, Aug.

July

Aug.

July

Aug.

July

Aug.

July, Aug.

July, Aug.

July, Aug.

3

3

3

4

4

4

3

3

3

3

3

3

3

3

3

24

4.2 OLAM Lattice

In accordance with the definition of OLAM cube, we can generate all possible

OLAM cubes from the star schema, thereby forming an OLAM lattice. In order to

provide hierarchical navigation and multidimensional exploration, the OMARS

system [15] models the OLAM lattice as a three-layer structure. The first layer lattice

expresses the combination of all dimensions. The second layer further exploits inter-

attribute combinations for each dimensional combination in the first layer lattice. The

third layer exploits all OLAM cubes corresponding to the meta-patterns derived from

each subcube in the second layer. Note that the real OLAM cubes are stored in the

third layer.

For example, consider the star schema illustrated in Figure 2.3. The first layer

lattice shown in Figure 4.1 is composed of eight possible dimensional combinations.

After constructing the first layer lattice, we choose the node composed of “customer”

and “time” dimensions, and extended it to form a second layer lattice shown in Figure

4.2. Each node of the second layer lattice is constructed by attaching any attribute

chosen from the selected dimensions. Finally, we extend cube <(city, education),

(date)> to form the third layer lattice shown in Figure 4.3. It can be observed that

there is one OLAM cube corresponding to inter-association, (city, education, date);

three OLAM cubes corresponding to hybrid-associations, (date*, city, education),

(*education, city, date) and (city*, education, date); and three cubes corresponding to

intra-associations, (education*, date*, city), (city*, date*, education), (city*,

education*, date).

Note that (city*, education*, date*) is shown to complete the lattice structure,

which is useless and will not be materialized.

25

Figure 4.1. The1st layer OLAM lattice for the example star schema in Figure 2.3

Figure 4.2. The 2nd layer lattice derived from <customer, time, -> in the 1st layer

26

Figure 4.3. The 3rd layer lattice derived from the subcube <(city, education), date > in

the 2nd layer

Because the real OLAM cubes are stored in the third layer lattice, we can mine

multidimensional association rules efficiently through materialize these OLAM cubes.

From these three layers lattice, we discover attribute dependency that defined as

follows:

Proposition 4.1 Consider two OLAM cubes, and .

If and , then every itemset in must be a subset of

an itemset in , and these two itemsets have the same support value.

Example 4.4. Consider the table T in Table 4.1. Let be the cube

27

illustrated in Table 4.4 and that illustrated in Table 4.5. Hence

, , , and prims = 3. It can

be verified that every frequent itemsets stored in is a subset of

frequent itemsets in , and both itemsets have the same support value.

Table 4.5. An OLAM Cube

Education Support

Bachelor

High school

Master

3

3

3

According to Proposition 4.1, we know there is a dependency between OLAM

cubes in the third lattice, which is formalized below.

Definition 4.3. Consider two OLAM cubes, and .

We say that is dependent upon if and

, and is denoted as .

One important aspect of Definition 4.3 is that if

then all multidimensional queries that can be answered via

can also be answered via .

Furthermore, it should be notice that not all of the OLAM cubes derived in the

lattice have to be materialized and stored, because the concept hierarchies defined

over the attributes in the star schema provide the possibility to prune some redundant

cubes.

Consider an OLAM cube, MCube(tG, tM). We observed that there are two

28

different types of redundancy.

Proposition 4.2. Schema redundancy: Let ai, aj tG. If ai, aj are in the same

dimension and aj is an ancestor of ai, then MCube(tG, tM) is a redundancy of cube

MCube(tG-{ aj }, tM).

Example 4.5. Consider the jointed table in Table 4.1. Let tM = {Category}. The

resulting table by grouping “Date” and “Month” as transaction attributes is shown in

Table 4.6. Note that this table has the same transactions as that obtained by grouping

“Date” as transaction attribute, as shown in Table 4.7. Thus, the resulting cube

MCube({Date, Month}, {Category}) is the same as MCube({Date}, {Category}).

Table 4.6. The resulting table by grouping {Date, Month}

as transaction attributes for Table 4.1

Date Month Category

7/4 July A

7/12 July A

7/18 July A, C

8/2 Aug. A, B

8/3 Aug. B

9/1 Sep. B

Table 4.7. The resulting table by grouping {Date}

29

as transaction attribute for Table 4.1

Date Category

7/4 A

7/12 A

7/18 A, C

8/2 A, B

8/3 B

9/1 B

Proposition 4.3. Values Redundancy: Let ai, aj tM. If ai, aj are in the same

dimension and aj is an ancestor of ai, then MCube(tG, tM) is a cube with values

redundancy.

Example 4.6. Consider the jointed table in Table 4.1. Let tG = {City}, tM = {Date,

Month} and prims = 2. The resulting OLAM cube is shown in Table 4.8. One can

observe that the tuples with dotted lines in this table are redundant patterns. Therefore,

it satisfies the values redundancy. Note that if it holds the values redundancy, we must

prune the redundant patterns during the generation of frequent itemsets.

Table 4.8. The resulting OLAM cube MCube({City}, {Date, Month})

Date Month support

30

7/18

8/2

-

-

-

-

July

Aug.

2

3

4

4

7/18 July 2

7/18

8/2

Aug.

July

2

3

8/2 Aug. 3

July, Aug. 4

7/18 July, Aug. 2

8/2 July, Aug. 3

In addition to above observations, we observe that any OLAM cube is useless if

it satisfies the following property.

Proposition 4.4. Useless Property: Let ai tG and tM = {aj}. If ai, aj are in the same

dimension and aj is an ancestor of ai, then MCube(tG, tM) is a useless cube.

Example 4.7. Let tG = {City, Date}, and tM = {Month}. The resulting table from table

4.1 by grouping {City, Date} as transactions is shown in Table 4.9. One can observe

that the cardinality of every transaction is 1. Therefore, we cannot find any association

rule from this table.

Table 4.9. The resulting table by grouping {City, Date} as transaction attribute for

31

Table 4.1

City Date Month

Toronto

Taipei

Taipei

N.Y.

N.Y.

Toronto

Seattle

N.Y.

7/4

7/12

8/2

7/18

8/2

8/2

8/3

9/1

July

July

Aug

July

Aug.

Aug.

Aug.

Sep.

4.3 OLAM Cube Selection

We now proceed to give a formal definition of the OLAM cube selection

problem. To this end, we introduce symbols as shown in Table 4.10.

Assume that an OLAM lattice contains n OLAM data cubes

, the set of users queries is , the set of query

frequencies is , and the space constraint is . The OLAM cube

selection problem is denoted as a five-tuple . A solution to is a

subset of D, say M, that can minimize the following cost function subject to constraint

,

.

Table 4.10. The Symbol Table

32

Symbol Definition

Lattice

Set of data cubes

nth data cube

Set of user queries

mth user query

Set of user query frequencies

Frequency of the ith query

Space constraint

Set of materialized cubes

The total time to response ith query in materialized views

33

Chapter 5

Evaluation of OLAM Query Cost

5.1 Query Evaluation Flow

As stated previously, the primary task of OLAM Engine is to generate

association rules according to users’ queries. After receiving a query, OLAM Engine

analyzes the query, transfers the necessary information to OLAM Mediator, and then

waits for the most matching cube from OLAM Mediator. When OLAM Mediator

receives the information of users’ queries from OLAM Engine, it will look for the

most matching cube. First, OLAM Engine searches for the required OLAM cube. If

found, then it further checks whether minsup prims; and if yes, then returns the

found OLAM cube to OLAM Engine, otherwise returns the corresponding auxiliary

cube of the found OLAM cube and notifies OLAM Engine to perform association

mining from data warehouse with the aid of this auxiliary cube. On the other hand, if

OLAM Engine can not find any qualified OLAM cube to answer user query, it will

notify OLAM Engine to perform association mining from data warehouse afresh.

The above described procedure employed by OLAM Mediator is depicted in

Figure 5.1.

34

Figure 5.1 The flow diagram of OLAM query

An important thing worth mentioning is that, for simplicity, we do not consider

OLAP cubes in this study, the OMARS system did take account of this kind of data

cubes in association mining.

In accordance with the work flow of OLAM Mediator and OLAM Engine, our

paradigm for evaluating OLAM query cost is shown below:

35

Procedure Evacost_OLAMQ(q)

begin

Let q = < tG, tM, minsup>;

found = OLAMQ_search(q, CQ);

if found = TRUE then

if prims minsup then

cost = the cost for evaluating query q using OLAM cube

CQ.Mcube; /*case 1*/

else

cost = the cost for evaluating query q using CQ.Mcube, auxiliary cube

CQ.XCube and data warehouse; /*case 2*/

end if

else

cost = the cost for evaluating query q using data warehouse; /*case 3*/

end if

return cost;

end

Figure 5.2. The procedure to compute the cost of user’s query

In summary, there are three different cases to be dealt with:

Case 1: evaluating the cost via the qualified OLAM cube.

Case 2: evaluating the cost via OLAM cube, auxiliary cube, and data

warehouse.

Case 3: evaluating the cost via data warehouse.

The cost complexity evaluation for each case will be elaborated in the following

sections. We end this section with the description of OLAMQ_search.

36

Procedure OLAMQ_search(q, CQ)

begin

found = FALSE;

if MCube(q. tG, q. tM) is materialized then

CQ.MCube = MCube(q.tG, q.tM);

CQ.XCube = XCube(q.tG, q.tM);

found = TRUE;

end if

CurQ = ;

for each MCube in the OLAM lattice do

if MCube is materialized and MCube. tG = q.tG and MCube.tM q. tM

and (MCube.tM CurQ. tM or CurQ = ) then

CurQ = MCube;

if found then

CQ.MCube = CurQ;

CQ.XCube = XCube(q. tG, CurQ. tM);

end if

return found

end

Figure 5.3. Procedure OLAMQ_search

Example 5.1. Suppose the OMARS system stores the following three materialized

OLAM cubes, MCube( , ), where = {City}, and = {Education, Date},

MCube( , ), where = {City}, and = {Education, Date, Category}; MCube(

, ), where = {Date}, = {City}, and prims = 3. We have three users’

37

queries as follows: q1, q2, q3, where = {City}, = {Education, Date}, and

= 4; = {City}, = {Education, Date, Category}, and = 2;

= {Date}, = {City, Education}, and = 3.

According to the above three queries, we have three conditions listed as follows:

1. When the user’s query is q1, this condition is the same as Case 1 described

above. Because the corresponding OLAM cube can be found in OMARS system,

and the minsup of user’s query is higher than prims, we can use MCube( , )

to respond user’s query immediately.

2. When the user’s query is q2, this condition is the same as Case 2 described

above. Because the minsup of user’s query is lower than prims, there is a need to

utilize the corresponding auxiliary cube of the found OLAM cube MCube( ,

) and data warehouse to answer query q2.

3. When the user’s query is , this condition is the same as Case 3 described

above. Because we can not find the any matching OLAM cube in OMARS

system, we should utilize data warehouse to answer query .

5.2 Cost Evaluation for Case 1

In this case, the OLAM cube returned from OLAM Mediator can be utilized to

respond users’ queries. The CBWon algorithm [15] is employed to mine association

rules. For convenience and facilitating the analysis, we replicate the CBWon algorithm

in Figure 5.4. Because the qualified frequent itemsets have been stored in the found

OLAM cube, and minsup prims, there is no need to generate the frequent itemsets

38

via Apriori-like algorithm. All we have to do is scanning frequent itemsets in OLAM

cube and performing the association_gen procedure in Figure 5.7 to generate qualified

association rules.

Algorithm CBWon

Input: relevant cube MCube(tG, tM), minsup and prims;

Output: The set of frequent itemsets F;

1 if minsup prims then

2 AF = {X| sup(X) minsup, X Auxiliary Cube} {Y| Y MCube(tG, tM) and |Y|

= K};

3 DF = Dwnsearchon(T , AF , K, minsup);

4 UF = Upsearch(AF, minusup);

5 F = DF UF;

6 else

7 F = {X| X MCube(tG, tM) and sup(X) minsup};

8 end if

9 return F;

Figure 5.4. Algorithm CBWon

Procedure Dwnsearchon

1 for i=1 to |D| do

39

2 scan the i-th transaction ti;

3 delete those items in ti but not in AF;

4 for each subset X of ti and 2 |X| K do

5 sup(X)++;

6 end for

7 DF = {X | sup(X) minsup} AF;

Figure 5.5. Procedure Dwnsearchon

Procedure Upsearch

1 transform horizontal data format T into t_id lists;

2 = frequent K-itemsets;

3 k = K, Fk = ;

4 repeat

5 k++;

6 Ck = new candidate k-itemsets generated from Fk-1;

7 for each X Ck do

8 perform bit-vector intersection on X;

9 count the support of X;

10 end for

11 = {X| sup(X) prims, X Ck};

12 UF = UF Fk;

13 until Fk =

Figure 5.6. Procedure Upsearch

Procedure association_gen (F: set of all frequent itemsets; min_conf: minimum

confidence threshold)

begin

40

for each l F do

generate P(l) = l - ; // P(l): power set of l

for each s l and s l-s do

if support_count(l) / support_count(s) min_conf then

output s l – s;

end

Figure 5.7. Procedure association_gen

The cost thus can be divided into two parts:

1. Frequent itemsets discovery: This involves searching the frequent itemsets stored

in OLAM cube with support lower than minsup of user’s query, which costs |DM|,

for DM denoting the OLAM cube.

2. Rule generation: For each discovered frequent itemset, we construct all possible

rules from it, compute the confidence, and keep those satisfy the minimum

confidence.

The key point for the complexity analysis thus lies in the number of candidate

rules to be generated and inspected. Our first step toward this direction is to consider

the number of rules that can be generated from a frequent k-itemset and all of its

subsets.

Lemma 1. The number of rules that can be constructed from a k-itemset is 2k-2.

Proof. Recall that each rule that can be constructed from an itemset X has the form for

A X and A , A X – A. Thus, the number of different A’s determines the

number of rules, which is

.

41

Lemma 2. For a k-itemset X, the total number of rules that can be generated from X

and its subsets is

.

Proof. From Lemma 1, we can derive

Now, if we know the set of maximal frequent itemsets, then we can complete the

analysis. Unfortunately, the exact set is unobtainable without the a priori knowledge

of user’s specified minsup. We thus resort to an estimation that proceeds by taking

prims in place of minsup. Then we apply sampling to obtain a random subset of the

warehouse data, and we can either

1. compute the maximal frequent itemsets for each OLAM cube using any

maximal pattern mining algorithm, or

2. apply the CBWoff algorithm to estimate Kα (cutting level), compute frequent

itemsets with cardinality of Kα, and regard these itemsets as the maximal

frequent itemsets.

Let MF denotes the set of maximal patterns. If the first approach is adopted, the

computation spent on rule generation will be

,

42

or

,

if the second approach is used. Here, for simplicity, we adopted the second approach.

Finally, combing the cost of frequent itemsets discovery and rule generation, we have

.


In this case, algorithm CBWon illustrated in Figure 5.4 will execute the “minsup

< prims” part of the “if” clause, which comprises three different steps.

1. Generate AF, i.e. . This requires scanning the auxiliary cube and the OLAM

cube. The cost is , where denotes auxiliary cube, and

denotes OLAM cube.

2. Execute procedure Dwnsearchon illustrated in Figure 5.5. Note that this procedure

presumes the availability of the corresponding jointed table, and ignores the

preprocessing step to generate the jointed table. To account for this task and

simplify the discussion, we assume this cost is w and the table is T.

As illustrated in Figure 5.5, the Dwnsearchon procedure needs to scan all the

transactions in the database. The I/O cost is .

Next we estimate the cost for the most consumptive step: counting itemset

support. Let l denotes the average length of each transaction. This step costs

, or in brief.

43

Finally, the total cost consumed by the Dwnsearchon procedure equals

.

3. Execute procedure Upsearch illustrated in Figure 5.6. To minimize the I/O cost

and avoid combinatorial decomposition, the Upsearch procedure first transforms

the transaction data into vertical data format called transaction-id lists, then

utilizes this structure to count the supports of itemsets. The cost lies in three

main steps.

(1) Data transformation. This requires data scan.

(2) Candidate generation. The dominate operation is itemset join. If the

largest itemset cardinality is Kmax. This task consumes at most

.

(3) Counting candidate support. For each k-itemset, counting involves k-1

bit-vector intersections and one bit-vector accumulation. Summing this

cost over all candidate itemsets, we have .

Finally, the total cost for procedure Upsearch is

.

Combing all of the analysis, we have

44


In this case, we should generate table T according to user’s query, and it costs

. After this, the CBWoff algorithm shown in Figure 5.8 is performed. It can

be observed that except step 1, the steps employed by CBWoff are quite similar to

those by CBWon in Case 2. Since step 1 costs

,

this makes the total cost for this case be

.

Algorithm CBWoff(T, prims)

Input: Table T and prims;

Output: The set of frequent itemsets F;

1 scan T to compute K and generate all frequent 1-itemsets F1;

2 DF = Dwnsearch(T , K, F1, prims);

3 UF = Upsearch(DF, prims);

45

4 return F = DF UF;

Figure 5.8. Algorithm CBWoff

Procedure Dwnsearch

1 for i=1 to |D| do

2 scan the i-th transaction ti;

3 delete the items in ti that are not in F1;

4 for each subset X of ti and 2 |X| K do

5 sup(X)++;

6 end for

7 store all X in Auxiliary cube for |X| = K and sup(X) prims;

8 DF={X | sup(X) prims};

Figure 5.9. Procedure Dwnsearch

To sum up, we list the cost functions for the three cases below:

Case 1: .

Case 2: .

Case 3:

.

46

Chapter 6

OLAM Cube Selection Methods

In this chapter, we describe three typical heuristic algorithms proposed for OLAP

cube selection problem, and elaborate how to modify and combine our cost models

depicted in last chapter with each method to select the most suitable OLAM cubes.

The methods include forward greedy selection (FGS) method proposed by

Harinarayan et al. [9], Pick by size (PBS) selection method proposed by Shukla et al.

[17], and the backward greedy selection (BGS) method proposed by Lin and Kuo

[13].

6.1 Forward Greedy Selection Method (FGS)

The forward greedy selection method is proposed by Harinarayan et al. [19]. As

is known to all, the greedy algorithm always chooses the local optimal solution in

each step under some constraint. For this purpose, we define a benefit function B(di,

M) as follows:

(6.1)

47

We use our benefit function to compute the benefit of all unselected OLAM cubes,

and combine the forward selection method to choose the most suitable OLAM cubes

one by one to materialize from empty until no cube can be added. The forward

selection method is described below:

Algorithm 1. Forward greedy selection (FGS)

Step 0. Let M= .

Step 1. When , repeat Step 2 to Step 5.

Step 2. According to equation (6.1), calculate the benefit of all unselected OLAM

cubes di, for 1 i n, and di M.

Step 3. Select the OLAM cube with the maximal benefit according to results of Step

2, and set it as dj.

Step 4. M M∪{dj}.

Step 5. Go to Step 1.

Figure 6.1. Forward Greedy Selection Method

Example 6.1. Suppose that we select three attributes city c, education e, and date d

from a sales star schema illustrated in Figure 2.3. Figure 6.2 depicts all possible

OLAM cubes formed with these three attributes as well as their dependencies, where

all OLAM cubes with the same transaction tG are packed into a meta-cube. The dotted

line between any two metacubes is used for clarification purpose, which accomplishes

the lattice structure of metacubes in terms of tG. Note that according to proposition

4.1, the dependency exists only in OLAM cubes within the same metacube. For

simplification, let us consider how to select the most suitable OLAM cubes from three

OLAM cubes ced*, cd*, and ed* to materialize under space constraint. The symbols

48

used in this example are shown in Table 6.1, and the required parameter settings are

shown in Table 6.2. Besides, we assume that the base relation size is 64, and prims is

3. Table 6.3 shows the first two selection steps using FGS.

Table 6.1. The symbols used in cost model

the set transaction attributes

the set of mining attributes

I/O to computation ratio

the cardinality of maximal frequent itemset

the cardinality of the largest itemset

number of candidate i-itemsets

l average length of each transaction

number of frequent i-itemsets

|DM| size of OLAM cube

|DX| size of auxiliary cube

frequency of OLAM cube

|T| size of the table composed of attributes

|D| size of base relation

Table 6.2. The required parameter settings

subcubes l |DM| |DX| |T| minsup

d*ce 1 2 4 6 4 8 8 5 20 15 30 4 0.3

d*c 1 2 4 5 2 5 8 4 10 10 30 5 0.3

d*e 1 2 4 5 1 6 6 2 15 5 30 3 0.4

49

Figure 6.2. All possible OLAM cubes formed with city, education, and date

Table 6.3. The benefits of OLAM cubes in the first two selection steps by FGS

First selection Second selection

subcubes Influenced

subcubesBenefit

Influenced

subcubesBenefit

d*ceced*, cd*,

ed*

((64*6+3*1*30+2*30+30*

28+1058+8* )-(8*

+1*20))*(64-

20)*(0.3+0.3+0.4)/

20=5306.4

ced*, cd*,

ed*

d*c

cd* ((64*6+3*1*30+2*30+30*

10+724+8* )-(8* +1*

10))*(64-10)*(0.3)/10

=2507.76

cd*((8* +1*20)-(8*

+1*10))*(20-10)*(0.3)/

10=3

d*e

ed* ((64*6+3*1*30+2*30+30*

15+586+6* )-(6* +1*

15))*(64-15)*(0.4)/15

=2031.87

ed* ((8* +1*20)-(6*

+1*15))*(20-15)*(0.4)/

15

=3.067

50

ced

ce cded

ce*d* c*ed* c*ed*

c*ed

c*e c*d

ce*d

ce* e*d

ced*

cd* ed*

6.2 Backward Greedy Selection Method (BGS)

The concept of the backward greedy selection method proposed by Lin and Kuo

[13] is similar to forward greedy selection method. The difference is that all OLAM

cubes have been selected at beginning, and the selection proceeds by removing one

OLAM cube which has the lowest detriment value step by step until the total size of

all remaining OLAM cubes is smaller than storage space. For this purpose, we define

a detriment function P(di,M) as follows:

(6.2)

Compared to FGS, the forward greedy selection algorithm can quickly find a set of

data cubes while storage space is noticeable smaller than the total sizes of data cubes.

But it is obviously that if the total cube size is not far from the storage space, BGS

will need more computation. The backward greedy selection algorithm is described

below:

51

Algorithm 2. Backward Greedy Selection (BGS)

Step 0. Let M D;


Step 2. According to equation (6.2), calculate the detriment of all subcubes di, for 1

i n, and di M.

Step 3. Select the OLAM cube with the minimum detriment value according to the

results in Step 2, and set it as dj.

Step 4. M M-{dj}.


Figure 6.3. Backward Greedy Selection Method

Example 6.2. Consider Example 6.1 again. Suppose that the three cubes have been

selected and the space constraint is 20. Besides, we assume when ced* is not

materialized, all queries which should answered by ced* will go back to the base

relation in data warehouse. Table 6.4 shows the first two selection steps performed by

backward greedy selection method.

Table 6.4. The benefits of OLAM cubes in the first two selection steps by BGS

52

First selection Second selection

subcubes Influenced

subcubesDetriment

Influenced

subcubesDetriment

d*ce ced*((64*6+3*1*30+2*30+30*

28+1058+8* )-(8*

+1*20))*(64-20)*(0.3)/20

=1591.92

ced*, cd* ((64*6+3*1*30+2*30+

30*28+1058+8* )-(8*

+1*20))*(64-

20)*(0.3+0.3)/

20=3183.84

d*c cd* ((8* +1*20)-(8*

+1*10))*(20-10)*(0.3)/

10=3

cd*

d*e ed* ((8* +1*20)-(6*

+1*15))*(20-15)*(0.4)/15

=3.067

ed* ((8* +1*20)-(6*

+1*15))*(20-15)*(0.4)/

15

=3.067

6.3 Pick by Size Selection Method (PBS)

The pick by size selection algorithm is an intuitive method proposed by Shukla

et al. [17]. Its concept is to compute all OLAM cubes size, and select the smallest

OLAM cubes one by one until the storage constraint is exceeded. The pick by size

selection algorithm is described below:

53

Algorithm 3. Pick by Size Selection (PBS)

Step 0. Sort all OLAM cube size;

Step 1. Let M= .


Step 3. Select the smallest size of all OLAM cube, and set it as dj.

Step 4. M M∪{dj}.


Figure 6.4. Pick by Size Greedy Selection Method

54

Chapter 7

Experimental Results

In this chapter, we describe our experiment and analysis. All experiments are

performed on a machine with Intel Celeron 1.2 GHz CPU, 512MB RAM, and running

on Microsoft Windows 2000 Server. The test data is generated from Microsoft

foodmart2000 database. We chose three dimensions from foodmart 2000 database,

including Customer, Time, and Product. Each dimension consists of two attributes.

They are city c, education e for the Customer dimension; date d, month m for the

Time dimension; and Product_ID p, Category a for the Product dimension. The three

dimensions’ schema hierarchy is shown in Figure 2.4. Characteristics of the test data

are shown in Table 7.1. In Table 7.1, we introduce the following notation:

g<attribute_list>(R),

where g is the symbol used to represent “group by” operation, and <attribute_list>

denotes the “group by”.

From this test database, we generated all possible OLAM subcubes, which are

detailed in Table 7.2. Note that we have filtered out those subcubes which satisfy

schema redundancy and those whose corresponding transaction table composed of tG

tM has less than 10000 tuples. After these, we observed that all OLAM cubes can

be grouped into seven classes distinguished by tG, as shown in Table 7.2. Each cube is

represented by the first letter of the attributes, except for Category is abbreviated as

55

“a”. The symbol “-“ means none, and the number is the size of each subcube.

In our experiments, we consider two different combinations of frequencies of

subcubes: 1) all frequencies are the same; 2) randomly generated numbers between 0

and 1. Besides, we consider three different combinations of minsup: 1) all minsup are

lower than prims; 2) randomly generated minsup between 1% and 99%; 3) all minsup

are equal or greater than prims.

Finally, we assume eight different storage constraints, which are 10%, 20%,

30%, 40%, 50%, 60%, 70%, and 80% of the sum of all subcubes. Furthermore if none

of the subcubes that can be used to answer a query is materialized, we assume that the

base relation is invoked to answer this query.

Table 7.1. Data parameters of foodmart 2000

Parameters Value

|D| 86565

|dom(City)| 78

|dom(Education)| 5

|dom(Date)| 323

|dom(Month)| 12

|dom(Product_ID)| 1559

|dom(Category)| 45

|g c, e, d(D)| 10541

|g c, p(D)| 49483

|g m, p(D)| 18492

|g c, e, a(D)| 11576

|g d, a(D)| 12113

|g c, d, a(D)| 49390

|g c, m, a(D)| 22854

Table 7.2. All subcubes of six attributes,

56

City, Education, Date, Month, Product_ID, Category

All possible mining attributes tM

ced mpa mp- m-a -pa m-- -p- --a

713 12 714 678 12 0 678

cp edma edm- ed-a e-ma -dma ed-- e-m- e--a

72 60 24 72 18 12 60 24

-dm- -d-a --ma e--- -d-- --m- ---a

12 6 18 12 0 12 6

mp ceda ced- ce-a c-da -eda ce-- c-d- c--a

407 377 403 62 62 373 54 59

-ed- -e-a --da c--- -e-- --d- ---a

34 58 9 51 30 3 6

cea dmp dm- d-p -mp d-- -m- --p

3955 3955 14 3891 14 3891 0

da cemp cem- ce-p c-mp -emp ce-- c-d- c--a

860 860 792 87 99 792 87 75

-ed- -e-a --da c--- -e-- --m- ---p

99 31 12 75 31 12 0

cda emp em- e-p -mp e-- -m- --p

22 22 9 12 9 12 0

cma edp ed- e-p -dp e-- -d- --p

30 30 30 0 30 0 0

7.1 Comparison of FGS, BGS, and PBS for

minsup prims

We first compare the query cost of the three selection methods when the

frequency of each subcube is random. The results are shown in Figure 7.1. According

to Figure 7.1, it is obvious that FGS and BGS are significantly better than PBS, and

there is an optimal cost-effective space around 50%. We also recorded the result when

the frequency of each subcube is uniform as illustrated in Figure 7.2. The

57

phenomenon is similar to Figure 7.1, but the optimal cost-effective space is about

40% for FGS.

Figure 7.1. Comparing the query cost of FGS, BGS, and PBS with random frequency

when minsup prims.

Figure 7.2. Comparing the query cost of FGS, BGS, and PBS with uniform frequency

58

when minsup prims.

We also compare the efficiency of forward, backward, and pick by size selection

method. Since the forward and backward greedy selection methods have similar

philosophy but different in the direction of selection, we use execution time as the

criterion. Besides, we also compared these two methods with PBS. The results are

shown in Figure 7.3, and Figure 7.4. From these two figures, the selection time

performed by forward greedy selection is less than backward greedy selection at first,

but when the space limit is higher than 30%, the situation is reversely. The reason is

that forward greedy selection method selects the subcube from empty until no cubes

can be added, while backward greedy selection method performs in the opposite way.

The pick by size selection method consumes the least time because it needs not to

select the best benefit subcube, and only chooses the subcubes according to their size.

Figure 7.3. Comparing the selection time of FGS, BGS, and PBS with random

frequency when minsup prims.

59

Figure 7.4. Comparing the selection time of FGS, BGS, and PBS with uniform

frequency when minsup prims.


minsup < prims

We first compare the query cost between FGS, BGS, and PBS. The results are

shown in Figure 7.5, and Figure 7.6. Recall that, when minsup < prims, the OMARS

system must utilize OLAM cube, auxiliary cube, and data warehouse to respond

user’s query. Obviously, this process will require more cost because no OLAM cube

can respond user’s query immediately. Thus, the query cost of FGS, and BGS when

minsup < prims is higher than that when minsup prim, no matter what frequencies

are. But the query cost of PBS when minsup < prims is similar to that when minsup

60

prim, because PBS selects the subcube only according to their size.


when minsup < prims.

Figure 7.6. Comparing the query cost of FGS, BGS, and PBS with uniform frequency

when minsup < prims.

61

We then compare the efficiency of these three methods. The results are shown in

Figure 7.7, and Figure 7.8. It can be observed that the results are similar to the

situation when minsup prims.


frequency when minsup < prims.


62

frequency when minsup < prims.


random minsups

Generally, the support settings of different users’ queries are different, and in

accordance with this situation, we conduct another experiment, setting minsups as

random. Figure 7.9 shows the results when the frequencies of all subcube are random.

When the space constraint goes beyond 50%, the forward and backward selection

method reaches the optimal cost-effective point around 50%. For space constraint is

over 50%, there is no further saving in query cost for forward greedy. In Figure 7.10,

the results are similar to Figure 7.9, but the optimal cost-effective point for FGS is

around 40%.

63


for random minsups.

Figure 7.10. Comparing the query cost of FGS, BGS, and PBS with uniform

frequency for random minsups.

We then compare the selection time of FGS, BGS, and PBS. As shown in Figure

7.11, and Figure 7.12, the phenomenon is similar to the above two cases.

64





65

Chapter 8

Conclusions and Future Works

8.1 Conclusions

In this thesis, we have considered the OLAM cube selection problem in OMARS

system, and proposed a cost model to evaluate the query cost. According to user’s

association queries, we divided the query evaluation into three cases, and accordingly

designed three different cost models. Through our cost models, we have modified

three most well-known heuristic algorithms, FGS, BGS, and PBS to choose the most

suitable OLAM cubes to materialize. We also have implemented these algorithms to

evaluate their performances.

There is one thing need to be clarified. Although our cost models are based on

CBWon and CBWoff algorithms proposed by Lin et al [15]. Most of the concepts are

suitable for algorithms different from above two algorithms, except that the cost

functions need to be modified to conform to the new algorithms.

8.2 Future Works

66

As we pointed out in the beginning of this thesis, the main focus of this research

is on OLAM cube selection in the OMARS system. There are some issues need to be

investigated in the near future:

1. OLAM cube and OLAP cube selection simultaneously

In OMARS system, there is another cube repository to store the OLAP Cube. In

our thesis, we only consider the OLAM cube selection problem. One of our future

works is to combine OLAP cube and OLAM cube, and design a suitable cost model to

evaluate the query cost in this situation.

2. Cube maintenance

In real world applications, data are evaluated as well as generated and need to be

loaded into data warehouse. This implies that the materialized cubes have to be

updated to reflect the new situation. One of our future works is to design a suitable

scheme to update our OLAM, OLAP and auxiliary cubes in OMARS system.

3. Other nonheuristics algorithms

In this thesis, we only consider the class of heuristic algorithms to select the most

suitable OLAM cubes to materialize. Besides these methods, we will consider other

nonheuristics algorithms, such as genetic algorithm, A* algorithm or dynamic

programming to select the most suitable OLAM cubes to materialize.

67

References

[1] 林文揚、張耀升:“啟發式資料方體挑選方法之分析比較”，九十年全國計算

機會議論文集，頁 47-58，2001。

[2] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in

Proceedings of the 20th VLDB Conference, pp. 487-499, 1994.

[3] K.S. Beyer and R. Ramakrishnan, “Bottom-up computation of sparse and

iceberg cubes,” in Proceedings of the ACM SIGMOD International Conference

on Management of Data, pp. 359-370, 1999.

[4] S. Chaudhuri and U. Dayal, “An overview of data warehouse and OLAP technology,”

ACM SIGMOD Record, Vol. 26, pp. 65-74, 1997.

[5] C.I. Ezeife, “A uniform approach for selecting views and indexes in a data

warehouse,” in Proceedings of International Database Engineering and

Applications Symposium, pp. 151-160, 1997.

[6] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, “Data cube: a relational aggregation

operator generalizing group-by, cross-tabs and subtotals,” in Proceedings of International

Conference on Data Engineering, pp. 152 -159, 1996.

[7] H. Gupta, “Selection of views to materialize in a data warehouse,” in

Proceedings of International Conference on Database Theory, pp. 98-112, 1997.

[8] J. Han and M. Kamber, Data Mining: Concepts and Techniques, MORGAN

68

KAUFMANN PUBLISHERS, 2000.

[9] V. Harinarayan, A. Rajaraman, and J.D. Ullman, “Implementing data cubes

efficiently,” in Proceedings of ACM SIGMOD, pp. 205-216, 1996.

[10] J.-T Horng, Y.-J. Chang, B.-J. Liu, and C.-Y. Kao, “Materialized view selection

using genetic algorithms in a data warehouse,” in Proceedings of World

Congress on Evolutionary Computation, pp. 2221-2227, 1999.

[11] W.H. Inmon and C. Kelley, Rdb/VMS: Developing the Data Warehouse, QED

Publishing Group, Boston, Massachussetts, 1993.

[12] R. Kimball, The Data Warehouse Toolkit Practical For Building Dimensional

Data Warehouses, JOHN WILEY & SONS, INC. 1996.

[13] W.Y. Lin and I.C. Kuo, “OLAP data cubes configuration with genetic

algorithms,” in Proceedings of IEEE System, Man and Cybernetics, pp. 1984–

1989, 2000.

[14] W.Y. Lin, I.C. Kuo, and Y.S. Chang, “A Genetic Selection Algorithm for OLAP

Data Cube,” in Proceedings of the 9th National Conference on Fuzzy Theory and

Its Application, Taiwan, November 2001, pp. 624-628, 2001.

[15] W.Y Lin, J.H Su and M.C Tseng, “OMARS: The framework of an online multi-

dimensional association rules mining system,” in Proceedings of 2nd

International Conference on Electronic Business, Taipei, Taiwan, 2002.

[16] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs,

Springer-Verlag, New York, 1994.

[17] A. Shukla, P. M. Deshande and J. F. Naughtion, “Materialized View Selection

for Multidimensional Datasets,” in Proceedings of the 24th VLDB Conference,

New York, USA, pp. 488-499, 1998.

[18] E. Soutyrina, F. Fotouhi, “Optimal view selection for multidimensional database

systems,” in Proceedings of International Database Engineering and

69

Applications Symposium, pp. 309-318, 1997.

[19] D. Theodoratos and T. Sellis, “Data warehouse configuration,” in Proceedings of

the 23rd VLDB Conference, pp.126-135, 1997.

[20] C. Zhang, X. Yao, and J. Yang, “Evolving materialized views in data

warehouse,” in Proceedings of World Congress on Evolutionary Computation,

pp. 823-829, 1999.

[21] C. Zhang and J. Yang, “Genetic algorithm for materialized view selection in data

warehouse environments,” in Proceedings of International Conference on Data

Warehouse and Knowledge Discovery, pp. 116-125, 1999.

[22] H. Zhu, On-Line Analytical Mining of Association Rules, SIMON FRASER

UNIVERSITY, December, 1998.

70

義守大學

資訊管理研究所

碩士論文

OMARS系統中線上關聯規則採掘資

料方體之挑選

OLAM Cube Selection in OMARS

研究生：王敏峰

指導教授：林文揚博士

中華民國九十二年七月

II

OMARS系統中線上關聯規則採掘資料方體之挑選


研究生：王敏峰 Student：Min-Feng Wang

指導教授：林文揚博士 Advisor：Dr. Wen-Yang Lin

義守大學

資訊管理研究所

碩士論文

A Thesis

Submitted to Department of Information Management

I-Shou University

in Partial Fulfillment of the Requirements

for the Master degree

in

Information Management

July, 2003

Kaohsiung, Taiwan, Republic of China

中華民國九十二年七月

OMARS系統中線上關聯規則採掘資料方

體之挑選

學生：王敏峰指導教授：林文揚博士

義守大學資訊管理研究所

摘要

從大型資料庫中採掘關聯規則是一計算密集的工作。為了減少關聯規則採掘

的複雜性，Lin等延伸了冰山方體(Ice-berg cube)的概念，提出線上關聯規則採

掘資料方體(OLAM cube)的概念來儲存頻繁項目集，並提出一架構，稱為線上

多維度關聯規則採掘系統 (On-Line Multidimensional Association Rule Mining

System, OMARS)，提供使用者執行類似 OLAP的查詢，以迅速的從資料倉儲中

採掘關聯規則。

本篇論文目的就是根據在 OMARS系統中提出的關聯規則採掘演算法，來

設計一成本模組，並利用來挑選出最適合回答關聯規則查詢的線上關聯規則採

I

掘資料方體。此外，本論文並修改及實作一些目前最先進的啓發式演算法來結合

我們提出的成本模組，並分析出其效能。

關鍵字：資料採掘，資料倉儲，線上多維度關聯規則採掘系統，線上關聯規則

採

掘資料方體，多維度關聯規則，資料方體挑選問題

II


Student: Min-Feng Wang Advisor: Wen-Yang Lin

Dept. of Information Management

I-Shou Unversity

ABSTRACT

Mining association rules from large database is a data and computation intensive

task. To reduce the complexity of association mining, Lin et al. proposed the concept

of OLAM (On-Line Association Mining) cube, an extension of Ice-berg cube used to

store frequent multidimensional itemsets. They also proposed a framework of on-line

multidimensional association rule mining system, called OMARS, to provide users an

environment to execute OLAP-like query to mine association rules from data

warehouses efficiently.

This thesis is a companion toward the implementation of OMARS. Particularly,

the problem of selecting appropriate OLAM cubes to materialize and store in

OMARS is concerned. And, according to the proposed mining algorithms in OMARS,

we deploy the model for evaluating the cost of answering association queries using

materialized OLAM cubes, which is a preliminary step for OLAM cubes selection.

Besides, we modify and implement some state-of-the-art heuristic algorithms, and

III

draw comparisons between these algorithms to evaluate their effectiveness.

Keywords: data mining, data warehouse, OMARS, OLAM cube, multidimensional

association rules, cube selection problem

Acknowledgement

此篇論文的完成，最感謝我的指導教授林文揚博士的耐心指導，讓我了解

如何做研究，並體會其中的甘苦。特別是在最後完稿階段，更要感謝老師在暑假

期間還撥空予以指點。在這兩年的研究所期間除了理論的研習外，更加感謝恩師

給予我時間來針對資料倉儲與資料庫的實作予以鑽研。

IV

兩年的時間是短暫的，我還要感謝洪宗貝老師、錢炳全老師、王學亮老師、

林建宏老師在這兩年的指教，讓我在這短暫的時間內增加了許多不同的見識，

此外，還要感謝在這兩年中一同與我研究及討論課業的同學們，尤其是思博、文

傑與欣龍，還有學長們，特別是耀升學長與詠騏學長。並感謝在口試時幫我的學

弟妹們。最後我要感謝我的家人，在這兩年中給我的支持與鼓勵。

V

Contents

CHINESE ABSTRACT…………………………………………………….............I

ENGLISH ABSTRACT…………………………………………………………..III

ACKNOWLEDGEMENT…………………………………………………………V

CHAPTER 1………………………………………………………………………...1

INTRODUCTION..………………………………………………………………...1

1.1 CONTRIBUTIONS..................................................................................................2

1.2 THESIS ORGANIZATION........................................................................................3

CHAPTER 2………………………………………………………………………...4

BACKGROUND AND RELATED WORK……………………………………….4

2.1 DATA WAREHOUSE AND OLAP...........................................................................4

2.1.1 Data Warehouse..............................................................................................4

2.1.2 On-Line Analytical Processing (OLAP)........................................................5

2.2 DATA WAREHOUSE DATA MODEL.......................................................................6

2.2.1 Star Schema....................................................................................................7

2.2.2 Snowflake Data Model...................................................................................9

2.3 ASSOCIATION RULE MINING..............................................................................10

2.3.1 Association Rules.........................................................................................10

2.3.2 Multi-dimensional Association Rules..........................................................11

2.4 RELATED WORK................................................................................................12

2.4.1 Data Cube.....................................................................................................12

2.4.2 Cube Selection Problem...............................................................................13

CHAPTER 3..……………………………………………………………………...16

THE OMARS FRAMEWORK…………………………………………………..16

3.1 OLAM CUBE AND AUXILIARY CUBE................................................................17

3.2 CUBE MANAGER................................................................................................17

VI

3.3 OLAM MEDIATOR AND OLAM ENGINE...........................................................18

CHAPTER 4..……………………………………………………………………...20

PROBLEM FORMULATION……………………………………………………20

4.1 OLAM CUBE AND OLAM QUERY....................................................................20

4.2 OLAM LATTICE................................................................................................25

4.3 OLAM CUBE SELECTION..................................................................................32

CHAPTER 5..……………………………………………………………………...34

EVALUATION OF OLAM QUERY COST……………………………………..34

5.1 QUERY EVALUATION FLOW...............................................................................34

5.2 COST EVALUATION FOR CASE 1........................................................................38



CHAPTER 6…..…………………………………………………………………...48

OLAM CUBE SELECTION METHODS..….………………….……………….48

6.1 FORWARD GREEDY SELECTION METHOD (FGS)...............................................48

6.2 BACKWARD GREEDY SELECTION METHOD (BGS)............................................52

6.3 PICK BY SIZE SELECTION METHOD (PBS) .......................................................54

CHAPTER 7…..…………………………………………………………………...56

EXPERIMENTAL RESULTS...…………………………………………………..56

7.1 COMPARISON OF FGS, BGS, AND PBS FOR MINSUP PRIMS.............................58

7.2 COMPARISON OF FGS, BGS, AND PBS FOR MINSUP < PRIMS...........................61

7.3 COMPARISON OF FGS, BGS, AND PBS FOR RANDOM MINSUPS........................64

CHAPTER 8..……………………………………………………………………...67

CONCLUSIONS AND FUTURE WORKS..…………………………………….67

8.1 CONCLUSIONS....................................................................................................67

8.2 FUTURE WORKS.................................................................................................67

REFERENCES…………………….………………………………………………69

VII

List of Figures

FIGURE 2.1. A TYPICAL ARCHITECTURE OF DATA WAREHOUSE...................................5

FIGURE 2.2. THE TYPICAL OPERATIONS OF OLAP........................................................7

FIGURE 2.3. AN EXAMPLE OF STAR SCHEMA FOR SALES.............................................8

FIGURE 2.4. AN EXAMPLE OF SCHEMA HIERARCHY FOR SALES STAR.........................8

FIGURE 2.5. AN EXAMPLE OF SNOWFLAKE SCHEMA FOR SALES.................................9

FIGURE 2.6. AN EXAMPLE OF DATA CUBE..................................................................13

FIGURE 3.1. THE OMARS FRAMEWORK......................................................................16

FIGURE 4.1. THE1ST LAYER OLAM LATTICE FOR THE EXAMPLE STAR SCHEMA IN

FIGURE 2.3..............................................................................................................26

FIGURE 4.2. THE 2ND LAYER LATTICE DERIVED FROM <CUSTOMER, TIME, -> IN THE

1ST LAYER................................................................................................................26

FIGURE 4.3. THE 3RD LAYER LATTICE DERIVED FROM THE SUBCUBE <(CITY,

EDUCATION), DATE > IN THE 2ND LAYER..................................................................27

FIGURE 5.1. THE FLOW DIAGRAM OF OLAM QUERY..................................................35

FIGURE 5.2. THE PROCEDURE TO COMPUTE THE COST OF USER’S QUERY.................36

FIGURE 5.3. PROCEDURE OLAMQ_SEARCH................................................................37

FIGURE 5.4. ALGORITHM CBWON..............................................................................39

FIGURE 5.5. PROCEDURE DWNSEARCHON.....................................................................40

FIGURE 5.6. PROCEDURE UPSEARCH..........................................................................40

FIGURE 5.7. PROCEDURE ASSOCIATION_GEN.............................................................41

FIGURE 5.8. ALGORITHM CBWOFF.............................................................................46

FIGURE 5.9. PROCEDURE DWNSEARCH.......................................................................46

FIGURE 6.1. FORWARD GREEDY SELECTION METHOD................................................49

FIGURE 6.2. ALL POSSIBLE OLAM CUBES FORMED WITH CITY, EDUCATION AND DATE

...............................................................................................................................51

FIGURE 6.3. BACKWARD GREEDY SELECTION METHOD.............................................53

FIGURE 6.4. PICK BY SIZE GREEDY SELECTION METHOD...........................................55

FIGURE 7.1. COMPARING THE QUERY COST OF FGS, BGS, AND PBS WITH RANDOM

FREQUENCY WHEN MINSUP PRIMS......................................................................59

VIII

FIGURE 7.2. COMPARING THE QUERY COST OF FGS, BGS, AND PBS WITH UNIFORM

FREQUENCY WHEN MINSUP PRIMS..................................................................59

FIGURE 7.3. COMPARING THE SELECTION TIME OF FGS, BGS, AND PBS WITH RANDOM


FIGURE 7.4. COMPARING THE SELECTION TIME OF FGS, BGS, AND PBS WITH UNIFORM



FREQUENCY WHEN minsup < prims........................................................................62



FIGURE 7.7. COMPARING THE SELECTION TIME OF FGS, BGS, AND PBS WITH RANDOM


FIGURE 7.8. COMPARING THE SELECTION TIME OF FGS, BGS, AND PBS WITH UNIFORM



FREQUENCY FOR RANDOM MINSUPS.................................................................64


FREQUENCY FOR RANDOM minsups........................................................................65

FIGURE 7.11. COMPARING THE SELECTION TIME OF FGS, BGS, AND PBS WITH

RANDOM FREQUENCY FOR RANDOM minsups........................................................65

FIGURE 7.12. COMPARING THE SELECTION TIME OF FGS, BGS, AND PBS WITH

UNIFORM FREQUENCY FOR RANDOM minsups........................................................66

IX

List of Tables

TABLE 2.1. A RELATIONAL REPRESENTATION OF OLAP CUBE....................................12

TABLE 4.1. A JOINTED TABLE T FROM STAR SCHEMA................................................21

TABLE 4.2. AN EXAMPLE OF INTRA OLAM CUBE EXPRESSED IN TABLE....................23

TABLE 4.3. AN EXAMPLE INTER-DIMENSIONAL OLAM CUBE EXPRESSED IN TABLE. .24

TABLE 4.4. AN EXAMPLE HYBRID-DIMENSIONAL OLAM CUBE EXPRESSED IN TABLE24

TABLE 4.5. AN OLAM CUBE.......................................................................................28

TABLE 4.6. THE RESULTING TABLE BY GROUPING {DATE, MONTH} AS TRANSACTION

ATTRIBUTES FOR TABLE 4.1....................................................................................29

TABLE 4.7. THE RESULTING TABLE BY GROUPING {DATE} AS TRANSACTION

ATTRIBUTE FOR TABLE 4.1.....................................................................................30

TABLE 4.8. THE RESULTING OLAM CUBE MCUBE({CITY}, {DATE, MONTH}).............31

TABLE 4.9. THE RESULTING TABLE BY GROUPING {CITY, DATE} AS TRANSACTION

ATTRIBUTE FOR TABLE 4.1.....................................................................................32

TABLE 4.10. THE SYMBOL TABLE..............................................................................33

TABLE 6.1. THE SYMBOLS USED IN COST MODEL......................................................50

TABLE 6.2. THE REQUIRED PARAMETER SETTINGS....................................................50

TABLE 6.3. THE BENEFITS OF OLAM CUBES IN THE FIRST TWO SELECTION STEPS BY

FGS..........................................................................................................................51

TABLE 6.4. THE BENEFITS OF OLAM CUBES IN THE FIRST TWO SELECTION STEPS BY

BGS.....................................................................................................................54

TABLE 7.1. DATA PARAMETERS OF FOODMART 2000................................................57

TABLE 7.2. ALL SUBCUBES OF SIX ATTRIBUTES, CITY, EDUCATION, DATE,

MONTH,PRODUCT_ID, CATEGORY...........................................................................58

X

下載

Technology

Transcript of 下載