[IEEE 2007 IEEE International Conference on Industrial Engineering and Engineering Management -...

7
Abstract Data Mining is one of the most significant tools for discovering association patterns that are useful for health services, Customer Relationship Management (CRM) etc. Yet, there are some drawbacks in existing mining techniques. Since most of them perform the flat mining based on pre-defined schemata through the data warehouse as a whole, a re-scan must be done whenever new attributes are added. Secondly, an association rule may be true on a certain granularity but fail on a smaller one and vise verse. And, they are used to find either frequent or infrequent rules. With regard to healthcare service management, this research aims at providing a novel data schema and an algorithm to solve the aforementioned problems. A forest of concept taxonomies is applied for representing healthcare knowledge space. On top of this structure, the mining process is formulated as a process of finding the large- itemsets, generating, updating and output the association patterns that represent portfolios of healthcare services. Crucial mechanisms in each step will be clarified in this paper. At last, this paper presents experimental results regarding efficiency, scalability, information loss, etc. of the proposed approach to prove its advantages. Keywords Multidimensional Data Mining, Concept Taxonomy, Healthcare Services, CRM, Association Pattern I. INTRODUCTION In the information economy era, markets offer more variances of services and products and customers become demanding on more intensive information and better quality of services. As a whole, the conventional methods of mass-marketing are being replaced by the customer- oriented approaches. As the second reason for seeking new way of services, healthcare institutions in most countries are facing a tail-off of Healthcare Assurance Payments. Healthcare institutions need thus to target the patients with various proper portfolios of services. Under this condition, hospitals like to provide various new services such as prevention methods with education on patient with changing habits to prevent chronic illness and disease, treatment and physical check-up periodically to assist patients to improve their health quality. Moreover, hospitals like improve the performance and to offer better quality of services. New tools and approaches such as CRM via data mining are needed to address this change. The notion of association rule can provide simple but useful form of knowledge on services and customers [3, 5]. Significant examples are finding new therapies/drugs for cancer cute and new portfolios of rationale services. For instance, “52% patients taking therapy X also take beauty treatment Y”. Given such an association rule, they may reduce the registration fee of the therapy from the health assurance, and raise the service fee of treatments to make more profit from private-paid services. However, most of conventional mining approaches only perform a flat scan over the databank based on a pre- defined schema. Questions may often arise such as: Are the other influence factors like W on treatment Y taken into account? Since most medical associations occur in a context of certain breadth, the knowledge does usually encompass multidimensional content. But, adding attributes to the mining task means changing the schema and leads to a fully re-scan or redundant scan that is extra time consuming. As the second deficit, most of conventional mining approaches assume, the induced rules should be effective throughout a database as a whole, but this obviously does not fit with real-life cases. Different association rules can be found in different parts of database. If a mining tool deals only with the database as a whole, the meaningful rules that are true only in a certain part will be lost. The goal of the research is to provide an approach with novel data structure and efficient mining algorithm for portfolio management on healthcare services. The crucial issue is to explore more efficient, more accurate multidimensional mining of association patterns on different granularities in a flexible and robust manner. II BASELINE OF THE RESEARCH A. Data Mining for healthcare services Data mining technology can contribute to hospital with more understanding on patient illness status, and to improve quality of service (QoS). Hospitals use databases of patients’ records, results of physical check-up, radiology, pathology etc. to analyze patient status. And, data mining and knowledge management tools are applied to select different types of patient categories for different prevention, treatment services etc. In these regards, five constructs can be explored for better service such as patient segmentations w.r.t different type of service, different insurance reimbursement for varies type of patient, chronic illness, self pay and physical check up patient services. Details of the service categories can be summarized as follow: [1] Patient segmentations on different type of service: To analyze different types of patient illness, Multidimensional Data Mining of Association Patterns in various Granularities for Healthcare Service Portfolio Management Johannes K. Chiang Department of MIS, National Chengchi University, Taipei, Taiwan 525 1-4244-1529-2/07/$25.00 ©2007 IEEE

Transcript of [IEEE 2007 IEEE International Conference on Industrial Engineering and Engineering Management -...

Page 1: [IEEE 2007 IEEE International Conference on Industrial Engineering and Engineering Management - Singapore (2007.12.2-2007.12.4)] 2007 IEEE International Conference on Industrial Engineering

Abstract – Data Mining is one of the most significant tools

for discovering association patterns that are useful for health services, Customer Relationship Management (CRM) etc. Yet, there are some drawbacks in existing mining techniques. Since most of them perform the flat mining based on pre-defined schemata through the data warehouse as a whole, a re-scan must be done whenever new attributes are added. Secondly, an association rule may be true on a certain granularity but fail on a smaller one and vise verse. And, they are used to find either frequent or infrequent rules.

With regard to healthcare service management, this research aims at providing a novel data schema and an algorithm to solve the aforementioned problems. A forest of concept taxonomies is applied for representing healthcare knowledge space. On top of this structure, the mining process is formulated as a process of finding the large-itemsets, generating, updating and output the association patterns that represent portfolios of healthcare services. Crucial mechanisms in each step will be clarified in this paper. At last, this paper presents experimental results regarding efficiency, scalability, information loss, etc. of the proposed approach to prove its advantages.

Keywords – Multidimensional Data Mining, Concept Taxonomy, Healthcare Services, CRM, Association Pattern

I. INTRODUCTION

In the information economy era, markets offer more variances of services and products and customers become demanding on more intensive information and better quality of services. As a whole, the conventional methods of mass-marketing are being replaced by the customer-oriented approaches. As the second reason for seeking new way of services, healthcare institutions in most countries are facing a tail-off of Healthcare Assurance Payments. Healthcare institutions need thus to target the patients with various proper portfolios of services.

Under this condition, hospitals like to provide various new services such as prevention methods with education on patient with changing habits to prevent chronic illness and disease, treatment and physical check-up periodically to assist patients to improve their health quality. Moreover, hospitals like improve the performance and to offer better quality of services. New tools and approaches such as CRM via data mining are needed to address this change. The notion of association rule can provide simple but useful form of knowledge on services and customers [3, 5]. Significant examples are finding new therapies/drugs for cancer cute and new portfolios of rationale services.

For instance, “52% patients taking therapy X also take beauty treatment Y”. Given such an association rule, they may reduce the registration fee of the therapy from the health assurance, and raise the service fee of treatments to make more profit from private-paid services.

However, most of conventional mining approaches only perform a flat scan over the databank based on a pre-defined schema. Questions may often arise such as: Are the other influence factors like W on treatment Y taken into account? Since most medical associations occur in a context of certain breadth, the knowledge does usually encompass multidimensional content. But, adding attributes to the mining task means changing the schema and leads to a fully re-scan or redundant scan that is extra time consuming.

As the second deficit, most of conventional mining approaches assume, the induced rules should be effective throughout a database as a whole, but this obviously does not fit with real-life cases. Different association rules can be found in different parts of database. If a mining tool deals only with the database as a whole, the meaningful rules that are true only in a certain part will be lost.

The goal of the research is to provide an approach with novel data structure and efficient mining algorithm for portfolio management on healthcare services. The crucial issue is to explore more efficient, more accurate multidimensional mining of association patterns on different granularities in a flexible and robust manner.

II BASELINE OF THE RESEARCH A. Data Mining for healthcare services

Data mining technology can contribute to hospital with more understanding on patient illness status, and to improve quality of service (QoS). Hospitals use databases of patients’ records, results of physical check-up, radiology, pathology etc. to analyze patient status. And, data mining and knowledge management tools are applied to select different types of patient categories for different prevention, treatment services etc. In these regards, five constructs can be explored for better service such as patient segmentations w.r.t different type of service, different insurance reimbursement for varies type of patient, chronic illness, self pay and physical check up patient services. Details of the service categories can be summarized as follow: [1] Patient segmentations on different type of service:

To analyze different types of patient illness,

Multidimensional Data Mining of Association Patterns in various Granularities for Healthcare Service Portfolio Management

Johannes K. Chiang

Department of MIS, National Chengchi University, Taipei, Taiwan

5251-4244-1529-2/07/$25.00 ©2007 IEEE

Page 2: [IEEE 2007 IEEE International Conference on Industrial Engineering and Engineering Management - Singapore (2007.12.2-2007.12.4)] 2007 IEEE International Conference on Industrial Engineering

hospitals are will providing various services for patients with the customization of treatment service, education and wellness maintenance. Hospitals notify patients to return back to the hospital for planning the best services for patient treatments.

[2] Different insurance reimbursement for varies types of patients: Hospitals analyze patients’ insurance types of reimbursement, and also apply data mining to provide appropriate service to earn the maximum reimbursement. Furthermore, hospitals classify the contributions of different patient types to provide the best services attracting the patients of higher society-levels to generate more revenue.

[3] Chronic illness: Hospitals analyze patients’ check-up results to define and predict different chronic illness types as well as different services for patients. Furthermore, hospitals notify patients back to hospitals for routine check-ups and treatments.

[4] Self- pay services: Hospital is capable of mining the patients’ needs for self-pay services such as tumor/ cancer MRI check-up, skin disease for skin beauty treatments, hypertension for brain stoke check-up, cardiac disease for VCT cardiac service etc.

[5] Physical check up patient services: Hospital applies data mining to retrieve patient illness status to notify patient for physical check up.

B. Finding Association Rules

TABLE 1

An Example of Transaction Database

T_ID Transaction content 001 Diagnosis-2, Therapy1. 002 Check-up-N, Therapy1. 003 MRI-Check, Diagnosis-3, Treatment-3 004 Check-up-N, Therapy1

Until now, we are used to store data in a transaction

database containing simple items identified by the Transaction IDs (TID) as in Table 1. Let I = {i1, i2, , in} is the set of all n different items in D, each transaction in D is a subset of I. An itemset is defined as a subset of I.

TABLE 2

An Example of Multidimensional Transaction Database

T_ID Date POS_No Occupation Sex Age transaction content

001 05/03/01 003 Student F 23 Diagnosis-2, Therapy1.

002 05/03/01 003 Student M 14 Check-up-1, Diagnosis-2

003 05/03/01 003 Manager M 47 MRI-Check, Diagnosis-3, Treatment-3

004 05/03/02 003 Teacher F 34 Check-up-N, Therapy1

Rather than using a one-dimensional transaction

database, healthcare services and related information on patients are usually gathered in a relational database or data warehouse. Apart from keeping track of the item

feed, a relational database may record other attributes associated with the transactions, and another table to record profile of patients. After joining several relational tables, a big table can be obtained to store not only which items saved in the transaction, but also 5W1H information corresponding to the transaction. Table 2 illustrates an example of multidimensional transaction database MD, assuming each attribute is a dimension.

An association rule is usually in the form: A→B where A and B are two disjoint itemsets; A→B implies that among transactions within D, the occurrences of B have high correlation with the occurrences of A. There are two important factors for the association rule, viz. support, and confidence. Support means how often a rule applies, i.e. repeatability. Confidence means how often a rule is true, i.e. reliability. Suppose we have a multidimensional database MD, the support of an itemset X is the fraction of transactions containing X in MD. The confidence of A → B is the fraction of transactions containing A and B simultaneously in transactions containing A. Given a set of transaction MD and a thresholdσ as minimum support, X is a large-itemset in MD if the support of X in MD exceed σ . The task for mining association rules is to generate all large-itemsets that own support and confidence greater than the user-specified minimum support (called minSup) and minimum confidence (called minConf) respectively [5, 10].

Normally, we are likely to find association rules with

high support and/or confidence values, viz. frequent rules. Recently, the importance of vital few association rules is perceived, viz. infrequent rules. Efficient algorithms for finding infrequent rules are also in development. Some data mining approaches allow users to set minimum support/confidence as the threshold for mining [5, 10]. C. Multidimensional Data Mining

Finding association rules involving various attributes

efficiently is an important subject for data mining. Association Rule Clustering System (ARCS) was proposed in [9], where association rule clustering is proposed for a 2-dimensional space. The restriction of ARCS is that it generates one rule in once by clustering. Thus, it takes massive redundant scans to find all rules.

The method proposed in [12] finds all large-itemsets at first and then uses a directed graph to assign attributes according the user given priority of each attribute. Since the method is meant to discover the large-itemsets over a database as a whole, it may loss some rules that hold only in specific segments of the database. Different priorities of the condition attributes will induce different rules so

| Transactions in D containing X |

| Transactions in D |Support(X) =

| Transactions in D containing both A and B |

| Transactions in D containing A |Confidence(A→B) =

526

Proceedings of the 2007 IEEE IEEM

Page 3: [IEEE 2007 IEEE International Conference on Industrial Engineering and Engineering Management - Singapore (2007.12.2-2007.12.4)] 2007 IEEE International Conference on Industrial Engineering

that user may need to try with all possible priorities to discover all possible rules. D. Apriori Algorithm (1) Apriori Algorithm: The Apriori algorithm is a level-wise iterative search algorithm for mining frequent itemsets w.r.t association rules [1-4, 6]. The drawback of Apriori algorithm is that it needs k passes of database scans when the cardinality of the longest frequent itemsets is k. In addition, the algorithm is computation intensive in generating the candidate itemsets and computing the support values, especially for application with very low support threshold and/or vast amount of items. In this algorithm, if the number of first itemsets element is k, the database will be scanned k times at least. It is not efficient enough and the key for improving the algorithm is to reduce the number of itemsets. (2) AprioriTID Algorithm [8]: The AprioriTID is a variant of Apriori algorithm which reduces the time needed for the frequency counting procedure by replacing every transaction in the database by the set of candidate sets that occurs in that transaction [8]. This is done iteratively. Though the AprioriTID algorithm is much faster in later iterations, it does much slower than the original Apriori in early iterations. This is mainly due to the additional overhead that is caused when the adapted transaction database Ck does not fit into main memory but to disk [5]. If a transaction does not contain any candidate k-sets, then Ck will not have an entry for the transaction. Hence, the number of entries in Ck may be smaller than the number of transactions in the database, especially at later iterations of the algorithm. Other drawbacks of ApririTID are that the database modified by Apriori-gen can be much larger than the initial database and only faster in the later stage of the scans. E. Concept description and Knowledge Schema

In comparison with the works for algorithms, the issue

of data structure and descriptive model for mining is less discussed. The concept description is problematic, since the term “concept description” is used in quite different ways. Researchers argue thus for a de factor standard definition for concept description [7, 13]. At this beginning stage, it is easier to deal with common criterion on higher abstraction levels of concept description, such as comprehension [7] and compatibility [5].

Han & Kamber view concept description as a form of data generalization and define concept description as a task that generates descriptions for the characterization and comparison of the data [7]. A Similar concept appears in the development of ontology for semantic web/grid. Semantic Web can be described as an extension of the existing Web where information is considered with priori defined meaning, enabling computer and people to work in cooperation centric to Internet [11]. The aim of such kind of techniques is to enhance ill-structured content so that it may be machine interpretable or human interactive.

In practical applications, ontology provides a vocabulary for specific domains and defines the meaning of the terms and relationships between them. In this article, ontology refers to the shared understanding (comprehension) on the domains of interests which is often conceived as a set of concepts, axioms etc. The term “Taxonomy” is similar to “Ontology” and both terms can be used to denote classification or categorization of concepts that describe entities and relations among them [5, 6]. This article applies the term Taxonomy rather than Ontology because the former is more flexible and even can cover the cases with no semantic meaning.

The term “portfolio” means a set of invested items in a certain domain, eventually with a weighting or a preference value [5]. Regarding business usages, the term at first comes from financial sectors to denote the set of chosen investment means. Recently, it is applied to industrial sectors for selecting optimal combination of projects, services, teaching materials (e-learning) etc. [5]. The goal of “Portfolio Management” becomes resource allocation and optimization.

III. METHODOLOGY A. Representation schema and data structure

Example itemset =:[Urban=Taipei, Sex=Male, Age= 40-50, Married] Fig. 1: An Example for Forest of Concept Taxonomies

For the sakes of comprehension and compatibility, the

Forest of Concept-Taxonomies is applied to represent the overall searching space, i.e. the set of all the propositions of the concepts. On top of this structure the sets of association patterns can be formed by selecting concepts from individual taxonomies. The notions can be clarified with examples as follows:

Taxonomy: A category consists of domain concepts in

a latticed hierarchical structure, while each member per se can be in turn taxonomy. An Example for patients’ characteristics can be [Age, Sex, Occupation...] and e.g. the taxonomy of occupation can be [manager, labor, teacher, Engineer...].

Forest of concept taxonomies: A hyper-graph for representing the universe of discourse or closed-world of interests encompasses domain taxonomies under consideration. An example for a forest of taxonomies of the patients’ characteristics is shown in Fig. 1.

Association Rule: A portfolio pattern consisting of elements taken from various concept taxonomies such

Any

20 ~ 24 25 ~ 29 30 ~ 34 35 ~ 39 40 ~ 44 45 ~ 49 50 ~ 54 55 ~ 59

Age

Unmarried

Marry

Married

Any

Urbanization

City TownMetropolitan

Any

Female

Sex

Male

Any

527

Proceedings of the 2007 IEEE IEEM

Page 4: [IEEE 2007 IEEE International Conference on Industrial Engineering and Engineering Management - Singapore (2007.12.2-2007.12.4)] 2007 IEEE International Conference on Industrial Engineering

as [(Resident. = TPE), (Sex=female), (Cares = VCT & CDG)...] owns support and confidence greater than the user-specified minSup and minConf respectively.

By multidimensional data mining of association rules,

the notion of relation often refers to the belonging relationship between elementary patterns and generalized patterns rather than semantics [5]. Other notations that will be used in this article are shown in Table 3.

TABLE 3

Concepts and Notations Notation Meaning CT Concept Taxonomy Ei. The i-th element segment T[Ei] an element segment over Ei in MD Gi The j-th generalized pattern T[Gj] The j-th combined segment over Gj RE i Rules w.r.t the i-th element segment RGj Rules w.r.t the j-th generalized pattern (Gj , r) association rules over Gj w.r.t to match ratio r

B. The data mining algorithm

Fig. 2: The outline of the proposed algorithm.

In general, the mining process can be formulated with

two cascading steps: (1) finding all itemsets in each element segment and (2) updating all combinations of segments by the output of step1. For practical reason, the algorithm proposed hereby in step1 should be capable to be replaced by anyone of the available elsewhere such as the Apriori so that an easy segregation of the two steps enables the flexible mining on a distributed environment. Fig. 2 shows outline of the proposed algorithm extending the mining process into 4 phases.

The input of the mining process involves 5 entities, viz. (1) a multidimensional transaction database MD (optional when a default MD is assigned), (2) a set of concept taxonomies for each dimension (CT), (3) a minimal support, viz. minsup, (4) a minimal confidence, viz. minconf, and (5) a match ratio m for the relaxed match. The output of the algorithm includes all the multi-dimensional associations w.r.t a full/relaxed match in the

MD. The last 3 settings can help with finding frequent or infrequent rules.

The crucial feature of the proposed algorithm is to discover all association rules REi in the element segment T[Ei] for each element pattern Ei, and then uses REi to update RGj, i.e. the set of associations patterns for every generalized pattern Gj which includes Ei. The heuristic w.r.t each element pattern is to find the large-itemsets in itself and acknowledge its super generalized patterns with the result. The task w.r.t each generalized pattern is to decide which rules hold within it according to the acknowledgements from the element patterns. The mining procedure needs only to work on each element segment to determine which rules hold in the compound segments. Thus, there is no need to scan all of the potential segments for finding the rules. C. Generating all patterns and the pattern table

As a pre-processing, the algorithm generates at first all elementary and generalized patterns with the given forest, while a pattern table for recording the belonging relation-ship between the elementary and generalized patterns is built. Given a set of concept taxonomies, a multi-dimensional pattern can be generated by choosing a node from each taxonomy. The compound of different choices represents all the multidimensional patterns. Fig. 3 shows an example of 12 patterns.

Fig. 3 Belonging relationships between multidimensional patterns

Fig.4: The pattern table (for the relations in Fig. 3)

Fig. 3 shows the belonging relationship between

patterns in a lattice structure. The relationships are recorded in the form of bit map (see Fig. 4) which includes element patterns and generalized patterns. In the table, a “1” indicates that the element pattern belongs to the corresponding generalized pattern and “0” indicates the case vise verse.

D. The update process

After all patterns and the pattern table are generated, the procedure reads the transactions of each element segment and then discovers all the association patterns.

1) Input: 2) Multidimensional Transaction Database MD 3) Concept taxonomies for each dimension: CTx (X= 1-n)4) User given threshold: minsup, minconf, match ratio m, ;5) Procedure: 6) Phase0: 7) to generate all Ei and Gj by CTx (x = 1 to n);8) build the pattern table;9) Phase1: 10) For all Ei ⊂ G 11) to discover all association rules r in T[Ei] as REi12) Phase2: 13) for all Ei 14) for all Gj that Ei ⊂ Gj 15) to update RGj using REi; 16) Phase3: 17) for all Gj 18) For all r (which satisfy m) in RGj 19) output (Gj, r);20) Output: 21) all multidimensional association rules (p, r)

,

,

Spring, Female

Spring (Any)

Spring, MaleAprMar May

Mar, Male Mar, Female Apr, Male Apr, Female May, Male May, Female

Spring, Female

Spring (Any)

Spring, MaleAprMar May

Mar, Male Mar, Female Apr, Male Apr, Female May, Male May, Female

111111

(Spring)

101010

(Spring, Female)

010101

(Spring, Male)

110000

(May)

001100

(Apr)

000011

(Mar)

(May, Female)(May, Male)(Apr, Female)(Apr, Male)(Mar, Female)(Mar, Male)

111111

(Spring)

101010

(Spring, Female)

010101

(Spring, Male)

110000

(May)

001100

(Apr)

000011

(Mar)

(May, Female)(May, Male)(Apr, Female)(Apr, Male)(Mar, Female)(Mar, Male)

528

Proceedings of the 2007 IEEE IEEM

Page 5: [IEEE 2007 IEEE International Conference on Industrial Engineering and Engineering Management - Singapore (2007.12.2-2007.12.4)] 2007 IEEE International Conference on Industrial Engineering

The output of this phase is all REi for each element pattern Ei that will be fed as the input to the next phase for updating each RGj using REi. For the full match illustrated in Fig. 5, the update is done by intersection of the set RGj and the set REi, where Ei belongs to Gj (let RGj = REi if RGj is updated for the first time). After all the intersections, the association patterns r left in RGj holds in all element segments covered by T[Gj].

Fig. 5: The “Update” algorithm for the full match

Fig. 6: The “Update” procedure for the relaxed match

For the relaxed match shown in Fig. 6, a counter is

designated for each rule in RGj. While using REi for updating RGj, the counters of both RGj and REi are incremented by 1 and the rules those appear in REi but not in RGj will be added to RGj, while setting the counter to 1. After all the update process, the association rule r in RGj whose counters exceed m|T[Gj]| holds in at least m *100% of the element segments T[Ei] covered by T[Gj], and thus (Gj, r) is an association rule in MD w.r.t the relaxed match. E. The output function

For a full match, the algorithm outputs all the (Gj, r)

pairs for every r left in each RGj. For a relaxed match, it outputs all the (Gj, r) pairs for every r in each RGj where the count exceed |mT[Gj]|. By means of this approach, loss of the rules that hold only in some meaningful segments can be prevented. And, counting multi-dimensional association rules that do not hold over all the range of the domain as general ones can also be avoided. For example, the full match can guarantee that the rules those hold only in two months of spring but fail in the rest one will never be counted as an association rules for the whole spring.

F. Incremental data mining

With the proposed approach, incremental data mining

can be realized. By keeping the rules mined out in each element segment, we need only mining on the new data. That is, using the proposed approach, we can produce the new association rules by combining the rules discovered

from the new data with the exiting rules to reduce redundant scan on the old data.

F. Design of metrics for measuring data mining

In order to assure the performance of the proposed

approach, we need to design metrics for evaluating the mining tools, at least to measure whether it is better than the previous algorithms. By cascade evaluating the results of a hypothetical measurement, we can evaluate the consequence from any sequence of measurements to determine the optimal next measure. With this concept, a one-step look-ahead strategy based on Shannon’s Entropy Function is adopted and the capacity of ICT systems can be described in the following form [5, 14]:

C=B*[log2 (1+S/N)], where B is the bandwidth, (S/N) is Signal-to-Noise ratio.

Drawing on this equation, the capacity, viz. the performance of data mining can be formulated as follow: C=|D| [log2 (1+information lost ratio) ], where |D| is the number of transactions of whole database [5].

While WSEi denotes each element segment in the measure, the WSEi of an element segment T[Ei] can be generated by a uniform distribution between 0 and SM. Suppose there are N element segments, the number of transactions in the element segment T[Ei] is:

Ein

a

Ea

Ei WSWS

DD

∑=

=

0

Thereafter, the definitions for information loss can be: Definition 1: discrete ratio is the ratio of the number of rules pruned by the improved algorithm to the number of rules discovered by prior mining approaches.

Definition 2: lost ratio is the ratio of the number of rules discovered by the improved algorithm but lost in the previous mining approaches to the number of rules discovered by the improved algorithm.

IV. EXPERIMENT AND EVALUATION

A. The experiment scenario on case of a medical center

A scenario for a medical center and related data were established to evaluate the performance of the proposed approach. The center possesses various departments and a web-site for e-services. The testbench is implemented with standard Java language on a PC Server with AMD

1) for all REi2) for all Gj Ei3) if (RGj never be updated)4) RGj = REi;5) else6) RGj = RGj REi;

1) for all REi2) for all Gj Ei3) if (RGj never be updated)4) RGj = REi;5) else6) RGj = RGj REi;

1) for all REi2) for all Gj Ei3) for all r in REi4) if (r RGj)5) add r to RGj;6) RGj.r.count = 1;7) else8) RGj.r.count++;

| { r | r holds in T[Gj] <Gj, r> doesn’t hold in MD } || {r | r holds in T[Gj] } |discrete ratio =

| {<Gj, r> | <Gj, r> holds in MD r doesn’t hold in T[Gj] } || {<Gj, r> | <Gj, r> holds in MD } |lost ratio =

529

Proceedings of the 2007 IEEE IEEM

Page 6: [IEEE 2007 IEEE International Conference on Industrial Engineering and Engineering Management - Singapore (2007.12.2-2007.12.4)] 2007 IEEE International Conference on Industrial Engineering

Athlon processor and MS Window NT operation system. Data from different departments and the website are gathered for the experiment.

B. The experiment data

The center provided basic patterns such as <Age over 60, take most of blood transfusions of Aphaeresis Platelet (AP) > resulted from its own tool and there are ca. 397K basic data. There are various attributes in the database of patients’ records that may influence the healthcare behaviors. We take five of them, viz. (Residence, Sex, Occupation, Age, Marriage) as the dimensions for the patients’ characteristics (ref. Fig. 1). Adding with the therapy/service catalog and the cost records, there are 7 dimensions, i.e. 7 concept-taxonomies. With respect to Apriori-Generator, three types of synthetic data sets are generated respectively, as shown in table 4. There are 110 multidimensional patterns w.r.t these taxonomies, 40 of them are element patterns and the other 70 of them are generalized ones. The proposed mining approach should find all large-itemsets for the 70 generalized patterns.

TABLE 4

Three Types of Experimental Data Set

Type1 To generate a single set of maximal candidate large-itemsets andthen generate transactions for each element pattern Ei following apriori-gen [3].

Type2 Beside a set of common maximal candidate large-itemsets, to generate maximal candidate large-itemsets for each element pattern Ei. and then generate transactions for each element pattern Ei and the common maximal candidate large-itemsetsrespectively following the apriori-gen [3].

Type3 generating a set of maximal candidate large item-sets for each element pattern Ei, and then generating transactions for each element pattern Ei from its own maximal candidate large-itemsets following the apriori-gen [3].

The first task for the evaluation is to determine the

size of the transactions, where the size is picked from a Poisson distribution with mean μ equal to the average transaction size |T|. As the second step, each transaction is assigned a series of candidate large-itemsets. If a large-itemset on hand does not fit with the transaction, the itemset is put in the transaction randomly in half of the cases. The number of maximal candidate large-itemsets is set to the maximal size of candidate large-itemsets |L|. A maximal candidate large-itemsets is generated by the size of the itemsets from a Poisson distribution with mean μ equal to its average size |I|. A series of data sets is accordingly generated for the experiments. The default value of MinSup is 0.5%, match ratio m = 1, and the number of element patterns is 40. C. The results of experiment and evaluation

At first, the 70 generalized patterns are successfully found and the execution time needed for processing itemsets in general is linear to the number of records as shown in Fig. 7. The experiment result w.r.t scalability in

Fig. 8 shows that the algorithm takes execution time linear to the number of transactions of all three data types.

Fig. 7: Scalability experiment w.r.t. the no. of records

Fig. 8: Scalability experiment w.r.t. the no. of transactions

The next experiment step is to evaluate the effect of

minimum supports on the algorithm. The experiment results are shown in Fig. 9. All of this kind of algorithms is sensitive to the minimum support; the smaller the minimum support, the longer the execution time. However, the real execution time of the step 2 (the update) in the proposed algorithm is relativity much shorter than the whole process (see Fig. 7). Fig. 10 shows the effect of number of element patterns on the algorithm, i.e. the algorithm takes time linear to the number of element patterns on all three data types.

Fig. 9: Efficiency in relation to the minimum support

Fig. 10: Efficiency in relation to the number of element patterns.

530

Proceedings of the 2007 IEEE IEEM

Page 7: [IEEE 2007 IEEE International Conference on Industrial Engineering and Engineering Management - Singapore (2007.12.2-2007.12.4)] 2007 IEEE International Conference on Industrial Engineering

One of the ultimate goals of this work is to discover association rules those hold in the meaningful element data segments within larger domains, and to mine rules that are lost in existing mining tools. Fig. 11 and Fig. 12 show the results of the measure on discrete ratio w.r.t minSup and match ratio, i.e. with these settings the proposed approach can effectively prune the rule w.r.t a generalized segment when the rule fails in a sub-segment of it. For instant, with a higher setting value only <Female, Age over 60, take most of AP transfusions> will be found but not <Age over 60, take most AP transfusions>. Fig. 13 shows the test result on lost ratio, i.e. the influence of minSup values on the lost rules by other mining tools in comparison to this approach. The above results proved that it is reliable to use the proposed approach to find frequent and infrequent rules on different granularities while users can set the minSup value and match ratio.

Figure 11: Effects of MinSup on discrete large-itemsets ratio

Figure 12: Effects of match ratio on discrete large-itemsets ratio

Figure 13: Effects of MinSup on lost itemsets.

V. SUMMARY

This paper proposes an approach including a novel

data structure and an efficient algorithm for mining association rules on various granularities. It is proved very useful for discovering healthcare service patterns. The advantages of this approach over existing approaches include (1) more comprehensive and easy-to-use (2) more efficient with limited scans and storage I/O operations (3)

more effective with finding rules hold in different granularity levels (4) capable of finding frequent pattern and infrequent patterns while users can choose the full match and the relaxed match (5) capable of incremental mining of association patterns to avoid unnecessary re-scan (6) low information loss rate.

The whole process of the development and evaluation of the multidimensional data mining approach was discussed in this paper. Since there is in our knowledge no metrics serving as base for the measurement, new metrics is derived from Shannon’s Entropy Function. The evaluation results show, the performances of the proposed approach, including efficiency, scalability and information loss rate, are better than the other approaches we know. The effects of perceived issues and potential development of data mining as well as concept description are worthy of further investigation.

REFERENCE

[1] Rakesh Agrawal, John C. Shafer: “Parallel Mining of Association Rules”, IEEE Transactions on Knowledge and Data Engineering, 8(6) (1996) 962-969.

[2] Ramakrishnan Srikant, Rakesh Agrawal: “Mining Generalized Association Rules”, Proceedings of the 21th VLDB Conference, Zurich, Swizerland, 1995.

[3] R. Agrawal, R. Srikant: “Fast algorithms for mining association rules in large databases”, in: Proceedings of the ’94 International Conference on Very Large Data Bases, pp. 487–499, 1994,

[4] R. Agrawal, T. Imielinski, A.N. Swami: “Mining association rules between sets of items in large databases”, in: Proceedings of the ACM-SIGMOD 1993 International Conference on Management of Data, pp. 207–216, 1993.

[5] Johannes K. Chiang: “Developing an Multidimensional Mining Data Mining Approach on various Granularities -On Examples of Financial Portfolio Discovery”, ISIS 2007, Sokcho City, Korea, 2007.

[6] Ronen Feldman, James Sanger: “The Text Mining Handbook – Advanced Approaches in Analyzing Unstructured Data”, Cambridge University Press, 2007.

[7] J. Han and M. Kamber: “Data Mining - Concepts and Techniques”, 2nd ed, Morgan Kaufman, 2006.

[8] Li-jiang He, Li chao Chen, Shuang-ying Liu: “improvement of AprioriTID algorithm for Mining Association Rules”, Journal of Yangtai University, Vol. 16, No. 4, 2003.

[9] Brian Lent, Arun Swami, Jennifer Widom: “Clustering Association Rules”, in: 13th International Conference on Data Engineerin.

[10] Bing Liu, Wynne Hsu, Yiming Ma: “Mining Association Rules with Multiple Minimum Supports”, in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-99)

[11] Maozhen Li, Mark Baker: “The GRID – Core Technologies”, Willy 2005.

[12] Pauray S.M. Tasi, Chien-Ming Chen: “Mining interesting association rules from customer databases and transaction databases “.

[13] The CRISP-DM Consortium, CRISP-DM 1.0, www.crisp-dm.org, 2000.

[14] William Stallings: “Business Data Communications”, 5th ed., Pretice Hall. 2004.

531

Proceedings of the 2007 IEEE IEEM