Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon...

18
Indirect Association: Mining Higher Order Dependencies in Data Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis, MN 55455-0159 USA TR 00-037 Indirect Association: Mining Higher Order Dependencies in Data Pang-ning Tan, Vipin Kumar, and Jaideep Srivastava June 05, 2000

Transcript of Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon...

Page 1: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

Indirect Association: Mining Higher Order Dependencies in Data

Technical Report

Department of Computer Science

and Engineering

University of Minnesota

4-192 EECS Building

200 Union Street SE

Minneapolis, MN 55455-0159 USA

TR 00-037

Indirect Association: Mining Higher Order Dependencies in Data

Pang-ning Tan, Vipin Kumar, and Jaideep Srivastava

June 05, 2000

Page 2: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h
Page 3: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

Indire t Asso iation : Mining HigherOrder Dependen ies in DataPang-Ning Tan1, Vipin Kumar1, and Jaideep Srivastava1Department of Computer S ien e,University of Minnesota,200 Union Street SE,Minneapolis, MN 55455.fptan,kumar,srivastag� s.umn.eduAbstra t. This paper introdu es the on ept of indire t asso iation be-tween items and examines its utility in various appli ation domains. Ex-isting algorithms for mining asso iation rules, su h as Apriori, will onlydis over itemsets that have support above a user-de�ned threshold. Anyitemsets that fall below the minimum support requirement are �lteredout. We will show that some of the removed itemsets may provide usefulinsight into the data. Consider a pair of items (a; b) with a low sup-port value. If there is an itemset Y su h that the presen e of a and b arehighly dependent on items in Y , then (a; b) are said to be indire tly asso- iated via Y . We have identi�ed many potential appli ations for indire tasso iations. In market basket s enario, these patterns an be used toperform ompetitive analysis among produ ts. For text do uments, in-dire t asso iations an be used to identify synonyms, antonyms or wordsthat appear in the di�erent ontexts of another word. We will presenta formal framework for des ribing indire t asso iation and propose analgorithm for mining su h patterns. Finally, we will demonstrate the ben-e�ts of mining these patterns based on empiri al results obtained fromretail, textual and sto k market data.

Page 4: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

1 Introdu tionIn re ent years, there has been onsiderable interest in extra ting asso iationrules [AIS93b,AIS93a℄ from large databases. Con eptually, an asso iation ruleindi ates that the presen e of a set of items ( alled itemset) in a transa tion of-ten implies the presen e of other items in the same transa tion. Current resear he�ort has fo ussed on developing more eÆ ient algorithms for mining these pat-terns [AS94,PCY95,SON95℄ as well as extending the on ept beyond the singletransa tion, binary literal paradigm [TLHF99,FLYH99℄.The problem of mining asso iation rules is often de omposed into two sep-arate tasks : (1) dis over all itemsets having support 1 above a user-de�nedthreshold, and (2) generate rules from these f requent itemsets. Under this for-mulation, any itemsets that fail the support threshold ondition are onsideredto be uninteresting. However, we believe that some of the infrequent itemsetsmay provide useful insight about the data. Consider a pair of items, (a; b), thatseldom o-o urs together in the same transa tion. If both items are highly de-pendent on the presen e of another itemset, Y , then the pair (a; b) is said tobe indire tly asso iated via Y (Figure 1). In this paper, we will investigate theproblem of mining indire t asso iation and examine its utility in various appli a-tion domains. Note that unlike typi al asso iation rules, an indire t asso iationpattern is hara terized by the two items parti ipating in this relationship, alongwith all other items mediating the intera tion.a

y

.

.

.

y

Y

y

1

2

k

b

Fig. 1. Indire t Asso iation between a and b via a mediating itemset Y .We motivate the usefulness of indire t asso iation with several examples. Formarket basket data, this method an be used to perform ompetitive analysis ofprodu ts. As an example, a and b may represent ompeting brands of a footwear1 Support of an itemset X denotes the fra tion of transa tions ontaining the itemsetX.

Page 5: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

produ t, su h as Reebok and Nike. 2 Suppose Reebok marketing exe utives areinterested in expanding their urrent market share by attra ting ustomers oftheir ompetitor through dire t marketing ampaigns. However, instead of pro-moting to every Nike ustomers, su h a ampaign an be made more e�e tive, interms of ost-bene�t and lift analysis [LL98,PSM99℄, by sele ting a smaller tar-get group whose buying behavior strongly resemble that of Reebok ustomers.This is where indire t asso iation an help, by identifying what are the itemsthat are highly dependent and frequently bought together with Nike and Reebokprodu ts.For text do uments, indire t asso iation between a pair of words often orre-spond to synonyms, antonyms or words that are present in the di�erent ontextsof another word. For example, the words oal and data an be indire tly asso i-ated viamining. If a user queries on the wordmining, the olle tion of do umentsreturned often ontains a mixture of both mining ontexts. Expli it dis overy ofindire t asso iation provides an opportunity to segment the query results intovarious ontexts of the queried term.Similarly, for sto k market data, indire t asso iation an be used to identifythe di�erent sets of events in uen ing the upward or downward movement of asto k pri e. For example, Redhat-Up and Intel-Down may be indire tly asso i-ated via Mi rosoft-Down. One an interpret this pattern as saying that the eventMi rosoft-Down is asso iated with at least two other disjoint sets of events : onewhi h is ausing the sto k pri e of its ompetitor to go up while the other may orrespond to the de line of other te hnology group sto ks.This paper is organized in the following way. In Se tion 2, a formal de�nitionof indire t asso iation is presented. Next, an algorithm for mining su h pattern isgiven, followed by empiri al results showing its utility in various appli ation do-mains. Finally, Se tion 5 on ludes with a summary of our work and suggestionsfor future resear h.2 Indire t Asso iation2.1 Related WorkThe importan e of indire t asso iation between attributes of a dataset has beenre ognized by several authors [Mel96,DMR98℄. However, to our knowledge, therehas not been any dire t attempts to expli itly derive su h patterns from largedatasets.[Mel96℄ observed that automated do ument translation systems tend to pro-du e lexi on translation tables that are full of indire tly-asso iated words. Alexi on translation table en odes the probability that two words from di�erentlanguages being semanti ally equivalent to one another. The presen e of indire tasso iation an pollute the resulting tables, thereby redu ing the overall pre i-sion of the system. An iterative strategy was proposed in [Mel96℄ to lean up2 The use of these brand names are for illustrative purpose only.

Page 6: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

existing translation tables by �nding only the most probable translations for agiven word.[DMR98℄ introdu ed the notion of internal and external measures of simi-larity between attributes of a database relation. Internal similarity between twoattributes a and b is a measure whose value depends only on the values of aand b olumns. Conversely, external measure takes into a ount data from other olumns ( alled the probe attributes). Their notion of probe attributes is similarto our idea of mediators for indire t asso iation. However, their sole purpose ofusing probe attributes is to perform attribute lustering.Indire t asso iation is losely related to the notion of negative asso iationrules [SON98℄. In both ases, we are dealing with itemsets that do not havesuÆ iently high support. A negative asso iation rule dis overs what are theitems a ustomer will not likely buy given that he/she buys a ertain set of otheritems. Typi ally, the number of negative asso iation rules an be prohibitivelylarge and the majority of them are not interesting to a data analyst. [SON98℄proposed the use of domain knowledge, in the form of item taxonomy, to de idewhat onstitutes an interesting negative asso iation rule. The intuition here isthat items belonging to the same parent node in a taxonomy are expe ted tohave similar types of asso iations with other items. If the observed support issigni� antly smaller than its expe ted value, then we on lude that a negativeasso iation exists between the items. Again, unlike indire t asso iation, thesetype of regularities do not spe i� ally look for mediating elements.Another related area is the study of fun tional dependen ies in relationaldatabases. Fun tional dependen ies are relationships that exist between attributesof a relation. However, the emphasis of fun tional dependen ies is to �nd depen-dent and independent attributes for appli ations su h as semanti query opti-mization [Wed92℄ and reverse engineering [TBSH93℄.2.2 Problem FormulationLet I = fi1; i2; � � � ; idg denotes a set of binary literals ( alled items) and T is theset of all transa tions, T = ftj j 8j : tj � Ig. Below, we give a formal de�nitionof indire t asso iation.De�nition 1. An itempair fa; bg is indire tly asso iated via a mediator Y if thefollowing onditions hold :1. sup(a; b) < ts (Itempair Support Condition)2. There exists a non-empty set Y su h that 8yi 2 Y :(a) sup(a; yi) � tf ; sup(b; yi) � tf (Mediator Support Condition).(b) d(a; yi) � td; d(b; yi) � td where d(p; q) is a measure of the dependen ebetween p and q (Dependen e Condition).The above thresholds are alled itempair support threshold (ts), dependen ethreshold (td) and frequent itemset threshold (tf ) respe tively. At �rst glan e, itmay seem that there are too many parameters needed in the above formulation.In pra ti e, we only need two threshold onditions, tf and td (similar to support

Page 7: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

and on�den e thresholds for asso iation rules). ts an always be hosen to besome reasonable onstant fra tion of tf . 3Condition 1 is needed be ause an indire t relationship between two itemsis signi� ant only if both items rarely o ur together in the same transa tion.Otherwise, it makes more sense to hara terize the pair in terms of their dire tasso iation. An alternative to this ondition is to test for independen e betweentwo items. Nevertheless, it is often the ase that itempairs onsisting of indepen-dent (or negatively orrelated) items tend to have low support values in manynatural datasets.Condition 2(a) an be used to guarantee the statisti al signi� an e of themediator set. In parti ular, for market basket data, the support of an itemseta�e ts the amount of revenue generated and justi�es the feasibility of a marketingde ision [BSVW99℄. Moreover, support has a ni e downward losure propertywhi h allow us to prune the ombinatorial sear h spa e of the problem.Condition 2(b) ensures that only items that are highly dependent on thepresen e of a and b will be used to form the mediator set. Over the years, manymeasures have been proposed to quantify the degree of dependen e between at-tributes of a dataset. From statisti s, the �2 test is often used for this purpose..However, the drawba k of this approa h is that it does not measure the strengthof dependen ies between items [BMS97,WH75℄. Furthermore, the �2 statisti depends on the number of transa tions in the database. As a result, other sta-tisti al measures of asso iation are often used, in luding Pearson's � oeÆ ient,Goodman and Kruskal's �, Yule's Q and Y oeÆ ients, et [Rey77℄.In the original formulation of asso iation rule mining, on�den e was hosento measure the goodness of a rule. However, [BMS97℄ showed that this mea-sure may produ e ounter-intuitive results espe ially when an itemset exhibits astrong negative orrelation among its items. Interest fa tor is another measurethat has been used quite extensively to quantify the strength of dependen yamong items [BMS97,BSVW99,CC99℄.De�nition 2. The interest fa tor between two items X and Y is :I(X;Y ) � P (X;Y )P (X)P (Y ) (1)Even though interest fa tor is an appropriate measure from a statisti al view-point, it may not be interesting within the mi roe onomi framework of a retailer[BSVW99℄.Other obje tive measures of interestingness from data mining and ma hinelearning literature in lude Piatetsky-Shapiro's rule-interest, Gini index, onvi -tion, lapla e, J-measure, et . An important riteria for a good obje tive measurelies in its ability to apture the statisti al notion of orrelation. For binary vari-ables, orrelation oeÆ ient is given by :�X;Y = E(X;Y )� E(X)E(Y )pE(X) (1�E(X))pE(Y ) (1�E(Y )) : (2)3 In our experiments, we hose ts = tf=10.

Page 8: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

where E(�) is the expe ted value of its argument. It an be shown that for ertainrange of support values 4,�X;Y �pI(x; y)� sup(x; y)=jTj :We will denote the right-hand side of this expression as the IS measure.Figure 2 shows the relationship between �X;Y and I fa tor for various datasetsused in our experiments. The graph was obtained by omparing the orrelation oeÆ ient against I fa tor for all itempairs that pass the minimum supportthreshold given in Table 2. Figure 3 is the orresponding graph for �X;Y andIS measure. Note the high linearity of these graphs, parti ularly for retail data.This suggests that IS measure is indeed a reasonable estimate of statisti al orrelation. Moreover, it is a desirable metri sin e it takes into a ount boththe interestingness and support aspe ts of a pattern. In this paper, we will useIS as the dependen e measure for Condition 2(b).0 2 4 6 8 10

−0.2

0

0.2

0.4

0.6

0.8

1Reuters−21578 (Finance)

Interest

Co

rre

latio

n c

oe

ffic

ien

t

0 2 4 6 8 10−0.2

0

0.2

0.4

0.6

0.8

1Reuters−21578 (Commodity)

Interest

Co

rre

latio

n c

oe

ffic

ien

t

0 10 20 30 40 50−0.2

0

0.2

0.4

0.6

0.8

1Retail data

Interest

Co

rre

latio

n c

oe

ffic

ien

t

0 2 4 6 8 10−0.2

0

0.2

0.4

0.6

0.8

1S&P−500

Interest

Co

rre

latio

n c

oe

ffic

ien

t

Fig. 2. Comparison between I fa tor and �X;Y for frequent itempairs in variousdatasets.Mediator of Indire t Asso iation. Another issue we need to address is what onstitutes an element of the mediator set. A naive approa h is to treat singleitems as mediating elements. For example, in the market basket s enario, if4 su h as P (x)� 1, P (y)� 1 and P (x;y)P (x)P (y) � 1.

Page 9: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

0 0.2 0.4 0.6 0.8 1−0.2

0

0.2

0.4

0.6

0.8

1

IS

Co

rre

latio

n c

oe

ffic

ien

t

Reuters−21578 (Finance)

0 0.2 0.4 0.6 0.8 1−0.2

0

0.2

0.4

0.6

0.8

1Reuters−21578 (Commodity)

IS

Co

rre

latio

n c

oe

ffic

ien

t0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1Retail Data

IS

Co

rre

latio

n c

oe

ffic

ien

t

0 0.2 0.4 0.6 0.8 1−0.2

0

0.2

0.4

0.6

0.8

1S&P−500 Data

IS

Co

rre

latio

n c

oe

ffic

ien

t

Fig. 3. Comparison between IS and �X;Y for frequent itempairs in various datasets.the pur hase of o�ee and tea is highly dependent on the pur hase of sugarand ream, then Y = ffsugarg; f reamgg may serve as the mediator between o�ee and tea. One an extend this de�nition to in lude itemsets of larger size.For example, in the text domain, individual words may not have signi� antdependen ies ompare to phrases of words. Other forms of representation arealso possible. For example, a mediating element an ontain both positive andnegative literals (�p; q; �r) as well as disjun tive items (p_ q_ r). In this paper, wewill fo us on onjun tion of positive literals as mediating elements.Indire t AÆnity. It is often useful to quantify the overall aÆnity between twoindire tly-asso iated items. Su h a metri an be used to rank the dis overedpatterns a ording to the strength of their indire t asso iation. If the mediatorhas only one element y, then, the overall aÆnity between a and b is :Aff(a; bjy) = min(IS(a; y); IS(b; y)) (3)where d(x; y) is the dependen e measure between x and y. 5.For mediator set with more than one element, there are several ways to de�nethe overall aÆnity between a and b. In this paper, we will use the lower boundof this value, i.e. :Aff(a; bjY ) � maxyi2Y (min(f(a; yi); f(b; yi))) (4)5 Other fun tions based on arithmeti or geometri means an also be used.

Page 10: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

3 Algorithm3.1 Main AlgorithmAn algorithm for mining indire t asso iation between itempairs is given in Table1. Initially, an itempair support matrix S is onstru ted by s anning the entiredatabase (step 2). The sum of ea h row in S will give the support ount of ea hindividual item. Next, the support matrix will be used to prune the itempairspa e (step 3) based on the following riteria :{ If the support of a is below tf , it will not appear in any frequent itemsets.Therefore, we an remove all itempairs involving a from our onsideration.{ If the support of a is above tf , but a does not belong to any frequent item-pairs, then the mediator set for a will always be empty. Again, we an ignoreany itempairs involving a from further onsideration.{ Any itempairs that violate the itempair support ondition an be removed.Table 1. Basi algorithm for mining indire t asso iation between itempairs.Indire t Asso iation Algorithm :1. let S = [sup(a; b)℄ denotes the support matrix for all itempairs (a; b).2. for ea h transa tion ti 2 TUpdateSupportMatrix(ti; S).3. prune the itempair spa e.4. for ea h remaining itempair (a; b) :4a. Y FindMediator(S, a, b, td, tf )4b. if Y = ;, go to next itempair.4 . else IP (a; bjY )5. Filtering or ranking itempairs.The key step of this algorithm lies in step 4a. There are two major phasesin this step : andidate generation and pruning of the mediator. Basi ally, itassumes that a latti e of frequent itemsets, FI , has been generated using stan-dard algorithm su h as Apriori. During ea h pass of andidate generation, it will�nd all frequent itemsets, yi � I � fa; bg , su h that both fag [ yi 2 FI andfbg[ yi 2 FI . Then, it will ompute the IS measure for ea h yi with respe t toboth a and b. Finally, during the pruning step, all andidate itemsets that failthe mediator dependen e ondition will be removed. If the resulting mediator setis empty, then we on lude that the itempair is not indire tly asso iated (step4b). Otherwise, we will insert the pair along with its mediator into IP . Theindire t itempairs an then be ranked a ording to their IS measure or �lteredusing other subje tive measures (step 5).

Page 11: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

3.2 Complexity AnalysisWe will now brie y dis uss the omplexity of our algorithm. The UpdateSup-portMatrix fun tion requires a single s an over the entire database. Assumingthat the matrix update operation takes O(1) and the maximum width of atransa tion is k, this operation takes at most O(k2jT j). The pruning step an beexe uted in O(n2) where n is the total number of items. Next, for ea h infrequentitempair (a; b), the most expensive operation in the FindMediator fun tion, Tmis the andidate generation. The omplexity of this step depends on the numberof frequent itemsets ontaining a or b. There is also an overall ost of omputingthe frequent itemset latti e, T .The ranking step requires us to �rst ompute the indire t aÆnity betweenevery dis overed itempairs. Suppose w is the maximum number of mediatingitemsets between a pair of indire tly asso iated items. In the worst- ase s enario,this step will take O(n2w) plus an additional O(n2) to sort the itempairs. Hen e,the overall worst- ase omplexity of this algorithm is O(k2jT j+wn2+n2Tm+T ).3.3 Alternative s heme for �nding single item mediatorsAn alternative way for �nding indire tly asso iated itempairs with single itemmediators is by using a bottom-up approa h. Essentially, the algorithm assumesthat all items an be potential mediators. For ea h su h item, y, the goal is to�nd all neighboring items xi, su h that (xi; y) satis�es both mediator supportand dependen e onditions. Then, for every pair of su h neighbors (xi; xj), wewill de lare them as indire tly asso iated if the support of these itempairs arebelow ts. This approa h is useful for �nding indire t asso iation in text and sto kdatasets, where the goal is to �nd the di�erent ontext (or events) asso iatedwith the mediating element.4 Experimental ResultsTo demonstrate the utility of indire t asso iations, experiments were arriedout using datasets from three real-world domains : text, retail and sto k data.Table 2 shows the parameters of ea h dataset along with the threshold valuesused. Figure 4 illustrates the e�e t of using various tf values on the number offrequent and indire tly asso iated itempairs generated by our algorithm. Theseexperiments were performed on a 500-MHz Pentium III ma hine with 256MBmain memory.4.1 Reuters-21578 Distribution 1.0 newswire arti les.This dataset onsists of a olle tion of news arti les that appeared on Reutersnewswire in 1987.6 Initially, two distin t subsets of the dataset were hosen :�nan ial arti les and ommodity arti les. The �nan e ategory ontains arti les6 available at http://www.resear h.att. om/~lewis

Page 12: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

Table 2. Summary of dataset parameters and results.Dataset ts tf td n jTj # Freq Indire tpairs pairsReuters(Finan e) 0.15% 1.5% 0.3 2886 2005 10877 99Reuters(Commodity) 0.15% 1.5% 0.3 3785 2308 6621 34Retail Data 0.01% 0.1% 0.1 14462 58565 1174 59S&P-500 0.25% 2.5% 0.2 976 716 13229 262

20 25 30 35 40 45 500

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

4

Frequent itemset support Threshold

No

of It

empa

irs

Reuters−21578 (Commodity)

Frequent itempairsIndirect Itempairs

Fig. 4. Number of frequent and indire t itempairs versus frequent itemset thresholds,tf

Page 13: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

on topi s su h as trade, retail, money, interest, et ., whereas the ommodity ategory en ompasses subje t areas su h as rubber, sugar, gold, o�ee and gas.Arti les in both ategories are prepro essed by removing stopwords and stem-ming ea h word to its root form.The �nan e ategory ontains 546 unique, stemmed words from 2005 newsarti les. Using our proposed algorithm, 99 indire tly asso iated itempairs aredis overed. Table 3 shows some of the dis overed itempairs, ranked a ording totheir IS measure. As expe ted, most of the word pairs represent the di�erent ontexts of a word. For example, the word fore ast often appears in two di�erent ontexts : fore ast of produ t shortage and fore ast of growth in the e onomy.Other examples in lude temporary agreement versus negotiate for agreement,trade dealer versus trade partner and supply of money versus shortage of money.Table 3. Indire t Asso iation between stemmed word pairs for Reuters-21578 Finan edataset. # a b y d(a; y) d(b; y) IS rank1 suppli shortag monei 0.4940 0.4597 22 partner temporari trad 0.3932 0.3043 83 partner dealer trad 0.3932 0.3064 774 shortag growth fore ast 0.3429 0.3502 275 shortag in at fore ast 0.3429 0.3241 506 shortag e onom fore ast 0.3429 0.3167 617 negoti temporari agreem 0.3625 0.3337 37There are 992 stemmed words and 2308 arti les in the Reuters ommoditynews dataset. We found 34 indire tly asso iated word pairs satisfying the thresh-old onditions (Table 4). Among the interesting pairs in lude Soviet Union ver-sus worker's union, Soviet Union versus union strike, and OPEC 7 quota versusICO 8 quota. Co-in idently, using WordNet 1.6, we found two of the distin tmeanings for union are \of trade unions" and \a politi al unit formed from pre-viously independent people or organizations; the Soviet Union".4.2 Retail dataThe retail data was obtained ourtesy of Fingerhut Corp. Table 5 illustrates someof the indire t asso iations dis overed from this dataset. As expe ted, most ofthem orrespond to pairs of ompeting items. This provides a great opportunityto do ompetitive analysis as des ribed in Se tion 1. A se ond type of analysisis to dis over surprising patterns. The intuition here is that items that belongto the same item ategory are expe ted to have similar types of asso iation withother items [SON98℄.7 Organization of Petroleum Exporting Countries8 International Co�ee Organization

Page 14: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

Table 4. Indire t Asso iation between stemmed word pairs for Reuters-21578 Com-modity dataset.# a b y d(a; y) d(b; y) IS rank1 orpor farmer agri ultur program 0.4946 0.4651agri ultur dlr 0.5160 0.4901 12 soviet strike union 0.6248 0.4149 93 soviet worker union 0.6248 0.3615 174 ope i o quota 0.4112 0.5539 105 ope o�ee quota 0.4112 0.4515 116 iran tax oil 0.3225 0.3312 247 germani texa west 0.5173 0.3197 26

Table 5. Indire t Asso iation for retail data# a b Yi d(a; Yi) d(b; Yi)1 Comforter Queen Comforter King Drapes (Madrid) 0.5634 0.4698(Madrid) (Madrid) Valan e (Madrid) 0.5771 0.4708Pillow (Madrid) 0.5840 0.44462 Comforter Twin Sheets Twin Border Wallpaper 0.3322 0.2517(Che kered Flag) (Nas ar Drivers) (Nas ar Drivers)3 Comforter Twin Curtains Border Wallpaper 0.3322 0.2481(Che kered Flag) (Nas ar Drivers) (Nas ar Drivers)4 Playstn W/Crash2 Playstn Controller Playstn memory ard 0.2784 0.4448

Page 15: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

The �rst itempair relates two produ ts of di�erent sizes (King versus Queensize). The mediator set ontains related produ ts from the same manufa turer,but without any size distin tion. Itempairs su h as this are often uninterestingbe ause they are not a tionable nor do they reveal any surprising knowledge.The se ond itempair involves two produ ts with di�erent design logos ( he k-ered ag versus Nas ar drivers). Ea h logo has its own mat hing omforter, sheet,pillow ase, et . Hen e, it is not surprising that the support value of this pairis low. However, unlike the previous example, the mediator set is made up ofwallpapers that belong to one of the two produ t groups. Su h a pattern is sur-prising be ause we do not expe t he kered ag omforters and Nas ar driverswallpapers (the mediator) to have a large support value. Therefore, this patternis potentially interesting to a data analyst. Upon loser examination, we foundthat the reason their observed support is high is be ause the produ t atalogdoes not o�er any he kered ag wallpapers. As a result, most of the ustomerswho buy he kered ag omforters end up buying Nas ar drivers wallpapers.The third itempair is also interesting for the same reason as before. Fur-thermore, this pattern is potentially a tionable sin e the produ t atalog doesnot o�er any he kered ag urtains. The marketing exe utive will be interestedin knowing why ustomers who buy he kered ag produ ts are not buyingNas ar-related urtains. Is it due to a la k of onsumer awareness or simply be- ause Nas ar-related urtains are not appealing to he kered- ag ustomers. Ifthe former is true, this provides an opportunity for ross-selling. On the otherhand, if the latter is the ase, su h information will be a useful feedba k to themanufa turer of Nas ar produ ts.Summarizing, the above examples show that indire t asso iation an providemeaningful insight into market basket data. Su h knowledge an not be derivedfrom asso iation rules alone be ause they require analysis of higher order depen-den ies and they are derived from infrequent itempairs.4.3 S&P 500 sto k market data.The dataset represents the daily u tuation of share pri es for S&P-500 sto ksfrom Jan. 1994 to O t. 1996. Ea h sto k is represented by two attributes, X-upand X-down. The value of X-up (or down) is 1 if the losing pri e for sto k Xis signi� antly higher (lower), by at least 2%, than its previous losing pri e.Some of the indire t asso iations found for this dataset are shown in Table 6.Indire t asso iations an be used for event partitioning, where one an determinethe set of events that are ausing the pri e of a sto k to move up or down. Forexample, the �rst indire t asso iation relates the event IBM-up with YELL-upvia LSI-up and MU-up. IBM (International Business Ma hines) is a ompanythat provides various ustomer solution in information te hnology while YELL(Yellow Corp) is involved in the transportation business. The mediator ontainsthe sto ks of two semi ondu tor ompanies, LSI Logi and Mi ron Te hnology(MU). This pattern indi ates that events involving LSI-up and MU-up, an bepartitioned into three disjoint sets - one involving IBM-up, another asso iatedwhi h YELL-up; and a third set of events not related to IBM-up nor YELL-up.

Page 16: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

Table 6. Indire t Asso iation between itempairs for S&P 500 data# a b Yi d(a; Yi) d(b; Yi)1 ibm-up yell-up lsi-up 0.3244 0.2200mu-up 0.2507 0.20052 hwp-down txn-up gnt-up 0.2002 0.22903 amgn-down gt-down digi-down 0.2303 0.2313s-down 0.2548 0.28264 oxy-down ph-down adsk-down 0.2740 0.23345 axp-down nke-up lsi-down 0.2197 0.20935 Con lusionsThis paper proposes a framework for des ribing indire t asso iation betweenitems in a database of transa tions. An itempair is said to be indire tly asso iatedif the support of the pair is low, yet, there exists a mediator set su h thatboth items are highly dependent on the mediating elements. We showed thatour measure of dependen e is a good way to estimate the orrelation betweenfrequently o uring items. We have also provided a simple way to express theoverall aÆnity between these itempairs. An algorithm for mining these patternsis also presented. Finally, we showed how this novel idea is appli able in variousreal-world appli ations.For future resear h, there are several unresolved issues we need to address.First of all, it is possible to extend our work to dis over indire tly asso iateditemsets rather than between a pair of items. In Se tion4, we have suggestedone su h approa h by ombining the itempairs found into larger itemsets. Wehope to explore other possible ways to do this. Further studies are also needed tode�ne a more exa t measure of indire t aÆnity between itemsets with multiplemediating elements. The dis ontinuous nature of the minimum and maximumfun tions make it diÆ ult to ombine the itemsets in a natural way. Also, our urrent implementation assumes that the itemset latti e an be stored in themain memory. Therefore, s alability issue is a major on ern. Threshold sele tionis another issue that needs further investigation.6 A knowledgementWe are grateful to Fingerhut Corp, in parti ular, Deb Campbell and NadavCassuto, for providing us with the retail dataset. We would also like to thankSam Han for providing us with pre-pro essed version of the text and sto k marketdata; George Karypis and Mahesh Joshi for many wonderful dis ussions.

Page 17: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

Referen es[AIS93a℄ R. Agrawal, T. Imielinski, and A. Swami. Database mining: a perfor-man e perspe tive. IEEE Transa tions on Knowledge and Data Engineer-ing, 5:914{925, 1993.[AIS93b℄ R. Agrawal, T. Imielinski, and A. Swami. Mining asso iation rules betweensets of items in large databases. In Pro . ACM SIGMOD Intl. Conf. Man-agement of Data, pages 207{216, Washington D.C., USA, 1993.[AS94℄ R. Agrawal and R. Srikant. Fast algorithms for mining asso iation rules. InPro . of the 20th VLDB Conferen e, pages 487{499, Santiago, Chile, 1994.[BMS97℄ S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: General-izing asso iation rules to orrelations. In Pro . ACM SIGMOD Intl. Conf.Management of Data, pages 265{276, Tu son, AZ, 1997.[BSVW99℄ T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using asso iation rulesfor produ t assortment de isions : A ase study. In Pro . of the Fifth ACMSIGKDD Intl Conf on Knowledge Dis overy and Data Mining, pages 254{260, San Diego, Calif, August 1999.[CC99℄ Robert Cooley Chris Clifton. Top at: Data mining for topi identi� ation ina text orpus. In Pro eedings of the 3rd European Conferen e of Prin iplesand Pra ti e of Knowledge Dis overy in Databases, 1999.[DMR98℄ G. Das, H. Mannila, and P. Ronkainen. Similarity of attributes by externalprobes. In Pro . of the Fourth ACM SIGKDD Intl Conf on KnowledgeDis overy and Data Mining, pages 23{29, New York, NY, 1998.[FLYH99℄ L. Feng, H.J. Lu, J.X. Yu, and J. Han. Mining inter-transa tion asso iationswith templates. In Pro . 1999 Int. Conf. on Information and KnowledgeManagement (CIKM'99), pages 225{233, Kansas City, Missouri, Nov 1999.[KMR+94℄ M. Klemettinen, H. Mannila, P. Ronkainen, T. Toivonen, and A. Verkamo.Finding interesting rules from large sets of dis overed asso iation rules. InPro . 3rd Int. Conf. Information and Knowledge Management, pages 401{408, Gaithersburg, Maryland, Nov 1994.[LL98℄ C.X. Ling and C. Li. Data mining for dire t marketing: Problems andsolutions. In Pro . of the Fourth ACM SIGKDD Intl Conf on KnowledgeDis overy and Data Mining, pages 73{79, New York, NY, 1998.[Mel96℄ D. Melamed. Automati onstru tion of lean broad- overage translationlexi ons. In 2nd Conferen e of the Asso iation for Ma hine Translation inthe Ameri as (ATMA 96), 1996.[PCY95℄ J.S. Park, M.S. Chen, and P.S. Yu. An e�e tive hash-based algorithm formining asso iation rules. SIGMOD Re ord, 25(2):175{186, 1995.[PSM99℄ G. Piatetsky-Shapiro and B. Masand. Estimating ampaign bene�ts andmodeling lift. In Pro . of the Fifth ACM SIGKDD Intl Conf on KnowledgeDis overy and Data Mining, pages 185{193, San Diego, CA, 1999.[Rey77℄ H.T. Reynolds. The Analysis of Cross-Classi� ations. Ma millan Publish-ing Co., New York, 1977.[SON95℄ A. Savasere, E. Omie inski, and S. Navathe. An eÆ ient algorithm formining asso iation rules in large databases. In Pro . of the 21st Int. Conf.on Very Large Databases (VLDB`95), Zuri h, Switzerland, Sept 1995.[SON98℄ A. Savasere, E. Omie inski, and S. Navathe. Mining for strong negative as-so iations in a large database of ustomer transa tions. In Pro . of the 14thInternational Conferen e on Data Engineering, pages 494{502, Orlando,Florida, February 1998.

Page 18: Technical Report - Computer Science & Engineering | · Technical Report Department of ... synon yms, an ton yms or w ords that app ear in the di eren t con ... applications suc h

[TBSH93℄ Z. Tari, O. Bukhres, J. Stokes, and S. Hammoudi. The reengineering ofrelational databases based on key and data orrelations. In S. Spa apietraand F. Maryanski, editors, Sear hing for Semanti s: Data Mining, ReverseEngineering, et . Chapman and Hall, 1993.[TLHF99℄ Anthony Tung, Hongjun Lu, Jiawei Han, and Ling Feng. Breaking thebarrier of transa tions: Mining inter-transa tion asso iation rules. In Pro .of the Fifth Int'l Conferen e on Knowledge Dis overy in Databases and DataMining, pages 297{301, San Diego, CA, August 1999.[Wed92℄ G.E. Weddell. Reasoning about fun tional dependen ies generalized forsemanti data models. ACM Transa tions on Database Systems, 17(1):32{64, Mar h 1992.[WH75℄ R. Winkler and W. Hays. Statisti s: Probability, Inferen e and De ision.Holt, Rinehart & Winston, New York, se ond edition, 1975.