A data mining approach to discovering reliable sequential patterns

A

HD

ARRAA

KDSI

1

ipihfi(1T2

pjsmHfs

t

c

0h

The Journal of Systems and Software 86 (2013) 2196– 2203

Contents lists available at SciVerse ScienceDirect

The Journal of Systems and Software

j our na l ho me p age: www.elsev ier .com/ locate / j ss

data mining approach to discovering reliable sequential patterns

uan-Jyh Shyur ∗, Chichang Jou1, Keng Chang2

epartment of Information Management, Tamkang University, 151 Ying-Chuan Road, Tamsui, New Taipei City, Taiwan, ROC

a r t i c l e i n f o

rticle history:eceived 27 April 2012eceived in revised form 26 March 2013ccepted 29 March 2013vailable online 22 April 2013

a b s t r a c t

Sequential pattern mining is a data mining method for obtaining frequent sequential patterns in asequential database. Conventional sequence data mining methods could be divided into two cate-gories: Apriori-like methods and pattern growth methods. In a sequential pattern, probability of timebetween two adjacent events could provide valuable information for decision-makers. As far as weknow, there has been no methodology developed to extract this probability in the sequential pattern

eywords:ata miningequential patternsnter-arrival time probability

mining process. We extend the PrefixSpan algorithm and propose a new sequential pattern miningapproach: P-PrefixSpan. Besides minimum support-count constraint, this approach imposes minimumtime-probability constraint, so that fewer but more reliable patterns will be obtained. P-PrefixSpan iscompared with PrefixSpan in terms of number of patterns obtained and execution efficiency. Our exper-imental results show that P-PrefixSpan is an efficient and scalable method for sequential pattern mining.

. Introduction

Sequential pattern mining, a well-studied and important issuen data mining field, is for discovering frequent subsequences asatterns in a sequence database. Discovering sequential patterns

s an important problem for many applications. A lot of effortsave been devoted to developing efficient algorithms for searching

requent sequential patterns. The existing sequential pattern min-ng algorithms can be separated into two categories: Apriori-likecandidate-generation-and-test) approaches (Agrawal and Srikant,994, 1995; Srikant and Agrawal, 1996; Orlando et al., 2004;oroslu, 2003) and pattern-growth approaches (Han et al., 2000a,b,004; Pei et al., 2001, 2004).

The PrefixSpan algorithm (Pei et al., 2001, 2004) is a well-knownattern-growth approach. It divides the database into smaller pro-

ected databases and solves them recursively. Since no candidateequence needs to be generated, the database need not be scannedultiple times, thus making it faster than Apriori-like algorithms.owever, it was found that when there are a large number of

requent subsequences, the traditional PrefixSpan algorithm run
lowly (Pei et al., 2004).
Most of the sequential pattern mining algorithms do not addresshe intervals between consecutive items, and mine conventional

∗ Corresponding author. Tel.: +886 2 26215656x2881; fax: +886 2 26209737.E-mail addresses: [email protected] (H.-J. Shyur),

[email protected] (C. Jou), [email protected] (K. Chang).1 Tel.: +886 2 26215656x2645; fax: +886 2 26209737.2 Tel.: +886 2 28226756.

164-1212/$ – see front matter © 2013 Elsevier Inc. All rights reserved.ttp://dx.doi.org/10.1016/j.jss.2013.03.105

© 2013 Elsevier Inc. All rights reserved.

sequential patterns including only the orders of items (Chen et al.,2003). A sequential pattern including probabilities of time betweentwo consecutive items can provide more valuable information thana conventional sequential pattern. It is required in many real-worldapplications. For example, a physician would like to know the prob-ability that symptom B will appear within 1 day, 2 days, 3 days, andso on once a patient has symptom A. While previous studies haveconsidered several variations, sequential patterns with probabilityof time are not revealed.

To remedy the above problems, a new approach is proposedin this paper. Extending the PrefixSpan algorithm, we developed anew sequential pattern mining approach – P-PrefixSpan. Comparedwith Apriori-like algorithms, PrefixSpan is an efficient method (Peiet al., 2001). The major difference between the PrefixSpan algo-rithm and P-PrefixSpan algorithm is the proposed algorithm needsto deal with the time stamp of each item in data sequences while thePrefixSpan algorithm is only concerned with the order of items indata sequences. The new algorithm can discover frequent sequen-tial patterns with probability of inter arrival time of consecutiveitems. With the estimated probability of inter arrival time, thealgorithm can be employed to reevaluate the support of searchedfrequent patterns. A probability-based pattern evaluation metricsis introduced. Given a set of transactions, the goal of sequentialpattern mining is to find all patterns with the probability of a prede-fined time-interval length larger than minimum probability thresh-old. To search such a sequential pattern, the probability will be
checked when a frequent item is to be appended to a sequential pat-tern. For example, if frequent item ̌ is to be appended to a sequen-tial pattern 〈˛〉, T represents the length of time between two con-secutive items and the minimum probability threshold is defined
dx.doi.org/10.1016/j.jss.2013.03.105

http://www.sciencedirect.com/science/journal/01641212

http://www.elsevier.com/locate/jss

mailto:[email protected]



dx.doi.org/10.1016/j.jss.2013.03.105

tems a

aottp

ctPePp

rAt

2

mfibdbnqsswepl

aiummToiawctmotifiio

tvmtbisasts

H.-J. Shyur et al. / The Journal of Sys

s P(T ≤ 10) = 30%. Let Tˇ|〈˛〉 express the time interval between theccurrence of ̌ and the last item in 〈˛〉. If P(Tˇ|<˛> ≤ 10) > 30%,hen ̌ can be appended to 〈˛〉 to form a new sequential pat-ern 〈˛′〉; otherwise, ̌ cannot be introduced to form a newattern.

To estimate the probability of time, P-PrefixSpan requires morealculation time than PrefixSpan. From this aspect, the executionime for P-PrefixSpan should be several times more than that forrefixSpan. However, with the additional probability-based patternvaluation metrics, the number of projected databases created in P-refixSpan is much smaller than that in PrefixSpan and the searchedatterns are more reliable.

The rest of the paper is organized as follows: In Section 2, weeview the related studies. Section 3 describes our new algorithm.n experimental study is presented in Section 4. The results are

hen discussed in Section 5 ending with the conclusions.

. Related works

Agrawal and Srikant (1995) first introduced the problem ofining sequential patterns. The problem is defined as how to

nd frequent subsequences. Today, a lot of efficient methods haveeen developed to search the frequent subsequences in a set ofata sequence, which includes a series of transactions orderedy transaction times. Agrawal and Srikant proposed an algorithmamed AprioriAll (Agrawal and Srikant, 1995) for obtaining fre-uent sequential patterns. With their approach, candidate frequentequential patterns can be obtained by joining shorter frequentequential patterns. However, a huge set of candidate sequencesill be generated and many repeated database scans will be nec-

ssary. If an Apriori-like approach is introduced in sequentialattern mining, it will incur a huge cost as the sequence is very

ong.Following AprioriAll, Srikant and Agrawal (1996) proposed

new algorithm named GSP (generalizations and performancemprovements), which uses a breadth-first search and bottom-p method to obtain the frequent subsequences. It also considersining sequential patterns with timing constraints regarding theinimal time gap, maximal time gap, and sliding window size.

he consideration of timing constraints is to model the real casef customer behaviors. Another advantage of the inclusion of tim-ng constraints is that patterns not satisfying the timing constraintre filtered out. Thus, the number of candidate sequential patternsould be reduced. Masseglia et al. (2009) considered handling time

onstraints in the earlier stage of the data mining process in ordero provide better performance. GSP met more real-world require-

ents than AprioriAll. However, when the database or the numberf possible items grows, the voluminous candidate sequential pat-erns and the increased number of database scans have a hugempact on the performance. Moreover, the methodology cannotnd a pattern whose interval between two consecutive items is not

n the range, and the sequential patterns include only the temporalrder of the items (Chen et al., 2003).

SPADE (Zaki, 2001) is an algorithm for fast discovery of sequen-ial patterns using a depth-first search and bottom-up method. Aertical id-list database format is employed and sequential patternining is performed by growing the subsequences one item at a

ime by Apriori candidate generation. The method utilizes com-inatorial properties to decompose the original mining problem

nto small sub-problems. The sub-programs can be independentlyolved in main-memory using efficient lattice search techniques,
nd using only simple joint operations on id-lists. SPADE reducesignificantly the number of database scans required. According tohe algorithm, all sequences can be discovered in three databasecans.
nd Software 86 (2013) 2196– 2203 2197

Han et al. proposed the FP-tree (Han et al., 2000a) to rem-edy the shortcomings of AprioriAll, especially for longer sequentialpatterns. The FP-tree data structure is to make use of the pre-fix concept to rearrange the transaction items in a tree structure.However, when the number of distinct queuing orders in sub-sequences is huge, its sequence re-arranging procedure does notperform well. Then they proposed FreeSpan (Han et al., 2000b) fortackling the problem. The idea is to generate projected sequen-tial databases of frequent patterns recursively. Frequent patternscould grow by concatenating frequent sub-sequences from thesesmaller projected databases. This algorithm could obtain the com-plete frequent patterns. Since projected databases were handledseparately, the number of candidate patterns generated is smallerthan that obtained by the joining methodology, and the cost ofdatabase scanning is much lower.

Pei et al. (2001) later proposed PrefixSpan for dealing withthe problem of FreeSpan; that is, the length of original databasesequences could not be shortened during the mining process. Theirprinciple is to check the frequency of patterns from the prefixsub-sequences. For those prefixes that pass the frequency thresh-old, the suffix sub-sequences of each original sequence would beinserted into their projected database. Each projected databasewould then generate its frequent patterns recursively. The advan-tage of partitioning the search space into the projected database isthat each projected database would contain only the required min-ing information with respect to the prefix. Along with the growthof the frequent patterns, the projected database will shrink, mak-ing PrefixSpan perform better than FreeSpan for dense databases.PrefixSpan also outperforms FreeSpan in general. However, sincethe generated projected databases occupy a lot of memory space,when the database or the number of items is huge, the memoryspace may be insufficient to store the projected databases. Thealgorithm CFR-PostfixSpan (Chen and Hu, 2006) improved PrefixS-pan by adding timing requirements of recency and compactness.The added constraints could filter out less important patterns andreduce the memory space required in storing projected databases.

Although consecutive events in the sequential patterns obtainedfrom previous data mining algorithms reveal their timing orders,the time between events is not determined. To solve the problem,Chen et al. (2003) present a new pattern, ‘the time-interval sequen-tial pattern’, which includes not only the order of the events but alsothe time intervals between two consecutive events. They dividedthe complete time domain into several fixed and nonoverlap timeranges. A time-interval sequential pattern has a form like (E1, I1,E2, I2, E3), meaning that event E1 happens first, followed by eventsE2 and E3 after time intervals of I1 and I2, respectively. Both I1and I2 are predetermined time ranges. Two algorithms, I-Aprioriand I-PrefixSpan, were developed to find frequent time-intervalsequential patterns in a sequential database. Since the time intervalbetween two consecutive events must belong to one of the prede-termined time ranges, it may cause the sharp boundary problem.Chen and Huang (2005) used the concept of fuzzy sets to extendthe original research so that fuzzy time-interval sequential patternscan be found in a database. Yun (2008) introduced the concept ofweighted sequential patterns. Chang and Park (2012) argued mostof the algorithms unable to get weighted sequential patterns thatconsider different weights of sequences in a sequence database.They proposed several sequence weighting approaches to get theweight of a sequence in mining sequential patterns. In this research,the importance of sequential pattern is measured by time-intervalprobability.

In recent years, important researches in sequential pattern min-
ing have tried to tackle more realistic requirements. For example,Kim et al. (2007) studied the problem of mining sequential patternswith quantities. Many real-world applications have quantitativeinformation recorded in the data. However, the information is

2 tems and Software 86 (2013) 2196– 2203

naQpi(a

3

Pebdaf

3

oi〈{wottqoti

Dap

Ditqs

DifiptRjt[[rt6

DltitqAtor

Start

Set ξ , ν , a nd td

<α> = null

Scan S|<α>

to generate frequent items

Estimate the inter-arrival time probability for

each frequ ent ite m in a t ime pe riod td

Append the frequent ite m with int er-arriv al time

probability gr eater than or equal to ν to <α > to form a new reliable pa ttern <α’>

Construct S|<α’>

Call P-Pre� ixSpan(< α’>,|< α’>|, S|<α’>))

198 H.-J. Shyur et al. / The Journal of Sys

eglected by most of the existing algorithms. Extending the existinglgorithms, Kim et al. (2007) proposed two approaches, Apriori-SP and PrefixSpan-QSP, to discovering quantitative sequentialatterns. Conventional sequential patterns can reveal the order of

tems but the time between items is not determined. Peng and Liao2009) tried to reveal more information from sequence databasescross multiple domains.

. Sequential pattern mining with probability of time

In this research, we build an efficient algorithm called P-refixSpan for sequential pattern mining. Between two consecutivevents in a frequent sequential pattern, not only the time intervalut also its probability will be revealed in the mining process. It iseveloped by extending the well known PrefixSpan algorithm andssuming that the time between two consecutive items or eventsollows an exponential distribution with constant arrival rate.

.1. Notations

In the paper of Pei et al. (2001), a sequence is defined as anrdered list of itemsets, denoted as 〈I1, I2, . . ., In〉. Items in the sametemset will be enclosed by the parentheses ‘(‘and’)’. For example,(bc),a,d,(aef),c〉 denotes the sequence of the following itemsets:b,c}, {a}, {d}, {a,e,f}, and {c}. To simplify the above presentation,hen there is only one item in an itemset, the parentheses will be

mitted. In this study, each of the items is attached to a transac-ion time and the items in an itemset share the same transactionime. A sequence is represented as 〈[q1,t1], [q2,t2], . . ., [qn,tn]〉,wherej is an item and tj stands for the transaction time at which qjccurs, 1 ≤ j ≤ n, and t1 ≤ t2 ≤ ··· ≤ tn. 〈˛〉 = 〈˛1, ˛2, . . ., ˛m〉 denoteshe sequential pattern. With the above representation, the follow-ng definitions are defined.

efinition 1 (:). For a sequence Q = 〈[q1, tq1], [q2, tq2], . . ., [qn, tqn]〉nd a pattern P = 〈p1, p2, . . ., pm〉, P is a time-relaxation prefix of Q if1 = q1, p2 = q2, . . ., pm = qm where m ≤ n.

efinition 2 (:). A sequence P = 〈[p1, tp1], [p2, tp2], . . ., [pm, tpm]〉s a time-relaxation subsequence of Q = 〈[q1, tq1], [q2, tq2], . . ., [qn,qn]〉, if there exist integers 1 ≤ j1 < j2, . . ., < jm ≤ n, such that p1 =j1 , p2 = qj2 , . . . , pm = qjm . And Q is defined as a time-relaxationuper sequence of P.

efinition 3 (:). A time-relaxation subsequence P′ of sequence Ps called a projection of P with respect to the time-relaxation pre-x R if (1) P′ has time-relaxation prefix R and (2) there exists noroper time-relaxation super sequence P′′ of P′ such that P′ is aime-relaxation subsequence of P and has a time-relaxation prefix. If one removes directly the time-relaxation prefix R from the pro-

ection P′, the new sequence is called the postfix of P with respecto time-relaxation prefix R. For example, in sequence 〈[a,5], [c,6],d,10], [c,10], [e,12]〉, the time-relaxation subsequences P1 = 〈[ ,6],d,10],[c,10],[e,12]〉 and P2 = 〈[ ,10],[e,12]〉 are postfixes of P withespect to time-relaxation prefix 〈c〉. Note that [ ,6] and [ ,10] meanshat the transaction times of time-relaxation prefix in P1 and P2 are

and 10, respectively.

efinition 4 (:). Let Tmax denote the occurrence time of theast transaction in a sequence database S. Q = 〈[q1, tq1], [q2,q2], . . ., [qn, tqn]〉 is a sequence in S. {Q(t1,t2)} represents all thetems that exist in Q between transaction time t1 and t2. Assumehat P = 〈[p1, tp1], [p2, tp2], . . ., [pm, tpm]〉 is a time-relaxation subse-uence of Q, and ̌ is an item which does not belong to {Q(tpm,tqn)}.
ccording to the current sequence Q, we do not know the exact
ransaction time of ˇ. To analyze the probability of time intervalf consecutive items, the potential censoring time for item ̌ withespect to P is defined as Tmax − tpm.

Fig. 1. The steps of P-PrefixSpan.

A sequential pattern 〈˛〉 is called a frequent pattern if the per-centage of transactions in S containing 〈˛〉 is greater than or equal tothe user-specified minimum support threshold – � (Pei et al., 2001).In this paper, a frequent pattern 〈˛〉 is called a reliable pattern ifthe estimated inter-arrival time probability of any two consecu-tive items in 〈˛〉 in a predefined time period td is larger than theuser-specified minimum probability threshold – �. The searchedsequential patterns should be frequent and reliable patterns in theproposed algorithm. It should be noted that if � is set to be 0, thenall frequent patterns are reliable patterns. In that case, P-PrefixSpanand PrefixSpan will obtain the same results.

3.2. The P-PrefixSpan algorithm

The P-PrefixSpan algorithm is developed by modifying the well-known PrefixSpan algorithm. Given a sequence database S and apattern 〈˛〉, we use projected database S|〈˛〉 to denote the collectionof postfixes in S with respect to 〈˛〉. The length of pattern 〈˛〉 rep-resents the number of items in 〈˛〉, denoted as |〈˛〉|. Fig. 1 depictsthe steps of algorithm. The pseudo code for the implementationof the algorithm is given in Fig. 2. In the first step, pattern 〈˛〉 isset to be null and the frequent items are generated by scanningS|〈˛〉 (=S). For each frequent item in S|〈˛〉, append it to the end of〈˛〉 to form a new sequential pattern 〈˛′〉. While 〈˛′〉 is identified,S|〈˛′〉 can be constructed. In PrefixSpan, the function PrefixSpan()is recursively called to identify all the frequent patterns. However,
the most important difference lies in that P-PrefixSpan includesan additional step for identifying the inter-arrival time probabil-ity between a frequent item ̌ in S|〈˛〉 and the last item in 〈˛〉. Thepseudo code shown in Fig. 3 is developed to estimate the arrival

H.-J. Shyur et al. / The Journal of Systems and Software 86 (2013) 2196– 2203 2199

001 Input: S, , td,

002 Output: A complete set of rel iable patterns

003 Subrout ine: P-Pre fixSpa n(< >,|< >|, S|< >)

004 Parameter s:

005 < >: a t ime-r elaxation Prefi x

006 |< >|: the length of < >

007 S|< >: the projec ted database of S with respect to < >

008 Methods:

009 Scan S|< > one time. Find all frequent items in S|< >;

010 If |< >| = 0, th en

011 {

012 For each freq uent item , appe nd to < > as < ’>;

013 }

014 If |< >|>0, then

015 {

016 For each freq uent item

017 {

018 = arriv alRate(< >, S|< >, );

=PTAI910 dte1 ;

>PTAIfI020 then appen d to < > as < ’>;

021 }

022 }

023 For each < ’>024 {

025 Construc t S|< ’>, the projec ted database of < ’>;

<(napSxiferP-PllaC620 ’>,| < ’>|, S|< ’>));

rrtb

P

wimab

ova〈Stinaos

T

Table 1A sequence database.

ID Sequence

0 〈[b,2],[c,2],[a,4],[d,7], [a,8], [f,8], [c,15]〉1 〈[a,0],[d,5],[f,5],[b,12]〉2 〈[a,1],[b,1],[c,1],[f,4],[a,6],[c,6],[b,8],[c,9]〉3 〈[c,5],[a,7],[b,15],[d,18],[f,18]〉

Table 2Projected database for 〈a〉.

ID Projected (postfix) database

0 〈[ ,4],[d,7], [a,8], [f,8], [c,15]〉〈[ ,8],[f,8],[c,15]〉

027 }

Fig. 2. Pseudo code for the implementation of the P-PrefixSpan algorithm.

ate of item ̌ with respect to pattern 〈˛〉. Once the arrival rate � isevealed, one can easily determine the probability that item ̌ willake place in a period of time td given that 〈˛〉 has been observedy the following equation:

(Tˇ ≤ td) = 1 − P(Tˇ > td) = 1 − e−�td , (1)

here Tˇ is the random variable representing the length of the timenterval. While the result is greater than or equal to the predefined

inimum probability threshold �, ̌ will be appended to 〈˛〉 to form new reliable pattern 〈˛′〉, and the new projected database S|〈˛′〉 cane constructed. Recursively, all the reliable patterns can be found.

The algorithm arrivalRate models statistically the occurrencef items in a frequent pattern by assuming that the time inter-al between two items follows an exponential distribution with

constant arrival rate. For the case of a time-relaxation prefix˛〉 = 〈˛1, ˛2, . . ., ˛m〉, given the condition that ̌ is a frequent item in|〈˛〉, we use the random variable Tˇ|〈˛〉 to represent the inter-arrivalime between 〈˛〉 and ˇ. Let S1 be the collection of sub-sequencesn S|〈˛〉 that contain item ˇ, and S2 be the sub-sequences that doot contain item ˇ. For each sub-sequence s ∈ S1, the item ̌ mayppear more than once. However, in this algorithm, we consider
nly the influence between pattern 〈˛〉 and the first ̌ to appear in. To estimate the arrival rate, we define Tˇ|〈˛〉 as
ˇ|〈˛〉 = tˇ − t˛m , (2)

1 〈[ ,0],[d,5],[f,5],[b,12]〉2 〈[ ,1],[b,1],[c,1],[f,4],[a,6],[c,6],[b,8],[c,9]〉〈[ ,6],[c,6],[b,8],[c,9]3 〈[ ,7],[b,15],[d,18],[f,18]〉

where t˛m and tˇ represent the recorded transaction time of ˛m

(the last item in 〈˛〉) and the first ̌ in s, respectively. In S2, item ˇcannot be observed because of censoring. Therefore, it is importantto determine Tmax, which is referred to as the sequential databasecensoring time. In this study, Tmax is the occurrence time of the lasttransaction in S. Let

T+ˇ|〈˛〉 = Tmax − t˛m . (3)

For a sub-sequence in S2, T+ˇ|˛m

is the censoring time of censored

item ˇ. Maximum likelihood methods are recommended for fittingparametric regression models to censored data. The arrival rate ofitem ̌ with respect to 〈˛〉 can be estimated by the following formula(Elsayed, 1996):

�ˇ|〈˛〉 = �∑

∀s ∈ S1Tˇ|〈˛〉 +

∑∀s̄ ∈ S2

T+ˇ|〈˛〉

, (4)

where � is the total number of sub-sequences of S1. Then, the prob-ability that item ̌ will occur in time period t given that pattern 〈˛〉is observed can be estimated by Eq. (1). The mean inter-arrival timeis 1/�ˇ|〈˛〉.

3.3. Example

Consider the sequence database S shown in Table 1, where IDdenotes the identity of sequence and Tmax = 18. Four sequences areobserved. Suppose � (minimum support threshold) = 2, � (mini-mum probability threshold) = 0.3, and td (expected time period) = 7.Then, the probability of the occurrence of a frequent item followingpattern 〈˛〉 within 7 days must be greater than or equal to 30%.

In the beginning, 〈˛〉 is set to be null. Scan the sequentialdatabase once. Five length-1 frequent items – a, b, c, d, and f arefound. Appending all the frequent items to 〈˛〉 yields five different〈˛′〉, and P-PrefixSpan() must be called again with new parame-ters for each 〈˛′〉. Consider the case with 〈˛′〉 = 〈a〉. The projecteddatabase S|〈a〉 will be constructed and is shown in Table 2.

By mining the projected database with the predefined �, thefrequent items in S|〈a〉 can be found. They are a, b, c, d, and f. In thenext step, for each frequent item, the probability of inter-arrivaltime with respect to 〈a〉 will be estimated. Take frequent item b forexample. The arrival rate of b can be calculated by the followingequation:

�b|〈a〉 = �∑

∀s ∈ S1Tb|〈a〉 +

∑∀s̄ ∈ S2

T+b|〈a〉

4
=(12 − 0) + (1 − 1) + (8 − 6) + (15 − 7)] + [(18 − 4) + (18 − 8)
= 0.087.

2200 H.-J. Shyur et al. / The Journal of Systems and Software 86 (2013) 2196– 2203

001 Input: < >, S |< >,

002 Output: , the arr ival rate of ite m oc curre nce after the la st item in < >

003 Subrout ine: arri valR ate(< >, S|< >, )

004 Parameter s:

005 < >: a t ime-r elaxation Prefi x

006 S|< >: the projec ted database of S with respect to < >

007 : freque nt item

008 Metho ds:

009 tl = the tra nsaction time of the last item in < >;

010 r = 0;

011 1 = 0;

012 2 = 0;

013 For each po stfix in S|< >

014 {

015 If items in current post fix then

016 {

017 2 = 2 + ( Tmax- tl);

018 }

019 Else

020 {

021 r = r + 1;

022 t = the tra nsact ion time of ;

023 1 = 1 + (t – tl);

024 }

025 }

026 )/( 21r ;

emen

t

P

aatG〈t

l

TA

027 return ;

Fig. 3. Pseudo code for the impl

Then the probability of inter-arrival time being less than or equalo td = 7 can be approximated as

(Tb ≤ td) = 1 − P(Tb > 7) = 1 − e−0.087×7 = 0.456 > � = 0.3.

Since � < 0.456, frequent item b can be appended to 〈a〉 to form new reliable pattern 〈˛〉 with length 2. In the same manner, therrival rate and the approximated probability can be calculated forhe above projected database and the results are shown in Table 3.iven time-relaxation prefix 〈a〉, three different reliable patterns

a,b〉,〈a,c〉, and 〈a,f〉 are derived. Recursively finding the reliable pat-
ern in S|〈˛〉 finally yields all the reliable patterns in S.
The application of inter-arrival time probability in the realife could be explained in the following scenario. Suppose while

able 3rrival rate and inter-arrival time probability.

Pattern Arrival rate Inter-arrival time probability Large or equal to �

〈a,a〉 0.033 0.208 No〈a,b〉 0.087 0.456 Yes〈a,c〉 0.072 0.399 Yes〈a,d〉 0.045 0.273 No〈a,f〉 0.142 0.632 Yes

tation of estimating arrival rate.

promoting the item a, a retailer would like to make a decision onwhich item to promote next within 7 days. By using P-PrefixSpan,it can be found from Table 3 that the probability of purchasing a, ord is lower than 30%, while that of f is greater than 50%. Therefore,item f would be a better choice for the next marketing campaign.

4. Experimental results

To evaluate the performance of our proposed method, two algo-rithms including PrefixSpan and P-PrefixSpan are implemented inC++ language and tested on a computer with AMD AthlonTM 643800 2.01 GHz CPU, 1GB memory, and MS Windows XP operatingsystem.

4.1. Behavior of P-PrefixSpan algorithm

To compare the patterns discovered by PrefixSpan and P-PrefixSpan, the NorthWind database provided in MS SQL Server2000 was used in the first experiment. There are 830 transactions
for 91 customers in the dataset. Since two customers have no ordersat all, there are totally 89 transaction sequences. The minimumsupport threshold � is set to be 0.3% and the minimum probabil-ity threshold is set to be P(inter - arrival time ≤ 60) > 10%. Table 4

H.-J. Shyur et al. / The Journal of Systems and Software 86 (2013) 2196– 2203 2201

Table 4Comparison of patterns discovered by PrefixSpan and P-PrefixSpan.

Number of patterns Pattern length Total

1 2 3 4 5

PrefixSpan 77 2342 2160 157 2 4738P-PrefixSpan (td = 60) 77 298 547 55 0 977

sboatorlbb6mpw

�amoTbb“apaaot

4

dPscEEiattrs

TI

3.16 times the number of frequent patterns obtained when � is setto be 1.5%; and when � is set to be 1.0%, PrefixSpan will obtain 5.39times the number of frequent patterns obtained when the � is set as0.5%. Under the same setting, P-PrefixSpan obtains only 1.60 times

P-PrefixSpan (td = 100) 77 1236

Reduction rate (td = 60) 100% 12.7%Reduction rate (td = 100) 100% 52.8%

ummarizes the total number of patterns discovered by applyingoth algorithms. Except those patterns with length 1, the numberf discovered patterns will be markedly reduced when the newlgorithm is applied. With the minimum probability threshold inhe new algorithm, the total number of patterns is reduced to 20.6%f that discovered by PrefixSpan. P-PrefixSpan may provide moreeliable patterns for decision-makers. For example, in this case, aength-2 pattern 〈Chai, Chang〉 was derived by PrefixSpan, but noty P-PrefixSpan. This is because the probability that a customerought item “Chai” followed by a further purchase of “Chang” in0 days is only around 8.5%. The value is less than the predefinedinimum probability threshold. Decision-makers may ignore these

atterns since the probabilities of related events that will occurithin a concerned time period are very low.

The second comparison is executed using the same dataset and, but the minimum probability threshold is reset to be P(inter -rrival time ≤ 100) > 10%. Table 4 shows the results. As can be seen,ore reliable patterns are derived when compared with those

btained with the original minimum probability threshold setting.he total number of patterns is reduced to 70% of that discoveredy PrefixSpan. In this case, the pattern 〈Chai, Chang〉 is derived byoth algorithms, since the probability that a customer bought itemChai” followed by a further purchase of “Chang” in 100 days isround 13.8%, which is larger than 10%. The value of minimumrobability threshold � will also affect our mining results. Setting

higher value of � will filter out more unlikely patterns. The reli-ble patterns provided not only can yield knowledge of the orderf items but also of the inter-arrival time probability between anywo consequent items.

.2. Performance evaluation

For mining interesting sequential patterns, we use 10 syntheticatasets to compare the performance between P-PrefixSpan andrefixSpan. These datasets were generated using the syntheticequential data generator developed by IBM Almaden researchenter. Table 5 exhibits the input parameters for the generator.ach generated sequence contains an ordered list of transactions.ach transaction contains a collection of items. The number oftems in all the datasets is set to be 1000. The transaction datare extended such that the transactions are assigned to different
ime values. An exponential random number generator generateshe time intervals between consequent transactions. The arrivalate is set between 0.1 and 0.9, which is randomly assigned to eachequence.
able 5nput parameters for sequential data generator (IBM).

Parameter Description

D Number of data sequence (unit: K)C Average number of transactions per sequenceT Average number of items per transactionS Average length of maximal sequencesI Average size of itemsets in maximal sequencesN Number of items

1851 150 2 331625.3% 35% 0% 20.6%85.7% 95.5% 100% 70.0%

We performed three experiments. The first one is to test theeffects of expected time period td on the execution time, the num-ber of reliable patterns obtained, and the maximal length of reliablepatterns. The second one is to test the effects of minimum supportson the execution time and the number of reliable patterns obtained.The third one is to test the effects of transaction database size onthe execution time. Note that the second and third experimentswill be performed for both PrefixSpan and P-PrefixSpan.

Experiment 1: The experiment is designed using datasetD10C10T2.5S4I1.25. During the test, � (minimum support thresh-old) is set to be 0.3% and � (minimum probability threshold) is setto be 20%. Fig. 4 summarizes the results. We plot both the numberof generated reliable patterns and the maximal length of reliablepatterns observed as we vary the expected time period td. The exe-cution times for all cases are around 46 s. The number of reliablepatterns grows with the increase in expected time period. The max-imum length of the projected databases remains at 8 for expectedtime period greater than or equal to 14. Thus, when the minimumprobability threshold is fixed, the probability that a frequent pat-tern is also a frequent time probability pattern grows with theinterval. In most algorithms, the execution time grows with thenumber of frequent patterns found. However, the execution timeof P-PrefixSpan remains stable.

Experiment 2: We ran both PrefixSpan and P-PrefixSpan withthe dataset D30C10T5S4I1.25. During the test, the expected timeperiod is set to be 7 and the minimum probability threshold is setto be 20%. This test varies the minimum support threshold, �, from0.5% to 2.5% (as shown in Figs. 5 and 6, respectively). The resultsexhibit the difficulties in selecting �. When � is less than a certainvalue, the number of patterns obtained would increase dramati-cally. Along with it, the number of databases projected will alsoincrease in the same fashion. A lot of memory will be consumedcorrespondingly, and the execution time would also become sig-nificantly longer. As seen in Fig. 5, when � is less than 1.5%, theexecution time of PrefixSpan increases dramatically. On the otherhand, as seen in Fig. 6, when � is set to be 1.0%, PrefixSpan will obtain

Fig. 4. Number and maximal reliable patterns for dataset D10C10T2.5S4I1.25.

2202 H.-J. Shyur et al. / The Journal of Systems and Software 86 (2013) 2196– 2203

aPfitrjfiroPa

bDdtop1infiatait

td2fpF

F

Fig. 7. Performance of P-PrefixSpan with minimum probability threshold varied.

Fig. 5. Comparison of execution time for dataset D30C10T5S4I1.25.

nd 4.28 times, respectively. According to the experimental results,-PrefixSpan performs faster and derives fewer patterns than Pre-xSpan. The reason is that although P-PrefixSpan needs to identifyhe inter-arrival time probability for each frequent item, it willeduce the number of candidate patterns and the number of pro-ected databases in the mining process. The added constraints couldlter out less important patterns and reduce the memory spaceequired in storing projected databases. Hence, P-PrefixSpan canutperform PrefixSpan. Moreover, the reliable patterns derived by-PrefixSpan provide additional information about the time prob-bility of sequential events.

Experiment 3: Fig. 7 shows the execution times and num-er of found patterns for the P-PrefixSpan algorithm on dataset30C10T5S4I1.25 as the minimum probability threshold, �, isecreased from 20% to 0%. When � is set to be 0, all frequent pat-erns are reliable patterns. In that case, P-PrefixSpan and PrefixSpanbtain the same results. During the experiment, the expected timeeriod is set to be 7 and the minimum support threshold is set to%. It is observed that the execution time of algorithm P-PrefixSpan

s reduced as � increases. This is due to that with a larger �; theumber of candidates is smaller. Setting a higher value of � willlter out more unlikely patterns. However the margin decreasess � increases. As we increase �, the number of candidates andhe execution time decreases initially. This shows that � is havingn immediate impact. However, as � increase, there is a diminish-ng return, and eventually, increasing � will not decrease executionime.

Experiment 4: The DxC10T2.5S4I1.25 datasets are employedo test scalability with the number of sequences in the sequenceatabase. The size of these datasets are 1000, 2000, 5000, 10,000,
5,000, 50,000, 100,000, and 300,000. We plot the execution timeor the minimum support threshold of 0.2%, the expected timeeriod of 7, and the minimum probability threshold of 20% inig. 8. From the performance test, both PrefixSpan and P-PrefixSpan
ig. 6. Comparison of number of patterns found for dataset D30C10T5S4I1.25.

Fig. 8. Comparison of execution time for datasets DxC10T2.5S4I1.25.

show linear scalability with the number of sequences from 1000 to300,000. This is because when the number of sequences increases,the number of projected databases increases and the size of eachprojected database also increases, such that the performance isdegraded for both algorithms. P-PrefixSpan is a more complicatedalgorithm. However, the results are confusing since P-PrefixSpanoutperforms PrefixSpan, and the performance gap increases as thenumber of sequences increases. We can observe that the executiontime for PrefixSpan is even 2.5 times that for P-PrefixSpan whenthe number of sequences is 300,000. This is due to the same reasonthat P-PrefixSpan will reduce the number of candidate patterns bythe newly defined probability threshold in the mining process.

5. Discussion

The original PrefixSpan algorithm discovers the sequential pat-
ters by a divide and conquer strategy (Chen et al., 2003). Initially,pattern 〈˛〉 is set to be null and the frequent items are generated byscanning S|〈˛〉. For each frequent item in S|〈˛〉, append it to the end

tems a

osyttfaqa

cttrTonjapiomttdanacta

pmnia

6

bsTaatbwdcxWtapt

s

H.-J. Shyur et al. / The Journal of Sys

f 〈˛〉 to form a new sequential pattern 〈˛′〉. Then S|〈˛′〉 can be con-tructed. Recursively finding the sequential patterns in S|〈˛′〉 finallyields all the sequential patterns in S. However, the application ofhe PrefixSpan algorithm to discover the probability of inter arrivalime of consecutive items in a sequential pattern is not straight-orward. The most important difference lies in that the proposedlgorithm consider the inter-arrival time probability between a fre-uent item ̌ in S|〈˛〉 and the last item in 〈˛〉; whereas the originallgorithm includes no such consideration.

Compared with the PrefixSpan algorithm, P-PrefixSpan is a moreomplicate algorithm. An additional step is developed to estimatehe arrival rate of frequent item ̌ with respect to pattern 〈˛〉 andhe probability of inter-arrival time. However, the experimentalesults show that P-PrefixSpan outperforms PrefixSpan algorithm.he performance gap increases as the minimum support thresh-ld decreases because when the minimum support decreases, theumber of the frequent sequences increases, the number of the pro-

ected databases increases and the size of each projected databaselso increases, such that the performance is degraded for PrefixS-an algorithm. For P-PrefixSpan, we focus on reliable patterns, that

s, it will reduce the number of candidate patterns and the numberf projected databases in the mining process. The reason is that theinimum probability threshold introduces an extra constraint for

he patterns. Setting the minimum probability threshold will fil-er out unlikely patterns. In most cases, the new mining approachiscovers much smaller number of patterns than the PrefixSpanlgorithm. As we increase the minimum probability threshold, theumber of candidates and the execution time decreases. This affectsll the steps of the mining process by decreasing the number ofandidates and by reducing the sizes of projected databases. Dueo this, in sequential pattern mining the memory requirement islso reduced.

The minimum probability threshold and the predefined timeeriod will affect the results of the mining process. How to deter-ine both of the parameters needs a further study. The average

umber of transactions per sequence and the average number oftems per transaction may affect the performance of the proposedlgorithm. It needs to be examined in the future experiments.

. Conclusions

Most studies of sequential pattern mining concentrate on sym-olic patterns, whereas numerical analysis usually belongs to thecope of trend analysis and forecasting (Han and Kamber, 2006).his paper proposes a new algorithm, P-PrefixSpan, for mining reli-ble sequential patterns, which concentrate on symbolic patternsnd numerical analysis simultaneously. The reliable sequential pat-ern can yield information not only of the order of frequent itemsut also of the time probability of arrival items. The information,hich provides a more detailed description about the behavior oferived patterns, is crucial to decision-makers. For example, onean identify “Customers who buy a brand C digital camera will have% probability to buy a brand H laser printer within y month(s).”

hen decision-makers specify the time period of concern to them,he inter-arrival time probability for every two consequent items in

reliable pattern can be estimated. In practice, reliable sequential
atterns are useful for analyzing the purchase behavior of cus-omers, the symptom behavior of disease, and many others.
According to the algorithm, users can specify the minimumupport threshold and minimum probability threshold to discover

nd Software 86 (2013) 2196– 2203 2203

reliable sequential patterns. It reduces the number of candi-date patterns by the minimum probability threshold, making theproposed algorithm superior to PrefixSpan. Experimental resultsclearly demonstrate that P-PrefixSpan is an efficient and scalablemethod for sequential pattern mining. In the future, the algorithmwill be programmed in a relational DBMS with aggregate UDFs.

References

Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules. In: Proc.1994 Int. Conf. Very Large Data Bases Conference, pp. 487–499.

Agrawal, R., Srikant, R., 1995. Mining sequential patterns. In: Proc. 1995 Int. Conf.Data Engineering (ICDE’95), pp. 3–14.

Chen, Y.L., Chiang, M.C., Ko, M.T., 2003. Discovering time-interval sequential pat-terns in sequence databases. Expert Systems with Applications 25, 343–354.

Chen, Y.L., Huang, C.K., 2005. Discovering fuzzy time-interval sequential patterns insequence databases. IEEE Transactions on Systems, Man, and Cybernetics—PartB: Cybernetics 35 (5), 959–972.

Chen, Y.L., Hu, Y.H., 2006. Constraint-based sequential pattern mining: the consid-eration of recency and compactness. Decision Support Systems 42, 1203–1215.

Chang, J.H., Park, N.H., 2012. Comparative analysis of sequence weightingapproaches for mining time-interval weighted sequential patterns. Expert Sys-tems with Applications 39 (3), 3867–3873.

Elsayed, E.A., 1996. Reliability Engineering. Addison Wesley, New York.Han, J., Kamber, M., 2006. Data Mining: Concepts and Techniques. Morgan Kaufmann

Publishers, San Francisco.Han, J., Pei, J., Yin, Y., 2000a. Mining frequent patterns without candidate generation.

In: Proc. ACM-SIGMOD Int. Conf. (SIGMOD’00), pp. 1–12.Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., Hsu, M.C., 2000b. FreeSpan:

frequent pattern-projected sequential pattern mining. In: Proc. Int. Conf. Knowl-edge Discovery and Data Mining (KDD’00), pp. 355–359.

Han, J., Pei, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu, M.C.,2004. Mining sequential patterns by pattern-growth: the PrefixSpan approach.IEEE Transactions on Knowledge and Data Engineering 16 (11), 1424–1440.

IBM, Quest Data Mining Project, IBM Almaden Research Center. Available from:http://www.almaden.ibm.com/software/quest/Resources/index.shtml

Kim, C., Lim, J.H., Ng, R.T., Shim, K., 2007. SQUIRE: sequential pattern mining withquantities. Journal of Systems and Software 80, 1726–1745.

Masseglia, F., Poncelet, P., Teisseire, M., 2009. Efficient mining of sequential pat-terns with time constraints: reducing the combinations. Expert Systems withApplications 36 (2), 2677–2690.

Orlando, S., Perego, R., Silvestri, C., 2004. A new algorithm for gap constrainedsequence mining. Symposium on Applied Computing (SAC’04), 540–547.

Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.C., 2001.Prefixspan: mining sequential patterns efficiently by prefix-projected patterngrowth. International Conference on Knowledge Discovery in Databases andData Mining, 215–224.

Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu, M.C.,2004. Mining sequential patterns by pattern-growth: the prefixspan approach.IEEE Transactions on Knowledge and Data Engineering 16, 1424–1440.

Peng, W.C., Liao, Z.X., 2009. Mining sequential patterns across multiple sequencedatabases. Data and Knowledge Engineering 68, 1014–1033.

Srikant, R., Agrawal, R., 1996. Mining sequential patterns: generalizations andperformance improvements. In: Proc. of the 5th International Conference onExtending Database Technology, pp. 3–17.

Toroslu, I.H., 2003. Repetition support and mining cyclic patterns. Expert Systemswith Applications 25, 303–311.

Yun, U., 2008. A new framework for detecting weighted sequential patterns in largesequence databases. Knowledge-Based Systems 21, 110–122.

Zaki, M.J., 2001. SPADE: an efficient algorithm for mining frequent sequences.Machine Learning 42, 31–60.

Huan-Jyh Shyur received his Ph.D. in Industrial and System Engineering from Rut-gers University. He is a Professor in the Department of Information Managementat Tamkang University, Taiwan. Before joining Tamkang University, he was a chiefsystem engineer with the Aviation Safety R&D section of the FAA. His research inter-ests include soft computing, data mining, aviation safety, reliability engineering, anddecision theory.

Chichang Jou received the Ph.D. degree in Computer Science from SUNY, StonyBrook. Currently, he is an associate professor of Information Management, TamkangUniversity. His interests include data mining, distributed computing, and object
oriented design.
Keng Chang received his Master degree in department of Information Manage-ment from Tamkang University. His research interests include data mining and webtechnology.

http://refhub.elsevier.com/S0164-1212(13)00087-3/SBREF0005




























































































































































































http://www.almaden.ibm.com/software/quest/Resources/index.shtml

















































































































































































A data mining approach to discovering reliable sequential patterns

Documents

Transcript of A data mining approach to discovering reliable sequential patterns