“Sequential Sequence Mining (SSM) Technique in Large Data...

“Sequential Sequence Mining (SSM) Technique

in Large Data base”

A Thesis submitted in partial fulfillment of the requirements for the

award of the degree of Doctor of Philosophy in the Faculty of

Engineering & Technology

Guide Research Scholar

Dr. J. S. Shah Kiran R. Amin

M.E, PhD B.E(Computer),ME(Computer)

PRINCIPAL, CLASS-I ASSO. PROF. & HEAD (CE)

GOVT. ENGG. COLLEGE, U.V. Patel College of Engg.

PATAN GANPAT VIDYANAGAR

At & Po : Katpur Reg. No: EN/02/002/07

U. V. PATEL COLLEGE OF ENGINEERING

GANPAT UNIVERSITY

DECEMBER- 2012

© Copyright 2012

By

Kiran R. Amin

All Rights Reserved

CERTIFICATE

This is to certify that the thesis entitled “Sequential Sequence Mining (SSM)

Technique in Large Data base” submitted by Kirankumar Ramchandra Amin of U. V.

Patel College of Engineering is the bonafied work completed under my supervision and

guidance for the award of Degree of Doctor of Philosophy in the Faculty of Engineering &

Technology, Ganpat University, Ganpat Vidyanagar. The experimental work included in

the thesis was carried out at the Department of Computer Engineering, U. V. Patel College

of Engineering under my supervision and the work is up to my satisfaction.

Research Guide

Prof. (Dr.) J. S. Shah

M.E, PhD

Forwarded Through:-

Dr. N. D. Jotwani

PhD,

Dean, Faculty of Engineering & Technology

Date :

Place : Ganpat Vidyanagar

CERTIFICATE

This is to certify that Mr. Kirankumar Ramchandra Amin is a research Scholar and

doing his PhD under my supervision. He has presented the findings of his research work

in pre-synopsis Seminar in front of the Doctoral held on 30th

August 2012 at U. V. Patel

College of Engineering, Ganpat Vidyanagar. He has incorporated all the

modifications/suggestions made by oral defense committee.(Ref No: F. No.

89/GNU/PhD/1176/2012 Dated 28th

September 2012) in the thesis entitled “Sequential

Sequence Mining (SSM) Technique in Large Data base”.

Research Guide

Prof. (Dr.) J. S. Shah

M.E, PhD

Date :


THESIS APPROVAL SHEET

The PhD thesis entitled “Sequential Sequence Mining (SSM) Technique in Large

Data base” by Mr. Kirankumar Ramchandra Amin has been approved for the award of the

Degree of Philosophy under the Faculty of Engineering & Technology, Ganpat University.

External Examiner(s) Research Guide

Date :


TABLE OF CONTENTS

CHAPTER PAGE

Declaration by Author i

Acknowledgement iv

Abstract vii

List of Figures viii

List of Tables x

Abbreviation xi

1 Chapter 1 Introduction 1

1.1 Background

1

1.2 Thesis organization

4

1.3 Aim of the Research

5

2 Chapter 2 Related work 6

2.1 Literature Survey and Critical Assessment

6

2.2 Sequential Sequence Mining Techniques

7

2.2.1 Apriori-based Techniques

7

2.2.2 Tree-based Techniques

8

2.2.3 Lattice-based Techniques

9

2.2.4 Regular Expression based Techniques

10

2.2.5 Prefix-based Techniques

11

2.2.6 Closed Sequential Sequences Techniques

12

2.2.7 Time interval Sequence Mining Techniques

12

2.3 State-of-the-art techniques in Sequential Sequence

Mining

13

CHAPTER PAGE

2.4 Categories of sequential sequence mining techniques 17

2.5 Empirical Analysis of State-of-Art techniques

21

2.5.1 Apriori Algorithm-Formal Description 21

2.5.1.1 Support 22

2.5.1.2 Formal Definition: Apriori property 22

2.5.1.3 Algorithm : Apriori

22

2.5. 2 Algorithm - Apriori-gen

24

2.5.2.1 The join procedure -Apriori-gen algorithm

24

2.5.2.2 The prune procedure of the Apriori-gen

algorithm

25

2.5.3 DHP Algorithm 25

2.5.4 Partitioning Algorithm-Formal Descriptin 27

2.5.4.1 Algorithm-Partition

28

2.5.4.1.1 Phase I

28

2.5.4.1.2 Merge Phase

28

2.5.4.1.3 Phase II

28

2.5.5 Sampling Algorithm 30

2.5.6 DIC Algorithm 31

2.5.7 Improved Apriori Algorithm 31

2.5.8 AprioriAll-Formal Description 32

2.5.8.1 Sort Phase 32

2.5.8.2 Litemset Phase 32

2.5.8.3 Transformation Phase 34

2.5.8.4 Sequence Phase 34

2.5.8.5 Maximal Phase 35

CHAPTER PAGE

2.5.9 Algorithm-AprioriAll 35

2.5.10 AprioriSome Algorithm 36

2.5.10.1 Algorithm- AprioriSome : Forward Phase 36

2.5.10.2 AprioriSome : Backward Phase

36

2.5.11 Relative performance - AprioriAll & AprioriSome

37

2.5.12 DynamicSome-Formal Description 37

2.5.13 Algorithm-DynamicSome 38

2.5.13.1 Initialization Phase

38

2.5.13.2 Forward Phase

38

2.5.13.3 Intermediate Phase

39

2.5.14 GSP 39

2.5.14.1 Formal Description

39

2.5.14.2 Join Phase 40

2.5.14.3 Prune Phase 40

2.5.14.4 Relative Performance 40

2.5.15 FreeSpan 44

2.5.16 SPADE 44

2.5.17 Prefixspan 48

2.5.18 SPAM 51

2.5.19 Allen’s Algorithm

59

2.5.19.1 Generalization of temporal events-

Formal Description

60

2.5.19.2 Algorithm : Generalization of temporal 60

2.5.19.3 Algorithm : Temporal interval relation rule

discovery

61

CHAPTER PAGE

3 Chapter 3 Motivation 63

4 Chapter 4 Scope of Work 65

5 Chapter 5 Proposed Algorithms 67

5.1 Sequential Sequence Mining 67

5.1.1 Support

69

5.1.2 Super sequences and sub sequences 70

5.3 Formal Notations & New Equations : MySSM

70

5.3.1 Customer 70

5.3.2 Item 70

5.3.3 Transaction 70

5.3.4 SequenceID 70

5.3.5 Equation for time interval 71

5.3.6 Equation for same time interval items

71

5.3.7 Equation for support

71

5.4 Algorithms of MySSM 72

5.4.1 Algorithm 1 SYNTIM

72

5.4.2 Algorithm 2 GCON

73

5.4.3 Algorithm 3 FS & GSGT 74

5.4.4 Algorithm 4 GAS

75

5.4.5 Algorithm 5 CMEM

75

5.4.6 Algorithm 6 OUTR

76

5.4.7 Algorithm 7 MySSM

77

6 Chapter 6.0 Empirical Analysis & Comparative Results 82

7 Chapter 7.0 Conclusion & Future Scope

92

Bibliography

94

Own Publication List 101

My other research publications 102

i

GANPAT UNIVERSITY

DECLARATION BY THE AUTHOR OF THE THESIS

I Kiran R. Amin Reg. No: EN/02/002/07,registered a research scholar of PhD

programme in the Faculty of Engineering & Technology, Ganpat University do hereby

submit my thesis entitled “Sequential Sequence Mining (SSM) Technique in Large Data

base.” (Here in referred to as my thesis) in printed as well as in electronic form for

holding in the library of records of the University.

I hereby declare that:

1. The electronic version of my thesis submitted here with in CDROM is in PDF

does not infringe or violate the rights of anyone else.

2. My thesis is my original work of which the copyright vests in me and my thesis

do not infringe or violate the right of anyone else.

3. The content of the electronic version of my thesis submitted herewith are the

same as those submitted as final hard copy of my thesis after my viva-voce and

adjudication of my thesis.

4. I agree to abide by the terms and conditions of the Ganpat University policy on

intellectual property(here after policy) currently in effect, as approved by the

competent authority of the University.

5. I agree to allow the University to make available the abstract of my thesis to any

user in both Hard copies(Printed) and electronic forms.

ii

6. For the University‟s own non-commercial, academic use, I grant to the University

submission of the thesis the non-exclusive license to make limited copies of my

thesis in whole or in part and to loan such copies at the University‟s discretion to

academic persons and bodies approved from time to time by the University for

non commercial academic use. All usage under this clause will be governed by

relevant fair use provisions in the policy and by the Indian Copy Right Act in

force at the time of submission of thesis.

7. I agree to allow the University to place such copies of the electronic version of

my format.

8. I agree to allow the University to place such copies of electronic version of my

thesis on the private intranet maintained by the University for its own academic.

9. If in the opinion of the University, my thesis contains patentable or copyrithable

material and if the University decides to proceed with the process of securing

copyrights and/or patents. I expressly authorize the University to do so. I also

undertake not to disclose any of the patentable intellectual properties before being

permitted by the University to do so or for a period of one year from the date of

final thesis examination, whichever is earlier.

10. In accordance with the intellectual property policy of the University, I accept that

any commercialized intellectual property contained in my thesis I the joint

property of me, my co-workers, my supervisors and the Institute. I authorize the

University to proceed with the protection of the intellectual property right in

accordance with prevailing laws. I agree to abide by the provisions of the

University intellectual property right policy to facilitate protection of the

intellectual property contained in my thesis.

iii

11. If I intend to file a patent based on my thesis when the University does not wish

so, I shall notify my intention to the University. In such case, my thesis should be

marked as patentable intellectual property and access to my thesis is restricted. No

part of my thesis should be disclosed by the University to any person(s) without

my written authorization for one year after my information to the University to

protect the IP on my own, within 2 years after the date of submission of the thesis

or the period necessary for sealing the patent, whichever is earliest.

Name of Research Student Name of Guide

Kirankumar Ramchandra Amin Prof. (Dr.) J. S. Shah

M.E(Computer) ME, PhD

Signature of Research Student Signature of Guide

Date : 26th

December 2012


iv

Acknowledgement

The humble accomplishment of this thesis would not have been possible without

the contribution of many individuals, to whom I express my appreciation and gratitude.

Firstly, I am deeply grateful to my supervisor Dr. J. S. Shah, Professor &

Principal of Government Engineering College, Katpur (PATAN) who guided me every

step of the way and was a source of inspiration. I am thankful to him for giving me

constant guidance, support, sparing valuable time throughout the course of this thesis. I

successfully overcame many difficulties and learned a lot. Despite of his ill health, he

used to review my thesis progress. Gave me valuable suggestions and made corrections.

His unflinching courage and conviction will always inspire me, and I hope to continue to

work with his noble thoughts.

.

I would like to thank to Dr. L. N. Patel, Vice Chancellor of Ganpat University. I

gratefully acknowledge him for his encouragement and personal attention which have

provided good and smooth basis for my Ph.D. tenure.

I am extremely indebted to Dr. N. D. Jotwani, Principal of U. V. Patel College of

Engineering & Dean of Faculty of Engineering & Technology for providing me required

infrastructure. I am also thankful to him for his constant support and encouragement and

supporting me to carry out my research work.

I am thankful to Doctoral Committee Members, Dr. D.C. Jinwala, Professor,

SVNIT, Surat and Dr. M.V. Joshi, Professor, DAIICT, Gandhinagar for their helpful

suggestions, valuable advice, constructive criticism and helpful comments during my pre-

synopsis Seminar.

I am also thankful to Dr. Ketan Kotecha, Director, NIT, Ahmedabad, Dr. Y. P.

Kosta, Director, Marwadi Education Foundations, Rajkot and Dr. N. D. Jotwani for

giving their useful comments during my pre-synopsis Seminar.

v

I thank to Shri Darshit Khambholja, Director, Bhavi Technolsoft for providing

me necessary resources to accomplish my research work.

I would like to express my appreciation to the Registrar, Deputy Registrar and

other staff members of the Ganpat University and U. V. Patel College of Engineering for

their unlimited support.

At this moment of accomplishment, I express my thanks to my well wishers, my

friends, colleagues and all those who contributed in many ways to the success of this

study and made it an unforgettable experience for me.

At Last but not least, I am greatly indebted for all the support of my wife and

children who have lost a lot due to my research work.

Kiran Amin

vi

Dedicated To

My wife Dr. Falguni

&

My Children Dhvani & Nisarg

vii

Abstract

The sequential sequence mining is very important in datamining. It produces

useful sequences occur frequently in database. These sequences are used in finding

users’ purchasing behavior in retail Industries, User’s access sequences to access web

pages, to identify the sequences repeatedly occur and responsible for particular disease

etc. The current state-of-the-art methods have not succeeded to produce sequences for

large database with Time Gap interval. They are found Memory consuming and Time

consuming. This motivated us to produce the sequences in large database by reducing

Memory and Time by including Time Gap between successive items of transactions.

We have proposed the sequential sequence mining technique which produces the

sequences for large database by reducing considerable amount of Memory and Time.

Our algorithms outperform current state-of-the-art techniques in sequential sequence

mining by not only in Computing Time and Memory but also in scalability with respect to

various parameters.

The Thesis focuses on the sequential sequence mining techniques in large

database.

viii

LIST OF FIGURES

FIGURE

PAGE

2.1 Apriori Algorithm

23

2.2 Apriorigen Algorithm

24

2.3 Apriorigen Algorithm : Join Procedure

25

2.4 Apriori-gen Algorithm : Prune Procedure

25

2.5 Mining Frequent itemsets using Partition algorithm 28

2.6 Partition Algorithm

28

2.7 Algorithm AprioriAll

35

2.8 Algorithm AprioriSome

38

2.9 Algorithm DynamicSome

39

2.10 Relative Performance

42

2.11 Comparison – GSP, Freespan, SPADE

47

2.12 Comparison – Freespan, SPADE, PrefixSpan

51

2.13 Comparison – PrefixSpan with SPAM

54

2.14 No of Customer vs Memory

55

2.15 No of Transaction vs Memory

55

2.16 Memory Prefixspan vs SPAM

56

2.17 Support vs Memory

57

2.18 Allen‟s Algorithm

59

2.19 Generalization of events

60

2.20 Temporal interval relation rule discovery

61

5.1 Algorithm SYNTIM

72

5.2 Algorithm GCON

73

5.3 Algorithm FS & GSGT 74

5.4 Algorithm GAS

75

5.5 Algorithm CMEM

75

5.6 Algorithm OUTR

76

5.7 Algorithm MySSM

77

ix

LIST OF FIGURES

FIGURE

PAGE

6.1 Number of Customers v/s Time(Milliseconds) for support =0.4 84

6.2 No of Customers v/s Memory(MB) for support =0.4 84

6.3 Number of Customers v/s Time(Milliseconds) for support=0.02 85

6.4 Number of Customers v/s Memory(MB) for support=0.02 85

6.5 Number of Customers v/s Time(Milliseconds) for support=0.3 86

6.6 Number of Customers v/s Memory(MB) for support=0.3 87

6.7 Number of Customers v/s Time(Milliseconds) 88

6.8 Number of Customers v/s Memory(MB) 88

6.9 Support v/s Time in Milliseconds 89

6.10 Support v/s Memory in MB 90

6.11 Support v/s Time in Milliseconds 90

6.12 No of different items v/s Total sequences 91

6.13 No of different sequences for number of different items=100 91



x

LIST OF TABLES

TABLE PAGE

2.1 Sample Database 23

2.2 Mining Frequent itemsets using AprioriAll

33

2.3 Mapping of sequence

33

2.4 Transformed sequence

34

2.5 Data set Example

46

2.6 Vertical Data format

46

2.7 Vertical Data format

46

2.8 Sequence Database

49

2.9 Projected Database

50

2.10 S-Matrix

50

2.11 Data set

52

2.12 Vertical format

52

2.13 S-step process

53

2.14 I-step process

53

5.1 Data set 1

68

5.2 Data set 2

69

5.3 Sequence Generator Table

78

5.4 Sequence Generator Table with Time stamp 78

5.5 Sequence generator Table

79

5.6 Table of time interval sequence for „p‟

81

xi

ABBREVIATION

SPAM Sequential Pattern Mining

PREFIXSPAN Prefix-projected Sequential pattern mining

SPADE Sequential Pattern Discovery using Equivalent Class

SPIRIT Sequential pattern mining with regular expression constraints

BIDE Bi-Directional Extension

CloSpan Closed sequential patterns

FTAPs Frequent temporal association pattern

CTMSP-Mine Cluster-based Temporal Mobile Sequential Pattern Mine

CTMSPs Cluster-based Temporal Mobile Sequential Patterns

CO-Smart-CAST Cluster-Object-based Smart Cluster Affinity Search Technique

DIC Dynamic Itemset Counting

GSP Generalised Sequential Pattern

SID Sequence Id

CID Customer ID

I-APRIORI Improved Apriori

I-PREFIXSPAN Improved PrefixSpan

SYNTIM Synthetic Time Date

MySSSM My Sequential Sequence Mining

GCON Get Configuration

FS Find Sequence 0 items

GSGT Generate Sequence Generator Table

GAS Generate All Sequences

CMEM Check Memory

OUTR Output Result

- 1 -

Chapter 1

Introduction

1.1 Background

Data mining extracts implicit, potentially useful knowledge from large amounts of

data. It is also called knowledge mining, knowledge extraction, data/sequence/pattern

analysis, data archaeology and data dredging from databases. In other words, data mining

is the act of drilling through huge volumes of data to discover relationships or answer

queries, generalized for traditional query tools.

In general, data mining tasks can be classified into two categories:

Descriptive mining: It is the process of drawing the essential characteristics or

general properties of the data in the database. Clustering, Association and Sequential

mining are one of the descriptive mining techniques.

- 2 -

Predictive mining: This is the process of inferring sequences form data to make

predictions. Classification, Regression and Deviation detection are predictive mining

techniques.

Data mining technique is useful in various areas, such as market basket analysis,

decision support, fraud detection, business management, telecommunications etc. The

data mining were drawn from Database Technology, Machine Learning, Artificial

Intelligence, Neural Networks, Statistics, Pattern Recognition, Knowledge-based

Systems, Knowledge Acquisition, Information Retrieval, High-performance computation

and Data Visualization.

Many methods came up to extract the information. The Sequential Sequence

Mining is one of the most important techniques that facilitate us to make the decisions in

various applications. The mining problem was first proposed by Agrawal and Srikant

[10]. It discovers sequential sequences which occur frequently in a sequence database.

In the Medicine, finding of time interval sequence of diseases from medical

records like diseases, treatments, and durations of hospital stay etc. are recorded in the

database of Hospitals. However, all the events such as suffering and curing diseases or

occurring symptoms are interval-based. The conventional sequential sequence mining is

not appropriate for the discovery of the sequences in these events. On other hand, time

interval sequences are more useful to identify if a patient suffers from a certain disease or

not. It also predicts the symptoms of a patient who has a certain disease.

In investment, a certain stock rises or falls is one of the important tasks that the

stock investors wanted to know. Further, the owners are worried about the stock trend of

their own businesses. Stockholders or Industry analysts also like to know the rise/fall of

certain stocks, which is actually one of the useful information extractions from the time

interval sequences of stock prices. The stock prices are recorded in every transaction

which acts as a historical data. We may find the time interval stock sequences from the

stock interval event database.

- 3 -

In the E-marketing, some Internet vendors provide new selling methods like

group buying offer. These occur when vendors wanted to sell products at lower prices

when someone collects a crowd of people to buy this product. The duration when an

individual joins a group buying section for a certain product till the closing of the session

is considered as an interval-based event. Since many group buying customers may join

buying sessions for a number of products concurrently or later, these interval-based

events form a set of sequences, which may include some interesting time oriented

sequences. Discovering time oriented sequences from group buying records will help the

purchasing behaviors of customers and make effective marketing strategies.

Traditional Association Rule Mining [10] works on transactional data. It

considers various items to be purchased in single transaction of a particular customer. It

doesn‟t care for the same customer purchases items in different transactions. The concept

of sequential sequence mining arrived and it considers various items to be purchased in

different transactions. It covers the idea regarding same customer purchases items in

more than one transaction and in more than one time. However the current state-of-the-art

techniques have limitations with the performance of Memory and Time which are

focused by us.

Sequential sequence mining mines sequential sequence from data base with

efficient support counting. It is used to find frequent subsequences occur with minimum

support value. The sequential sequence mining focuses on sequence of events occurred

frequently in given dataset unlike simple association rule mining. For example, the

customer in electronics retail shop purchases Computer System then again he purchases

Scanner after some amount of time. That means the purchasing of Scanner is made after

the purchasing of Computer System. The sequence of the items plays major role. We use

the order dataset where all events stored in some particular order. The traditional

sequential sequence mining doesn‟t care for the timing between the purchasing of items.

- 4 -

The goal of our research work is to develop and evaluate new algorithms of

MySSM which efficiently produce sequential sequences in large database having

significant improvement in execution Time and Memory.

1.2 Thesis organization

We have discussed introductory part of our thesis in Chapter 1. We have also

focused on the organization of our thesis and the aim of our research work in this chapter.

Chapter 2 focuses on the related work to our research. The first part of this

chapter is based on literature survey. In second section, we have discussed various

sequential sequence mining techniques. Third section of this chapter focuses on state-of-

the-art techniques for finding sequential sequence mining. Gradually these techniques are

compared with in close proximity techniques. The results of empirical analysis of state-

of-the-art methods are discussed in fourth section of this chapter. This chapter helped us

to strengthen to our technique by considering various parameters of matrix of evaluation

in the area of sequential sequence mining.

Chapter 3 provides the motivation of our research work. It focuses on our

inspiration to do the research work in the sequential sequence mining. The deficiency in

state-of-the-art methods motivated us to develop new sequential sequence mining

technique.

Chapter 4 focuses on the scope of work of our algorithm MySSM. We have

discussed proposed algorithms in chapter 5 which includes the steps of our Algorithm

MySSM. We have proposed seven algorithms named SYNTIM, MySSM, GCON, FS,

GSGT, GAS, CMEM and OUTR which all are discussed in this chapter.

Chapter 6 serves to experimentally validate the claims of efficiency in terms of

Time and Memory. In addition, we have empirically analyzed it for large database with

- 5 -

various parameters like various support values, no of items per transactions, no of

transactions per customers, no of customers per database.

Chapter 7 summarizes the thesis and focuses on future scope of the work. This

chapter is followed by references used in our thesis.

1.3 Aim of the Research

The fundamental aim of our thesis is to study and develop a new sequential

sequence mining technique that produces sequential sequences from the large database. It

considers the time gap between successive items to be purchased by the customers. It

produces the sequential sequences with reasonable amount of Time and Memory.

- 6 -

Chapter 2

Related work

Sequential sequence mining is one of the important techniques in data mining.

From the literature review of association rule mining technique to sequential sequence

mining technique, we found that more efforts have been exerted in discovering sequential

sequences. To design new algorithm for resolving these mining problems, we referred

well-known sequential sequence mining techniques. These techniques and brief critiques

are focused here.

2.1 Literature Survey and Critical Assessment

We referred important literatures in the area of sequential sequence mining and

studied various techniques related to our work. These techniques are elaborated here.

Brief critique with gradual improvement over various techniques is discussed here.

The state-of-the-art techniques in sequential sequence mining algorithms are

classified in to different classes with respect to following:

- 7 -

(1) Methods and data-structures used for the candidate sequence generation.

(2) Pruning techniques used to accelerate the mining process.

(3) Final output set that the algorithms are targeting.

Above classes provide seven various techniques. These techniques are focused

here in section 2.2 and the proposed algorithms are discussed in section 2.3. Based on

our literature survey, we have discussed various state-of-the-art methods in section 2.2 to

2.4. The empirically tested results are compared in section 2.5.

2.2 Sequential Sequence Mining Techniques

Sequential sequence [7] is defined as: The data set is a set of sequences, named as

data-sequences. Each data-sequence is a group of transactions. Each transaction is a set

of literals, called items or events. Typically there is a transaction time associated with

each transaction. The sequential sequence mining finds all sequential sequences with a

user defined minimum support.

2.2.1 Apriori-based Techniques

The first and simplest family of sequential sequence mining algorithms is Apriori-

based algorithms and their main characteristic is that they use Apriori principle [10]. The

problem of sequential sequence mining was introduced along with other three Apriori-

based algorithms (AprioriAll, AprioriSome and DynamicSome) [7]. At each step k, a set

of candidate frequent sequences Ck of size k is generated by performing a self-join on

Lk−1; Lk consists of all those sequences in Ck that satisfy a minimum support threshold.

The efficiency of support counting was improved by using a hash-tree structure.

A similar approach, GSP (Generalized Sequential Patterns) was developed [6]

that uses time constraints as well as the window constraints. This was proved to be more

efficient than its predecessors. Mannila et al. introduced the idea of mining frequent

- 8 -

episodes [17], i.e. frequent sequential sequences in a single long input sequence. They

used a sliding window to cut the input sequence into smaller segments and employed a

mining algorithm similar to that of AprioriAll.

Discovering all frequent sequential sequences in large databases was a very

challenging task since the search space was large.

For the database with m attributes and length of k frequent sequence, there are

O(mk) potentially frequent ones. Increasing the number of objects might lead to a high

computational cost. Apriori-based algorithms utilize a bottom-up search lists every single

frequent sequence. To produce a frequent sequence of length l, all 2l subsequences have

to be generated. It can be easily worked out that this exponential complexity is restricting

all the Apriori-based algorithms to discover only short sequences, since they only

implement subset infrequency pruning by removing any candidate sequence for which

there exists a subsequence that does not belong to the set of frequent sequences.

2.2.2 Tree-based Techniques

A faster and more efficient candidate production can be attained by using a tree-

like structure [18]. The traversal is made in a depth-first search manner. It is applied such

that all the candidate sequences applying both subset infrequency and superset frequency

pruning. Initially, the above idea was introduced for mining frequent itemsets, but then it

was extended for sequential sequences. Ayres employed an efficient approach in SPAM

[3]. SPAM generated sequence enumeration tree to generate all the candidate frequent

sequences. The level k of the tree contains the complete set of sequences of size k (with

each node representing one sequence) that occurs in the database. The nodes of each level

are generated from the nodes of the previous level using two types of extensions:

(1) Itemset extension (the last itemset in the sequence is extended by adding one

more item to the set),

- 9 -

(2) Sequence extension (a sequence is extended by adding a new itemset at the

end of the sequence).

The candidate sequences are specified by traversing the tree using depth-first

search. If the sequence is found infrequent, the sub tree of the node representing that

sequence is pruned. If the sequence is found to be frequent, then all its subsequences have

to be frequent, thus the tree nodes representing those sequences are skipped. For efficient

support counting, the database is represented by a bitmap, which further improves

performance over the lattice-based approaches [4] discussed in next method.

2.2.3 Lattice-based Techniques

Lattice structure was another class of sequential sequence mining algorithms was

proposed a lattice based method to enumerate the candidate sequences efficiently. In fact,

a lattice seems to be a “tree-like” structure where each node may have more than one

parent node. A node on the lattice represents a sequence s, is connected to all the pairs of

nodes on the previous level that can be joined to form s. This is shown in the example: let

s = {d, (bc), a}, then all the following nodes should be connected to s on the lattice: {(bc),

a}, {d, b, a}, {d, (bc)}, {d, c, a}, since all pairs of these subsequences can be joined to

form s.

SPADE [4] used above structure to efficiently specify the candidate sequences.

The basic characteristics of SPADE were

(1) Vertical representation of the database using id-lists, where each sequence is

associated with a list of database sequences in which it occurs.

(2) Used lattice-based approach to decompose the original search space into

smaller subspaces.

(3) Each sub-lattice, two different search strategies (breadth-first and depth-first

search) were used for getting frequent sequences.

- 10 -

cSPADE was the extension of SPADE was proposed in [4], which allows a set of

constraints to be placed on the mined sequences. These constraints are:

(1) Length and width constraints

(2) Gap and window constraints

(3) Item constraints

(4) Class constraints

GO-SPADE [19] was the similar algorithm proposed later on, where the idea of

generalized occurrences was introduced. The aim behind GO-SPADE was that in a

sequence database certain items may appear in a consecutive way. For reducing the cost

of the mining process, GO-SPADE tried to compact all these consecutive occurrences by

defining a generalized occurrence of a sequence p as a tuple (sid, [min, max]), where sid

is the sequence id, and [min, max] used for the interval of the consecutive occurrences of

the last event of p.

2.2.4 Regular Expression based Techniques

Huge majority of the former algorithms focused the discovery of frequent

sequential sequences based on only a support threshold, which limits the results to the

most common. Thus, a lack of user controlled focus in the sequence mining process can

be detected that may sometimes lead to great volume of useless sequences. A solution to

this problem was proposed in [20], where the mining process was restricted by a support

threshold and user-specified constraints modeled by regular expressions. Later on the

series of SPIRIT [20] algorithms were introduced, where a set of constraints C was

pushed into the mining process along with a sequence database. Therefore, the minimum

support requirement and a set of additional user specified constraints were applied

simultaneously which restrict the set of candidate sequences produced during the mining

process. To fulfill this, two different types [20] of pruning techniques were used.

First was based on constraint and second was based on support value. The first

technique used a relaxation C0 of C ensuring that during each pass of the candidate

- 11 -

generation, all the candidate sequences satisfy C0. The second technique, tries to ensure

that all the subsequences of a candidate sequence satisfy C0 are present in the current set

of discovered frequent sequences.

Another characteristic of the SPIRIT [20] algorithms were related to anti-

monotonicity. Consider a given set of candidates C and a relaxation C0 of C. In fact C0

was a weaker constraint which was less restrictive.

In such case, support-based pruning was maximized, since support information

for every subsequence of a candidate sequence in C0 could be used for pruning. In

addition, if C0 was not anti-monotone, the efficiency of both support-based and

constraint-based pruning depends on the relaxation C0.

2.2.5 Prefix-based Techniques

Other techniques of sequential sequence mining algorithms include the prefix-

based [21]. In this method, the database is projected with respect to a frequent prefix

sequence and based on the outcome of the projection, new frequent prefixes are identified

and used for further projections until the support threshold constraint is satisfied.

The main steps of a prefix-based algorithm are following:

(1) Scanning of the database for the frequent 1-sequences.

(2) Project the database with respect to s for each frequent 1-sequences found in

the previous step.

(3) Scan the projected database for local frequent items.

(4) Add each new frequent item to the end of the prefix and project the database

with respect to the new prefix.

(5) Repeat steps 3-4 for each new prefix, until the projected database is of size

less than the support threshold.

- 12 -

2.2.6 Closed Sequential Sequences Techniques

In addition to mine the complete set of frequent sequences including their

subsequences, the closed frequent sequence techniques were proposed. The algorithms

were proposed by Zaki [4] and Pei [5]. Two of the most efficient algorithms for mining

frequent closed sequences were BIDE [23] and CloSpan [48]. They are based on the

notion of the projected database. They use special techniques to limit the number of

frequent sequences and finally keep only the closed ones.

CloSpan[48] used the candidate maintenance-and-test approach, i.e. it first

generates a set of closed sequence candidates which is stored in a hash-indexed tree

structure and then prunes the search space using Common Prefix and Backward Sub

sequence pruning. However the drawback of CloSpan is that it consumes much memory

when there are many closed frequent sequences, since sequence closure checking leads to

a vast search space. Therefore, it does not scale well with respect to the number of closed

sequences. To overcome this limitation, BIDE employed a BIDirectional Extension

paradigm for mining closed sequences, where a forward directional extension is used to

grow the prefix sequences which checks their closure and a backward directional

extension. It is used to check the closure of a prefix sequence and prune the search space.

Overall, It is seen that BIDE[23] has high efficiency, regarding speed (an order of

magnitude faster than CloSpan[48]) and scalability with respect to database size.

2.2.7 Time interval Sequence Mining Techniques

Up to this point, the events were considered to be instantaneous. There were

several techniques on discovering intervals that occurred frequently in a transactional

database [24]. In most cases, the intervals were not labelled and no relations were

between them considered. Vill. [25] extended the sequential sequence techniques by also

including the relation introduced previously. In time interval sequential mining, the time

between events is considered.

- 13 -

2.3 State-of-the-art techniques in Sequential Sequence Mining

Here we will depict some of the existing and past researches on the field of

sequential sequence mining. It is followed by innovation of our research.

Chen [8] proposed a method for discovering time-interval sequential sequences in

sequence databases. Dhany, Saputra [1] proposed improved version of prefixspan named

as i-prefixspan.

W. Li [28] proposed novel concept of a frequent time interval association

sequences. They used multiple gene sequences. Their algorithm has several advantages

over traditional methods. A set of genes simultaneously show complex time item interval

expression sequences recurrently across multiple microarray datasets. Such time interval

signals are hard to recognize in individual microarray datasets, but become significant by

their frequent occurrences across multiple datasets. They designed an efficient two-stage

algorithm to identify FTAPs [28]. First for each gene, they recognized expression trends

that occurred frequently across multiple datasets. Second, they found for a set of genes

that simultaneously exhibit their respective trends recurrently in multiple datasets. They

applied this algorithm to 18 yeast time-series microarray datasets. The majority of FTAPs

identified by the algorithm were associated with specific biological functions. Moreover,

a significant number of sequences included genes those were functionally related but do

not exhibit co-expression. Their approach offers advantages: (1) it can identify complex

associations of time interval trends in gene expression, an important step towards

understanding the complex mechanisms governing cellular systems; (2) it is capable of

integrating time-series data with different time scales and intervals.

Tsai [26] proposed a sequential sequence method to explore consumer behaviors

for purchasing items. They concentrated on how to improve accuracy and efficiency of

their methods and discussed how to detect sequential sequence changes between two

time-periods. To help business managers understand the changing behaviors of their

- 14 -

customers, a three-phase sequential sequence change detection framework was proposed

by them. In phase I [26], two sequential sequence sets were generated respectively from

two time-period databases. In phase II, the dissimilarities between all pairs of sequential

sequences were evaluated using the proposed sequential sequence matching algorithm.

Based on a set of judgment criteria, a sequential sequence was clarified as one of the

following three change types: an emerging sequential sequence, an unexpected sequence

change, or an added sequential sequence. In phase III, significant change sequences were

returned to managers if the degree of change for a sequence is large enough.

Mirko B.[27] proposed method of recognizing customer segments and tracking

their change over time. It was important for businesses which operate in dynamic markets

with customers who wanted for new innovations and competing products, had highly

changing demands and attitudes. They presented a system for customer segmentation

which accounts for the dynamics of today‟s markets. Their approach [27] was based on

the discovery of frequent item sets and the analysis of their change over time which,

finally, resulted in a change-based notion of segment interestingness. Their approach

allowed them to detect arbitrary segments and analyzed their temporal development.

Thereby, their approach was assumption-free and pro-active and could be run

continuously.

Fabian Moerchen[22] represented Temporal pattern mining for time point based

and time intervals based methods. They distinguished time point-based methods and

interval-based methods as well as univariate and multivariate methods.

They presented symbolic temporal data models and temporal operators that were

used for pattern discovery in data mining research. They divided temporal data models

such as time point v/s. time interval data, univariate v/s. multivariate data and numeric

v/s. symbolic data.

- 15 -

They divided the sequences based on time point data. They are categorized into

mining sub sequences with suffix tries [29], Mining sequential sequences [30], Mining

episodes [31] and Mining partial orders [32].

J. Kang and H. Yong[33] proposed mining spatio-temporal patterns in trajectory

data. The spatio-temporal sequences extorted from historical trajectories from the moving

objects expose important knowledge about behavior of the movement for Location based

services. They compared with the existing approaches which transform trajectories into

sequences of location symbols and derive frequent subsequences by applying

conventional sequential pattern mining algorithms. However, the loss of spatio-temporal

correlation occurred due to the inappropriate approximations of spatial and temporal

properties. They addressed the problem of mining spatio-temporal [33] sequences from

trajectory data. The inefficient description of temporal information decreases the mining

efficiency and the interpretability of the sequences. They provided an efficient

representation of spatio-temporal movements and proposed a new approach to discover

spatio-temporal sequences in trajectory data. Their proposed method first finds spatio-

temporal regions by using prefix-projection methods and extracts frequent spatio-

temporal sequences.

With the advances in mobile communication [33] and positioning technology,

large amounts of moving objects data from various types of devices, such as GPS

equipped mobile phones or vehicles with navigational equipment was collected. From

these devices, movements of objects were collected in the form of trajectories. Spatio-

temporal sequences in trajectories which represented the movement of sequences of

objects could provide useful information for high quality Location Based Services (LBS).

They addressed the problem of inefficient representation of spatio-temporal

properties and proposed new algorithms for mining spatio-temporal sequences. First they

introduced two compact representations of movements of objects, which abstracted

original trajectories into sequences of regions which objects mostly visited. This spatio-

- 16 -

temporal abstraction of data contributed for improving the mining efficiency and the

interpretability of extracted sequences.

Yan H. [34] proposed a Framework for mining sequential sequences from spatio-

temporal event data sets.

In large spatio-temporal database of events, each event consists of the fields like

event ID, time, location, and event type, mining spatio-temporal sequential sequences

recognizes significant event-type sequences. Such spatio-temporal sequential sequences

are critical for the investigation of spatial & temporal evolutions in many applications.

Earlier research literature explored the sequential sequences on transaction data and

trajectory analysis on moving objects. However, these methods could not be directly

applied to mining sequential sequences from a large number of spatio-temporal events.

However two major research challenges still remained: 1) the definition of significance

measures for spatio-temporal sequential sequences to avoid spurious ones and 2) the

algorithmic design under the significance measures, which could not give guarantee of

the downward closure property. In this paper [34], they proposed a sequence index as the

significance measure for spatio-temporal sequential sequences, which was meaningful

due to its interpretability using spatial statistics. They proposed slicing-STS-Miner to

tackle the algorithmic design challenge using the spatial sequence index, which did not

preserve the downward closure property.

Damian F. Zhang Chen[35] proposed sequential pattern mining of multi modal

data streams in dyadic Interactions. Finding sequential sequences from multi modal data

is an important topic in various research fields, such as human-human communication,

human-agent or human-robot interactions, and human development and learning. Using a

multimodal human-robot interaction dataset, they showed that ESM data mining

algorithm was able to detect and validate various kinds of reliable temporal sequences

from multi-streaming, multi-modal data streams. They [35] proposed a sequential

sequence mining method to analyze multimodal data streams using a quantitative

temporal approach. They presented a new temporal data mining method focusing on

- 17 -

extracting exact timings and durations of sequential patterns extracted from multiple

temporal event streams. While other related existing algorithms could only find

sequential orders of temporal events. Their method [35] with its application to the

detection and extraction of human sequential behavioral sequences over multiple

multimodal data streams in human-robot interactions.

Eric Lu [36] proposed Mining Cluster-Based Temporal Mobile Sequential

Sequences in Location-Based Service Environments. Due to a wide range of potential

applications, researches on Location-Based Service (LBS) have been emerging in recent

years. The earlier studies focused on discovering mobile sequences from the whole logs.

However, this kind of sequences might not be precise enough for predictions since the

differentiated mobile behaviors among users and temporal periods were not considered.

They proposed an algorithm, namely, Cluster-based Temporal Mobile Sequential Pattern

Mine (CTMSP-Mine) which discovers the Cluster-based Temporal Mobile Sequential

Patterns (CTMSPs). Moreover, a prediction strategy was proposed to predict the

subsequent mobile behaviors.

In CTMSP-Mine, user clusters were constructed by a novel algorithm named

Cluster-Object-based Smart Cluster Affinity Search Technique (CO-Smart-CAST) and

similarities between users were evaluated by the proposed measure, Location-Based

Service Alignment. By the time, a time segmentation approach was presented to find

segmenting time intervals where similar mobile characteristics. They worked on mining

and prediction of mobile behaviors with considerations of user relations and temporal

property simultaneously. Through experimental evaluation under various simulated

conditions, their proposed methods were shown to deliver excellent performance.

2.4 Categories of sequential sequence mining techniques

Sequential sequence mining is categorized into two methods.

- 18 -

1. Point based methods.

2. Interval based methods.

We understand that the events (or items) in each data sequence occur at a time

point are called point-based events. Most of existing sequential sequence mining

methods finds sequences for data-sequences of point-based events.

The point based state-of-the-art methods can be categorized into the following

classes:

1. Performance enhancing algorithms

2. Constraint-based sequential sequence mining

3. Incremental sequential sequence mining

4. Mining variants of sequential sequences

The variants of sequential sequences are

1. Maximum sequences

2. Similar sequences

3. Fuzzy sequential sequences

4. Closed sequences

5. Multidimensional sequences

The Interval based methods includes time interval sequences.

Above methods are elaborated below.

1. Performance enhancing sequential sequence mining algorithms.

The sequential sequence mining algorithm improves the performance based on

various parameters of evaluation matrix. Many efforts devoted for improving the

performance of discovering sequential sequences by proposing new mining algorithms.

The performance analysis is discussed in section 2.5.

- 19 -

2. Constraint-based Sequential Sequence Mining

In many applications, the requirements of the discovered sequences may be

different. SPIRIT [20] allows a user to discover user-specified sequential sequences by

giving regular expression constraints. Pei [39] proposed mining sequential sequences

with constraints, which improves the efficiency and effectiveness of mining results.

3. Incremental Sequential Sequence Mining

In the dynamic environment, the databases are updated every time & everyday.

Mining whole database when it changes seems inefficient. Therefore, many incremental

mining methods are developed to solve the problem [40], [41].

4. Mining variants of sequential sequences

When applying the sequential sequence mining methods into real time

applications, users may require the variants of the revealed sequences. The following are

some typical variations.

(1) Maximal sequences

A sequential sequence is called maximal if it is not contained in any other

sequence in the set. Agrawal and Srikant found the maximal sequences [44]. Discovering

maximal sequences may reduce the amount of output sequences.

(2) Similar sequences

Similar sequences are found by similar sequence mining methods, occur

frequently in data sequences by processing similarity queries. The difference between

similar sequences and sequential sequences is that a similar sequence should not exactly

occur in data sequences. A similarity query is satisfied if the similarity between the query

sequence and a data-sequence is high enough.

- 20 -

(3) Periodic Sequences

The sequences recurring in the database are found by Periodic sequence mining

methods [44], [45],[46], [47]. For example, the events behaving cyclically in time series

are interesting in the marketing and biology domains.

(4) Closed sequences

A closed sequential sequence is a sequential sequence included in no other

sequential sequence having exactly the same support [48], [30]. To discover closed

sequential sequences may generate results that are more compact and perform more

efficiently.

(5) Episode

Episode is a gathering of events following a specified structure and occurring

repeatedly in a time series [50], [51]. Episode is useful and efficient to analyze time

series data.

(6) Multidimensional sequences

While traditional sequential sequence mining considered only the time dimension

of items, multidimensional sequences consider more than one dimension of items, such as

region, time, customer group etc. [52], [11]. Multidimensional sequences give more

information than traditional methods.

(7) Fuzzy sequential sequences

Sequential sequences can be extended by using fuzzy sets. Chen Ko discovered

fuzzy time-interval sequential sequences [50]. Moreover, Hong, Kuo proposed fuzzy

sequential sequences with quantitative data.

Yen Chen[8] used sequential sequence mining, which finds frequent

subsequences as sequences in a sequence database. They considered time between items

to be purchased. They addressed sequential sequences that include time intervals, called

time-interval sequential sequences. They developed two efficient algorithms for mining

- 21 -

time-interval sequential sequences. The first algorithm was based on the conventional

Apriori algorithm, while the second one was based on the PrefixSpan algorithm.

2.5 Empirical Analysis of State-of-the-art techniques

Here we have discussed experimental evaluation of various state-of-the-art

techniques. Earlier the association rule mining [10] was introduced by Agrawal and

Srikant which is described as follows.

2.5.1 Apriori Algorithm-Formal Description

The Apriori was the first algorithm developed by R. Agrawal and R. Shrikant on

Association Rule Mining [10] which generates candidate item sets to find frequent item

set. The basic of the algorithm is understood by the set of items is = [ i1 , i2 , i3 , …. ,

im ]. The set of database transaction is D where each transaction T is a set of items such

that T . TID is associated with each transaction. Let us consider A be a set of items. if

A T then A transaction T is said to contain A.

An association rule is implied by A B, where A , B and A B = .

The rule A B gives the transaction set D whose support is s,

Where, s is the percentage of transaction in D that contain A B

( i. e both A and B ).

This is considered as the probability, P (A B ).

2.5.1.1 Support [10]

Support( A B ) = P ( A B ) =

(# Tuple containing A with B )/Total no of tuples

Confidence( A B ) = P ( B | A )

(# Tuple containing A with B ) / # Tuple containing A

- 22 -

Rules, satisfying both minimum support threshold (min-sup) and minimum

confidence threshold (min-conf) are called strong. For simplicity, we write support and

confidence values occur between 0% to 100% rather than 0 to 1.0. A set of items is

referred to as an itemset. An itemset that contain k items is a k-itemset. The set

{ computer , financial_management_software } is a 2-itemset. The set of frequent k-

itemsets is commonly denoted by Lk.

2.5.1.2 Formal Definition: Apriori property [10]

All nonempty subset of a frequent itemsets must also be frequent. Apriori is an

important algorithm for mining frequent itemsets for Association rules. The name of the

algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset

properties. Apriori employs an iterative approach, where k-itemsets are used to explore (k

+ 1)-itemsets. First, the set of frequent L1-itemsets is found. This set is denoted L1. L1 is

used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no

more frequent k-itemsets can be found. The Apriori property presented below is used to

reduce the search space.

Apriori property says that all subsets of frequent itemset must also be frequent.

This property belongs to a special category of properties called anti-monotone in the

sense that if a set cannot pass a test, all of its supersets will fail the same test as well. It is

called anti-monotone because the property is monotonic in the context of failing a test.

2.5.1.3 Algorithm : Apriori

Algorithm Apriori

Join Step Ck is generated by joining Lk-1 with itself

Prune Step Any (k-1)-itemset that is not frequent

Begin

Ck: Candidate itemset of size k

Lk: frequent itemset of size k

- 23 -

L1 {frequent items}

for ( k = 1; Lk! = Ø ; k++ ) do

Begin

Ck+1 = candidates generated from Lk

for each transaction t in database do

Increment the count of all candidates in Ck+1 that

are contained in t

Lk+1 candidates in Ck+1 with min_support

End

Return Uk Lk

End

Figure 2.1 Apriori Algorithm

It is tedious to repeatedly scan the database and check a large set of candidates.

Apriori algorithm scans the database too many times. It generates large amount of

frequent itemsets which are not efficient.

Let us consider sample database as shown in Table 2.1.

Customer_Id Item_Id

1 {1,2,3,4,5}

2 {1,3}

3 {1,2}

4 {1,2,3,4}

Table 2.1 : Sample Database

For example there are five different items, 1 to 5 and four different transactions. If

we set minimum support 0.5 then the frequent itemsets are {1}, {2}, {3}, {4}, {1,2}, {1,

3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}, {1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {2, 3, 4}, and {1, 2, 3, 4}.

Here they occur at least half of the transactions. An itemset is called as frequent item if

it‟s support value is above minimum support. Otherwise it is called as infrequent.

- 24 -

For instance, itemset {1, 2} is frequent. Here three out of four transactions

(transaction 1, 3, and 4) contain items 1 and 2 whose support is 0.75 which is more than

0.5.

On the other hand, itemset {2, 5} is infrequent as such only one out of the four

transactions contains items 2 and 5 whose support value is 0.25 which is less than

minimum support 0.5.

2.5.2 Algorithm - Apriori-gen

Algorithm Apriori-gen

Input A database and a user-defined minimum support

Output All frequent itemsets

Begin

L0 Ø; k 1

C1 {{i}| i I }

Answer Ø

While Ck ≠ Ø

Read database and count supports for Ck Lk { frequent itemsets in Ck }

Ck+1 Apriori-gen(Lk)

k k + 1

Answer Answer Lk Return Answer

End

Figure 2.2 Apriorigen Algorithm

2.5.2.1 The join procedure - Apriori-gen algorithm

Input Lk, the set containing frequent itemsets found in pass

k

Output Preliminary candidate set Ck+1 Begin

for i from 1 to | Lk - 1 |

for j from i + 1 to |Lk| if Lk.itemseti and Lk.itemsetj have the same (k- 1)-

prefix

Ck+1 := Ck+1 {Lk.itemseti Lk.itemsetj }

- 25 -

else

break

End

Figure 2.3 Apriorigen Algorithm : Join Procedure

2.5.2.2 The prune procedure of the Apriori-gen algorithm

Input Preliminary candidate set Ck+1 generated from

join procedure

Output Final candidate set Ck+1 which does not contain any

infrequent subset

Begin

for all itemsets c in Ck+1

for all k-subsets s of c

if kLs

Delete c from Ck+1

End

Figure 2.4: Apriori-gen Algorithm : Prune Procedure

The Apriori-Gen [37] was later developed which uses the property of Apriori.

However the candidate generation process is divided into two steps. First, the preliminary

candidate set is calculated as C’k ={ X X’ | X, X’ Lk-1 and | X X’ |= k-2 }

against the actual candidates are generated in Apriori by Ck = { X C’k | X contains k

members of Lk-1 }.

The Apriori-gen overcomes Apriori by reducing candidates which are used in

Partition and DHP and Sampling Algorithm.

2.5.3 DHP Algorithm [14]

DHP [14] has improved Apriori by using a hash filter. It counts the support for the

next pass. It is observed that reducing the candidate items from the database is one of the

important tasks for increasing the efficiency. The support value is used to eliminate

- 26 -

candidates. This algorithm reduces the number of candidates in the second pass, which is

generated very large in Apriori. Thus a DHP technique [14] was proposed to reduce the

number of candidates in the early passes Ck for k > 1 and thus the size of database

reduces. In this method, support is counted by mapping the items from the candidate list

into the buckets which is divided according to support known as Hash table structure.

The new itemset is encountered if item exists earlier then it increases the bucket count

otherwise it inserts into new bucket. Thus at the end the bucket whose support count is

less than minimum support is removed from the candidate set.

Here we use an example to show how the hash filter works. Suppose there are {1},

{2}, {3}, {5} frequent 1-itemsets in a database of five items 1, 2, 3, 4, and 5. In the first

pass, during transaction is examined, DHP [14] not only update the support of all 1-

itemsets transaction but also updates the counts in a hash table for 2-itemsets. It uses hash

function.

Suppose the Hash function is defined as h({x, y}) = (10x + y) mod 7, the

transaction {1, 3, 5} increments the supports for 1-itemsets {1}, {3}, and {5}. DHP also

updates the counts in index h({1, 3}), index h({1, 5}), and index h({3, 5}) of the hash

table. Hence it updates index 6, index 1, and index 0 of the hash table. Again the database

is read, if the count in a bucket is less than the minimum support then 2-itemsets in this

bucket are considered as infrequent and the value 0 is set in the filter. Otherwise, the

value 1 is set to in the filter.

The candidates are pruned using filter. It prunes the candidates before reading

database in the next pass. However, according to the experiments made [43], this

optimization may not be as good as using a two-dimensional array as discussed in [44].

DHP considers every frequent itemset like Apriori.

However, the limitation is that, it reduces the generation of candidate sets in the

earlier stages but as the level increases, the size of bucket also increases thus it is difficult

to manage hash table as well candidate set.

- 27 -

2.5.4 Partition Algorithm-Formal Description [12]

Apriori and DHP have limitations such as they require scanning of database

multiple times as many times as the length of the longest frequent itemset. The second

issue is that most of the records in the database are not useful in the later passes, since

many of the records may not even contain the items in the candidates. In other words, a

record that does not contain any item in any candidates can be removed without affecting

the support counting process.

Partitioning algorithm [12] is based on finding the frequent elements on the basis

of partitioning the database in n parts. It overcomes the memory problem for large

database which may not fit into main memory because small parts of database easily fit

into main memory. This algorithm divides into two passes as shown in Figure 2.5.

Step 1 : In the first pass, whole database is divided into n number of parts based

on the size of database.

Step 2 : Each partitioned database is loaded into main memory one by one and

local frequent elements are found.

Step 3 : Combine the all locally frequent elements and make it globally candidate

set.

Step 4 : Find the globally frequent elements from this candidate set.

Figure 2.5: Mining Frequent itemsets using Partition algorithm [12]

- 28 -

The Partition algorithm is given in Figure 2.6.

2.5.4.1 Algorithm-Partition

P partition_database(D)

n Number of partitions

2.5.4.1.1 Phase I

for i = 1 to n do

Begin

Read in_partition(pi P)

Li gen_large_itemsets(pi)

End

2.5.4.1.2 Merge Phase

for (i = 2; Li j , j = 1,2....,n; i++) do

Begin

Ci j

j = 1,2,...nLi j

End

2.5.4.1.3 Phase II

for i = 1 to n do

Begin

Read-in_partition(pi P)

for all candidates c CG gen_count(c, pi)

LG {c CG| c.count > minsup}

End

Figure 2.6: Partition Algorithm

To resolve first issue, the database is divided equally into equal sized partitioned

horizontally, which can fit in main memory. Each partition is processed independently to

- 29 -

produce a local frequent set for that partition in the first pass. This process uses a bottom-

up approach similar to Apriori however with a different data structure. After all local

frequent sets are discovered, their union forms superset of the actual frequent set, called

global candidate set. It depends on the fact that if an itemset is frequent then it must be

frequent in at least one of the partition. Similarly, if an itemset is not frequent in any

partition, then it must be infrequent.

During the second pass, it produces the actual support for global candidate set by

reading the database again. Therefore, the entire process finishes within two passes. It

uses bottom-up approach. It extends the length of the candidates by one in every loop

until no more candidates are generated.

To prevent reading the database each time the length of the candidate is

incremented, the database is transformed into a TID-list. Each candidate stores a list of

the transaction IDs that support this candidate. The database is partitioned into a size that

fits into the main memory. The TID-list solves the second issue, since only those

transactions that support current candidates will be available in the TID-list. However,

this TID-list gets additional overhead, the transaction ID for a transaction containing m

items may appear, in the worst case, in

k

m TID-lists for the kth

pass. However the

partition approach has three major limitations. First, it requires the choice of a good

partition size to get a good performance. If the partition is too big, then the TID-list might

increase too fast and it may create a problem to fit in the main memory. But, if the

partition is too small, then there will be large set of global candidates and which may lead

to be infrequent.

Second limitation is, negatively impacted by data skew, which causes the local

frequent set to be very different from each other. Then, the global candidate set will be

very large.

- 30 -

Third limitation is, the algorithm will consider more candidates than Apriori. So

this algorithm is infeasible for long maximal frequent itemsets.

2.5.5 Sampling Algorithm [13]

The partition approach uses whole database, hence therefore it increase the I/O

overhead. To reduce this Sampling algorithm was proposed by Toivonen [13]. It

considers only some samples of the database and discovers an approximate frequent set

by using a bottom-up approach. This random sampling approach overcomes the problem

of the data skew in the Partition algorithm.

This algorithm is based on the idea to pick a random sample of itemset R from the

database instead of whole database D. The sample is picked in such a way that whole

sample is accommodated in the main Memory. In this way it tries to find the frequent

elements for the sample only and there is chance to miss the global frequent elements in

that sample therefore lower threshold support is used instead of actual minimum support

to find the frequent elements local to sample.

This algorithm is a guess-and-correct algorithm [42]. It estimates an answer in the

first pass and corrects the answer in subsequent passes. This algorithm looks only at a

part of the database in the first pass unlike partition algorithm which looks at the entire

database, Therefore, a frequent itemset found in the sample database may not be actually

frequent (false positive) and an infrequent itemset found in the sample database may turn

out to be frequent (false negative). With the support value, the false positive itemsets are

removed after reading entire database. It is more difficult to recover the missing frequent

itemsets (false negative itemsets). The performance of this Sampling algorithm depends

on the sample database. This algorithm considers at least the same candidates as Apriori.

Therefore, it has still limitations when the frequent itemsets are long.

- 31 -

2.5.6 Dynamic Itemset Counting (DIC) [15]

This algorithm is also used to reduce the number of database scan. It is based

upon the downward disclosure property in which adds the candidate itemsets at different

point of time during the scan. In this dynamic blocks are formed from the database

marked by start points and unlike the previous techniques of Apriori. It dynamically

changes the sets of candidates during the database scan.

2.5.7 Improved versions of Apriori [16]

The improved version of Apriori algorithm [16] is based on the combination of

forward scan and reverse scan of a given database. If certain conditions are satisfied, the

improved algorithm can greatly reduce the iterations, scanning time required for the

discovery of candidate itemsets.

Suppose the itemset is frequent, all of its nonempty subsets are frequent. Based on

this thought, it was proposed an improved method of Apriori by combining forward and

reverse thinking. Here, first it finds the maximum frequent itemsets from the maximum

itemset. Then it gets all the nonempty subsets of the frequent itemsets. We know that they

are frequent as per Apriori's property. Then it scans the database again from the lowest

itemset and count the frequent itemsets. During this scanning, if one item is found out

being excluded in the frequent set, it will be processed to judge whether the itemsets

associated with it is frequent or not? if they are frequent, they will be added in the barrel-

structure. Here we get all the frequent itemsets. The key of this algorithm is to find the

maximum frequent itemset in fast manner.

R. Shrikant and R. Agrawal introduced the problem of mining sequential

sequences over such databases. They proposed algorithms AprioriAll, AprioriSome [7] to

solve this problem. They evaluated their performance using synthetic data. Both have

comparable performance, although AprioriSome performs a slightly better when the

minimum number of customers support a sequential sequence is low. Scale-up

- 32 -

experiments show that both AprioriSome and AprioriAll[7] scale linearly with the

number of customer transactions. They also have excellent scale-up properties with

respect to the number of transactions per customer and the number of items in a

transaction. Let us see in detail.

2.5.8 AprioriAll -Formal Description [7]

It finds frequent subsequence item sets. The frequent subsequence item sets are

generated with the help of the candidate itemsets. Also it scans dataset every time to find

out k-large sequences. The algorithm uses five phases:

i) Sort phase

ii) Litemset phase

iii) Transformation phase

iv) Sequence phase

v) Maximal phase

2.5.8.1 Sort Phase :

The database is sorted with customer-id as the major key and transaction-time as a

minor key. This step converts the dataset in the sequential order.

2.5.8.2 Litemset Phase :

In this phase it finds the set of all litemsets L. It also simultaneously finds the set

of all large l-sequences, since this set is just the problem of finding large itemsets in a

given set of customer transactions, although with a slightly different definition of support,

has been considered in [7]. The support for an itemset has been defined as the fraction of

transactions in which an itemset is present.

- 33 -

The main difference is that the support count should be incremented only once per

customer even if the customer buys the same set of items in two different transactions.

The set of litemsets is mapped to a set of contiguous integers. The large itemsets are (30),

(40), (70), (40 70) and (90) are shown in Table 2.2.

Customer_Id Transaction Time Items Brought

1 August 24, ‘12 20

1 August 29, ‘12 80

2 August 9, ‘12 5, 10

2 August 14, ‘12 20

2 August 19, ‘12 30, 50, 60

3 August 24, ‘12 20, 40, 60

4 August 24, ‘12 20

4 August 29, ‘12 30, 60

4 August 24, ‘12 80

5 August 11, ‘12 80

Table 2.2: Mining Frequent itemsets using AprioriAll

Customer_Id Customer

Sequence

Large

Itemsets

Mapped

To

1 <(20)(80)

(20) 1

2 <(5 10)(20)(30 50

60)>

(30) 2

3 <(20 40 60)> (60) 3

4 <(20)(30 60)(80)> (30 60) 4

5 <80)> (80) 5

Table 2.3 : Mapping of sequence

min_sup_count=2

- 34 -

2.5.8.3 Transformation Phase:

This phase repeatedly determines which of a given set of large sequences are

contained in a customer sequence. To make this test fast each customer sequence is

transformed into an alternative representation as shown in Table 2.3 & Table 2.4. In a

transformed customer sequence, each transaction is replaced by the set of all litemsets

contained in that transaction. If a transaction does not contain any litemset, it is not

retained in the transformed sequence. If a customer sequence does not contain any

litemset, this sequence is dropped from the transformed database. However, it still

contributes to the count of total number of customers. A customer sequence is now

represented by a list of sets of litemsets.

Customer_Id

Original

Customer Sequence

Transformed

Customer Sequence

After

Mapping

1 <(20)(80) > <{(20)} {(80)}> <{(1)} {(5)}>

2 <(5 10)(20)(30 50

60)>

<{(20)} {(30), (60),

(30 60)}>

<{(1)} {2, 3, 4 }>

3 <(20 40 60)> <{(20 60)}> <{(1 3)}>

4 <(20)(30 60)(80)> <{(20)}{(30), (60),

(30 0)}{(80)}>

<{(1)}{2, 3,

4}{5}>

5 <80)> <(80)> <(5)>

Table 2.4 : Transformed sequence

2.5.8.4 Sequence Phase:

The algorithm scans the dataset multiple times. In each scan, it starts with a seed

set of large sequences. The seed set for generating new potentially large sequences are

used which is called candidate sequences. It finds the support for these candidate

- 35 -

sequences during the scan over the data. At the end of the scan, it determines which of

the candidate sequences are actually frequent. These large candidates become the seed for

the next scan. The two families of algorithms, called as count-all and count-some. The

count-all algorithm counts all the large sequences including non-maximal sequences. The

non-maximal sequences must then be pruned out (in the maximal phase). They presented

count-all algorithm, called AprioriAll. They presented two count-some algorithms:

AprioriSome and DynamicSome.

2.5.8.5 Maximal Phase:

It finds the maximal sequences among the set of large sequences. Having found

the set of all large sequences S in the sequence phase it finds maximal sequences. Let the

length of the longest sequence be n.

The AprioriAll algorithm is available in Figure 2.7.

2.5.9 Algorithm-AprioriAll [7]

L1 {large 1-sequences}

for ( k = 2; Lk-1 <>0; k++ ) do

Begin

Ck New candidates generated from Lk-1

for each customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c. Lk Candidates in Ck with minimum support.

Answer = Maximal Sequences in k Lk

L1 = {large 1-sequences}

for ( k = 2; Lk-1 <>0; k++ ) do

Begin

Ck = New candidates generated from Lk-1

For each customer-sequence c in the database do


that are contained in c. Lk = Candidates in Ck with minimum support.

Answer = Maximal Sequences in k Lk

Figure 2.7 : Algorithm AprioriAll

- 36 -

2.5.10 AprioriSome-Formal Description [7]

AprioriSome algorithm [7] runs in forward and backward pass. In forward pass, it

only counts sequence of certain lengths. For example it counts sequences of length 1, 2, 4

and 6 in the forward pass and count sequences of length 3 and 5 in the backward pass. It

saves the time by not counting those sub-sequences which are not maximum. Sometimes

we required all the frequent sub-sequences rather than only max-subsequences. So it also

saves the time and memory. The detail algorithm is given in Figure 2.8.

2.5.10.1 Algorithm- AprioriSome : Forward Phase

L1 {large 1-sequences}

C1 L1

Last 1

for (k = 2; Ck-1 ≠ 0; and Llast ≠ 0; k++) do

Begin

If (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

If (k==next(last)) then

Begin

for each customer-sequence c in the

database do


that are contained in c.

Lk = Candidates in Ck with minimum support.

Last k;

End

End

2.5.10.2 AprioriSome : Backward Phase

Begin

for (k--; k>=1; k--) do

if (Lk not found in forward phase) then

Begin

Delete all sequences in Ck contained in some

Li i>k;

for each customer-sequence c in DT do

- 37 -


that are contained in c

Lk = Candidates in Ck with minimum support

End

else

Delete all sequences in Ck contained in some

Li i>k;

Answer = k Lk

Here DT is Transformed database

End

Figure 2.8 : Algorithm AprioriSome

2.5.11 Relative performance - AprioriAll & AprioriSome

The major advantage of AprioriSome over AprioriAll is that it ignores counting of

many non-maximal sequences. However this advantage is reduced because of two

reasons. First, candidates Ck in AprioriAll are generated using Lk-1 unlike AprioriSome.

Since the number of candidates generated using AprioriSome can be larger. Second,

although AprioriSome skips over counting candidates of some lengths, they are

generated and stay memory resident. If memory gets filled up, AprioriSome is forced to

count the last set of candidates generated even if the heuristic suggests skipping some

more candidate sets. This effect decreases the skipping distance between the two

candidate sets that are indeed counted and AprioriSome starts behaving more like

AprioriAll. For lower supports, there are longer large sequences and hence more non-

maximal sequences and AprioriSome does better.

2.5.12 DynamicSome-Formal Description [7]

DynamicSome generates candidates on-the-fly using the large sequences found in

the previous passes and the customer sequences read from the database. There are four

phases of this algorithm shown in Figure 2.9. First is initialization phase, in this phase all

the large sequences up to steps are counted. Second phase is forward phase, in this phase

all the sequences whose length is multiple of steps are counted. Third phase is

- 38 -

intermediate phase, all the candidate sequences which are not counted in first two phase

are counted here. Last phase is backward phase that is identical to AprioriSome

algorithm.

However, unlike in AprioriSome, these candidate sequences were not generated in

the forward phase. The intermediate phase generates them. Then the backward phase is

identical to AprioriSome.

The limitation of this algorithm is the main memory capacity. It fails when there

is little main memory, or many potentially large sequences.

2.5.13 Algorithm-DynamicSome

2.5.13.1 Initialization Phase

L1 = {large 1-sequences}

for ( k = 2; k <= step and Lk-1<>0 ;; k++ )

do Begin

Ck = New candidates generated from Lk-1;

for each customer-sequence c in DT do


that are contained in c.

Lk = Candidates in Ck with minimum support

End

2.5.13.2 Forward Phase

for ( k = step; Lk <> 0 ; k += step )

Find Lk+step from Lk and Lstep

Begin

Ck+step = 0

for each customer sequences c in DT do

Begin

X = otf-generate(c, Lk, Lstep)

for each sequence x X’, increment its count in Ck+

step End

Lk+step = Candidates in Ck+step with minimum support.

End

- 39 -

2.5.13.3 Intermediate Phase

for ( k--; k > 1; k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

Figure 2.9 : Algorithm DynamicSome

2.5.14 GSP [6]

R. Shrikant and R. Agrawal introduced the GSP algorithm. It uses the downward-

closure property of sequential sequences and multiple-pass, candidate generate-and-test

approach. It uses a horizontal format. During first scan it finds all of the frequent items

with minimum support. Each item gives a 1-event frequent sequence consisting of that

item. Candidate 2–sequences are formed from the frequent sequences. The higher

candidates are generated from the candidates generated in previous step. This process is

repeated until no more frequent sequences are found.

2.5.14.1 Formal Description

The algorithm [6] creates multiple passes over the data. The first pass decides the

support of each item. At the end of the first pass, the algorithm knows which items are

frequent. Each such item generates a 1-element frequent sequence consisting of that

item. Each subsequent pass starts the frequent sequences found in the previous pass

called as a seed set. The seed set is used to generate new frequent sequences, named as

candidate sequences. Each candidate sequence contains one more item than a seed

sequence. In same pass, all the candidate sequences will have the same number of items.

The support of these candidate sequences is found during this pass. Finally, the algorithm

determines which of the candidate sequences are actually frequent. These frequent

- 40 -

candidates are used as the seed for the next pass. This algorithm terminates when there

are no frequent sequences or no candidates generated at the end of a pass.

It uses two keys:

1. Candidate generation: Candidates sequences are generated when the pass

begins.

2. Counting candidates: For the support count of candidate sequences.

The Candidates are generated in two steps:

2.5.14.2 Join Phase: It generates candidate sequences Ck+1 by joining Lk with Lk. The

candidate sequence are generated by joining s1 with s2 which is the sequence s1 extended

with the last item in s2. The added item is a separate item if it is a separate element in s2,

and part of the last element of s1.

2.5.14.3 Prune Phase: It deletes candidate sequences that have a contiguous subsequence

whose support value is less than minimum support.

During counting of candidates, it uses Hash-tree data structure to reduce the

number of candidates in C that are tested for a data-sequence. It is transformed in the

representation of the data-sequence d so that it can efficiently find whether a specific

candidate is a subsequence of d. It checks for the data-sequence containing a specific

sequence.

2.5.14.4 Relative Performance:

Figure 2.10 shows the relative performance with respect to execution time. The

synthetic datasets was generated by using synthetic data generation using IBM Quest data

mining project (IBM). The datasets were generated by using following symbols with

various values.

- 41 -

D is used for number of customers in the dataset, C for the average number of

transactions per customer, T for the average number of Items per Transaction, S for

average length of maximal sequences and I is used for average length of transactions

within the maximal sequences. These dataset is compared with various parameters. The

values are taken as D10000-C10-T2.5-S4-I1.25. Means, for 10,000 Customers in dataset

with average numbers of transactions per customer were 10. The average 2.5 items per

transaction are considered. The average length of sequences was 1.25 per Item were

taken in the empirical Analysis.

Here for the three algorithms and for the given synthetic datasets, the minimum

support is decreased from 1% support to 0.2% support. The graph for DynamicSome is

not plotted due to it generates too many candidates and run out of memory for minimum

support. Even if DynamicSome has more memory, the cost of finding the support for that

many candidates ensures execution time much larger than those for Apriori or

AprioriSome.

The execution time of all the algorithms increase as the support is decreased

because of a large increase in the number of large sequences in the result. DynamicSome

performs worse than the other two algorithms mainly because it generates and counts a

much larger number of candidates in the forward phase. Execution time increases as

number of the customers increase. The number of transactions per customer increase,

time also increases. The value of support increases, time reduces because there is less

sequences that qualify minimum support criteria.

Scale-up experiments show that both AprioriSome and AprioriAll scale linearly

with the number of customer transactions. Two of the algorithms, AprioriSome and

AprioriAll have similar performance, although AprioriSome performs a little better for

the lower values of the minimum number of customers.

- 42 -

The major advantage of AprioriSome over AprioriAll is that it avoids counting of

many non-maximal sequences. AprioriSome and AprioriAll have similar performance.

However AprioriSome performs a little better for the lower values of the minimum

support value.

The comparative performance is shown in Figure 2.10. We can see that as

minimum support decreases, the difference of timing between AprioriAll and

AprioriSome increases. So AprioriSome performs better than AprioriAll.

Figure 2.10: Relative Performance

GSP [6] has some limitations. A huge set of candidate sequences are generated.

1,000 frequent length-1 sequences generates huge number of length-2 candidates

500,499,12

999100010001000

. Especially for 2-item candidate sequences, multiple

scans of database are needed. The length of each candidate grows by one at each database

scan. It is inefficient for mining long sequential sequences.

GSP & DynamicSome generate too many candidate items for low values of

minimum support. Execution time of all the algorithms increases as the support

- 43 -

decreases because of a large increase in the number of large sequences in the result. GSP

& DynamicSome perform worse. DynamicSome generates and counts a much larger

number of candidates in the forward phase & intermediate stages.

The efficiency [38] of all frequent sequence mining algorithms is provided by

following way. With the minimum support threshold is with n = |C| different items in

the item collection, C. For | I | different possible existent itemsets, where I is the powerset

of C, and its value is given by equation 2.1.

121||1

nn

j j

nI

…Equation 2.1

Let the database has sequences with at most m itemsets and each itemset has at

most one item. In this condition, there would be nm possible different sequences with m

itemsets and different arbitrary length sequences. It is given in equation 2.2

1

1

1

n

nnn

mk

m

k

…Equation 2.2

Similarly, if each itemset has an arbitrary number of items, there exists Sm with

possible frequent sequences with m itemsets, with the value of Sm is given by equation

2.3.

mnmImS )12(|| …Equation 2.3

The S sequences in general, as in equation 2.4.

nmn

nmnkn

m

k

S 212

121)12(

)12(

1

…Equation 2.4

- 44 -

2.5.15 FreeSpan [5]

The FreeSpan algorithm [5], introduced by the Jiawei han and Jianpei. FreeSpan

uses the projected sequential databases to confine the search and growth of subsequence

fragments. First it scans the database and then finds the frequent item lists, which is 1-

length list. The complete set of sequential sequences is divided into number of subsets

according to frequency. The frequent item list is generated without overlap. It uses bi-

technique first time for finding the frequent sub-sequences. Bi-technique reduces the

number of projected database.

It gives advantages over Apriori based algorithm. The alternatively-level

projection in FreeSpan [5] reduces the cost of scanning multiple projected databases and

takes advantage of Apriori way candidate pruning. It works faster than the Apriori

because it examines the substantially fewer combinations of subsequences.

FreeSpan [5] has many bottlenecks. The major overhead of FreeSpan is that it

generates many nontrivial projected databases. If a sequence appears in each sequence of

a database, it‟s projected database does not shrink and it is likely to be the original

database.

The growth of a subsequence is explored at any split point in a candidate

sequence, it is very expensive. So it generates some of the unnecessary sequences.

2.5.16 SPADE [4]

SPADE (Sequential PAttern Discovery using Equivalent Class) [4] was

developed by the Zaki. SPADE outperforms GSP (Generalized Sequential Patterns) [6]

by a factor of two and by an order of magnitude with pre computed support of 2-

sequences.

- 45 -

SPADE [4] uses only simple temporal join operation on id-lists. As the length of a

frequent sequence increases, the size of its id-list decreases results in very fast joins. No

complicated hash-tree structure is used and no overhead of generating and searching of

subsequences incurred. SPADE [4] has excellent locality, since a join requires only a

linear scan of two lists.

As the minimum support is lowered, more and larger frequent sequences are

found. GSP makes a complete dataset scan for each iteration. SPADE [4] on the other

hand restricts itself to usually only three scans. This algorithm uses the vertical data

format. The sample data set is shown in Table 2.5.

Seq ID Sequences

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

Table 2.5 : Data set Example

Here it converts the dataset in the vertical format. It uses SID and CID. SID

means in which sequence, these items belongs to and CID represents in which

transaction, it belongs to as shown in Table 2.6.

SID CID Items SID CID Items

1 1 a 3 1 ef

1 2 abc 3 2 ab

1 3 ac 3 3 df

1 4 d 3 4 c

1 5 cf 3 5 b

2 1 ad 4 1 e

2 2 c 4 2 g

2 3 bc 4 3 af

2 4 ae 4 4 c

- 46 -

4 5 b

4 6 c

Table 2.6 : Vertical Data format1

The Sequence ID of „a‟ is 1(because its sequences index no is 1.) and CID = 1

(Because it happened in 1st transaction). Here it uses the vertical data format where each

item has SID and CID. Now we have to scan the dataset and the find out the frequent

items. So here we have a, b, c, d, e, f which are the frequent items if we take minimum

support as 2 as shown in Table 2.7.

‘a’ ‘b’

SID CID SID CID

1 1 1 2

1 2 2 3

1 3 3 2

2 1 3 5

2 4 4 5

3 2 4 3

ab ba

SID CID(a) CID(b) SID CID(b) CID(a)

1 1 2 1 2 3

2 1 3 2 3 4

3 2 5

4 3 5

Table 2.7: Vertical Data format

Here we have to find the 2-length sequences. It generates the candidate sequences

using 1-length sequence then checks for the frequent sequence. It generates the table for

the sequences „ab‟. It takes those „a‟ and „b‟ which have same SID. The CID of „a‟

should have lower index value than „b‟ for generating the sequence „ab‟. Here with SID

1, the sequence is generated as „ab‟. Look the CID of „a‟ and „b‟. It is 1 and 2

respectively. It indicates that „a‟ occurs before „b‟. Similarly it generates the 3-length

sequence sequences and so on.

- 47 -

This algorithm uses index number which finds frequent items. It works faster than

the GSP. However multiple scans waste the time.

The limitation of SPADE is that it needs an exponential number of short

candidates.

A 1030 candidate sequences are used to generate length-100 sequential

sequences. Means 30100100

1

1012100

i i

The C10-T2.5-S4-I1.25 dataset is used in experiments with different minimum

support levels ranging from 0.25% to 1%. The comparative results are shown in Figure

2.11. It is observed the SPADE outperforms GSP & Freespan. However, Freespan gives

better performance in time compare to GSP due to reducing the cost of scanning multiple

projected databases several times.

When the minimum support decreases, the execution time increases. This can be

seen in the Figure 2.11.

Figure 2.11: Comparison – GSP, Freespan, SPADE

- 48 -

2.5.17 Prefixspan [9]

This algorithm uses a pattern growth approach. It never generates the candidates

which do not appear in the database. It uses optimization methods. For closed sequence,

it is easy to extend it with other constraints for the closed sequences. It uses a divide and

conquer technique. First it generates the projected database and then finds the frequent

sequences.

To overcome this bottleneck of the FreeSpan, Jiaweihan and Jianpei developed

new algorithms called PrefixSpan [2]. It outperforms both the Apriori and FreeSpan

algorithms in almost all the fields like huge no of sequences, support. Different projection

methods are used for PrefixSpan [2]: level-by-level projection, bi-level projection etc.

The pefixspan algorithm is discussed in Figure 2.12.

Algorithm Prefixspan

Input A sequence database S, the minimum support

Output The complete set of sequential patterns

Begin

Call PrefixSpan(<>,0,S)

procedure PrefixSpan (α, L, α S )

Scan α S once, find each frequent item b, such that:

b or be appended to α to form a sequential sequence

for each f requent item b, append it to α to form a

sequential sequence α’ and output α’

for each α’, construct α’-projected database S |α’.

Call PrefixSpan(α',L+1, S|α’)

Figure 2.12: Algorithm - Prefixspan

The first step of PrefixSpan is to scan the sequential database for getting the

length-1 sequences, which is in fact the large 1-itmesets. Then the sequential database is

divided into different partition according the number of length-1 sequences. Each

partition means the projection of the sequential database that takes the corresponding

length-1 sequences as prefix. The projection database only contains the suffix of these

- 49 -

sequences. All the length-2 sequential sequences are generated from the parent of length-

1 sequential sequences as prefix from the projected database. Then the projected database

is partitioned again by those length-2 sequences. The same process is repeated until the

projected database is empty or no more frequent length-k are generated.

Let the sequence database is shown in Table 2.8. The item sets are {a, b, c, d}.

The sequence <ac(bc)d(abc)ad> have 7 elements, like (a), (c), (bc), (d), (abc), (a) & (d).

Table 2.8: Sequence Database

The main cost of above method is the time and space used to construct and scan

the projected database as show in Table 2.9. This is called level-by-level projection.

Another projection method is called bi-level projection is used to reduce the number of

projection databases. First step is the same, by scanning the sequential database we can

get the frequent 1-sequence.

In second step, instead of constructing projected database, one n×n triangle matrix

is constructed as shown in Table 2.10. It represents the support of all length-2 sequences.

Example [<d>, <a>] = (3, 3, 0) means supports of <d, a>, <a, d> and < (ad) > are 3, 3

and 0 respectively. It creates the projected database of length-2 sequences which are

frequent and pass the minimum support threshold.

The pseudo projection technique reduces the number and size of projected

databases. The idea is given as follows: Instead of performing physical projection, one

can register the index (or identifier) of the corresponding sequence and the starting

Customer ID Customer Sequence

1 <ac(bc)d(abc)ad>

2 <b(cd)ac(bd)>

3 <d(bc)(ac)(cd)>

- 50 -

position of the projected suffix in the sequence. Then, a physical projection of a sequence

is replaced by a sequence identifier and the projected position index point. It represents

like <sid, offset>. sid = item indication and offset is pointer to that sequence. i.e. <a, 2>

means „a‟ occurred in particular sequence at 2nd

place. Pseudo projection reduces the

cost of projection substantially when the projected database fits in main memory. But it

may not be efficient once the pseudo projection is used for disk-based accessing since

random access disk space is expensive. Based on this observation, if the original

sequence database or the projected databases is too big to fit into main memory, then the

physical projection should be applied.

Table 2.9: Projected Database

Table 2.10 : S-Matrix

Large

Itemsets

Projected Database

(suffix dataset or postfix)

A <c(bc) d(abc)ad >

<c(bd)>

<(_c)(cd)>

B <(_c)d(abc)ad>

<(cd)ac(bd)>

<(_c)(ac)(cd)>

C <(bc) d(abc)ad >

<(_d)ac(bd)>

<(ac)(cd)>

D <(abc)ad >

<ac(bd)>

<(bc)(ac)(cd)>

<a> 0

 (3,2,1) 0

<c> (3,3,2) (2,3,2) 0

<d> (3,3,0) (3,3,1) (3,3,2) 0

<a> <c> <d>

- 51 -

The main cost of PrefixSpan [2] is that it takes much time when dataset is too

huge and low support because of scanning of projected database. However it is improved

by using bi-level projection and pseudo projection technique.

It does not use vertical representation. So it may need to scan the database several

times. The database has to be stored in memory. Generate the projection tables for every

sequence.

When the support threshold is high, it has a limited number of sequential

sequences and the length of sequences is short, these methods are very near in terms of

runtime. However, as the support threshold decreases, the time to generate the sequences

become more. It clearly seems that FreeSpan and PrefixSpan overcome GSP. And also

PrefixSpan methods are more efficient than FreeSpan.

Figure 2.12: Comparison – Freespan, SPADE, PrefixSpan

2.5.18 SPAM [3]

Jay Ayres developed the one more algorithm for the sequential mining is SPAM

(Sequential PAttern Mining) [3]. It uses a bitmap representation. There are several

- 52 -

optimizations possible with bitmaps. It uses a vertical representation of the database so

that the database only needs to be scanned once to create the vertical representation. It is

very fast to calculate the intersection of two SIDs sets (sets of sequence ids) by doing a

kind of logical AND with two bitmaps.

This is basically based on SPADE [4] means vertical representation. The author

introduced a novel depth-first search strategy [4] that integrates a depth-first traversal of

the search space with effective pruning mechanisms. The implementation of the search

strategy combines a vertical bitmap representation of the database with efficient support

counting. According to the two processes exited in SPAM [3], it uses two pruning

techniques: S-step pruning and I-step pruning, based on the Apriori heuristic to minimize

the size of candidate items. S-step adds the whole sequence at the end (i.e. <(a,b)(d)>)

and I-step adds the item at the end of the current sequence (i.e<(a,b,d)>). The dataset is

shown in Table 2.11.

SID Sequences

1 <(a,b,d),(b,c,d),(b,c,d)>

2 <b,(a,b,c)>

3 <(a,b),(b,c,d)>

Table 2.11: Data set

It uses the vertical data format. But it is different from the SPADE because SPAM

deal with the bitmap representation. Let us see the vertical bitmap dataset representation

of the example as shown in Table 2.12.

SID TID {a} {b} {c} {d}

1 1 1 1 0 1

1 2 0 1 1 1

1 3 0 1 1 1

2 1 0 1 0 0

2 2 1 1 1 0

3 1 1 1 0 0

3 2 0 1 1 1

Table 2.12: Vertical format

- 53 -

In this table it represents bitmap with various values of SID and TID. 1 indicates

that the item is available in the SID and TID. 0 indicates that item is not available in SID

and TID. Here 0 and 1 numbers are used. Means it represents the binary format. The

sequences are generated in S-type and I-type. The S-step process is shown in Table 2.13.

{a} ({a})s {b} ({a},{b})

1 0 1 0

0 1 1 1

0 S-step 1 1 1

0 0 & 1 Result 0

1 Process 0 1 0

1 0 1 0

0 1 1 1

Table 2.13 : S-step process

Here 1st column is shown in Table 2.14 for the {a}. 2

nd column is derived from 1

st

column. It shows that the item „a‟ occurs in the same sequence before item „b‟. Here item

„a‟ is available in the 1st TID of SID 1, so it is made zero. Then all successive TIDs are

made 1. This indicates that sequence „a‟ appeared. In ({a}s), we make all as 1 till the SID

doesn‟t change. Now the 3rd

column for the {b} is the same as it is. The AND operation

is performed and result shows that it is possible to make <(ab)> as one sequence (If AND

result is 1). This way the S-type sequences are created. Then I-type sequences are

generated. The sequences are shown in Table 2.14.

({a},{b}) AND {d} ({a},{b,d})

0 1 0

1 1 1

1 1 Result 1

0 & 0 0

0 0 0

0 0 1

1 1 0

Table 2.14 : I-step process

- 54 -

SPAM [3] performs better for large dataset. Prefixspan [2] sometime outperforms

the SPAM. However with huge number of items, SPAM works better than the

PrefixSpan. For the small dataset SPAM consumes more memory than the SPADE.

In Figure 3.23, the dataset with D10000-D50000C10T5S3.5I1.25 was taken for

100 different items and with 10000 to 50000 customers. The frequent sequences are

generated and compared them with time(sec). It is seen that the PrefixSpan[2] runs faster

than the SPAM[3]. SPAM generates candidates and then computes the SIDs set of the

candidate to calculate its support. It may generate many candidates that are not frequent,

therefore it wastes time. It can also generate candidates that do not appear in the database.

If the sequences are very long, the memory usage will go up because each bitmap

takes more memory space. For more frequent items, more bitmaps are needed to be

stored into memory. SPAM [3] generates candidates and then computes the SIDs set of

the candidate to calculate its support. It generates many candidates those are not frequent,

therefore it wastes time. It also generates candidates that do not appear in the database.

Figure 2.13: Comparison – PrefixSpan with SPAM

- 55 -

In Figure 2.13, the dataset with support 3.5% and 100 different items and vary the

number of customers from 10000 to 50000. The frequent sequences are found and

compared them with time (sec). We can see that PrefixSpan runs faster than the SPAM.

Figure 2.14: No of Customer v/s Memory

In Figure 2.14, the dataset with support 3.5% and 100 different items and vary the

number of customers from 10000 to 50000. The frequent sequences are found and

compared it with memory used by them. It is seen that SPAM uses less memory

compared with PrefixSpan [2]. Means that SPAM can deal with the large dataset.

Figure 2.15: No of Transaction v/s Memory

- 56 -

In Figure 2.15 shows the comparison for No of Transactions vs Memory with

dataset support is fixed from 2.5% to 100 different items It is noticed that as the

transactions per customer increase, Memory used by PrefixSpan [2] increases smoothly.

Where memory used by SPAM doesn‟t have much effect. So it indicates that SPAM

always works as memory management.

In Figure 2.16, the dataset with fixed support as a 2.5% and 100 different items

and vary the number of transactions per customers. The frequent sequences are found

with respect to time (seconds) used by them. We can see that as no of transactions

increase, the time to generate the sequences also increase respectively.

Figure 2.16: Memory Prefixspan v/s SPAM

In Figure 2.17, the dataset with 100 different items and vary the support 0.04 to

0.025. It is found that as the support decreases, there is increase in Memory size due to

availability of more sequences during lower support values.

- 57 -

Figure 2.17: Support v/s Memory

In 2009, Y.J. Lee [49] proposed the new algorithm for time interval sequential

mining technique based on Allen‟s theory. Their basic idea was to implement the

preprocessing algorithm in which they got time interval data from data with time points.

They worked on medical database. For example, if a patient showed daily a symptom B

between March and April, there would be several transactions checking a symptom B at

different time point. These transactions must have been executed uniformly during one

month, and thus these could be summarized as a single transaction with an interval from

March to April. Through this generalization process, they could produce time interval

data and reduce the size of search space for time interval sequences. The time interval

relation discovery algorithm could discover time interval relation rules among

summarized transactions involving time interval data.

They focus on the sequences of events of customers & proposed algorithms

related to time interval sequence mining.

If u is used as a set of time granularities. If a transaction is issued once a month,

the time granule is one month. Also the sequence has one month time granule if each

event of the sequence represents information about one month.

- 58 -

The time granularity [49] is U u and a base time point V TS, an event

sequence S is converted into a sequence S’.

Thus S’ = <(E1, [vs1, ve1]), (E2, [vs2,ve2]),. . .,(En, [vsn, ven])>,

where vei vsi+1 for i = 1,. . .,n - 1 and vsi, vei are positive numbers.

Also, the time interval of S’ = [vs1, ven] is converted into [1, m]. Where m is

positive number.

Each event pair (x, y) is included in a set of event pairs. X = {(x,y)|x, y IE, x –

y} has binary time interval relation R(x, y), where R is given as a binary time interval

relations between two events x and y.

A temporal interval relation is defined as R(x,y) = {P(x,y)|(x,y) , P IO}

[49]. The set of temporal interval operators is IO = {before, equals, meets, overlaps,

during} and P(x, y) is a binary predicate which expresses the temporal interval

relationship P between x and y. R(x, y) is defined as follows:

before(x, y) means that an event x occurs prior to the event period of y. before is

represented as before(x, y) x . ve < y . vs

equals(x, y) denotes that x and y occur in the same period. equals are represented

by g(x,y) (x . vs = y . vs) (x . ve =y . ve).

meets(x, y) means that y happens immediately after the event period of x. meets

can be represented by meets(x,y) x . ve = y . vs.

overlaps(x, y) expresses that y occurs before the end point of x. overlaps can be

represented by overlaps(x,y) x. vs < y . vs x . ve > y . vs.

- 59 -

during(x, y) represents that x occurs during the event period of y. during is

represented by during(x, y) x . vs > y . vs x . ve < y . ve.

They proved the theorem as per Dr. Allen‟s equation. However for large data, the

effort to discover temporal interval relations is too high.

For all (x, y) Ω , the total possible temporal interval relations in R(x, y) is (n)

(n 1) m where m is constant. The time complexity of Allen‟s algorithm is O(N2).

Hence the computing time of Allen‟s algorithm cannot be extended for large databases.

To solve this problem, Lee [49] presented a new algorithm for mining

temporal interval relation rules which is shown in Figure 2.18.

2.5.19 Allen’s Algorithm

Let IE be the event set

R(x, y) is the time interval relation

RS Φ for each event x in IE

for each event y in IE

RS RS R(x, y)

Return RS

Figure 2.18 : Allen’s Algorithm [49]

Lee [49] proposed two sub algorithms. The first was an event generalization

algorithm designed for summarizing time interval sequences. It reduces the size of input

database. The second one is a time interval relation rule discovery algorithm. It discovers

time interval relation rules from time interval data that satisfies a given minimum

support.

- 60 -

2.5.19.1 Generalization of temporal events-Formal Description [49]

Each transaction of given database DB consists of a customer-id, a transaction-

time stamped with a time-point, and a set of event types. A customer can issue several

transactions. For example, a patient periodically can take a medical examination. Each

medical examination is a transaction and shows multiple symptoms. Symptoms are the

events in a transaction. No customer has more than one transaction with the same

timestamp. All events in a transaction have the same timestamp.

2.5.19.2 Algorithm : Generalization of temporal

Input The transactions of a given database

Output A generalized events with time interval

Begin

Sort the transactions in a database DB as per

customerID(Cid) and the timestamps.

Calculate the frequent event types based on

customer ID.

Remove non-frequent event types from transactions

Calculate a set of event sequences per customer

SS(Cid) = {ES(Cid, Ei)| Ei ETS(Cid)}, where

ETS(Cid) contains only frequent event types

Calculate a set of all event sequence set S(Cust)

Calculate a set of uniform event types. Calculate

a set of sequences having uniform event type

Delete non-uniform event types from S(Cust)

Generalize each event sequence in S(Cust) into a

generalized event with a time interval

End

Figure 2.19 : Generalization of events [49]

- 61 -

2.5.19.3 Algorithm : Temporal interval relation rule discovery

Input data A database GD with generalized events & a time

interval.

Output A set of time interval relation rules {TR1, TR2,. .

.,TRn}

Begin

Find a set of all candidate time interval relations,

CR=

k

iCidiCR

1)(

.

Find a set of frequent time interval relations,

FR = {Ri(x, y)jSupp(Ri(x, y))/NcustP Suppmin and

Ri(x, y) from CR.

Discover the time interval relation rules {TR1,TR2,. .

.,TRn}, from FR.

End

Figure 2.20 : Temporal interval relation rule discovery [49]

Lee [49] proposed a new data mining technique to efficiently discover useful time

interval relation rules from time interval data on the basis of Allen‟s interval operators.

This technique is combination of an event generalization algorithm and a time interval

relation rule discovery algorithm. The event generalization algorithm summarizes the

time interval events with time points and generalizes it into time interval data. The time

interval relation rule discovery algorithm generates time interval relation rules by

discovering frequent time interval relations from time interval data generated from the

event generalization algorithm.

This technique has some significant advantages in comparing the existing

methods. First, it discovers useful time interval rules from time interval data. Secondly, it

enables us to extract time interval relation rules from a time interval database. To prove

the effectiveness of technique proposed by Lee [49], they performed several experiments

while scaling up datasets. First, the execution time of the algorithm increases slowly as

the number of records increases, so that it has significant performance benefits in

comparison to Allen‟s algorithm. Second, the time interval relationship step and the event

- 62 -

generalization step require the greatest amount of time among the different steps involved

in the algorithm proposed by Lee [49]. These algorithms have used for the concept of

time interval sequences. However still our proposed technique is still effective compare

to all techniques discussed here.

The algorithm proposed by Dhany Saputra[1] uses Seq-Tree Framework and

separator table[1]. The separator database is proposed by them stores the list of separator

indices of each customer. From the original database to check all items one by one is

time-consuming, hence I-PrefixSpan[1] does not use any of them.

Dr. Chen[8] & his team proposed an algorithm for time interval sequential

mining. They proposed two efficient algorithms for mining time-interval sequential

sequences. The first algorithm [8] is based on the conventional Apriori algorithm, while

the second one is based on the PrefixSpan algorithm. The second algorithm outperforms

the previous by considering computing time and scalability by considering various

parameters.

- 63 -

Chapter 3

Motivation

Our literature survey & critiques of various state-of-the-art methods motivated us

and directed us to propose the sequential sequence mining technique which overcomes

the limitations and adds the value to the state-of-the-art methods.

Various sequential sequence mining techniques were applied on data which are

critically evaluated and discussed in Chapter 2. These techniques could find the

sequential sequence in desired manner. We could find that every technique has tried to

justify and tried to overcome the deficit of earlier techniques and they tried to improve

their performance with different parameters. The relevant techniques with their

limitations and merits are elaborated in the chapter 2.

By analyzing these techniques, we came to know that very few have concentrated

on the Memory usage and execution time to find sequential sequence and also very few

researchers have worked by considering time interval between various events/items. It

was also observed that most of the state-of-the-art methods use the feature of sequential

sequence mining in various applications. Even the state-of-the-art methods have less

- 64 -

focused on the large size of the database. Hence therefore, it directed us to focus on this

issue and it was really a great challenge for us. We focused on this issue and planned to

propose new sequential sequence mining technique. Initially we proposed the technique

of sequential sequence mining for small dataset. Then we tried on large database by

proposing more algorithms. With many efforts, we could find notable improvement in

our technique.

Our proposed technique leads over all state-of-the-art techniques. We ensure that

our new approach will be useful to the researcher in the area of sequential sequence

mining. Our proposed technique is theoretically discussed in chapter 4 and empirically

evaluated in chapter 5. The improved results are compared with other state-of-the-art

methods. We could find improved results in our technique.

- 65 -

Chapter 4

Scope of Work

Sequential sequence mining finds the frequent sequences in a sequential database,

is an significant data mining problem with extensive applications including the analysis

of customers‟ purchase sequences or Web access sequences, the analysis of sequences or

time-related processes such as scientific experiments, natural disasters and disease

treatments, the analysis of DNA sequences and soon. In the world of E-commerce, the

purchasing behavior of customers can be extracted from log files. The web managers can

actively send desired information to their customers. Thus not only the customers

experience the convenience of quickly obtained, but also the likelihood that they

purchase products from this company is increased. Manufactures can analysis the

market demand, plan production schedules, and determine inventory levels so that

they can react to market changes correctly and quickly.

The scope of our algorithm is to provide better efficient sequential sequences with

various parameters of matrix of evaluation. The detail is discussed in chapter 4 and 5.

- 66 -

Our algorithms improve the performance and efficiency compared to various

algorithms developed for sequential sequences like DynamicSome, GSP, AprioriSome,

AprioriAll, SPAM, Prefixspan [2], I-prefixspan [8][1] etc. It generates various time

interval sequences by using sequence generator table. Here we have analyzed various

sequential mining techniques and compared them. Our algorithm outperforms other

sequential sequence mining algorithms. More ever our algorithms have excellent scale-up

properties.

Typical prefixspan [2] fails to provide sequences with time interval gap [8]

between sequences, our algorithm gives the sequences by taking care of time interval

between sequences.

In typical I-prefixspan [1], the projection table is created every time while

creation of every sequence, so it requires more Memory and Time while generating

sequences. The database is kept in the Memory after use so this algorithm is less effective

because of consumption of Memory. Our algorithm creates sequence generator table from

original database. The frequent sequences are created based on sequence generator table.

Hence therefore, it requires less Memory, Time and very efficient compare to latest

algorithms developed now a day.

- 67 -

Chapter 5

Proposed Algorithms

5.1 Sequential Sequence Mining

The Sequence is defined by order of events. Some time events occur in one

particular order. Sequential sequence mining is used to find out all the frequent sequences

which occur in maximum no of transactions. For example, the customer purchases a

laser printer will come back to buy Printer in two months and then a Scanner in three

months. Let us discuss sequential sequence mining in detail.

Let two sequences α=< a1, a2 … an> and β=< b1 b2 … bm> are given. α is called

a subsequence of β, denoted as α⊆ β, if there exist integers 1≤ j1< j2<…<jn ≤m such that

a1⊆ bj1, a2⊆ bj2,…, an⊆ bjn. β is a super sequence of α. Here <a(cd)f> is the subsequence

of the <a(bcd)(ef)ad>.

The length of a sequence is the number of itemsets in the sequence. A sequence of

length k is called a k-sequence. i.e

Candidate 1-subsequences:

<i1>, <i2>, <i3>, …, <in>

- 68 -

Candidate 2-subsequences:

<i1, i2>, <i1, i3>, …, <(i1 i1)>, <(i1 i2)>, …, <(in-1 in)>

Let I = { i1, i2 ……, in } be a set of items for transaction data. We call a subset X

I an itemset and we call a | X | the size of X. A sequence k = ( k1 , k2 , ……., km )

is an ordered list of Itemsets,

where k, i I , i { 1,……., m } . The size m, of a sequence is the number of

Itemsets in the sequence, i.e. | k |. The length l of a sequence k = ( k1 , k2 ,…….., km) is

defined as

length ||1

m

i

iKl

suppose K = ( k1 , k2 , k3 )

Where K1 = { p } and K2 = { p , q }

K3 = { p , q , r }

K4 = { p , q , r , s}

Then length ||4

1

i

iKl

length l = length of K1 + length of K2 + length of K3+ length of K4

= 1 + 2 + 3 + 4

= 10

Now Let us see following transactional dataset.

SID Sequences

1 <a(bc)(ef)ad>

2 <bcd>

3 <adb>

Table 5.1: Data set 1

Various transactions are shown in Data set 1. SID is the Sequence ID of the

Customer. Sequences represent various transactions made by respective Customers.

- 69 -

Sequences represent in <…> brackets. For SID 1, the sequence is <a(bc)(ef)ad>. Note

that a,b,c,d,e,f are the item codes. The items available in (…) brackets indicate that these

items are purchased by the customer at same time means in single transaction. If

customer purchases single item in transaction in that case the (…) brackets are not

required. For SID 1, we have 5 transactions. In 1st transaction item „a‟ is purchased. In 2

nd

transaction items “b and c” are purchased in same time. In 3rd

transaction items “e and f”

are purchased together. In 4th

transaction item „a‟ is purchased and in last transaction item

„d‟ is purchased. In sequence mining “ab” and “ba” have different meaning.

5.1.1 Support

The absolute support of a sequence Kp in the sequence representation of a

database D is defined as the number of sequences k D that contain Kp , and the relative

support is defined as the percentage of sequences k D.

suppD( Kp ) gives the support of Kp in the database. The minimum support

threshold is minSup. The sequence Kp is always frequent if the suppD( Kp ) minSup.

The problem of mining sequential sequence is to find all frequent sequential sequences

for database D and given a support threshold.

The Support indicates the occurrence of sequences in the database. Prefixspan [2]

gives only the frequent sequences but doesn‟t give the time interval between successive

items. Our new method gives the time interval sequences between successive items. The

dataset 2 as shown in Table 5.2 gives the detail including time interval between two

successive Items. Here (a, 2) means item „a‟ occurs at time stamp 2.

Table 5.2: Data set 2

SID Sequences

1 <(a,2)(bc,4)(ef,7)(a,8)(d,9)>

2 <(b,4)(c,6)(d,7)>

3 <(a,2)(d,3)(b,6)>

- 70 -

5.1.2 Super Sequence and Subsequence:

A sequence with length l is called an l-sequence. A sequence Kp = < p1,p2 ,..,pn>

is contained in another sequence Kq = < q1,q2, ……..,qm >

if there exist integers 1 i1<i2 <…. m such that p1 qi1 , p2 qi2 , ., pn qin.

e.g. if Kq = <q1,q2,q3 >where q1 = {p1,p2,p3 }. Here p1 q11 , p2 q12, p3 q13 &

q1 ={ p1,p2, p3} , q2 ={ p4 } & q3 = {p5, p6}.

Then Kq = <{ p1, p2 , p3 } , {p4 },{p5, p6}>

If sequence Kp is contained in sequence Kq then Kp is called a subsequence of

Kq and Kq is called a super sequence of Kp. In above example, Kp is subsequence of Kq

and Kq is super sequence of Kp.

5.3 Formal Notations & New Equations

5.3.1 Customer : The sequence of transactions T1, T2,. . ., Tn in the database D such that

C=<(T1,)(T2,). . .,( Tn) >, where Ti < Tj and i < j. The customer ID represents the identity

of the customer. It is represented by CID.

5.3.2 Item : An event (Item) I is defined as I = (E, t), where E is an Item or Event, where

t T. Here T is time.

5.3.3 Transaction: A transaction is a set of Items or Events such as T = (Cid, I, t), where

Cid is a customer identifier, I is an item type and t is a time where the event occurred.

5.3.4 SequenceID : The sequence of transactions T1, T2,. . ., Tn, such that C = <(T1,),

(T2,). . .,( Tn) >, where Ti < Tj and i < j. For the same customer‟s all transactions are

denoted by SID with same value.

- 71 -

5.3.5 Equation for time interval

The equation for time interval with various sequential items P1.. Pn & Q1…Qm is

given by

< (P1 Q1 Q2.. Qm, t1), (P2 Q1 Q2.. Qm, t2), (P3 Q1 Q2.. Qm, tα.), (P4 Q1 Q2.. Qm, tα.)…

(Pn Q1 Q2.. Qm, tn)>

The time interval equation is

Iαβ = tβ - tα. ,where α , β are time interval, where β α

…Equation 5.1

The time interval given by equation 5.1 is considered for different time interval.

5.3.6 Equation for same time interval items

In equation 5.1, if α = β , then Iαβ = tβ - tα. is called as the items occurred in

same time interval.

Iαβ = tβ - tα., where α , β are time interval, where α = β

…Equation 5.2

5.3.7 Equation for support

Support = The occurrence sequences in the database D with respect to all

sequences of database is called support.

Support = P(s)/P(S) …Equation 5.3

Where s < (P1 Q1 Q2.. Qm, t1), (P2 Q1 Q2.. Qm, t2), … (Pn Q1 Q2.. Qm, tn)>

S = Total SIDs

For a sequence SID, a sequence of items represented by <i1, i2,. . ., in>, where ii =

(I, ti), ii Ti and ti ti+1 for each i = 1,. . ., n - 1. A time interval between the first item i1

and the last event in is denoted as [t1, tn].

- 72 -

5.4 Algorithms of MySSM

We have proposed the series of MySSM algorithms. The first algorithm is

proposed as a SYNTIM for synthetic data generation. It generates the synthetic data with

different time intervals, different transactions and different items. This algorithm is given

in Figure 5.1. Algorithm 2 reads the “config.dat” file. This proposed algorithm is called

as a GCON. Algorithm 3 is proposed as a FS & GSGT which finds the 0-sequence and

also generates the sequence generator Table. Algorithm 4 is proposed to generate all

frequent sequences. It is proposed as a GAS. Algorithm 5 is proposed as a CMEM which

checks the memory. The 6th

proposed algorithm is named as a OUTR, which generates

the sequences in “output.dat” and also generates the “analysis.dat” file. The 7th

proposed

algorithm is a MYSSM. It is a Sequential Sequence Generation Algorithm. This

algorithm is main algorithm which includes all algorithms. These algorithms are shown

in Figure 5.1 to Figure 5.7.

5.4.1 Algorithm 1 : SYNTIM

Algorithm SYNTIM

Input Number of Customers, Number of Items

Output Dataset.dat,Datasetdetail.dat

Begin

Open dataset.dat file for writing

for i 0 to Last customer do for j 0 to No of Transaction

do time random value

item random value

end for

end for

Close dataset.dat file

Open datasetdetail.dat file for writing

Average items per transaction

Total no of items/No. of transactions

Average number of transactions per customer

Total number of transactions / Total no of customers

- 73 -

Close dataset detail file.

End

Figure 5.1 : Algorithm SYNTIM

The SYNTIM algorithm generates the customers‟ transactions with various time

intervals and items to be purchased. It generates the items based on number of

transactions and number of items available. This detail is stored in “dataset.dat” file.

Later on, it is used by MsSSM algorithm for the finding sequential sequence. It also

generates the average items per transaction and average transactions per customer, which

is stored in “datasetdetail.dat” file.

5.4.2 Algorithm 2: GCON

Algorithm GCON

Input Config.dat

Output Time interval, range, items, support

Begin

Initialize line, data

Initialize interval, range, item, customer, minsup

Open config.dat file for reading

for line 1 to end of data do

if(line==1)then interval data

else if(line==2) then range data

else if(line==3)then item data

else if(linenum==4)then customerdata

else if(line==5)then minsup data

end if

end for

Close file

End

Figure 5.2 : Algorithm GCON

- 74 -

GCON algorithm reads the “config.dat” file. It reads all the data from the file. It

first reads the interval of time unit, range of time interval, items to be purchased, Number

of customers & minimum support. These values are used by MySSM algorithm.

5.4.3 Algorithm 3: FS & GSGT

Algorithm FS & GSGT

Input dataset.dat

Output sequence generator table

Begin

Initialize datanum,indexno,i,item,time,count

Open dataset.dat

Repeat until end of file encountered

read time index and item index

Initialize counter, indexno

Repeat until length of customer sequence

Store the item index and time where the sequence occurs

Generate sequence generator table

Store using array index and time interval for each SID

Read item occurred in all SIDs

Increment the counter for the particular item occurred

If the counter value is more than minimum support then

add this item in large item list

else ignore it

end repeat

Close file

End

Figure 5.3 : Algorithm FS & GSGT

The FS & GSGT algorithm reads the “dataset.dat” file and generates the sequence

generator table. The sequence generator table stores the values of item index and time. By

using sequence generator table, it finds sequential sequences which occur frequently.

- 75 -

5.4.4 Algorithm 4: GAS

Algorithm GAS

Input sequence generator table

Output frequent sequential sequence

Begin

Declare the variables

Scan the sequence generator table

Repeat until end of file encountered

Scan the sequence generator table by Sequence ID

Scan the sequence generator table by item ID

Measure the repeated sequences with Time ID

If occurrence >= minimum support then

Keep it

Else ignore it

Check other combinations

If found then keep it

Else ignore it

End

Figure 5.4 : Algorithm GAS

The GAS algorithm scans the sequence generator table by using sequence Id, Item

Id and Time ID. It generates all the frequent sequences occurred in the database whose

support count is more than minimum support.

5.4.5 Algorithm 5: CMEM

Algorithm CMEM

Input dataset.dat, config.dat

Output sequential

Initialize maxMemory 0

Begin

Get total Memory during runtime

- 76 -

Get total Free Memory during runtime

current Memory = Total Memory - Free Memory

If current Memory >= maxMemory

then maxMemory = Current Memory

Return maxMemory in MB

End

Figure 5.5 : Algorithm CMEM

The CMEM algorithm finds the maximum memory used during run time. First it

finds the total memory during execution. The memory used during execution is found by

making a difference between max memory and free memory.

5.4.6 Algorithm 6: OUTR

Algorithm OUTR

Input Sequence generated by GAS

Output output.dat, analysis.dat

Begin

Open the output.dat file for writing

Write minimum support

Do while sequences exist

Write 0-sequences

Write all desired sequences generated by GAS algorithm

End do

Close file

Open analysis.dat file for writing

Write Number of Time Intervals, Gap between Time

interval, Minimum support

Write summary of all sequences generated by GAS

algorithm

Write Total number of sequence generated

Write Execution time in MilliSeconds & MaxMemory in MB

Close file

End

Figure 5.6 : Algorithm OUTR

- 77 -

The algorithm OUTR uses the sequences generated by GAS algorithm. It

generates the “output.dat” file. It writes the minimum support along with 0-sequences and

all frequent sequences generated by GAS algorithm in “output.dat” file. The OUTR

algorithm generates the status of the execution process. It creates “analysis.dat” file in

which it writes the summary of the execution of the programs like, Number of Time

Intervals, Gap between Time intervals, Minimum support, Total number of sequence

generated, Execution time in MilliSeconds & MaxMemory in MB. This algorithm is very

important for the empirical analysis of our proposed algorithms.

5.4.7 Algorithm 7: MySSM

Algorithm MySSM

Input dataset.dat, config.dat

Output sequential sequences, Execution time, Memory used

Begin

Initialize time, range, item, support

Initialize t1, t2, maxMemory

Open Dataset.dat and config.dat files

Initialize customer’s sequence, counter

Initialize arraylist for finding index and time

Call Procedure GCON()

Read the parameters from config.sys

t1 System.currentTimeMillis();

Call procedure FS&GSGT()

Generate all sequences onwards sequence-0

Generate sequence generator table

Return large sequence

Call procedure CMEM()

Return Memory used

Call procedure OUTR()

Return Time Interval, Gap, Min support, sequences

t2 System.currentTimeMillis() - t1;

Return sequential sequence

Close files

End

Figure 5.7 : Algorithm MySSM

- 78 -

The algorithm MySSM reads the data from config.dat, dataset.dat files. It

generates the large sequential sequences whose support count is greater than minimum

support. It finds time and memory used during execution. MySSM is main algorithm. It

executes all other algorithms proposed by us. The complexity of MySSM algorithm in

running time is O(log n) which shows the improved performance our algorithms

compared to other algorithms available at present.

Let‟s take one dataset and find out the time interval sequential sequences.

Sequence ID Sequence

1 <(p,2),(r,4),(p,5),(q,5),(p,7),(t,7),(r,11)>

2 <(s,4),(p,6),(q,6),(t,6),(s,8),(t,8),(r,13),(s,13)>

3 <(p,9),(q,9),(t,12),(s,14),(q,17),(r,17),(t,21)>

4 <(q,14),(f,16),(t,17),(q,21),(t,21)>

Table 5.3: Sequence Generator Table

First we transform the dataset into equal time stamp separator. The table will look

like as shown in Table 5.4.

Sequence ID Sequence

1 <(p,2),(r,4),(p,q,5),(p,t,7),(r,11)>

2 <(s,4),(p,q,t,6),(s,t,8),(r,s,13)>

3 <(p,q,9),(t,12),(s,14),(q,r,17),(t,21)

4 <(q,14),(f,16),(t,17),(q,t,21)>

Table 5.4: Sequence Generator Table with Time stamp

- 79 -

The items in same „( )‟ bracket have same time stamp. The scanning of sequence

generator table occurs. For minimum support is 50% and number of time intervals are 4

and the gap is shown below.

I0 = 0; I1 = 0 < t ≤ 5

I2 = 6 < t ≤ 10 I3 = 11 < t ≤ ∞

1st step of this algorithm is same as typical PrefixSpan. In the 1

st scan of the

dataset, we find the frequent items, which are called as 1-sequences. Now from the

example, <a>, , <c>, <d>, <e> are the frequent items which satisfy the minimum

support threshold. During the 1st step, the algorithm generates sequence generator table

as shown in Table 5.5 which is useful to find out the time interval sequential sequences.

SID <q> <r> <s> <t>

1 (1,2), (5,5),

(8,7)

(6,5) (3,4), (11,11) Ø (9,7)

2 (3,6) (4,6) (10,13) (1,4),(7,8),

(11,13)

(5,6), (8,8)

3 (1,9) (2,9), (8,17) (9,17),

(11,21)

(6,14) (4,12)

4 Ø (1,14),

(7,21)

(8,21) Ø (5,17)

Table 5.5: Sequence generator Table

Sequence generator table looks same as pseudo projection table used in the typical

PrefixSpan[2] but here we include the time with the item index as well. 1st column of the

sequence generator table indicates the Sequences ID and 1st raw shows the frequent

sequences. This table is extended as it finds more and more sequences. Here for sequence

Id 1, generates 3 pairs, (1, 2), (5, 5), (8, 7), which indicates that item occurs 3

- 80 -

times in this sequence (but in different transaction). In (1, 2), 1 represents the index of

 and 2 indicates the time when particular occurs. Same notation is applied for all

other cells. One more thing is to be noted for symbol „Ø‟. This symbol indicates that the

item does not occur in this sequence. Now understand how to generate sequence

generator table. Here in the 1st sequence there are <(p, 2),(r, 4),(p, q, 5),(p, t, 7),(r,

11)>sequences. Now scan this sequence. Here „p‟ occurs at 2nd

place so it‟s index is 2.

Then it generates the sequential sequences using both tables.

The sequences can be generated in two ways : <p, q> or <(p, q)>. First sequence

indicates that „p‟ and „q‟ occur in different transactions and the second sequence indicates

that „p‟ and „q‟ occur in same transaction. Now suppose we look for the <p, q> sequence.

Then first we have to find the index of the „p‟. There are 3 indexes in the first sequence.

They are (1, 2), (5, 5), (8, 7). The 1st index is 1. Find out the index of „q‟ which is greater

than index of item „p‟. Here, the index of „q‟ is 6 which greater than index of „p‟. If we

want to check either sequence <p, q> occurs or sequence <(p, q)> occurs. In this case, if

the index of „q‟ is greater than the index of „p‟ then we get the sequence <p, q> otherwise

we get the sequence <(p, q)>.

As per our example, the index of „p‟ is 1 and index of „q‟ is 6. After scanning

sequence generator table, the index of „q‟ is 6 which is greater than the index of „p‟. So

we find the <p, q> occurs in different transactions. Now for finding the time interval

between event „p‟ and event „q‟, the algorithm finds the difference of time interval. Here

we have time stamp with „p‟ is 2 and with „q‟ is 5. So time interval between „p‟ and „q‟ is

5-2 = 3. The value of „3‟ comes in range of I1 so the algorithm finds the sequence <p, I1,

q>. If the algorithm cannot find any index which is greater than the index of „p‟ and less

than the index of „q‟ then it assumes that both events occur in the same transaction. So we

get the sequence<p, I0, q>.

- 81 -

Table p q r s t

I0 0 3 0 0 2

I1 2 1 1 2 3

I2 0 1 4 1 0

I3 0 0 0 0 1

Table 5.6: Table of time interval sequence for ‘p’

Table 5.6 shows the time interval sequence for „p‟. In the table we have 1st

column as a time interval and 1st row as frequent items. Each cell indicates the count of

particular sequence. The value of the cell is considered as a counter. The counter is

incremented on occurrence of the sequence with „p‟. This is applicable for other items

also. Suppose we have . The index of this sequence is added in the sequence

generator table. It helps to find out the 3-sequences. Here we got (p I0 q), (p I0 e), (p I1

e) and (p I3 c) Thus it finds various frequent sequences.

- 82 -

Chapter 6

Empirical Analysis & Comparative Results

To evaluate the performance of the algorithms over a large range of data

characteristics, we generated synthetic data set for customers‟ transactions. This is the

basic step of the evaluating the algorithm. Here we generate one synthetic dataset

generator similar to the IBM synthetic dataset generator. We have tested on large data

base means the size of the items for 100 items and the transactions of 50000 customers or

more in large database. It generates 0-sequence to long sequences till the frequent length

of sequence is possible with minimum support. All the data are tested with 2048 MB

RAM of Java virtual Machine in Intel I5 processor with 8 GB DDR3 RAM & 500 GB

HDD and we compared the results with the state-of-the-art methods.

The few lines of large dataset/database are given below.

Data Set

1 9 95 99 161 9 277 9 324 9 337 9 363 9 399 11 101

11 280 19 60 27 99 27 209 27 236 27 318 27 358 27 393

2 8 14 8 33 8 215 8 285 8 300 8 317 8 345 8

- 83 -

3 3 41 3 68 3 72 3 154 3 352 3 384 12 7 12 27 12 115

12 220 17 160

4 8 26 19 91 19 333 26 5 26 15 26

Here all the data are generated randomly. The algorithm works fine on the

synthetic large database. The comparative results are discussed in this section.

1st number indicates the sequence ID. The sequence ID is followed by time and

the item codes. In 2 8 14 8 33; 2 indicates customer ID, 8 indicates the time and 14

and 33 indicate the items codes.

We generated the synthetic dataset. We tested the scalability of MySSM in both

runtime and memory usage using different parameters of matrix of evaluation such as

different support, items per transaction and transactions per customer. MySSM shows a

linear scalability in both the runtime and memory usage. We compared our results with i-

prefixspan[1][8]. The scale-up properties with respect to these parameters are shown in

Figure 6.1 to 6.11. The empirically analysis shows that the performance of our algorithm

MySSM is better than the i-prefixspan.

Figure 6.1 shows the empirical analysis of Number of customers v/s Time in

Milliseconds with number of time intervals are 3 and gap of time interval is 8, support

in value is 0.4000, No of different items are 10, No. of transactions per customer are 11,

No of items per transaction are 3 for 10,000 to 1,00,000 customers. With increase in

number of customers the time also increases in both.

- 84 -

Figure 6.1 : Number of Customers v/s Time(Milliseconds) for support =0.4

With the same parameters as per Figure 6.1 are taken and they are tested with

respect to memory. In both the cases, the memory usage is increased during run time with

increase in customers. This is shown in Figure 6.2.

Figure 6.2 : No of Customers v/s Memory(MB) for support =0.4

Experimental results are expanded and shown in Figure 6.3 with the support value

is 0.0200, time intervals are 3 and gap of time interval is 8, Number of different items are

- 85 -

100, Number of transactions per customer are 11, Number of items per transaction are 3

for 10,000 to 1,00,000 customers. The runtime increases when no of customers increase.

There is sudden increase in time for 50,000 to 1,00,000 customers due to more gap

between number of customers.

Figure 6.3 : Number of Customers v/s Time(Milliseconds) for support=0.02

Figure 6.4 : Number of Customers v/s Memory(MB) for support=0.02

- 86 -

The memory analysis is shown in Figure 6.4 with the same parameters are

considered as taken in Figure 6.3. The storage space required increases as increase in

number of customers

Figure 6.5 shows the analysis graph of Number of customers v/s Time in

Milliseconds with number of time intervals are 3 and gap of time interval is 8, support

in value is 0.3, Number of different items are 10, Number of transactions per customer

are 11, Number of items per transaction are 3 for 500 to 1,20,000 customers.

Figure 6.5 : Number of Customers v/s Time(Milliseconds) for support=0.3

Figure 6.6 shows the analysis graph of Number of customers v/s Memory in MB

with number of time intervals are 3 and gap of time interval is 8, support in value is 0.3,

Number of different items are 10, Number of transactions per customer are 11, Number

of items per transaction are 3 for 500 to 1,20,000 customers. The graph linearly increase

when Number of customers increase.

- 87 -

Figure 6.6 : Number of Customers v/s Memory(MB) for support=0.3

The time and memory analysis for 0.0008 support value is shown in Figure 6.7

and Figure 6.8 respectively with number of time intervals are 3 and gap of time interval is

5, Number of different items are 100, Number of transactions per customer are 11,

Number of items per transaction are 3 for 1000 to 1,00,000 customers.

- 88 -

Figure 6.7 : Number of Customers v/s Time(Milliseconds)

In both the cases, as shown in Figure 6.7 and Figure 6.8, The time & memory

scale up linearly according to the increase in number of customers.

Figure 6.8 : Number of Customers v/s Memory(MB)

- 89 -

Figure 6.9 shows the analysis graph of Number of customers v/s Time in

Milliseconds for the number of customers are 10,000, Number of different items are 100,

Number of transactions per customer are 11, Number of items per transaction are 3 for

the various support values range from 0.03 to 0.0008.

Figure 6.9 : Support v/s Time in Milliseconds

The graphs are expanded in Figure 6.10 and Figure 6.11 for various support

values. Figure 6.10 is shown with number of customers v/s Memory in MB for the

number of customers are 10,000 while Figure 6.11 is shown with support v/s Time for

number of customers are 50,000 with number of different items are 100, number of

transactions per customer are 11, number of items per transaction are 3 for the various

support values range from 0.03 to 0.0008.

When the support decreases, the time and memory also increase because with the

lower support more number of sequences are produced.

- 90 -

Figure 6.10 : Support v/s Memory in MB

Figure 6.11 : Support v/s Time in Milliseconds

The empirical analysis is shown in Figure 6.1 to Figure 6.11 with various

parameters of matrix of evaluation; it shows that MySSM outperforms i-prefixspan.

MySSM takes less time and utilize less memory during execution.

- 91 -

Our test results show that for 30000 customers with time interval is 3 and range of

time interval is 8, it is observed that when number of different items decrease , the total

number of sequences also decrease. We can see in Figure 6.12.

Figure 6.12 : No of different items v/s Total sequences

It is also observed that, for 30000 customers with time interval is 3 and range of

time interval is 8, when number of different items decrease, the number of different

independent sequences increases. We can see in Figure 6.13,6.14.6.15.

Figure 6.13 : No of different sequences for number of different items=100

- 92 -



- 93 -

Chapter 7.0

Conclusion & Future Scope

We could generate the sequential sequences by MySSM algorithm in very efficient way.

With our observation and work we could conclude that MySSM provides better

performance compared to all earlier algorithms produced for sequential sequences. Our

empirical analysis and test results state that MySSM outperforms the state-of-the-art

methods because of using sequence generator table. The sequence generator table saves

the time during execution and decrease the memory usage during execution.

Further, we may improve the performance of the algorithm in future as extension

our work by using multiple thread architecture and parallel execution of threads. It may

be more efficient and effective by considering time and memory.

- 94 -

Bibliography

[1]. Dhany, Saputra and Rambli Dayang, R.A. and Foong, Oi Mean,“Mining Sequential

Patterns Using I-PrefixSpan”, World Academy of Science, Engineering and Technology,

Dec., 2008.

[2]. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M.-C. Hsu,

“Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach”,

Transactions on Knowledge and Data Engineering, Vol. 16, No. 11, Pages 1424-1440,

2004.

[3]. J. Ayres, J. Gehrke, T. Yiu, and J. Flannick, “Sequential Pattern Mining Using a

Bitmap Representation”, Proc. ACM SIGKDD Int‟l Conf. Knowledge Discovery and

Data Mining (SIGKDD ‟02), Pages 429-435, July 2002.

[4]. M. Zaki, “SPADE: An Efficient Algorithm for Mining Frequent Sequences”,

Machine Learning, Vol. 40, Pages 31-60, 2001.

[5]. J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu., “Freespan:

Frequent pattern-projected sequential pattern mining”, In Proc. 2000, Int‟l Conf.

Knowledge Discovery and Data Mining (KDD‟00), Pages 355–359, Aug. 2000.

[6]. R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and

Performance Improvements”, Proc. Fifth Int‟l Conf. Extending Database Technology

(EDBT ‟96), Pages 3-17, Mar. 1996.

[7]. R. Agrawal and R. Srikant, “Mining Sequential Patterns”, Proc. 1995 Int‟l Conf.

Data Eng. (ICDE ‟95), Pages 3-14, Mar. 1995.

- 95 -

[8]. Chen, Y.L., Chiang, M.C. and Ko, M.T., "Discovering time-interval sequential

patterns in sequence databases", Expert Syst. Appl., Vol. 25, No. 3, Pages 343-354,

2003.

[9]. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M.-C. Hsu,

“PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth”,

Proc., Int‟l Conf. Data Eng. (ICDE ‟01), Pages 215-224,2001.

[10]. R Agrawal, R Srikant, “Fast Algorithm for Mining Association Rules”, Proc. 20th

Int‟l Conf. Very Large Data Bases, VLDB, Pages 487-499, 1994.

[11]. C. C. Yu and Y.-L. Chen, “Mining sequential patterns from multi-dimensional

sequence data”, IEEE Transaction on Data and Knowledge Engineering, 17(1), 136-140,

2005.

[12]. A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining

association rules in large databases”, In Proc. Int‟l Conf. Very Large Data Bases

(VLDB), Pages 432–443, Sept. 1995.

[13]. Toivonen. H., “Sampling large databases for association rules”, In Proc. Int‟l Conf.

Very Large Data Bases (VLDB), Pages 134–145,1996.

[14]. Park. J. S, M.S. Chen, P.S. Yu., “An effective hash-based algorithm for mining

association rules”, In Proc. ACM-SIGMOD Int‟l Conf. Management of Data (SIGMOD),

San Jose, CA, Pages 175–186, 1995.

[15]. Brin.S, Motwani. R, Ullman. J.D, and S. Tsur., “Dynamic itemset counting and

implication rules for market basket analysis”, In Proc. ACM-SIGMOD Int‟l Conf.

Management of Data (SIGMOD), Pages 255–264,1997.

- 96 -

[16]. Dongme Sun, Shaohua Teng, Wei Zhang, Haibin Zhu, “An Algorithm to Improve

the Effectiveness of Apriori”, In Proc. Int‟l Conf. on 6th IEEE Int‟l Conf. on Cognitive

Informatics (ICCI'07), 2007.

[17]. Mannila, H. and Toivonen, H.,”Discovering generalized episodes using minimal

occurrences”, In Proc. of ACM Conference on Knowledge Discovery and Data Mining

(SIGKDD), Pages 146–151, 1996.

[18]. Bayardo, R., Agrawal, R., and Gunopulos, D., “Constraint-based rule mining in

large, dense databases”, In Proc. of IEEE Int‟l Conf. on Data Engineering (ICDE), Pages

188–197, 1999.

[19]. Leleu, M., Rigotti, C., Boulicaut, J., and Euvrard, G., “Go-spade: Mining sequential

patterns over databases with consecutive repetitions”, In Proc. of Int‟l Conference on

Machine Learning and Data Mining in Pattern Recognition (MLDM), Pages 293–306,

2003.

[20]. Garofalakis, M., Rastogi, R., and Shim, K.,“Spirit: Sequential pattern mining with

regular expression constraints”, In Proc. of Int‟l Conf. on Very Large Databases (VLDB),

Pages 223–234,1999.

[21]. M.C., “Prefixspan: Mining sequential patterns efficiently by prefixprojected pattern

growth”, In Proc. of IEEE Int‟l Conf. on Data Engineering (ICDE), Pages 215–224,

2001.

[22]. Fabian Moerchen, “Temporal pattern mining for time points, time intervals, and

semi-intervals”, Siemens Corporate Research, January, 2011

[23]. Wang, J. and Han, J., “BIDE: Efficient mining of frequent closed sequences”, In

Proc. of IEEE Int‟l Conf. on Data Engineering (ICDE), Pages 79–90, 2004.

- 97 -

[24]. Lin, J. L.,”Mining maximal frequent intervals. Technical report”, In Proc. of Annual

ACM, Symposium on Applied Computing (SAC), Pages 624–629, 2002.

[25]. Villafane, R., Hua, K. A., Tran, D., and Maulik, B., “Knowledge discovery from

series of interval events”, Intelligent Information Systems, 15(1):71–89, 2000.

[26]. Chieh-Yuan Tsai, Yu-Chen Shieh,“A change detection method for sequential

patterns”, Decision Support Systems Vol. 46, Pages 501–511, Year 2009, ,ElsevierB.V.

,2000.

[27]. Mirko B., Martin S., Detlef N., Rudolf K., “Mining changing customer segments in

dynamic markets”, Expert Systems with Applications 36, ScienceDirect Page 155–164,

2009.

[28]. Wenyuan Li, Min Xu, Xianghong Jasmine Zhou, “Unraveling complex temporal

associations in cellular systems across multiple time-series microarray datasets”, Journal

of BI 43, Elsevier, ScienceDirect, Pages 550–559, 2010

[29]. A. Apostolico, M. E. Bock, S. Lonardi, and X. Xu. “Efficient detection of unusual

words”, Journal of Computational Biology, 7(1-2):71-94, 2000.

[30]. J. Wang and J. Han., “BIDE: Efficient mining of frequent closed sequences”, In

Proceedings of the 20th Int‟l Conf. on Data Engineering (ICDE'04), Pages 79-90. IEEE

Press, 2004.

[31]. S. Laxman, P. S. Sastry, and K. P. Unnikrishnan, “A fast algorithm for finding

frequent episodes in event streams”, In Proceedings of the 13th ACM SIGKDD Int‟l

Conf. on Knowledge Discovery and Data Mining (KDD'07), Pages 410-419, 2007.

- 98 -

[32]. J. Pei, H. Wang, J. Liu, K. Wang, J. Wang, and P. S. Yu., “Discovering frequent

closed partial orders from strings”, IEEE Transactions on Knowledge and Data

Engineering, 18(11):1467-1481, 2006.

[33]. Juyoung Kang and Hwan-Seung, “Mining Spatio-Temporal Patterns in Trajectory

Data”, Journal of Information Processing Systems, Vol. 6, No.4, 2010.

[34]. Yan Huang, Liqin Zhang, and Pusheng Zhang, “A Framework for Mining

Sequential Patterns from Spatio-Temporal Event Data Sets”, IEEE Transactions on

Knowledge and Data Engineering, Vol. 20, NO. 4, 2008.

[35]. Damian Fricker Hui Zhang Chen Yu, “Sequential Pattern Mining of Multi modal

Data Streams in Dyadic Interactions”, ICDL, 978-1-61284-990-4/11, IEEE, 2011.

[36]. Eric Hsueh-Chan Lu, Vincent S. Tseng, Philip S. Yu, “Mining Cluster-Based

Temporal Mobile Sequential Patterns in Location-Based Service Environments”, IEEE

Transactions on Knowledge and Data Engineering, Vol. 23, NO. 6, 2011.

[37]. H. Mannila, H. Toivonen, and A. Verkamo, “Improved methods for finding

association rules”, In Proc. AAAI Workshop on Knowledge Discovery, 1994.

[38]. Claudia Antunes and Arlindo L. Oliveira, “Sequential Pattern Mining Algorithms:

Trade-offs between Speed and Memory”, In 2nd

Workshop on Mining Graphs, Trees and

Seq, 2004.

[39]. J. Pei, J. Han, and W. Wang, "Mining sequential patterns with constraints in large

databases", Proceedings of the eleventh Int‟l Conf. on Information and knowledge

management, McLean, Virginia, USA, 2002.

[40]. S. Parthasarathy, et al., "Incremental and interactive sequence mining", Proceedings

of the eighth Int‟l Conf. on Information and knowledge management, 1999.

- 99 -

[41]. M. Zhang, et al.,"Efficient algorithms for incremental update of frequent

sequences", Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data

Mining (PAKDD2002), Taipel, Taiwan, 2002.

[42]. H. Mannila and H. Toivonen,“On an algorithm for finding all interesting

sentences”, In 13th European Meeting on Cybernetics and Systems Research, 1996.

[43]. R. Agrawal and J. Shafer, “Parallel mining of association rules”, IEEE Trans. on

Knowledge and Data Engineering, 1996.

[44]. R. Agrawal and R. Srikant, “Mining Sequential Patterns”, In Proc. 11th

Int'l Conf.

on Data Engineering (ICDE), 1995.

[45]. J. Yang, W. Wang, and P. S. Yu, “Mining asynchronous periodic patterns in time

series data”, IEEE Transactions on Knowledge and Data Engineering, 15(3), 613-628,

2003.

[46]. J. Han, W. Gong, and Y. Yin, “Mining segment-wise periodic patterns in time-

related databases”, Proc. Int‟l Conf. on Knowledge Discovery and Data Mining, 1998.

[47]. S. Ma, et al., “Mining partially periodic event patterns with unknown periods”,

Data Engineering, 2001. Proceedings. 17th

Int‟l Conf., 2001.

[48]. X. Yan, J. Han, and R. Afshar, “CloSpan: Mining closed sequential patterns in large

datasets”, Proceedings of the Int. Conf. SIAM Data Mining, 2003.

[49]. Y. J. Lee, J.W.Lee, D.J.Chai, B. H. Hwang, K.Ho Ryu, “Mining temporal interval

relational rules from temporal data”, The Journal of Systems and Software 82, Pages

155–167, 2009.

- 100 -

[50]. Y. L. Chen and T. C. K. Huang, “Discovering fuzzy time-interval sequential

patterns in sequence databases”, Systems, Man and Cybernetics, Part B, IEEE

Transactions on, 35(5), 959-972, 2005.

[51]. H. Mannila, H. Toivonen, and A. Inkeri Verkamo, “Discovery of frequent episodes

in event sequences”, Data Mining and Knowledge Discovery, 1(3), 259-289, 1997.

[52]. H. Pinto, et al., “Multi-dimensional sequential pattern mining”, Proceedings of the

10th

Int‟l Conf. on Information and Knowledge Management, 2001.

- 101 -

Own Publication List

Publications related to my research work

[International Journals/Conferences]

[1]. Kiran Amin, Dr. J.S. Shah, “Sequential Sequence Mining Technique in

Mammographic Information Analysis Database” in International Journal of

Emerging Technology and Advanced Engineering ,ISSN 2250-2459, Vol. 2, Issue 5,

May 2012.

[2]. Kiran Amin, Dr. J. S. Shah ” Improved technique in Sequential Sequence Mining in

large database of transaction”, International Journal of Engineering Research and

Technology, ISSN: 2278- 0181, Vol. 1, Issue 4, June 2012

[3]. Kiran Amin, Dr. J. S. Shah “ Gradual Evolution of Sequential Sequence Mining for

Customer relation database” , International Journal on Computer Science and

Engineering, ISSN: 2229-5631, Vol. 4, Issue 7, July 2012

[4]. Kiran Amin, Dr. J. S. Shah “ Sequential Sequence Mining Technique in Large

Information Analysis Database “ at 6th Int‟l Conf. on Next Generation Web Services

Practices (NWeSP 2010), November, 2010 - Gwalior, India, available on IEEE

Explorer

[5]. Kiran Amin, Dr. J. S. Shah” Sequential Sequence Mining Technique in Large

Database of Gene Sequence at Int‟l Conf. on Computational Intelligence and

Communication Networks (CICN 2010), November 2010, Bhopal, India, available on

IEEE Explorer

- 102 -

My other research publications

[International Journals/Conferences]

[1]. Kiran Amin, “Web search result rank optimization using search engine query log

mining”, Int‟l Conf. on Recent Advances in Engineering and Technology, ISBN :

978-81-923541-0-2, April, 2012

[2]. Kiran Amin, “Survey on web log data in terms of Web Usage Mining” International

Journal of Engineering Research and Applications” ISSN: 2248-9622

[3]. Kiran Amin, “Attribute Based Routing For Query Processing To Minimize Power

Consumption in Wireless Sensor Networks” on Innovations in Embedded Systems.

Mobile Communication and Computing Technologies, MACMILLAN

PUBLISHERS INDIA LTD. Catalog no. ISBN 13: 978-0230-63910-2 IN

MACMILLAN Advanced Research Series. Proceedings by Mobile Communication

and Networking Center of Excellence (MCNC) and PES School of Engineering,

Bangalore, India July, 2009

[4]. Kiran Amin “Utilization of SIP Contact Header for Reducing the Load on Proxy

Servers in FoIP Application” Intelligence, Communication Systems and Networks”,

(CICSyN-2009), Published in IEEE Computer Society Journal, (Copy writes transfer

to IEEE) on Proceedings published by jointly organized by UK Simulation Society

and Asia Modelling and Simulation Society, 2009.

“Sequential Sequence Mining (SSM) Technique in Large Data...

Documents

Transcript of “Sequential Sequence Mining (SSM) Technique in Large Data...