8/11/2019 Ch5 Sequential
1/20
August 19, 2014 Data Mining: Concepts and Techniques 1
Chap 5.1: Mining Sequential Patterns
A kind of association rules Its algorithms are closely related with association
rule mining algorithms
8/11/2019 Ch5 Sequential
2/20
August 19, 2014 Data Mining: Concepts and Techniques 2
Sequence Databases and SequentialPattern Analysis
Transaction databases, time-series databases vs. sequencedatabases
Time-series dbstores sequences of values that change with time, such as
data collected regarding the stock exchange.
Frequent patterns vs. (frequent) sequential patterns
Applications of sequential pattern mining
Customer shopping sequences:
First buy computer, then CD-ROM, and then digital camera,
within 3 months. Medical treatment, natural disasters (e.g., earthquakes),
science & engineering processes, stocks and markets, etc.
Telephone calling patterns, Weblog click streams
DNA sequences and gene structures
8/11/2019 Ch5 Sequential
3/20
August 19, 2014 Data Mining: Concepts and Techniques 3
What Is Sequential Pattern Mining?
Given a set of sequences, find the complete setof frequent subsequences
A sequence database
A sequence : < (ef) (ab) (df) c b >
An element may contain a set of items.Items within an element ar
e unorderedand we list them alphabetically.
is a subsequenceof
Given support thre
shold min_sup =2, is a
sequential pattern
SID sequence
10
20
30
40
8/11/2019 Ch5 Sequential
4/20
August 19, 2014 Data Mining: Concepts and Techniques 4
Mining Sequential Patterns
A sequence is contained in anothersequence if there exist integers1i1
8/11/2019 Ch5 Sequential
5/20
August 19, 2014 Data Mining: Concepts and Techniques 5
Mining Sequential Patterns : An example
CustomerId
TransactionTime
ItemsBought
1
1
June 25 93
June 30 93
30
90
2
2
2
June 10 93
June 15 93
June 20 93
10, 20
30
40, 60, 70
3 June 25 93 30, 50, 70
4
44
5
June 25 93
June 30 93July 25 93
June 12 93
30
40,7090
90
Database Sorted by CustomerId and Transaction Time
Customer CustomerID Sequence
1 2 3 4
5
Customer sequence version of the DB
Sequential patterns with support > 25%
8/11/2019 Ch5 Sequential
6/20
August 19, 2014 Data Mining: Concepts and Techniques 6
Mining Sequential Patterns
Given a database Dof customer transaction, the problemof mining sequential patterns is to find the maximalsequences among all sequences that have a certain userspecified minimum support
8/11/2019 Ch5 Sequential
7/20August 19, 2014 Data Mining: Concepts and Techniques 7
Mining Sequential Patterns
Involve 5 phase Sort phase
Litemset phase
Transformation phase
Sequence phase
Maximal phase
Terminology
Lengthof a sequencenumber of itemset in the sequence (
sequence of length k called k-sequence) Support of an itemset i is a fraction of customers who bought
items in i in a single transaction
An itemset with minimum support is called a large itemset orlitemset
8/11/2019 Ch5 Sequential
8/20August 19, 2014 Data Mining: Concepts and Techniques 8
Mining Sequential Patterns
Involve 5 phase Sort phase
The database (D) is sorted, with customer-id as the major key and
transaction time as the minor key
Converts the original transaction database into a database of
customer sequences
Litemset phase
Find the set of all litemsets L
Simultaneously find the set of all large 1-sequences {|l L} The set of litemset is mappedto a set of contiguous integers
Reason for mapping: by treating litemset as single entities,comparetwo itemset for equality in constant time, and reduce the timerequired to check if a sequence is contained in a customer sequence
8/11/2019 Ch5 Sequential
9/20August 19, 2014 Data Mining: Concepts and Techniques 9
Litemset phase
CustomerId
TransactionTime
ItemsBought
1
1
June 25 93
June 30 93
30
90
2
2
2
June 10 93
June 15 93
June 20 93
10, 20
30
40, 60, 70
3 June 25 93 30, 50, 70
4
44
5
June 25 93
June 30 93July 25 93
June 12 93
30
40,7090
90
Fig.1 Database Sorted by CustomerId and Transaction Time
Large itemsets are (30), (40),(70),(40,70) and (90)
Large Itemset Mapped to
(30) 1(40) 2(70) 3(40 70) 4(90) 5
8/11/2019 Ch5 Sequential
10/20August 19, 2014 Data Mining: Concepts and Techniques 10
Transformation Phase
Repeatedly determine which of agiven set of large sequences arecontained in a customer sequence
By transforming each customersequence into an alternativerepresentation
Each transaction is replaced bythe set of all litemsetscontained in the transaction If a transaction does not
contain any litemset, notretained in the transformedsequence
If a customer sequence doesnot contain any litemsets, thissequence is dropped from thetransformed database
Customer CustomerID Sequence
1 2 3 4
5
Customer sequence version of the DB
Large Itemset Mapped to
(30) 1(40) 2(70) 3(40 70) 4(90) 5
Large itemset
8/11/2019 Ch5 Sequential
11/20August 19, 2014 Data Mining: Concepts and Techniques 11
Transformation Phase
Customer Customer Transformed AfterID Sequence Customer Sequence Mapping
1 2 3 4 5
Transformed Database
Large Itemset Mapped to(30) 1(40) 2(70) 3(40 70) 4(90) 5
Large itemset
8/11/2019 Ch5 Sequential
12/20August 19, 2014 Data Mining: Concepts and Techniques 12
Sequence Phase
Make multiple passesover the data In each pass, we start with a seed set of large
sequences
Use the seed set for generating new potentially
large sequences called candidate sequences Count the support while pass the data
At the end of the pass, determine the largecandidate sequences
these large candidate becomes the seed for thenext pass.
Involve 2 algorithms Count-all andcount some
Count-all based on apriori algorithm calledAprioriAll
8/11/2019 Ch5 Sequential
13/20August 19, 2014 Data Mining: Concepts and Techniques 13
Sequence Phase
AprioriAll Apriori candidate generation
Large Candidate Candidate3-Sequences 4-sequences 4-sequences
(after join) (after pruning)
Maximal phase
Find the maximal sequences among the set of large sequences
8/11/2019 Ch5 Sequential
14/20August 19, 2014 Data Mining: Concepts and Techniques 14
Sequence Phase
Large sequences< {1 5} {2} {3} {4} >< {1} {3} {4} {3 5}>< {1} {2} {3} {4}>< {1} {3} {5} >
< {4} {5} >
Customer sequences
1-sequences support 4 2 4
4 4
2-sequences support 2 4
3 3 2 2 3 2
2
3-sequences support 2 2 3 2 2
4-sequences support 2
L1
L2
L3
L4
Min_support = 40%(2 customer sequences)
Maximal large sequence, ,
8/11/2019 Ch5 Sequential
15/20August 19, 2014 Data Mining: Concepts and Techniques 15
Challenges on Sequential Pattern Mining
A hugenumber of possible sequential patterns are
hidden in databases
A mining algorithm should
find the complete set of patterns, when possible,
satisfying the minimum support (frequency) threshold
be highly efficient, scalable, involving only a small
number of database scans be able to incorporate various kinds of user-specific
constraints
8/11/2019 Ch5 Sequential
16/20August 19, 2014 Data Mining: Concepts and Techniques 16
Studies on Sequential Pattern Mining
Concept introduction and an initial Apriori-like algorithm
R. Agrawal & R. Srikant. Mining sequential patterns, ICDE95
GSPAn Apriori-based, influential mining method (developed at IBMAlmaden)
R. Srikant & R. Agrawal. Mining sequential patterns:Generalizations and performance improvements, EDBT96
From sequential patterns to episodes (Apriori-like + constraints)
H. Mannila, H. Toivonen & A.I. Verkamo. Discovery of frequentepisodes in event sequences, Data Mining and Knowledge
Discovery, 1997
Mining sequential patterns with constraints
M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential
Pattern Mining with Regular Expression Constraints. VLDB 1999
8/11/2019 Ch5 Sequential
17/20August 19, 2014 Data Mining: Concepts and Techniques 17
A Basic Property of Sequential Patterns: Apriori
A basic property: Apriori (Agrawal & Srikant94)
If a sequence S is not frequent
Then none of the super-sequences of S is frequent
E.g, is infrequentso do and
50
40
30
20
10
SequenceSeq. ID Given support thresholdmin_sup =2
8/11/2019 Ch5 Sequential
18/20August 19, 2014 Data Mining: Concepts and Techniques 18
GSPA Generalized Sequential Pattern Mining Algorithm
GSP (Generalized Sequential Pattern) mining algorithm
proposed by Agrawal and Srikant, EDBT96
Outline of the method
Initially, every item in DB is a candidate of length-1
for each level (i.e., sequences of length-k) do
scan database to collect support count for eachcandidate sequence
generate candidate length-(k+1) sequences from
length-k frequent sequences using Apriori
repeat until no frequent sequence or no candidatecan be found
Major strength: Candidate pruning by Apriori
8/11/2019 Ch5 Sequential
19/20August 19, 2014 Data Mining: Concepts and Techniques 19
Finding Length-1 Sequential Patterns
Examine GSP using an example Initial candidates: all singleton sequences
, , , , , ,,
Scan database once, count support forcandidates
50
40
30
20
10
SequenceSeq. ID
min_sup =2
Cand Sup
3
5
4
3
3
2
1
1
8/11/2019 Ch5 Sequential
20/20A t 19 2014 D t Mi i C t d T h i 20
Generating Length-2 Candidates
15 length-2
Candidates
Without Apriori
property,8*8+8*7/2=92
candidates
Apriori prunes
44.57% candidates