Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining (581550-4): Seminar Meetings...

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page1/30

Course on Data Mining (581550-4): Course on Data Mining (581550-4): Seminar MeetingsSeminar Meetings

Ass. RulesAss. Rules

EpisodesEpisodes

Text MiningText Mining

02.11.

09.11.

ClusteringClustering

KDD ProcessKDD Process

Home ExamHome Exam

23.11.

30.11.

16.11.

M

P

Seminar by Mika

Seminar by Pirjo

P P

PM

M


Page2/30

Today 09.11.2001Today 09.11.2001

• Rakesh Agrawal and Ramakrishnan Rakesh Agrawal and Ramakrishnan Srikant: Srikant: Mining Sequential Mining Sequential PatternsPatterns. Int'l Conference on Data . Int'l Conference on Data Engineering, 1995.Engineering, 1995.

• F. Masseglia, P. Poncelet and M. F. Masseglia, P. Poncelet and M. Teisseire: Incremental Mining of Teisseire: Incremental Mining of Sequential Patterns in Large Sequential Patterns in Large Databases. 16èmes Journées Bases Databases. 16èmes Journées Bases de Données Avancées, 2000.de Données Avancées, 2000.

Course on Data Mining (581550-4): Course on Data Mining (581550-4): Seminar MeetingsSeminar Meetings


Page3/30

Mining Sequential PatternsMining Sequential Patterns

Rakesh Agrawal and Ramakrishnan SrikantIBM Almaden Research Center, USA

Published in ICDE'95 (Int'l Conf. on Data Engineering)

Data Mining course Autumn 2001/University of Helsinki

Summary by Mika Klemettinen


Page4/30

Mining Sequential PatternsMining Sequential Patterns• Problem statement:Problem statement:

• Database D with customer transactions• Customer-id, transaction time, items purchased• Quantities of items purchased are NOT concerned

• Definitions:Definitions:• Itemset: a non-empty set of items, i1 i2 i3 … • Sequence: an ordered list of itemsets, s1 s2 s3 … • A sequence a1 a2 … an is contained in b1 b2 … bn if there

exist i1 < i2 < ... < in such that a1 bi1, a2 bi2

, … an bin• E.g., (3)(4 5)(8) (7)(3 8)(9)(4 5 6)(8)>, since (3) (3

8), (4 5) (4 5 6) and (8) (8)• However, note that sequence (3)(5) (3 5) (and vice

versa)


Page5/30

Mining Sequential PatternsMining Sequential Patterns• Customer sequence: a sequence of transactions

("shopping baskets") of a customer, ordered by transaction times Ti: itemset(T1) itemset(T2) … itemset(Tn)

• A customer supports a sequence s if s is contained in the customer sequence for this customer

• The support for a sequence is defined as the fraction of total customers who support this sequence

• Task:Task: Given a database D of customer transactions, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain user-specified minimun support. Each such maximal sequence represents a sequential pattern


Page6/30

Mining Sequential PatternsMining Sequential PatternsCustomer Id Transaction time Items bought1 June 25, 1993 301 June 30, 1993 902 June 10, 1993 10, 202 June 15, 1993 302 June 20, 1993 40, 60, 70... ... ...

Customer Id Customer sequence1 (30)(90) 2 (10 20)(30)(40 60 70) 3 (30 50 70) 4 (30)(40 70)(90) 5 (90)

Min. support 25%=> 2 customers:<(30)(90)> (1&4) and<(30)(40 70)> (2&4)are maximal


Page7/30

Mining Sequential PatternsMining Sequential Patterns• Definitions:Definitions:

• Length of a sequence is the number of itemsets in the sequence• A sequence of length k is called k-sequence• A sequence concatenated from sequences x and y is denoted by

x.y• The support for an itemset i is defined as the fraction of

customers who bought the items in i in a single transaction• An itemset with minimum support is called large itemset or

litemset• Each itemset in a large sequence must have minimum support,

i.e., any large sequence must be a list of litemsets (Apriori trick!)• Three algorithms, all for sequential patterns:Three algorithms, all for sequential patterns:

• AprioriSome• AprioriAll• DynamicSome


Page8/30

Mining Sequential PatternsMining Sequential Patterns• Mining of sequential patterns:Mining of sequential patterns:

• 1. Sort Phase1. Sort Phase• Sort according to customer Id and transaction time

• 2. Litemset Phase2. Litemset Phase• Find large itemsets in a Apriori fashion, but like in

MaxFreq, the support count is incremented only once even if the customer buys the same set of items in two different transactions

• The large itemsets are mapped to a set of contiguous integers (e.g. (30), (40), (70), (40 70) and (90) becomes 1, 2, 3, 4 and 5); checking of equality is then fast (constant time)!


Page9/30

Mining Sequential PatternsMining Sequential Patterns• 3. Transformation Phase3. Transformation Phase

• There is a need to repeatedly check which large itemsets are contained in customer sequences

• To make this fast, each customer sequence is transformed to a list of large itemsets

• Then the large itemsets are mapped to integersCId Original seq. Transf. Mapping1 (30)(90) {(30)}{(90)} {1}{5}

2 (10 20)(30)(40 60 70) {(30)}{(40),(70),(40 70)} {1}{2,3,4}3 (30 50 70) {(30),(70)} {1,3}4 (30)(40 70)(90) {(30)}{(40),(70),(40 70)}{(90)}

{1}{2,3,4}{5}5 (90) {(90)} {5}


Page10/30

Mining Sequential PatternsMining Sequential Patterns• 4. Sequence Phase4. Sequence Phase

• The large itemsets are used to find the desired sequences

• AprioriAll:– Based on the normal Apriori algorithm– Counts all the large sequences– Prunes non-maximal in the "Maximal phase"

• *Some: – Avoid counting sequences that are contained in

longer sequences by counting the longer ones first, also avoid having to count many subsequences because their supersequences are not large


Page11/30

Mining Sequential PatternsMining Sequential Patterns– Forward phase: find all large sequences of certain

lengths– Backward phase: find all remaining large sequences– AprioriSome: use only large sequences from

previous pass to generate candidates and validate their supports (i.e., if they are frequent or not)

– DynamicSome: generate candidates on-the-fly based on large sequences found from the previous passes and the customer sequences read from the database

• 5. Maximal Phase• Find the maximal sequences among the large sequences• In practice, starting from the largest sequences, delete all

their subsequences


Page12/30

Mining Sequential PatternsMining Sequential Patterns• AprioriAll:AprioriAll:

• Find all large sequences "normally"

• Prune the non-maximal ones away starting from 1 2 3 4 by deleting all its subsequences ( 1 2 3 , 1 2 4 , 1 3 4 , 2 3 4 , 1 2 , 1 3 , …, 4 ), then take the remaining 1 3 5 and prune all its subsequences, …

• The maximal large sequences are 1 2 3 4 , 1 3 5 and 4 5


Page13/30

Mining Sequential PatternsMining Sequential Patterns• AprioriSome:AprioriSome:

• Count only sequences of, e.g., length 1, 2, 4 and 6 in "forward phase" and count sequences of length 3 and 5 in "backward phase"

• Note: in the forward phase, candidates for all levels are counted:

• If in the large sequences of length Lk-1were checked, then generate new candidates Ck based on them

• If in the large sequences of length Lk-1were NOT checked, then generate new candidates Ck based on candidates Ck-1

• In backward phase: delete all sequences of the length k in candidate collection if they are contained in some longer large sequence Li (i > k)


Page14/30

Mining Sequential PatternsMining Sequential Patterns• Function "next" determines the next sequence

length which is counted: this is based on the assumption that if, e.g, almost all sequences of length k are large (frequent), then many of the sequences of length k+1 are also large (frequent). E.g.,

• Most of the sequences are large (85%) => next round is k+5

• ...• Not many of the sequences are large (67%) =>

next round is k+1 (AprioriAll)


Page15/30

Mining Sequential PatternsMining Sequential Patterns• DynamicSome:DynamicSome:

• In the initialization phase, count only sequences upto and including step variable length

• E.g., if step is 3, count sequences of length 1, 2 and 3• In the forward phase, we generate sequences of length 2

× step, 3 × step, 4 × step, etc. on-the-fly based on previous passes and customer sequences in the database

• E.g., while generating sequences of length 9 with a step size 3: While passing the data, if sequences s6 L6 and s3 L3 are both contained in the customer sequence c in hand, and they do not overlap in c, then sk . sj is a candidate (k+j)-sequence


Page16/30

Mining Sequential PatternsMining Sequential Patterns• In the intermediate phase, generate the candidate

sequences for the skipped lengths• E.g., if we have counted L6 and L3 , and L9 turns

out to be empty: we generate C7 and C8 , count C8 followed by C7 after deleting non-maximal sequences, and repeat the process for C4 and C5

• The backward phase is identical to AprioriSome

• Then we go on and spare a few thoughts on Then we go on and spare a few thoughts on incremental mining of sequential patternsincremental mining of sequential patterns


Page17/30

Incremental Mining of Incremental Mining of Sequential Patterns in Large Sequential Patterns in Large

DatabasesDatabasesF. Masseglia, P. Poncelet and M. TeisseireLaboratoire PRiSM & LIRMM UMR CNRS,

FrancePublished in BDA'00 (Bases de Données

Avancées)

Data Mining course Autumn 2001/University of Helsinki

Summary by Mika Klemettinen


Page18/30

Incremental Mining of Sequential Incremental Mining of Sequential PatternsPatterns

• Problem setting:Problem setting:• Let us consider an original and an incremental customer

transaction database• For the original database, the frequent patterns have been

created• Incremental database may contain new customers and new

transactions for both old and new customers• To compute the set of sequential patterns in the updated

database, we want to avoid counting everything from the scratch

• Some main things one has to consider:• Discover all sequential patterns NOT frequent in the

original database but become frequent with the increment• Examine all transactions in the original database which

can be extended to become frequent• Old frequent sequences may become invalid when adding

a customer or customers


Page19/30


• Definitions are basically the same as in "Mining Definitions are basically the same as in "Mining Sequential Patterns" paperSequential Patterns" paper

• Again, the problem is to find all (maximal) Again, the problem is to find all (maximal) sequences whose support is greater than a sequences whose support is greater than a specified threshold (minimum support)specified threshold (minimum support)

• Additional definitions:Additional definitions:• DB is the original database, minSupp is the minimum

support• db is the increment database• U = DB db is the updated database containing all

sequences from DB and db• LDB is the set of frequent sequences in DB • Task is to find frequent sequences in U, noted LU, with

respect to the minSupp• An example database is presented on the next slideAn example database is presented on the next slide


Page20/30



Page21/30


• First problem (Figure 1):First problem (Figure 1): Append new transactions to Append new transactions to customers already existing in the original databasecustomers already existing in the original database

• Suppose that we have minSupp threshold of 50%• In the original database, the frequent (maximal) sequences LDB

are { (10 20) (30) , (10 20) (40) }

• New transactions are appended to customers C2 and C3• Sequences (60) (90) and (10 20) (50 70) become

frequent• Customers C3 and C4 contain the first one, thus support is

50%• Customers C1, C2, and C3 contain (10 20) , thus the

increments for C2 and C3 make the second one frequent, since customers C1 and C2 contain it ; thus support is 50%

• Sequences (10 20) (30)(50 60)(80) and (10 20) (40)(50 60)(80) become frequent, since (50 60) (80) is frequent in db and was added to the rows already containing frequent sequences (10 20) (30) and (10 20) (40)


Page22/30


• Second problem (Figure 2):Second problem (Figure 2): Append new customers and Append new customers and new transactions to the original databasenew transactions to the original database

• Suppose again that we have minSupp threshold of 50%• When one new customer is added to the database, a frequent

sequence must be observed for 3 customers (previously 2)• In the original database, the frequent (maximal) sequences

LDB used to be { (10 20) (30) , (10 20) (40) }, but is now just { (10 20) }

• Sequences (10 20) (30) and (10 20) (40) occur only for customers C2 and C3

• Sequence (10 20) occurs for C1, C2, and C3• By introducing increment database db, the LU becomes { (10

20) (50) , (10) (70) , (10) (80) , (40) (80) , (60) }• E.g., sequence (10 20) (50) is in the original database

only for C1, and is not frequent; as the item 50 becomes frequent with the increment database, the sequence matches also C2 and C3


Page23/30


• Algorithm (ISE):Algorithm (ISE): The incremental mining is decomposed The incremental mining is decomposed into two subproblems (into two subproblems (k k = length of the longest frequent = length of the longest frequent sequences in sequences in DBDB))

• Find all new frequent sequences of size j (k+1). During this phase, three kinds of frequent sequences are considered:

• Sequences in DB can become frequent since they have sufficient support with the increment

• There can be new frequent sequences appearing in increment db but not in original DB

• Sequences in DB can become frequent when adding items of db

• Find all new frequent sequences of size j > (k+1)• This is straightforward Apriori-like algorithm applying, since

we have all frequent (k+1)-sequences discovered in the previous phase


Page24/30


• First iteration (1):First iteration (1):• Make a pass on db, count support for individual items of db• Provide 1-candExt, sequences occurring in db• Determine which items of db are frequent in U => Ld

1b

• Prune out frequent sequences that used to be frequent in LDB, but which are no more frequent in U


Page25/30


• First iteration (2):First iteration (2):• Create candidate sequences of length 2 by joining Ld

1b with

Ld1

b => 2-candExt• Generate from LDB the set of frequent sub-sequences• Scan U to find out frequent 2-sequences from 2-candExt

and frequent sub-sequences occurring before items of Ld1b


Page26/30


• First iteration (3):First iteration (3):• freqSeed <= frequent sub-sequences occurring

before items of Ld1

b and appended with the item• 2-freqExt <= frequent 2-sequences from 2-candExt


Page27/30


• j th iteration with j (k+1)While (j-freqExt != AND j (k+1) docandInc <= Generate candidates from freqSeed and j-freqExt ;j++;j-candExt <= Generate candidate j-sequences from (j-1)freqExt ;Scan db for j-candExt ;if (j-candExt != AND candInc != ) thenScan U for j-candExt and candInc ;endif j-freqExt <= frequent j-sequences;freqInc <= freqInc + candidates from candInc verifying the support on U ;enddo LU <= LDB { max. freq. sequences in freqSeed freqInc freqExt};


Page28/30


• j th iteration with j > (k+1)Apply Apriori-style algortihm until all frequent sequences are discoveredLU <= LU { max. freq. sequences obtained from the previous step};

• On the next slide, processes in the first and j th iteration with j > (k+1) are summarized

• Optimization in "candInc <= Generate candidates from freqSeed and j-freqExt ":

Consider two sequences (s freqSeed, s' freqExt) such that an item i Ld

1b is the last item of s and the first item of

s'Do not append s' freqExt to s freqSeed if there exist an item j Ld

1b such that j is in s' and j is not preceded by s


Page29/30



Page30/30

Unofficial Evaluation (Personal Unofficial Evaluation (Personal Views…)Views…)

• Mining Sequential Patterns:Mining Sequential Patterns:• Paper comes from one of the top research groups in data mining

area (IBM Almaden Data Mining group led by Rakesh Agrawal)• Quite well-written paper: Good language, clear examples and

presentation => rather "easy to read"• Simple ideas, not very "break-through" ideas (at least this is the

interpretation now); quite good international conference• One has to remember: this is written already in 1995

• Incremental Mining of Sequential Patterns in Large Incremental Mining of Sequential Patterns in Large DatabasesDatabases

• Paper comes from not so well-known French research group• Good: Lots of examples• Bad: Language is not always as good as it could be & definitions

are sometimes somewhat "blurry", maybe too many abbreviations used

• Probably not very "break-through" ideas, national DB conference• Remember: this is from year 2000 - rather new!

Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining (581550-4): Seminar Meetings...

Documents

Transcript of Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining (581550-4): Seminar Meetings...