Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining (581550-4): Seminar Meetings...
-
Upload
britton-wilkerson -
Category
Documents
-
view
220 -
download
0
description
Transcript of Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining (581550-4): Seminar Meetings...
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page1/30
Course on Data Mining (581550-4): Course on Data Mining (581550-4): Seminar MeetingsSeminar Meetings
Ass. RulesAss. Rules
EpisodesEpisodes
Text MiningText Mining
02.11.
09.11.
ClusteringClustering
KDD ProcessKDD Process
Home ExamHome Exam
23.11.
30.11.
16.11.
M
P
Seminar by Mika
Seminar by Pirjo
P P
PM
M
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page2/30
Today 09.11.2001Today 09.11.2001
• Rakesh Agrawal and Ramakrishnan Rakesh Agrawal and Ramakrishnan Srikant: Srikant: Mining Sequential Mining Sequential PatternsPatterns. Int'l Conference on Data . Int'l Conference on Data Engineering, 1995.Engineering, 1995.
• F. Masseglia, P. Poncelet and M. F. Masseglia, P. Poncelet and M. Teisseire: Incremental Mining of Teisseire: Incremental Mining of Sequential Patterns in Large Sequential Patterns in Large Databases. 16èmes Journées Bases Databases. 16èmes Journées Bases de Données Avancées, 2000.de Données Avancées, 2000.
Course on Data Mining (581550-4): Course on Data Mining (581550-4): Seminar MeetingsSeminar Meetings
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page3/30
Mining Sequential PatternsMining Sequential Patterns
Rakesh Agrawal and Ramakrishnan SrikantIBM Almaden Research Center, USA
Published in ICDE'95 (Int'l Conf. on Data Engineering)
Data Mining course Autumn 2001/University of Helsinki
Summary by Mika Klemettinen
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page4/30
Mining Sequential PatternsMining Sequential Patterns• Problem statement:Problem statement:
• Database D with customer transactions• Customer-id, transaction time, items purchased• Quantities of items purchased are NOT concerned
• Definitions:Definitions:• Itemset: a non-empty set of items, i1 i2 i3 … • Sequence: an ordered list of itemsets, s1 s2 s3 … • A sequence a1 a2 … an is contained in b1 b2 … bn if there
exist i1 < i2 < ... < in such that a1 bi1, a2 bi2
, … an bin• E.g., (3)(4 5)(8) (7)(3 8)(9)(4 5 6)(8)>, since (3) (3
8), (4 5) (4 5 6) and (8) (8)• However, note that sequence (3)(5) (3 5) (and vice
versa)
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page5/30
Mining Sequential PatternsMining Sequential Patterns• Customer sequence: a sequence of transactions
("shopping baskets") of a customer, ordered by transaction times Ti: itemset(T1) itemset(T2) … itemset(Tn)
• A customer supports a sequence s if s is contained in the customer sequence for this customer
• The support for a sequence is defined as the fraction of total customers who support this sequence
• Task:Task: Given a database D of customer transactions, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain user-specified minimun support. Each such maximal sequence represents a sequential pattern
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page6/30
Mining Sequential PatternsMining Sequential PatternsCustomer Id Transaction time Items bought1 June 25, 1993 301 June 30, 1993 902 June 10, 1993 10, 202 June 15, 1993 302 June 20, 1993 40, 60, 70... ... ...
Customer Id Customer sequence1 (30)(90) 2 (10 20)(30)(40 60 70) 3 (30 50 70) 4 (30)(40 70)(90) 5 (90)
Min. support 25%=> 2 customers:<(30)(90)> (1&4) and<(30)(40 70)> (2&4)are maximal
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page7/30
Mining Sequential PatternsMining Sequential Patterns• Definitions:Definitions:
• Length of a sequence is the number of itemsets in the sequence• A sequence of length k is called k-sequence• A sequence concatenated from sequences x and y is denoted by
x.y• The support for an itemset i is defined as the fraction of
customers who bought the items in i in a single transaction• An itemset with minimum support is called large itemset or
litemset• Each itemset in a large sequence must have minimum support,
i.e., any large sequence must be a list of litemsets (Apriori trick!)• Three algorithms, all for sequential patterns:Three algorithms, all for sequential patterns:
• AprioriSome• AprioriAll• DynamicSome
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page8/30
Mining Sequential PatternsMining Sequential Patterns• Mining of sequential patterns:Mining of sequential patterns:
• 1. Sort Phase1. Sort Phase• Sort according to customer Id and transaction time
• 2. Litemset Phase2. Litemset Phase• Find large itemsets in a Apriori fashion, but like in
MaxFreq, the support count is incremented only once even if the customer buys the same set of items in two different transactions
• The large itemsets are mapped to a set of contiguous integers (e.g. (30), (40), (70), (40 70) and (90) becomes 1, 2, 3, 4 and 5); checking of equality is then fast (constant time)!
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page9/30
Mining Sequential PatternsMining Sequential Patterns• 3. Transformation Phase3. Transformation Phase
• There is a need to repeatedly check which large itemsets are contained in customer sequences
• To make this fast, each customer sequence is transformed to a list of large itemsets
• Then the large itemsets are mapped to integersCId Original seq. Transf. Mapping1 (30)(90) {(30)}{(90)} {1}{5}
2 (10 20)(30)(40 60 70) {(30)}{(40),(70),(40 70)} {1}{2,3,4}3 (30 50 70) {(30),(70)} {1,3}4 (30)(40 70)(90) {(30)}{(40),(70),(40 70)}{(90)}
{1}{2,3,4}{5}5 (90) {(90)} {5}
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page10/30
Mining Sequential PatternsMining Sequential Patterns• 4. Sequence Phase4. Sequence Phase
• The large itemsets are used to find the desired sequences
• AprioriAll:– Based on the normal Apriori algorithm– Counts all the large sequences– Prunes non-maximal in the "Maximal phase"
• *Some: – Avoid counting sequences that are contained in
longer sequences by counting the longer ones first, also avoid having to count many subsequences because their supersequences are not large
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page11/30
Mining Sequential PatternsMining Sequential Patterns– Forward phase: find all large sequences of certain
lengths– Backward phase: find all remaining large sequences– AprioriSome: use only large sequences from
previous pass to generate candidates and validate their supports (i.e., if they are frequent or not)
– DynamicSome: generate candidates on-the-fly based on large sequences found from the previous passes and the customer sequences read from the database
• 5. Maximal Phase• Find the maximal sequences among the large sequences• In practice, starting from the largest sequences, delete all
their subsequences
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page12/30
Mining Sequential PatternsMining Sequential Patterns• AprioriAll:AprioriAll:
• Find all large sequences "normally"
• Prune the non-maximal ones away starting from 1 2 3 4 by deleting all its subsequences ( 1 2 3 , 1 2 4 , 1 3 4 , 2 3 4 , 1 2 , 1 3 , …, 4 ), then take the remaining 1 3 5 and prune all its subsequences, …
• The maximal large sequences are 1 2 3 4 , 1 3 5 and 4 5
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page13/30
Mining Sequential PatternsMining Sequential Patterns• AprioriSome:AprioriSome:
• Count only sequences of, e.g., length 1, 2, 4 and 6 in "forward phase" and count sequences of length 3 and 5 in "backward phase"
• Note: in the forward phase, candidates for all levels are counted:
• If in the large sequences of length Lk-1were checked, then generate new candidates Ck based on them
• If in the large sequences of length Lk-1were NOT checked, then generate new candidates Ck based on candidates Ck-1
• In backward phase: delete all sequences of the length k in candidate collection if they are contained in some longer large sequence Li (i > k)
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page14/30
Mining Sequential PatternsMining Sequential Patterns• Function "next" determines the next sequence
length which is counted: this is based on the assumption that if, e.g, almost all sequences of length k are large (frequent), then many of the sequences of length k+1 are also large (frequent). E.g.,
• Most of the sequences are large (85%) => next round is k+5
• ...• Not many of the sequences are large (67%) =>
next round is k+1 (AprioriAll)
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page15/30
Mining Sequential PatternsMining Sequential Patterns• DynamicSome:DynamicSome:
• In the initialization phase, count only sequences upto and including step variable length
• E.g., if step is 3, count sequences of length 1, 2 and 3• In the forward phase, we generate sequences of length 2
× step, 3 × step, 4 × step, etc. on-the-fly based on previous passes and customer sequences in the database
• E.g., while generating sequences of length 9 with a step size 3: While passing the data, if sequences s6 L6 and s3 L3 are both contained in the customer sequence c in hand, and they do not overlap in c, then sk . sj is a candidate (k+j)-sequence
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page16/30
Mining Sequential PatternsMining Sequential Patterns• In the intermediate phase, generate the candidate
sequences for the skipped lengths• E.g., if we have counted L6 and L3 , and L9 turns
out to be empty: we generate C7 and C8 , count C8 followed by C7 after deleting non-maximal sequences, and repeat the process for C4 and C5
• The backward phase is identical to AprioriSome
• Then we go on and spare a few thoughts on Then we go on and spare a few thoughts on incremental mining of sequential patternsincremental mining of sequential patterns
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page17/30
Incremental Mining of Incremental Mining of Sequential Patterns in Large Sequential Patterns in Large
DatabasesDatabasesF. Masseglia, P. Poncelet and M. TeisseireLaboratoire PRiSM & LIRMM UMR CNRS,
FrancePublished in BDA'00 (Bases de Données
Avancées)
Data Mining course Autumn 2001/University of Helsinki
Summary by Mika Klemettinen
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page18/30
Incremental Mining of Sequential Incremental Mining of Sequential PatternsPatterns
• Problem setting:Problem setting:• Let us consider an original and an incremental customer
transaction database• For the original database, the frequent patterns have been
created• Incremental database may contain new customers and new
transactions for both old and new customers• To compute the set of sequential patterns in the updated
database, we want to avoid counting everything from the scratch
• Some main things one has to consider:• Discover all sequential patterns NOT frequent in the
original database but become frequent with the increment• Examine all transactions in the original database which
can be extended to become frequent• Old frequent sequences may become invalid when adding
a customer or customers
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page19/30
Incremental Mining of Sequential Incremental Mining of Sequential PatternsPatterns
• Definitions are basically the same as in "Mining Definitions are basically the same as in "Mining Sequential Patterns" paperSequential Patterns" paper
• Again, the problem is to find all (maximal) Again, the problem is to find all (maximal) sequences whose support is greater than a sequences whose support is greater than a specified threshold (minimum support)specified threshold (minimum support)
• Additional definitions:Additional definitions:• DB is the original database, minSupp is the minimum
support• db is the increment database• U = DB db is the updated database containing all
sequences from DB and db• LDB is the set of frequent sequences in DB • Task is to find frequent sequences in U, noted LU, with
respect to the minSupp• An example database is presented on the next slideAn example database is presented on the next slide
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page20/30
Incremental Mining of Sequential Incremental Mining of Sequential PatternsPatterns
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page21/30
Incremental Mining of Sequential Incremental Mining of Sequential PatternsPatterns
• First problem (Figure 1):First problem (Figure 1): Append new transactions to Append new transactions to customers already existing in the original databasecustomers already existing in the original database
• Suppose that we have minSupp threshold of 50%• In the original database, the frequent (maximal) sequences LDB
are { (10 20) (30) , (10 20) (40) }
• New transactions are appended to customers C2 and C3• Sequences (60) (90) and (10 20) (50 70) become
frequent• Customers C3 and C4 contain the first one, thus support is
50%• Customers C1, C2, and C3 contain (10 20) , thus the
increments for C2 and C3 make the second one frequent, since customers C1 and C2 contain it ; thus support is 50%
• Sequences (10 20) (30)(50 60)(80) and (10 20) (40)(50 60)(80) become frequent, since (50 60) (80) is frequent in db and was added to the rows already containing frequent sequences (10 20) (30) and (10 20) (40)
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page22/30
Incremental Mining of Sequential Incremental Mining of Sequential PatternsPatterns
• Second problem (Figure 2):Second problem (Figure 2): Append new customers and Append new customers and new transactions to the original databasenew transactions to the original database
• Suppose again that we have minSupp threshold of 50%• When one new customer is added to the database, a frequent
sequence must be observed for 3 customers (previously 2)• In the original database, the frequent (maximal) sequences
LDB used to be { (10 20) (30) , (10 20) (40) }, but is now just { (10 20) }
• Sequences (10 20) (30) and (10 20) (40) occur only for customers C2 and C3
• Sequence (10 20) occurs for C1, C2, and C3• By introducing increment database db, the LU becomes { (10
20) (50) , (10) (70) , (10) (80) , (40) (80) , (60) }• E.g., sequence (10 20) (50) is in the original database
only for C1, and is not frequent; as the item 50 becomes frequent with the increment database, the sequence matches also C2 and C3
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page23/30
Incremental Mining of Sequential Incremental Mining of Sequential PatternsPatterns
• Algorithm (ISE):Algorithm (ISE): The incremental mining is decomposed The incremental mining is decomposed into two subproblems (into two subproblems (k k = length of the longest frequent = length of the longest frequent sequences in sequences in DBDB))
• Find all new frequent sequences of size j (k+1). During this phase, three kinds of frequent sequences are considered:
• Sequences in DB can become frequent since they have sufficient support with the increment
• There can be new frequent sequences appearing in increment db but not in original DB
• Sequences in DB can become frequent when adding items of db
• Find all new frequent sequences of size j > (k+1)• This is straightforward Apriori-like algorithm applying, since
we have all frequent (k+1)-sequences discovered in the previous phase
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page24/30
Incremental Mining of Sequential Incremental Mining of Sequential PatternsPatterns
• First iteration (1):First iteration (1):• Make a pass on db, count support for individual items of db• Provide 1-candExt, sequences occurring in db• Determine which items of db are frequent in U => Ld
1b
• Prune out frequent sequences that used to be frequent in LDB, but which are no more frequent in U
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page25/30
Incremental Mining of Sequential Incremental Mining of Sequential PatternsPatterns
• First iteration (2):First iteration (2):• Create candidate sequences of length 2 by joining Ld
1b with
Ld1
b => 2-candExt• Generate from LDB the set of frequent sub-sequences• Scan U to find out frequent 2-sequences from 2-candExt
and frequent sub-sequences occurring before items of Ld1b
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page26/30
Incremental Mining of Sequential Incremental Mining of Sequential PatternsPatterns
• First iteration (3):First iteration (3):• freqSeed <= frequent sub-sequences occurring
before items of Ld1
b and appended with the item• 2-freqExt <= frequent 2-sequences from 2-candExt
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page27/30
Incremental Mining of Sequential Incremental Mining of Sequential PatternsPatterns
• j th iteration with j (k+1)While (j-freqExt != AND j (k+1) docandInc <= Generate candidates from freqSeed and j-freqExt ;j++;j-candExt <= Generate candidate j-sequences from (j-1)freqExt ;Scan db for j-candExt ;if (j-candExt != AND candInc != ) thenScan U for j-candExt and candInc ;endif j-freqExt <= frequent j-sequences;freqInc <= freqInc + candidates from candInc verifying the support on U ;enddo LU <= LDB { max. freq. sequences in freqSeed freqInc freqExt};
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page28/30
Incremental Mining of Sequential Incremental Mining of Sequential PatternsPatterns
• j th iteration with j > (k+1)Apply Apriori-style algortihm until all frequent sequences are discoveredLU <= LU { max. freq. sequences obtained from the previous step};
• On the next slide, processes in the first and j th iteration with j > (k+1) are summarized
• Optimization in "candInc <= Generate candidates from freqSeed and j-freqExt ":
Consider two sequences (s freqSeed, s' freqExt) such that an item i Ld
1b is the last item of s and the first item of
s'Do not append s' freqExt to s freqSeed if there exist an item j Ld
1b such that j is in s' and j is not preceded by s
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page29/30
Incremental Mining of Sequential Incremental Mining of Sequential PatternsPatterns
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page30/30
Unofficial Evaluation (Personal Unofficial Evaluation (Personal Views…)Views…)
• Mining Sequential Patterns:Mining Sequential Patterns:• Paper comes from one of the top research groups in data mining
area (IBM Almaden Data Mining group led by Rakesh Agrawal)• Quite well-written paper: Good language, clear examples and
presentation => rather "easy to read"• Simple ideas, not very "break-through" ideas (at least this is the
interpretation now); quite good international conference• One has to remember: this is written already in 1995
• Incremental Mining of Sequential Patterns in Large Incremental Mining of Sequential Patterns in Large DatabasesDatabases
• Paper comes from not so well-known French research group• Good: Lots of examples• Bad: Language is not always as good as it could be & definitions
are sometimes somewhat "blurry", maybe too many abbreviations used
• Probably not very "break-through" ideas, national DB conference• Remember: this is from year 2000 - rather new!