SPADE -

22
SPADE Sequence mining algorithm Monica Dăgădiţă ISI

Transcript of SPADE -

Page 1: SPADE -

SPADESequence mining algorithm

Monica DăgădiţăISI

Page 2: SPADE -

04/12/2023Data Mining 2

OUTLINE Introduction to sequence mining Why sequence mining? Sequence mining algorithms SPADE

MotivationDefinitions and examplesAlgorithm Implementation

Page 3: SPADE -

04/12/2023Data Mining 3

INTRODUCTION TO SEQUENCE MINING Aim - finding statistically relevant

patterns between data examples where the values are delivered in a sequence

Originally introduced for market basket analysis - customer behaviour predictions

2 types of sequence mining:string mining – biology (gene/protein

sequences)  itemset mining - marketing and CRM

applications

Page 4: SPADE -

04/12/2023Data Mining 4

WHY SEQUENCE MINING? Discovering patterns:

Bookstore: 70% of the people who buy Jane Austen’s “Pride and Prejudice” also buy “Emma” within a month

Website: finding sequences of most frequently accessed pages

Usage:PromotionsShelf placementRestructure the websiteRecommender systems

Page 5: SPADE -

04/12/2023Data Mining 5

SEQUENCE MINING ALGORITHMS Apriori GSP (Generalized Sequential Pattern) FreeSpan (Frequent pattern-projected

Sequential pattern mining) PrefixSpan (Prefix-projected Sequential

pattern mining) SPADE (Sequential PAttern Discovery

using Equivalence classes)

Page 6: SPADE -

04/12/2023Data Mining 6

MOTIVATION Problems of existing solutions

Repeated database scans Complex internal data structures

Key features of SPADE:Fixed number of database scans Vertical id-list database formatDecomposition of search space into smaller

pieces – processed independently

Page 7: SPADE -

04/12/2023Data Mining 7

DEFINITIONS AND EXAMPLES Itemset: set of m distinct items

I = {i1, i2, …, im } Event: non-empty collection of items

(i1,i2 … ik) Sequence : ordered list of events

< e1 -> e2 -> … -> en > K-sequence : sequence with k items

(B->AC) – 3-sequence

Page 8: SPADE -

04/12/2023Data Mining 8

DEFINITIONS AND EXAMPLES (2) Subsequence: given two sequences α=<a1

a2 … an> and β=<b1 b2 … bm>, α is called a subsequence of β, denoted as α⊆ β, if there exist integers 1≤ j1 < j2 <…< jn ≤m such that a1 ⊆ bj1, a2 ⊆ bj2,…, an ⊆ bjn

Examples: 1. (B->AC) is a subsequence of (AB->E->ACD) 2. (AB->E) is not a subsequence of (ABE)

Page 9: SPADE -

04/12/2023Data Mining 9

DEFINITIONS AND EXAMPLES (3)

Page 10: SPADE -

04/12/2023Data Mining 10

DEFINITIONS AND EXAMPLES (4)

Id-lists of the most frequent items (1-sequences)

Page 11: SPADE -

04/12/2023Data Mining 11

DEFINITIONS AND EXAMPLES (5) D->BF->A

Step 1: D->B

Step 2: D->BF

Page 12: SPADE -

04/12/2023Data Mining 12

DEFINITIONS AND EXAMPLES (6) D->BF->A

Step 3 : D->BF->A

Not space-efficientSolution: 2 columns - (sid,eid) for each

sequenceEid – id of the sequence’s last item

Page 13: SPADE -

04/12/2023Data Mining 13

DEFINITIONS AND EXAMPLES (6) D->BF->A (space-efficient id-list joins)

D->B

SID EID

1 15

1 20

4 20

D->BF

SID EID

1 20

4 20

D->BF->A

SID EID

1 25

4 25

Page 14: SPADE -

04/12/2023Data Mining 14

DEFINITIONS AND EXAMPLES (7) Complete latice representation

Page 15: SPADE -

04/12/2023Data Mining 15

Page 16: SPADE -

04/12/2023Data Mining 16

DEFINITIONS AND EXAMPLES (8) Decomposing the latice => smaller

pieces that can be solved independently

Equivalence classes2 sequences are in the same class (Θk) if

they share a common k length prefixExample

k=1 : Θ1 -> {[A],[B],[D],[F]}

Page 17: SPADE -

04/12/2023Data Mining 17

DEFINITIONS AND EXAMPLES (9)

Page 18: SPADE -

04/12/2023Data Mining 18

DEFINITIONS AND EXAMPLES (10)

Page 19: SPADE -

04/12/2023Data Mining 19

ALGORITHM SPADE(min_sup,D)

//min_sup – minimum_support//D –initial datasetF1<- {frequent items or 1-sequences}F2<- {frequent 2-sequences}Ε <- {equivalence classes [X] Θ1 }

for all [X] in Eenumerate_frequent_seq([X],min_sup)

Page 20: SPADE -

04/12/2023Data Mining 20

ALGORITHM(2) Enumerate_frequent_seq(S,min_sup)

for all Ai in S

Ti <- {}

for all Aj in S, with j≥i

R<- Ai v Aj (join)

if R satisfies min_supTi <- Ti U {R}

endEnumerate_frequent_seq(Ti , min_sup) //DFS

endFor all non-empty Ti

Enumerate_frequent_seq(Ti , min_sup) //BFS

Page 21: SPADE -

04/12/2023Data Mining 21

IMPLEMENTATION

The R Project for Statistical Computingdeveloped at Bell Laboratories (formerly

AT&T, now Lucent Technologies) by John Chambers and colleagues

Different implementation of S language

arulesSequences package

Page 22: SPADE -

04/12/2023Data Mining 22

QUESTIONS

?