SPADE -

SPADESequence mining algorithm

Monica DăgădiţăISI

04/12/2023Data Mining 2

OUTLINE Introduction to sequence mining Why sequence mining? Sequence mining algorithms SPADE

MotivationDefinitions and examplesAlgorithm Implementation


INTRODUCTION TO SEQUENCE MINING Aim - finding statistically relevant

patterns between data examples where the values are delivered in a sequence

Originally introduced for market basket analysis - customer behaviour predictions

2 types of sequence mining:string mining – biology (gene/protein

sequences) itemset mining - marketing and CRM

applications


WHY SEQUENCE MINING? Discovering patterns:

Bookstore: 70% of the people who buy Jane Austen’s “Pride and Prejudice” also buy “Emma” within a month

Website: finding sequences of most frequently accessed pages

Usage:PromotionsShelf placementRestructure the websiteRecommender systems


SEQUENCE MINING ALGORITHMS Apriori GSP (Generalized Sequential Pattern) FreeSpan (Frequent pattern-projected

Sequential pattern mining) PrefixSpan (Prefix-projected Sequential

pattern mining) SPADE (Sequential PAttern Discovery

using Equivalence classes)


MOTIVATION Problems of existing solutions

Repeated database scans Complex internal data structures

Key features of SPADE:Fixed number of database scans Vertical id-list database formatDecomposition of search space into smaller

pieces – processed independently


DEFINITIONS AND EXAMPLES Itemset: set of m distinct items

I = {i1, i2, …, im } Event: non-empty collection of items

(i1,i2 … ik) Sequence : ordered list of events

< e1 -> e2 -> … -> en > K-sequence : sequence with k items

(B->AC) – 3-sequence


DEFINITIONS AND EXAMPLES (2) Subsequence: given two sequences α=<a1

a2 … an> and β=<b1 b2 … bm>, α is called a subsequence of β, denoted as α⊆ β, if there exist integers 1≤ j1 < j2 <…< jn ≤m such that a1 ⊆ bj1, a2 ⊆ bj2,…, an ⊆ bjn

Examples: 1. (B->AC) is a subsequence of (AB->E->ACD) 2. (AB->E) is not a subsequence of (ABE)


DEFINITIONS AND EXAMPLES (3)



Id-lists of the most frequent items (1-sequences)


DEFINITIONS AND EXAMPLES (5) D->BF->A

Step 1: D->B

Step 2: D->BF


DEFINITIONS AND EXAMPLES (6) D->BF->A

Step 3 : D->BF->A

Not space-efficientSolution: 2 columns - (sid,eid) for each

sequenceEid – id of the sequence’s last item


DEFINITIONS AND EXAMPLES (6) D->BF->A (space-efficient id-list joins)

D->B

SID EID

1 15

1 20

4 20

D->BF

SID EID

1 20

4 20

D->BF->A

SID EID

1 25

4 25


DEFINITIONS AND EXAMPLES (7) Complete latice representation


DEFINITIONS AND EXAMPLES (8) Decomposing the latice => smaller

pieces that can be solved independently

Equivalence classes2 sequences are in the same class (Θk) if

they share a common k length prefixExample

k=1 : Θ1 -> {[A],[B],[D],[F]}


ALGORITHM SPADE(min_sup,D)

//min_sup – minimum_support//D –initial datasetF1<- {frequent items or 1-sequences}F2<- {frequent 2-sequences}Ε <- {equivalence classes [X] Θ1 }

for all [X] in Eenumerate_frequent_seq([X],min_sup)


ALGORITHM(2) Enumerate_frequent_seq(S,min_sup)

for all Ai in S

Ti <- {}

for all Aj in S, with j≥i

R<- Ai v Aj (join)

if R satisfies min_supTi <- Ti U {R}

endEnumerate_frequent_seq(Ti , min_sup) //DFS

endFor all non-empty Ti

Enumerate_frequent_seq(Ti , min_sup) //BFS


IMPLEMENTATION

The R Project for Statistical Computingdeveloped at Bell Laboratories (formerly

AT&T, now Lucent Technologies) by John Chambers and colleagues

Different implementation of S language

arulesSequences package


QUESTIONS

?

SPADE -

Technology

Transcript of SPADE -