The study on mining temporal patterns and related applications in dynamic social...

Post on 29-Jun-2015

182 views 5 download

Tags:

Transcript of The study on mining temporal patterns and related applications in dynamic social...

Yi-Cheng Chen 陳以錚

1

Mining Temporal Pattern and Related Applications

Curriculum VitaeBasic Information

Birthday – Aug. 31, 1978Education

Depart. of CSE, YZU (B. S. 2000) Depart. of CS, NTUST (M. S. 2002)Depart. of CSIE, NCTU (Ph. D. 2012)

Advisor: Prof. Suh-Yin Lee ( 李素瑛 教授 ), Wen-Chih Peng (彭文志 教授 )

Ph. D. Dissertation: A Study on Time Interval-based Sequential Patterns Mining

2

OutlineCurrent Research

Temporal Pattern Mining

Social Network Analysis

Smart Home Application

Cloud Computing

3

Lots of data is being collected Web data, e-commerce purchases at department Bank/Credit Card

transactions

Computers have become cheaper and more powerful

Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in

Customer Relationship Management)

Why Data Mining? Commercial Viewpoint

4

Why Data Mining? Scientific Viewpoint

Data collected and stored at enormous speeds (GB/hour)

remote sensors on a satellite

telescopes scanning the skies

microarrays generating gene expression data

scientific simulations generating terabytes of data

Traditional techniques infeasible for raw dataData mining may help scientists

in classifying and analyzing data in Hypothesis Formation 5

Data MiningWe are buried in data, but looking for

knowledge Data mining

Knowledge discovery in databasesExtraction of interesting knowledge (rules,

regularities, patterns) from data in large databases

6

7

Temporal Pattern Mining

8

Point-based sequential pattern mining Customer analysis, network intrusion detection, finding

tandem repeats in DNA sequence… Simple relation between point

time point-based

diaper

milk

diaper

beer

milk

beer

Three relation(before, equal, after )

with min_sup = 2, (ab)dc is a frequent sequential pattern

Sequential Pattern Mining

Interval Data Everywhere !!Interval data

Data has duration time

Clinical data, library data, appliance usage data

ApplicationsDiagnosis System, recommendation

system, Smart home

9Diagnosis System Smart Home

DB

Recommendation

10

Chess pain

fever

cough

Interval-based sequential pattern mining Library reader analysis, patient disease analysis, stock

fluctuation, ... Complex relations

Allen’s 13 temporal relations

time interval-based

With min_sup = 4, is a frequent temporal pattern

Temporal Pattern Mining

11

Allen’s 13 temporal logics describe relationship between any two events (binary relation) [ACM 1983]

Allen Relationship

12

Real example Some temporal patterns generated from NCTU library

13

Representation Allen’s relations are binary relation Express the relation more than 3 intervals

Ambiguous problem Space usage

Efficient algorithms Mining temporal pattern * Mining closed temporal pattern Incrementally maintain discovered temporal

pattern and closed temporal pattern Related applications

Social network Smart home

Motivation

14

Proposed Method Coincidence representation

Segment intervals into disjoint slices Nonambiguous and compact representation

Endpoint representation Global information of a sequence Nonambiguous and compact representation

TPMiner (Temporal Pattern Miner) Pattern-growth approach

Without candidate generation and test Two components

RPrefixSpan Pruning strategies

15

Segment intervals into disjoint slices Four kinds of event slice Start slice (+), intermediate slice (*), finish slice (-)

and intact slice ( ) Coincidence

Slices occurring simultaneously

Space usage (for a k-pattern) Best: k, Worst: 2k space

coincidence

event intervals

coincidence representation: (A+) (AB+) (B) (C+) (C*D ) (C) (E)

C

(AB+) (E)(C+) (C*D) (C)(A+) (B)

EA

B D

Coincidence representation

16

A data structure, endtime_list Sort and merge Trace endtime_list one-by-one

(A, 1, 4)

(B, 2, 5)

(C, 2, 8)

(D, 3, 5)

(E, 5, 7)

Incision strategy

coincidence representation:

(A+) (B+ C+) (A D+ ) (B D) @ (E) (C)

trace one- by- one

endtime_listtypesymbol time

sD 3

sA 1sBC 2

fA 4fBD 5sE 5fE 7fC 8

endtime_listtypesymbol time

sC 2

sA 1

sB 2

fC 8sD 3

sE 5fE 7

fD 5

fA 4

fB 5

merge

sort

17

Sequence of ordered time points +: start time, : finish time

NonambiguousSpace usage (for a k-pattern)

2k space

Endpoint Representation time points of events

ABCD

A ( B C ) A ( B C D ) D

18

Example Database

19

Every item is disjoint The relations among slices are simple

Before, equal and after (like time-point data) RPrefixSpan

Borrow the idea of PrefixSpan Scan local database to find frequent slices Append and extend the pattern Project database

Pruning strategy Reduce search space Pre-pruning and post-pruning

TPMiner – RPrefixSpan (1/2)

20

D

D |en

D |e1

D |e2

D |ei

transform sequences and project database

scan database

frequent items:e1, e2, ..., ei, ..., en

..

..

..

..

..

..

..

D |e2...

D |e1...

..

…D |en...

D |ei...

..

..

..

collect all mining patterns

Frequent temporal patterns

recursively project database and append & extend pattern

TPMiner – RPrefixSpan (2/2)

21

Pruning Strategy – Pre-pruning

scan database

frequent local slice :A, B+, B, C

D| A+

A+ …

A+ C D|A+ C

A+ B D|A+ B

A+ B+ D|A+ B+

A+ A D|A+ A

Non-qualified pattern

Non-promising projection can be pre-pruning !

Utilize the concept of slice and coincidence Start slices and finish slices occur in pairs Only require projecting the frequent finish slices which

have the corresponding start slices in their prefixes

22

Pruning Strategy – Post-pruning

E S1: (D - )(B- )S2: (D - )S3: (D - )

D |E

...S1: (B + )(D + )(E)(D - )(B- )S2: (B + )(B - D + )(E)(D - )S3: (B)(A)(D + )(E)(D - )

A coincidence database D

...

Insignificant sequences

Projected database can be post-pruning

Utilize the concept of slice and coincidence Start slice always appear before finish slice Only collect the significant postfixes

With respect to a prefix , all finish slices in postfix have corresponding start slices in

23

Experimental Results (1/2)

(b) The number of temporal patterns(a) The performance of six algorithms

D200k – C40 – N10k

num

ber

of g

ener

ated

pat

tern

s

minimum support (%)minimum support (%)

exec

utio

n tim

e (s

ec)

D200k – C40 – N10k

H-DFS

ARMADA

TPrefixSpan

IEMiner

TPMiner-CR

TPMiner-ER

0

10000

20000

30000

40000

50000

60000

70000

1 0.9 0.8 0.7 0.6 0.5

0

500

1000

1500

2000

2500

3000

3500

4000

1 0.9 0.8 0.7 0.6 0.5

(b) The number of temporal patterns(a) The performance of six algorithms

D200k – C40 – N10k

num

ber

of g

ener

ated

pat

tern

s

minimum support (%)minimum support (%)

exec

utio

n tim

e (s

ec)

D200k – C40 – N10k

H-DFS

ARMADA

TPrefixSpan

IEMiner

TPMiner-CR

TPMiner-ER

H-DFS

ARMADA

TPrefixSpan

IEMiner

TPMiner-CR

TPMiner-ER

0

10000

20000

30000

40000

50000

60000

70000

1 0.9 0.8 0.7 0.6 0.50

10000

20000

30000

40000

50000

60000

70000

1 0.9 0.8 0.7 0.6 0.5

0

500

1000

1500

2000

2500

3000

3500

4000

1 0.9 0.8 0.7 0.6 0.5

0

500

1000

1500

2000

2500

3000

3500

4000

1 0.9 0.8 0.7 0.6 0.5

N10k – C20 – N10k

minimum support (%)

mem

ory

usag

e (M

B)

0

500

1000

1500

2000

2500

1 0.9 0.8 0.7 0.6 0.5

H-DFS

ARMADA

TPrefixSpan

IEMiner

TPMiner-CR

TPMiner-ER

N10k – C20 – N10k

minimum support (%)

mem

ory

usag

e (M

B)

0

500

1000

1500

2000

2500

1 0.9 0.8 0.7 0.6 0.50

500

1000

1500

2000

2500

1 0.9 0.8 0.7 0.6 0.5

H-DFS

ARMADA

TPrefixSpan

IEMiner

TPMiner-CR

TPMiner-ER

H-DFS

ARMADA

TPrefixSpan

IEMiner

TPMiner-CR

TPMiner-ER

24

Experimental Results (2/2)

0

1000

2000

3000

4000

5000

6000

1 0.9 0.8 0.7 0.6 0.5

TPMiner-CR

TPMiner-CR without post-pruning strategy

(b) The performance test of influence on post-pruning strategies

minimum support (%)

exec

utio

n tim

e (s

ec)

TPMiner-CR

TPMiner-CR without pre-pruning strategy

(a) The performance test of influence on pre-pruning strategies

minimum support (%)

exec

utio

n tim

e (s

ec)

0

1000

2000

3000

4000

5000

6000

7000

1 0.9 0.8 0.7 0.6 0.5

TPMiner-CR

TPMiner-CR without subset-pruning strategy

(c) The performance test of influence on subset-pruning strategies

minimum support (%)

exec

utio

n tim

e (s

ec)

0

1000

2000

3000

4000

5000

6000

1 0.9 0.8 0.7 0.6 0.5

(b) The performance test of influence on all proposed pruning strategies

minimum support (%)

exec

utio

n tim

e (s

ec)

0

1000

2000

3000

4000

5000

6000

7000

8000

1 0.9 0.8 0.7 0.6 0.5

TPMiner-CR

TPMiner-CR without any pruning strategy

0

1000

2000

3000

4000

5000

6000

1 0.9 0.8 0.7 0.6 0.5

TPMiner-CR

TPMiner-CR without post-pruning strategy

(b) The performance test of influence on post-pruning strategies

minimum support (%)

exec

utio

n tim

e (s

ec)

0

1000

2000

3000

4000

5000

6000

1 0.9 0.8 0.7 0.6 0.50

1000

2000

3000

4000

5000

6000

1 0.9 0.8 0.7 0.6 0.5

TPMiner-CR

TPMiner-CR without post-pruning strategy

TPMiner-CR

TPMiner-CR without post-pruning strategy

(b) The performance test of influence on post-pruning strategies

minimum support (%)

exec

utio

n tim

e (s

ec)

TPMiner-CR

TPMiner-CR without pre-pruning strategy

(a) The performance test of influence on pre-pruning strategies

minimum support (%)

exec

utio

n tim

e (s

ec)

0

1000

2000

3000

4000

5000

6000

7000

1 0.9 0.8 0.7 0.6 0.5

TPMiner-CR

TPMiner-CR without pre-pruning strategy

TPMiner-CR

TPMiner-CR without pre-pruning strategy

(a) The performance test of influence on pre-pruning strategies

minimum support (%)

exec

utio

n tim

e (s

ec)

0

1000

2000

3000

4000

5000

6000

7000

1 0.9 0.8 0.7 0.6 0.50

1000

2000

3000

4000

5000

6000

7000

1 0.9 0.8 0.7 0.6 0.5

TPMiner-CR

TPMiner-CR without subset-pruning strategy

(c) The performance test of influence on subset-pruning strategies

minimum support (%)

exec

utio

n tim

e (s

ec)

0

1000

2000

3000

4000

5000

6000

1 0.9 0.8 0.7 0.6 0.5

TPMiner-CR

TPMiner-CR without subset-pruning strategy

TPMiner-CR

TPMiner-CR without subset-pruning strategy

(c) The performance test of influence on subset-pruning strategies

minimum support (%)

exec

utio

n tim

e (s

ec)

0

1000

2000

3000

4000

5000

6000

1 0.9 0.8 0.7 0.6 0.5

0

1000

2000

3000

4000

5000

6000

1 0.9 0.8 0.7 0.6 0.5

(b) The performance test of influence on all proposed pruning strategies

minimum support (%)

exec

utio

n tim

e (s

ec)

0

1000

2000

3000

4000

5000

6000

7000

8000

1 0.9 0.8 0.7 0.6 0.5

TPMiner-CR

TPMiner-CR without any pruning strategy

(b) The performance test of influence on all proposed pruning strategies

minimum support (%)

exec

utio

n tim

e (s

ec)

0

1000

2000

3000

4000

5000

6000

7000

8000

1 0.9 0.8 0.7 0.6 0.5

0

1000

2000

3000

4000

5000

6000

7000

8000

1 0.9 0.8 0.7 0.6 0.5

TPMiner-CR

TPMiner-CR without any pruning strategy

TPMiner-CR

TPMiner-CR without any pruning strategy

25

Related Applications

26

Smart Home Application

(2) Pattern Mining

CloudDatabase

UsagePattern

s

P2:P3: …

P1: (1) Sensor data log

(5) System Alarm & Remote Control

(3) Behavior Detection

(4) Abnormal Detection

Home

Current Behavior

Usage Pattern

Air Conditioner

light

Air Conditioner

light

Current Behavior

Air Conditioner

light

Alarm

Home Server

Remote Control

on offID3

on offID2

on offID2

on offID4

D-Link controler

Light

Alarm

Home Server

Remote Control

Alarm

Home Server

Remote Control

on offID3

on offID2

on offID2

on offID4

D-Link controler

Light on offID3

on offID2

on offID2

on offID4

D-Link controler

Light on offID3

on offID3

on offID2

on offID2

on offID2

on offID4

D-Link controler

Light

27

Dynamic Social Network (1/2)Dynamic social network

A sequence of interaction graph Nodes and edges vary with time

A lossless transformation Graph sequence interval sequence

B

A

CD

E

G4

B

A

CD

E

G1

B

A

CD

E

G2

B

A

CD

E

G3

….B

A

CD

E

G4

B

A

CD

E

G1

B

A

CD

E

G2

B

A

CD

E

G3

….

31C

31AD

64E

42D

31B

C

31C

31AB

64E

42D

31BA

event sequencefinishtime

starttime

event symbol

SID

31C

31AD

64E

42D

31B

C

31C

31AB

64E

42D

31BA

event sequencefinishtime

starttime

event symbol

SID

EB

D

EB

D

A

C

A

C

EB

D

EB

D

A

C

A

C

31C

31AD

64E

42D

31B

C

31C

31AB

64E

42D

31BA

event sequencefinishtime

starttime

event symbol

SID

31C

31AD

64E

42D

31B

C

31C

31AB

64E

42D

31BA

event sequencefinishtime

starttime

event symbol

SID

EB

D

EB

D

A

C

A

C

EB

D

EB

D

A

C

A

C

t3

t2t1

Reduce the complexity of graphAvoid isomorphism testing

Dynamic Social Network Analysis Pattern miningClassificationRecommending systemNetwork sampling Clustering

28

Dynamic Social Network (2/2)

29

Social Network Analysis

30

Social Network Analysis A graph representation

Nodes and edges

31

Influence Maximization

32

Advertisement Budget According to , advertisement spending

on worldwide social networking sites 2008, $23.3 millions 2010, $23.6 billions 2011, almost $25.5 billions Advertisement spending

33

Word-of-mouth effect in social networkInfluence maximization problem

Select initial users (seeds) so that the number of users that adopt the product or innovation is maximized

Influence Maximization

social networksocial network

Seeds select

34

MotivationCharacteristic of social network

Community structure

Community and degree heuristic (CDH) Utilize community information Avoid influence overlapping

65

4

11

12

103

72

9 8

1

65

4

11

12

103

72

9 8

1

8

2

9

4

5

1 21

8

7

9

4

5 6

35

Proposed Algorithm – CDHFramework of CDH

36

CDH – Adjust Step Adjust selected fundamental nodes

Seeds selected from large community may activate more inactive nodes than small community

Replace the fundamental node in small community If we can activate more inactive nodes

Finally, output the result as selected seed nodes

CkC1 C2

second largestdegree node

in C1

C3 ……

largest degree node in Ck

replace!!delete!!

37

Experimental Results - Facebook

38

Dynamic Recommendation

Recommendation Systempredict the ratings or preferencesusing a model build from the characteristics

39

(a) amazon.com (b) youtube.com

Collaborative Filtering (CF)1. Calculate the similarity between the active user

and the other users• Person’s correlation, cosine similarity, conditional

probability, etc.

2. Predict the rating of items that have not been rated by the active user

3. Output the top-k items by the predicting results

40

i1 i2 i3 i4Avg.Ofuser

A 4 1 4 3

B 2 4 3

C 3 3 2 2

normalize

wwp

normalizew

normalizew

cabaia

ca

ba

,,,

,

,

*)22(*)34(3

)23)(34()23)(31(

)32)(31(

4

item

user

41

MotivationDynamic! Dynamic! Dynamic!

Why we need dynamicAll things vary with time

Dynamic Collaborative Filteringconsider the time influence in the calculation.

Without considering about the timethe results of prediction might be out of date.

42

Dynamic Similarity based on Collaborative Filtering (DSCF)

( user->item : rating (time) )1 -> 1193 :5 (2012.5.18)5 -> 661 :3 (2012.3.5)3 -> 914 :3 (2012.6.27)1 -> 3408 :4 (2012.3.18)… …

( user->item : rating (time) )9 -> 6610 : 5 (2012.7.8)2 -> 6610 : 3 (2012.7.15)… ….

………. ….. ..

………. …. ..

………. …. ..

*(1-α)*(1-α)

101

0 )1( t

ttt MsimMsimMsim

01tDB

1ttDB

01tMsim

1ttMsim

0tMsim

43

Advanced DSCFα (similarity decay value, SDV) might not be

consistent for all time.each user might have his/her own SDV in

different time points.feedback predicted values from actual values

44

k

j jaja

k

j jajajij

aiamsimsi

msimsirrrA

1 ,,

1 ,,,

,])1([

])1([)(

45

Activeuser

?

k

j jaja

k

j jajajij

aiamsimsi

msimsirrrp

1 ,,

1 ,,,

,])1([

])1([)(

Recommend

Predict

Activeuser

Aa,i

Feedback

Experimental Results

46

47