Tutorial 2 Finding Repeated Structure in Time Series ...mueen/Tutorial/SDM2015Tutorial2.pdfWe have a...

Abdullah MueenUniversity of New Mexico, USA

Eamonn KeoghUniversity of California Riverside, USA

Finding Repeated Structure in Time Series: Algorithms and Applications

Tutorial 2

Funding by NSFIIS-1161997 II

Slides available athttp://www.cs.unm.edu/~mueen/Tutorial/SDM2015Tutorial2.pdf

(we will start 5 min late to allow folks to find the room)

1

Tutorial Structure

• I will start with applications and talk about algorithms after that.

• There will be four Q&A segments. Please hold your question till the next segment.

• There is a feedback form available. Negative/positive, anonymous/known feedbacks are welcome.

• There will be a break at 5:00PM for 10 minutes.

2

What are Time Series?

A time series is a collection of observations made sequentially in time.

25.175025.225025.250025.250025.275025.325025.350025.350025.400025.400025.325025.225025.200025.1750

..

..24.625024.675024.675024.625024.625024.625024.675024.7500

0 50 100 150 200 250 300 350 400 450 50023

24

25

26

27

28

29

3

Repeated Pattern (Motif)

Find the subsequences having very high similarity to each other.

0 2000 4000 6000 8000 100000

102030

0 100 200 300 400 500 600 700 800

Finding Motifs in Time Series, Jessica Lin, Eamonn Keogh, Stefano Lonardi, Pranav Patel, KDD 2002

时间序列中的重复模式

4

General Outline

• Applications (50 minutes)

– As Subroutines in Data Mining

– In Other Scientific Research

• Algorithms (100 minutes)

– Uni-dimensional

– Multi-dimensional

– Open Problems

5

Applications Outline

• Applications– As Subroutines in Data Mining

• Never Ending Learning• Time Series Clustering• Rule Discovery• Dictionary Building

– In Other Scientific Research• Data center chiller management• Worm locomotion analysis• Physiological Prediction• Activity recognition

– Motifs in Other Data-types• Audio• Shapes• Motion

6

…Y dijo Dios: Sea la luz; y fue la luz. Y vio Dios que la luz era buena . Y llamó Dios a …

…And God said, “Let there be light”; and there was light. And God saw the light, that it was good. And God …

If you have parallel texts, then over time you can learn a dictionary with high accuracy.

Motifs allow us to learn, forever, without an explicit teacher…

7

…Y dijo Dios: Sea la luz; y fue la luz. Y vio Dios que la luz era buena . Y llamó Dios a …

…And God said, “Let there be light”; and there was light. And God saw the light, that it was good. And God …

Suppose however that the unknown “language” is not discrete, but real-valued time series? In this case, repeated pattern discovery can help*…

*Yuan Hao, Yanping Chen et al (2013). Towards Never-Ending Learning from Time Series Streams. SIGKDD 2013

Note the mapping is non-linear, the learning algorithms in this domain are non-trivial.

If you have parallel texts, then over time you can learn a dictionary with high accuracy.


8


This dataset contains standard IADL housekeeping activities (vacuuming, ironing, dusting, brooming,, watering plants etc). We have a discrete (binary) “text” that notes if the hand is near a cleaning instrument, and a real-valued accelerometer “text”

P (X-axis acceleration of wrist)

0 1000 2000 3000

offon

Glove

B (RFID tag)

offon

Iron(omitted)

5 seconds

Binary “text”

Real-valued “text”

9

We can run motif discovery on the time series stream. If we find motifs, we can see if they correlate with the discrete streams…

P (X-axis acceleration of right wrist)

0 1000 2000 3000

offon

Glove

B (RFID tag)

C1 observed C1 observed C1 observed

Glove 1/1

Iron 0/1

(omitted)

offon

Iron

(omitted)

Glove 2/2

Iron 0/2

(omitted)

Glove 3/3

Iron 0/3

(omitted)

5 seconds

In this snippet, the motifs seem to correlate with the presence of a glove…

10

How well does this work?

Over a hour of activity, we learn to recognize a behavior in the time series that indicates the user is putting on a glove.

0 200 4000 200 400

Learned concept C1

True positives

False positives

Discovered pattern

0 minutes

0

0.5

1

iron P(C1=fan) = 0.07

P(C1=iron) = 0.23

P(C1=glove) = 0.70

fan

P(X-axis acceleration of right wrist)Events shown in last slide take place here

55 minutes

glove

Note: There are false positives, but we considered only a single axis for simplicity. 11

200 400 600 800 1000 12000

And how would we evaluate our answer?

Motifs allow us to cluster subsequences of a time series…

12Thanawin Rakthanmanon, Eamonn Keogh, Stefano Lonardi, and Scott Evans (2011). Time Series Epenthesis: Clustering Time Series Streams

Requires Ignoring Some Data. ICDM 2011

== Poem ==In a sort of Runic rhyme,To the throbbing of the bells--Of the bells, bells, bells,To the sobbing of the bells;Keeping time, time, time,As he knells, knells, knells,In a happy Runic rhyme,To the rolling of the bells,--Of the bells, bells, bells--To the tolling of the bells,Of the bells, bells, bells, bells,Bells, bells, bells,--To the moaning and the groan-ing of the bells.

Data: last 30 seconds from 4-min poem, The Bells by Edgar Allan PoeEdgar Allan Poe

200 400 600 800 1000 12000



13

== Poem ==In a sort of Runic rhyme,To the throbbing of the bells--Of the bells, bells, bells,To the sobbing of the bells;Keeping time, time, time,As he knells, knells, knells,In a happy Runic rhyme,To the rolling of the bells,--Of the bells, bells, bells--To the tolling of the bells,Of the bells, bells, bells, bells,Bells, bells, bells,--To the moaning and the groan-ing of the bells.

== Text in each clusters ==bells, bells, bells,Bells, bells, bells,

Of the bells, bells, bells,Of the bells, bells, bells--

To the throbbing of the bells--To the sobbing of the bells;To the tolling of the bells,

To the rolling of the bells,--To the moaning and the groan-

time, time, time,knells, knells, knells,

sort of Runic rhyme,groaning of the bells.


200 400 600 800 1000 12000


14

Key observations that make this possible:

• Time Series Motifs!

• We are willing to allow some data to be unexplained by the clustering

• We score the possible clustering's with MDL, this is parameter-free!

• Allowing the clusters to be of different lengths/sizes

Thanawin Rakthanmanon, Eamonn Keogh, Stefano Lonardi, and Scott Evans (2011). Time Series Epenthesis: Clustering Time Series Streams

Requires Ignoring Some Data. ICDM 2011

200 400 600 800 1000 12000


15

Motifs are useful, but can we predict the future?

0 1000 2000 3000 4000 5000 6000 7000 8000-30

-25

-20

-15

-10

What happens next?

Prediction vs. Forecasting (informal definitions)

Forecasting is “always on”, it constantly predicts a value say, two minutes out (we are not doing this)

Prediction only make a prediction occasionally, when it is sure what will happen next

(unpublished work, email Dr. Keogh for preprint)

16

Why Predict the (short-term) Future?If a robot can predict that is it about to fall, it may be able to..

• Prevent the fall• Mitigate the damage of the fall

More importantly, if the robot can predict a human’s actions

• The robot could catch the human!• This would allow more natural human/robot interaction.• Real time is not fast enough for interaction! We need to be a half second before real time.•

• Other examples:• Predict a car crash, tighten seatbelts, apply brakes• Predict the next spoken word after ‘data’ is ‘mining’, then begin prefetching WebPages..• etc

17

Previous attempts at this have largely failed…

However, we can do this, and time series motifs are the key tool

The rule discovery technique will use:• Time Series Motifs• MDL (minimum description length)• Admissible speed-up techniques (not discussed here)

18

Let us start by finding motifs

0 1000 2000 3000 4000 5000 6000 7000 8000 9000-30

-25

-20

-15

-10

0 20 40 60 80 100

First occurrence

Second occurrence

19

We can convert the motifs to a rule

0 1000 2000 3000 4000 5000 6000 7000 8000 9000-30

-25

-20

-15

-10

0 20 40 60 80 100

0 20 40 60 0 20 40

t1 = 7.58

maxlag = 0

We can use the motif to make a rule…

IF we see thisshape, (antecedent)

THEN we see thatshape, (consequent)

within maxlag time

The Euclidean distance between thisshape and the observed window must be within a threshold t1 = 7.58

20

We can monitor streaming data with our rule..

8000 9000

0 20 40 60 0 20 40

t1 = 7.58

maxlag = 0

21

The rule gets invoked…

0 20 40 60 0 20 40

t1 = 7.58

maxlag = 0

5685 5735 5785 5835 5885

rule fires here predicting this shape’s occurrence

22

It seems to work!

0 20 40 60 0 20 40

t1 = 7.58

maxlag = 0

5685 5735 5785 5835 5885

rule fires here predicting this shape’s occurrence

23

What is the ground truth?

0 1000 2000 3000 4000 5000 6000 7000 8000 9000-30

-25

-20

-15

-10

0 20 40 60 80 100

at door chambermy

The first verse of The Raven by Poe in MFCC space

Once upon a midnight dreary, while I pondered weak and weary.. ..rapping at my chamber door……

The phrase “at my chamber door” does appear 6 more times, and we do fire our rule correctly each time, and have no false positives.

What are we invariant to? • Who is speaking? Somewhat, we can handle other males, but females are tricky.• Rate of speech? To a large extent, yes.• Foreign accents? Sore throat? etc

0 20 40 60 0 20 40

t1 = 7.58

maxlag = 0

24

Why we need the Maxlagparameter

Here the maxlag depends on the number of floors we have in our building.

We can hand-edit this rule to generalize for short buildings to tall buildings

Can physicians edit medical rules to generalize from male to female…

t1 = 0.4

maxlag = 4 sec 0 40 80 120-11

-10

-9 waiting for elevator

stepping inside

elevator moves up

elevator stops

walking away

y-

acce

lera

tio

n

t1 = 0.4

maxlag = 4 sec 0 40 80 120-11

-10

-9 waiting for elevator

stepping inside

elevator moves up

elevator stops

walking away

y-

acce

lera

tio

n

25

maxlag = 20 minutes

0 120 180015500 16000 16500 17000

IF we see a Clothes Washer used

THEN we will see Clothes Dryer used within 20 minutes27

This works, really!

Insect Behavior AnalysisBeet Leafhopper (Circulifer tenellus)

plant membraneStylet

voltage source

input resistor

V

0 50 100 150 2000

1

2

to insectconductive glue

voltage reading

to soil near plant

0 1 2 3 4 5 6 7 8x 104

50010001500200025003000350040004500

Additional examples

of the motif

0 50 100 150 200 250 300 350 400-3

-2

-1

0

1

2

3

4

5

6

Instance at 20,925

Instance at 25,473

The electrical penetration graph or EPG is a system used by biologists to study the interaction of insects with plants.

15 minutes of EPG recorded on Beet Leafhopper

As a bead of sticky secretion, which is by-product of sap feeding, is ejected, it temporarily forms a highly conductive bridge between the insect and the plant.

Exact Discovery of Time Series Motifs. A Mueen, et al. SDM, 2009.29

900 1000 1100 1200 900 1000 1100 1200-5

0

5

Length :412600 800 1000 600 800 1000

-5

0

5

Length :565

7000 7200 7400 7600 7000 7200 7400 7600-5

0

5

Length :7995200 5400 5600 5800 5200 5400 5600 5800

-5

0

5

Length :799

3900 4000 4100 4200 3900 4000 4100-5

0

5

Length :3344800 4900 5000 5100 4800 4900 5000 5100

-5

0

5

Length :384

N

S

N

Insect Behavior Analysis

More motifs reveal different feeding patterns of Beet Leafhopper. 30






31

Sustainable Operation and Management of Data Center Chillers using Temporal Data Mining

HP Labs with Virginia Tech

“Our primary goal is to link the time series temperature data gathered from chiller units to high level sustainability characterizations… thus using time series motifs as a crucial intermediate representation to aid in data reduction.”

“switching from motif 8 to motif 5 gives us a nearly $40,000 in annual savings!” Patnaik et al. SIGKDD09

32

A dictionary of behavioral motifs reveals clusters of genes affecting C. elegans locomotion

Brown et al. PNAS 2013; 110(2): 791–796.

Laboratory of Molecular Biology, Cambridge, United Kingdom

Goal: Detect genotype by using the locomotion only.Convert postures to four dimensional time series.

33

Brown A E X et al. PNAS 2013;110:791-796

A dictionary of behavioral motifs reveals clusters of genes affecting C. eleganslocomotion

34

Variability in motif structure is lower in juvenile Directed than in Undirected and similar to that in adult song.

35Social performance reveals unexpected vocal competency in young songbirdsSatoshi Kojima and Allison J. Doupe, PNAS 2011, 108 (4) 1687-1692.

Amplitude Envelop

Motif discovery in physiological datasets: A methodology for inferring predictive elements

University of Michigan and MIT

We evaluated our solution on a population of patients who experienced sudden cardiac death and attempted to discover electrocardiographic activity that may be associated with the endpoint of death. To assess the predictive patterns discovered, we compared likelihood scores for time series motifsin the sudden death population…

Zeeshan Syed et al. TKDD 201036

Constrained Motif Discovery in Time SeriesToyoaki Nishida, Kyoto University

“we use time series motifs to find gesture patterns with applications to robot-human interactions” Okada, Izukura and Nishida 2011

37

Discovering Characteristic Actions from On-Body Sensor Data

David Minnen, Thad Starner, Irfan Essa, and Charles Isbell, Georgia Tech

Our algorithm successfully discovers motifs that correspond to the real exercises with a recall rate of 96.3% and overall accuracy of 86.7% over six exercises and 864 occurrences.

Minnen et al. Symp. Wearable Computers 200638

Motifs can Spot Dance Moves…

x 104

AB

C

EF

A AA BCC

DDDDDDEE

FF

0 0.5 1 1.5 2 2.5 3

0/20/22/4

1/41/21/2

0/30/40/2

0/40/40/4

Hip

Hand

Arm

Legxyz

xyz

xyz

xyz

Data: H. Pohl et al. SMC 2010Algorithm: Mueen, ICDM 2013

Step Action

A Side steps with no arm movement

B Rock steps sideways without arm movement

C Rock steps sideways with arm movement

D Side steps with arm movement

E Side steps with arms up in the air

F Standing still with head bopping

Motifs are from the same dance steps or the same transitions 86% of the time.

Choreography

Time

39






42

Motifs in AudioMice calls are inaudible and have significant noise

Manual inspection over temporal signal is impossible

Features like MFCC are not good for animal song

Just use the raw spectrogram to find repeated calls

43Yuan Hao et al. Parameter-Free Audio Motif Discovery in Large Data Archives. ICDM 2013: 261-270

Projectile shapes

Petroglyphs

Lampasas River Cornertang

Castroville Cornertang

0 50 100 150 200 250 300 350

Algorithm detects a rare cornertang segment – an object that has long intrigued anthropologists.

Algorithm detects similar petroglyphs drawn across continents and centuries

Motifs in Shapes

44

Motifs in Motion

Two motion can be stitched together by transiting from one motif to the other, a very useful technique for motion synthesis.

45Yankov et al. Detecting time series motifs under uniform scaling. KDD 2007

Mueen et al. A disk-aware algorithm for time series motif discovery. Data

Min. Knowl. Discov. 2011.

Time Series Motifs have 1,000 of Uses

• ..for discovering motifs in the music data is called the Mueen-Keogh (MK) algorithm.. Cabredo et al. 2011

• we apply the MK motif algorithm to time series retrieved from seismicsignals… Cassisi et al 2012

• we take motif developed by Keogh in order to support a medical expert in discovering interesting knowledge. Kitaguchi.

• for the problem of estimation of Micro-drilled Hole Wall of PWBs we take the Motif method developed by Keogh... Toshiki et al. (fabrication)

• the most efficient motif provided a power savings of 41 This translates to an annual reduction of 287 tons of CO2. Watson InterPACK09.

• We use Keogh's Motifs for unsupervised discovery of abnormal humanbehavior in multi-dimensional time series data... Vahdatpour SDM 2010.

• variability of behavior, using motifs, provides more consistent groupings of households across different clustering algorithms… Ian Dent 2014

46

Questions and Comments

47

Algorithms Outline

• Algorithms– Definition, Distance Measures and Invariances– Exact Algorithms

• Fixed Length• Enumeration of All length• K-motif Discovery• Online Maintenance

– Approximate Algorithms• Random Projection Algorithm

– Multi-dimensional Motif Discovery– Open Problems

48

Definition of Time Series Motifs

1. Length of the motif

2. Support of the motif

3. Similarity of the Pattern

4. Relative Position of the Pattern

20 40 60 80 100 120 140 160 180 200-2

-1

0

1

2

49

Distance Measures

• The choices are

– Euclidean Distance

– Correlation

– Dynamic Time Warping

– Longest Common Subsequences

– Uniformly Scaled Euclidean Distance

– Sliding Nearest Neighbor Distance

50

Euclidean Distance Metric

y

x

d(x,y)

Given two time series

x = x1…xn

and

y = y1…yn

their z-Normalized Euclidean distance is

defined as:

n

i

ii

y

yi

i

x

xii

)yx(yxd

yy

xx

1

2ˆˆ),(

ˆ,ˆ

Early abandoning reduces number of operations when minimizing

51

0 10 20 30 40 50 60 70 80 90-2

0

2sum of squared error exceeded best2

Pearson’s Correlation Coefficient

• Given two time series x and y of length m.

• Sufficient Statistics:

• Correlation Coefficient:

𝑐𝑜𝑟𝑟 𝒙, 𝒚 = 𝑖=1𝑚 𝑥𝑖𝑦𝑖 −𝑚𝜇𝑥𝜇𝑦

𝑚𝜎𝑥𝜎𝑦

Where 𝜇𝑥 = 𝑖=1𝑚 𝑥𝑖𝑚

and 𝜎𝑥2 = 𝑖=1𝑚 𝑥𝑖

2

𝑚− 𝜇𝑥

2

• Early abandoning is possible when maximizing

• Correlation is not a metric, therefore, use of triangular inequality needs special attention

𝑖=1𝑚 𝑥𝑖𝑦𝑖 𝑖=1

𝑚 𝑥𝑖 𝑖=1𝑚 𝑦𝑖 𝑖=1

𝑚 𝑥𝑖2 𝑖=1

𝑚 𝑦𝑖2

52

Relationship with Euclidean Distance

𝑑 𝒙, 𝒚 = 2𝑚(1 − 𝑐𝑜𝑟𝑟(𝒙, 𝒚))

𝑥𝑖 =𝑥𝑖−𝜇𝑥

𝜎𝑥and 𝑦𝑖 =

𝑦𝑖−𝜇𝑦

𝜎𝑦

𝑑2 𝒙, 𝒚 =

𝑖=1

𝑚

𝑥𝑖 − 𝑦𝑖2

Minimizing z-normalized Euclidean distance and Maximizing Pearson’s correlation coefficient are identical in effect for motif discovery. 53

Euclidean Vs Dynamic Time Warping

Euclidean DistanceSequences are aligned “one to one”.

“Warped” Time AxisNonlinear alignments are possible.

54

C

QC

Q

How is DTW Calculated?

),(),( nmDCQDTW

Warping path w• Quadratic time complexity• DTW is not a metric

)1,1(),,1(),1,(min)(),( 2 jiDjiDjiDcqjiD ji

55

A four-slide digression, to make sure you understand what invariances are, and why they are important

56

Suppose we are walking in a cemetery in Japan.

We see an interesting grave marker, and we want to learn more about it.

We can take a photo of it and search a database….

57

Campana and Keogh (2010). A Compression Based Distance Measure for Texture. SDM 2010.

58

In order to do this, we must have a distance measure with the right invariances

Color invariance

Occlusion invariance

Size invariance

Rotation invariance59

Time Series Data has Unique Invariances

• These invariances are domain/problem dependent

• They include

– Complexity invariance

– Warping invariance

– Uniform scaling invariance

– Occlusion invariance

– Rotation/phase invariance

– Offset invariance

– Amplitude invariance

• Sometimes you achieve the invariance in the distance measure, sometimes by preprocessing the data.

• In this work, we will just assume offset/amplitude invariance. See [a] for a visual tour of time series invariances.

[a] Batista, et al (2011) A Complexity-Invariant Distance Measure for Time Series. SDM 2011

Z-normalization of each subsequence removes these

60

-2

-1

0

1

2

Z-Normalization ensures scale and offset invariances

x

xi

xx

ˆ

61

-5

-3

-1

1

3

0 100-5

-3

-1

1

3 Query

Distance to the query

0 500 1000 1500

0

40

80

120

Threshold = 30

Wandering Baseline

Without Normalization only 75% of the beats are missed

Algorithms Outline





62

Simplest Definition of Time Series Motifs

1. Length of the motif = Given

2. Support of the motif = 2

3. Similarity of the Pattern = Euclidean distance

4. Relative Position of the Pattern = non-overlapping

20 40 60 80 100 120 140 160 180 200-2

-1

0

1

2

Given a length, the most similar/least distant pair of non-overlapping subsequences

63

Problem Formulation

The most similar pair of non-overlapping

subsequences

100 200 300 400 500 600 700 800 900 1000

-8000

-7500

-7000

12345678...

873

time:1000

The closest pair of points in high dimensional

space

Optimal algorithm in two dimension : Θ(n log n)For large dimensionality d, optimum algorithm is

effectively Θ(n2d) 64

Lower Bound

If P, Q and R are three points in a d-space

d(P,Q)+d(Q,R) ≥ d(P,R)

d(P,Q) ≥ |d(Q,R) - d(P,R)|

A third point R provides a very inexpensive lower bound on the true distance

If the lower bound is larger than the existing best, skip d(P, Q)

d(P,Q) ≥ |d(Q,R) - d(P,R)| ≥ BestPairDistance

P Q

R

65

Circular Projection

r

Pick a reference point r

Circularly Project all points on a line passing through the reference point

Equivalent to computing

distance from r and then sorting the points according

to distance

1

5

3

716

10

12

20

11

6

24

21

18

2

22

17

15

23

13

148

49

19 r

66

The Order Line

r

P Q

r

|d(Q, r) - d(P, r)|

d(Q, r)

d(P, r)

k = 1k = 2k = 3

k=1:n-1• Compare every pair

having k-1 points in between

• Do k scans of the order line, starting with the 1st to kth point

BestPairDistance

0

1

5

3

716

10

12

20

11

6

24

21

18

2

22

17

15

23

13

148

49

19 r

67

Correctness

• If we search for all offset=1,2,…,n-1 then all possible pairs are considered.

– n(n-1)/2 pairs

• if for any offset=k, none of the k scans needs an actual distance computation then for the rest of the offsets=k+1,…,n-1 no

distance computation will be needed.

r

68

Performance

• Orders of Magnitude faster• Exact in execution• No sacrifice of the quality of the results

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

10 30 50 70 90 110

X103

X103

Brute Force Brute Force with Optimized Distance Computation Our Algorithm

69

(MK)

• Use multiple reference points for tighter lower bounds.

Multiple References

70

r1 r2

yx

105 106 107 108 109 1010105

106

107

108

109

1010

Pruning by Triangular Inequality

Pru

nin

g b

y P

tole

mai

c In

equ

alit

y Ptolemaic bound wins

Linear bound wins

𝑥𝑦 ≥|𝑥𝑟1. 𝑦𝑟2 + 𝑥𝑟2. 𝑦𝑟1|

𝑟1𝑟2

Ptolemaic bound

Pruning by Multiple References

71

P

Q

R1

R3

R4

R2max( | d(P,Ri) - d(Q,Ri) | )

0 10 20 30 40 50 60

5

6

7

8

Lo

gar

ithm

of

tim

e

(sec

on

ds)

Number of Reference Time

Series Used

Time to find the motif in a

database of 64,000

random walk time series

of length 1,024

5 10 15 20 25 30 35

500

1000

1500

2000

2500

3000

0 5 10 15 20 25 30 350

200

400

600

800

1000

1200

1400

0 5 10 15 20 25 30 350

200

400

600

800

1000

1200

1400

1600

1800

Algorithms Outline





72


73


1. Length of the motif = Given All

2. Support of the motif = 2



20 40 60 80 100 120 140 160 180 200-2

-1

0

1

2

The most similar/least distant pairs of non-overlapping subsequences at all lengths.

74

Goals: Enumerating Motifs

1. Remove the length parameter

2. Search for motifs in a range of lengths and report

– ALL of the motifs of all of the lengths

3. Retain Scalability

75

Bound on Extension

1. Two time series 𝒙 and 𝒚 of length 𝑚

2. Their normalized Euclidean distance 𝑑 𝒙, 𝒚

3. Find 𝑑𝐿𝐵 𝒙+1, 𝒚+1 if we increase the length of 𝒙and 𝒚 by appending the next two numbers.

Values Changed

1 2 3 4 5

345678910

Without Normalization

1 2 3 4 5-4-3-2-101234

With Normalization

76

Intuition

1 2 3 4 5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

1 2 3 4 5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

1 2 3 4 5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Length 4 Length 5 Length 5

Append 10 and re-normalize Append 20 and re-normalizeNormalized

Area shrinks

Area between blue and red is the distance between the signals

Area shrinks further

If infinity is appended to both the signals, they will have zero area/distance. 77

Bounding Euclidean Distance

𝑑𝐿𝐵2 𝒙+1, 𝒚+1 =

1

𝜎𝑚2 𝑑𝑚

2 𝒙, 𝒚 < 𝑑𝑚2 ( 𝒙, 𝒚)

Variances of 𝒙+1and 𝒚+1, 𝜎𝑚2 =

𝑚

𝑚+1+

𝑚

𝑚+1 2 𝑧2

𝑧 = maximum normalized value in the databaseA safe approximation 𝑧 = max(𝑎𝑏𝑠 𝒙 , 𝑎𝑏𝑠 𝒚 )

78

Experimental Validation of the Bounds

2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3x 105

20

20.5

21

21.5

22

22.5

23

23.5

24

24.5

25

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 105

0

5

10

15

20

25

30

35

Pairs in ascending order of distances

Norm

aliz

ed D

ista

nce

Blue: Distances of random signals of length 255Gray: Distances of the signals when they are extended by one random sample Red: The upper and lower bounds before observing the extensions 79

Intuition

3.94 =4.57

1.16

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000-8.5

-8

-7.5

-7 x 103

145, 5410, 1.26

8345, 4211, 2.63

1655, 9461, 2.96

6531, 2501, 3.17

851, 1440, 3.73

2512, 3110, 3.98

1685, 9260, 4.57

Length 𝑚=7

𝑂 𝑛 pairs in order of

the distances for

length 𝑚

145, 5410, 1.79

8345, 4211, 1.63

1655, 9461, 3.61

6531, 2501, 2.71

851, 1440, 3.83

2512, 3110, 4.18

1685, 9260, 4.27

Length 𝑚=8

8345, 4211, 1.63

145, 5410, 1.79

6531, 2501, 2.71

1655, 9461, 3.61

851, 1440, 3.83

2512, 3110, 4.18

1685, 9260, 4.27

𝜎72=1.16

Location 1Location 2

Distance

80

----, -----, -----

----, -----, -----

n = 10000

2.77 =3.68

1.32

Intuition

8345, 4211, 1.63

145, 5410, 1.79

6531, 2501, 2.71

1655, 9461, 3.61

851, 1440, 3.83

Length 𝑚=8

8345, 4211, 1.23

145, 5410, 1.98

6531, 2501, 1.71

1655, 9461, 3.68

851, 1440, 3.61

Length 𝑚=9

8345, 4211, 1.23

6531, 2501, 1.71

145, 5410, 1.98

851, 1440, 3.61

1655, 9461, 3.68

𝜎82=1.32

• Once in every 10 lengths, the exact ordered list is required to be populated.

• This yields a 10x speed-up from running fixed-length motif discovery for all lengths.

81

Sanity Check

0 1000 2000 3000 4000 5000 6000-2

0

2

4

1380 1400 1420 1440 1460-5

0

5

Length :871320 1340 1360 1380 1400 1420

-5

0

5

Length :105600 700 800

-5

0

5

Length :2992200 2300 2400

-5

0

5

Length :299

(1) (2) (3) (4)

White Noise

http://www.cs.unm.edu/~mueen/Projects/MOEN/index.html

• Three Patterns planted in a random signal with different scaling.

• The algorithm finds them appropriately.82

Experimental Results: Scalability

0 2 4 6 8 10 12 14 16x 104

0

1

2

3

4

5

6

7x 105

Data Length (n)

Exec

uti

on

Tim

e in

Sec

on

ds

Smart Brute ForceEEGEOGRandom WalkIterative MK

1 2 3 4 5 6 7 8 9 10

0

2

4

6

8

10

12

14

16

18x 104

Smart Brute ForceEEGRandom WalkEOG

Exec

uti

on

Tim

e in

Sec

on

ds

Range of Lengths (maxLen-minLen+1)

x 102

83

MOENMOEN

Algorithms Outline





84


1. Length of the motif = Given All

2. Support of the motif = 2 k and τ



20 40 60 80 100 120 140 160 180 200-2

-1

0

1

2

The non-overlapping subsequences at all lengths having k or more τ-matches.

85

Optimal algorithm is hard

• Search for locations of the τ-balls that contain k subsequences

• NP-Hard

• Instead we search for a motif representative that has k subsequences within τ

86

How do we find the motif representative?

• Simply take one of the two occurrences as the representative

• Take the average of the two

• Find all occurrences within a threshold of pair and train a HMM to capture the concept (Minnen’07)

87

Using each one in the pair as the representative (k=4, τ=0.9)

88

1

2

3

4

0 50 100 150-4

-2

0

2

0 5000 10000

1

2

4

5

3

0 50 100 150-4

-2

0

2

4

0 5000 10000

EEG Data

EPG Data

It takes only k similarity searches to find other occurrences. The overall complexity remains the same.

Finding top-K motif

89

1000 2000 3000 4000 5000 6000

1000

2000

3000

4000

A A B A x A x A B A x A A x B x A A B A x B A x B A C AxC A A x C

Length :155

Length :255

Length :157

Length :251

A

C

B

• Run MK for K times• Replace occurrences by random noise

between iterations

Algorithms Outline





90

• Streaming time series

• Sliding window of the recent history– What minute long trace repeated in the last hour?

Online Time Series Motifs

91

Discovery

100 200 300 400 500 600 700 800 900 1000

-8000

-7500

-7000

12345678...

873

The most similar pair of non-overlapping

subsequences

Maintenance

12345678...

873874

time:100112345678...

873874875

time:100212345678...

873874875876

time:100312345678...

873874875876877

time:100412345678...

873874875876877878

time:1005

Update motif pair afterevery time tick

time

Problem Formulation

92

• A subsequence is a high dimensional point

– The dynamic closest pair of points problem

• Closest pair may change upon every update

• Naïve approach: Do quadratic comparisons.

4

5

2

7

3

6

89

4

5

2

7

3

6

8

1

4

5

2

7

3

6

8

1

Deletion Insertion

Challenges

93

• Goal: Algorithm with Linear update time

• Previous method for dynamic closest pair (Eppstein,00)

– A matrix of all-pair distances is maintained

• O(w2) space required

– Quad-tree is used to update the matrix

• Maintain a set of neighbors and reverse neighbors for all points

• We do it in O( ) spaceww

Related Work

94

• Smallest nearest neighbor → Closest pair

• Upon insertion

– Find the nearest neighbor; Needs O(w) comparisons.

• Upon deletion

– Find the next NN of all the reverse NN

4

5

2

7

3

6

8

1

4

5

2

7

3

6

8

1

Insertion

9

4

5

2

7

3

6

8

1

Deletion

9

Maintaining Motif

95

1 2 3 4 5 6 7 8FIFO Sliding window

N-listsNeighbors in order

of distances

8 7 21 8 1 24

5 1 17 3 3 67

7 6 75 2 4 55

1 4 83 7 5 72

4 8 42 5 2 13

6 1 38 1 8 38

3 5 66 4 6 46

R-listsPoints that should be updated

Total number of nodes in R-lists is w

5 1 3 24

8 67

4

5

2

7

3

6

8

1

4

7

(ptr, distance)

Data Structure

96

While inserting Updating NN of old points is not necessary A point can be removed from the neighbor list if it

violates the temporal order

Averagelength O( )w

Observations

97

4

5

2

7

3

6

8

1

1 1 21 3 1 2

52

83

2 43 5 3 6

4 7

5

6

4

7

6

1 2 3 4 5 6 7 8

2 3 32 3 2 6

3 7 9

5

54 4 6 8

5 7

6

68

4

2 3 4 5 6 7 8 91

4

5

2

7

3

6

8

1

9

4

5

2

7

3

6

89

10

3 34 3 6 6

57 9

54 4 7 8

5

6

4

6 8

5

7

8

9

3 4 5 6 7 8 9 1021

10

Example

98

• Up to from general dynamic closest pairper point with increasing window size

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Insect

EEG

EOG

RW

Av

erag

e U

pd

ate

Tim

e (S

ec)

0 0.5 1 1.5 2 2.5 3 3.5 4x 104

Window Size (w)0 200 400 600 800 1000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Insect

EEG

EOG

RW

Av

erag

e U

pd

ate

Tim

e (S

ec)

Motif Length (m)

0 0.5 1 1.5 2 2.5 3 3.5 4x 104

102030405060708090100110

Insect

EEG

EOG

RW

Av

erag

e L

eng

th o

f N

-lis

t

Window Size (w)0 200 400 600 800 1000

0

50

100

150

200

250

300

350

Insect

EEG

EOG

RW

Motif Length (m)

Av

erag

e L

eng

th o

f N

-lis

t

Results

99

Algorithms Outline





100


101

Abdullah MueenUniversity of New Mexico, USA

Eamonn KeoghUniversity of California Riverside, USA

Finding Repeated Structure in Time Series: Algorithms and Applications

Break; We meet back in this room at 5:15PM

Slides available at: http://www.cs.unm.edu/~mueen/Tutorial/SDM2015Tutorial2.pdf102

General Outline

• Applications (50 minutes)

– As Subroutines in Data Mining

– In Other Scientific Research

• Algorithms (100 minutes)

– Uni-dimensional

– Multi-dimensional

103

Algorithms Outline





104

How do we find approximate motif in a time series?

The obvious brute force search algorithm is just too slow…

Our algorithm is based on a hot idea from bioinformatics, random projection* and the fact that SAX allows us to lower bound discrete representations of time series.

* J Buhler and M Tompa. Finding motifs using random projections. In RECOMB'01. 2001.

Symbolic Aggregate ApproXimation

106

How do we obtain SAX?

First convert the time series to PAA representation, then convert the PAA to symbols

It take linear time

0 20 40 60 80 100 120

C

C

0

--

0 20 40 60 80 100 120

bbb

a

cc

c

a

baabccbc 107

Time series subsequences tend to have a highly Gaussian distribution

-10 0 10

0.001

0.003

0.01 0.02

0.05

0.10

0.25

0.50

0.75

0.90

0.95

0.98 0.99

0.997

0.999

Pro

babi

lity

A normal probability plot of the (cumulative) distribution of

values from subsequences of length 128.

0

--

0 20 40 60 80 100 120

bbb

a

cc

c

a

Why a Gaussian?

108

Visual Comparison

A raw time series of length 128 is transformed into the word “ffffffeeeddcbaabceedcbaaaaacddee.”

– We can use more symbols to represent the time series since each symbol requires fewer bits than real-numbers (float, double)

-3

-2 -1 0 1 2 3

DFT

PLA

Haar

APCA

a b c d e f

109

ab:

:

a:

b

cc:

:

c:

c

ba:

:

c:

c

ab:

:

a:

c

1

2

:

:

58

:

985

0 500 1000

a c b a

T (m= 1000 )

a = 3 { a,b,c}n = 16w = 4

S^

C 1

C 1^

Assume that we have a time series T

of length 1,000, and a motif of length

16, which occurs twice, at time T1

and time T58.

A simple worked example of approximate motif discovery algorithm

The next 3 slides

110

A mask {1,2} was randomly chosen, so the values in columns {1,2} were used to project matrix into buckets.

Collisions are recorded by incrementing the appropriate location in the collision matrix


111

A mask {2,4} was randomly chosen, so the values in columns {2,4} were used to project matrix into buckets.

Once again, collisions are recorded by incrementing the appropriate location in the collision matrix


112

0

1 1 E( , , , , ) 1-

2

t i w id

i

k wi ak a w d t

iw a a

We can calculate the expected values in the matrix, assuming there are NO patterns…

two randomly-generated words of size w over an alphabet of size a, the probability that they match with up to d errors

t is the length of the projected string. We conclude that if we have k random strings of size w, an entry of the similarity matrix will be hit on average

113

Motif significance involves several independent dimensions

Similarity

Support

Length/size

Assessing significance requires estimating a function S:Rd→R over these dimensions so we can rank the motifs 114

More invariances mean more independent dimensions

Similarity

Support

Length/size

Complexity

Dimensionality

115

Multi-dimensional Motif

• Synchronous

– Treat it as an even higher dimensional problem

– Simple extensions of uni-dimensional algorithms work

– To find sub-dimensional motifs, all possible sub-spaces have to be considered

116

Multi-dimensional Motif

• Non-Synchronous

– Lags among motifs are common

– Subsets of dimensions can possibly construct a motif

David Minnen, Charles Isbell, Irfan Essa, and Thad Starner. Detecting Subdimensional Motifs: An Efficient Algorithm for Generalized Multivariate Pattern Discovery. ICDM '07

Alireza Vahdatpour, Navid Amini, Majid Sarrafzadeh: Toward Unsupervised Activity Discovery Using Multi-Dimensional Motif Detection in Time Series. IJCAI 2009: 1261-1266

117

Coincidence table

coincident(ri, rj) is the number of overlapping occurrences of motif iand motif j

118

Single-dimensional motifs to graph• Produce a co-occurrence graph

• Nodes are single dimensional motifs

• An edge between x and y denotes, x and y always co-occur within a time lag

• Cluster the graph using min-cut algorithms to find multi-dimensional motifs

x 104

0 0.5 1 1.5 2 2.5 3

0/20/22/4

1/41/21/2

0/30/40/2

0/40/40/4

Hip

Hand

Arm

Legxyz

xyz

xyz

xyz

119

Open Problems

• New Invariances:– P1: Find repeated patterns under warping

distance.

– P2: Finding motifs under Complexity invariance.• Uniform scaling (Yankov 06)

• Significance:– P3: Assessing significance of motifs without

discretization.• Parameter-free

• Data adaptive

120

Open Problems

• Algorithmic:– P4: Optimal k-motif for a given threshold– P5: Exact multi-dimensional motif discovery

• Application:– P6: Finding hidden state machine from motifs

• States == clustering• Rules between patterns only• State machine is for rules among clusters• Systems:

– A suite with all the techniques added

• Parallel motif discovery using GPU

121

References1. Abdullah Mueen, Eamonn J. Keogh: Online discovery and maintenance of time series motifs. KDD 2010: 1089-10982. Abdullah Mueen, Eamonn J. Keogh, Qiang Zhu, Sydney Cash, M. Brandon Westover: Exact Discovery of Time Series

Motifs. SDM 2009: 473-4843. Abdullah Mueen, Eamonn J. Keogh, Nima Bigdely Shamlo: Finding Time Series Motifs in Disk-Resident Data. ICDM 2009:

367-3764. Abdullah Mueen: Enumeration of Time Series Motifs of All Lengths. ICDM 2013: 547-5565. Yuan Hao, Mohammad Shokoohi-Yekta, George Papageorgiou, Eamonn J. Keogh: Parameter-Free Audio Motif Discovery

in Large Data Archives. ICDM 2013: 261-2706. Dragomir Yankov, Eamonn J. Keogh, Jose Medina, Bill Yuan-chi Chiu, Victor B. Zordan: Detecting time series motifs under

uniform scaling. KDD 2007: 844-8537. Xiaopeng Xi, Eamonn J. Keogh, Li Wei, Agenor Mafra-Neto: Finding Motifs in a Database of Shapes. SDM 2007: 249-2608. Bill Yuan-chi Chiu, Eamonn J. Keogh, Stefano Lonardi: Probabilistic discovery of time series motifs. KDD 2003: 493-4989. Pranav Patel, Eamonn J. Keogh, Jessica Lin, Stefano Lonardi: Mining Motifs in Massive Time Series Databases. ICDM

2002: 370-37710. Alireza Vahdatpour, Navid Amini, Majid Sarrafzadeh: Toward Unsupervised Activity Discovery Using Multi-Dimensional

Motif Detection in Time Series. IJCAI 2009: 1261-126611. Debprakash Patnaik, Manish Marwah, Ratnesh Sharma, and Naren Ramakrishnan. 2009. Sustainable operation and

management of data center chillers using temporal data mining. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '09)

12. Yuan Li, Jessica Lin, Tim Oates: Visualizing Variable-Length Time Series Motifs. SDM 2012: 895-90613. Tim Oates, Arnold P. Boedihardjo, Jessica Lin, Crystal Chen, Susan Frankenstein, and Sunil Gandhi. 2013. Motif discovery

in spatial trajectories using grammar inference. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management(CIKM '13). ACM, New York, NY, USA, 1465-1468.

14. Sorrachai Yingchareonthawornchai, Haemwaan Sivaraks, Thanawin Rakthanmanon, Chotirat Ann Ratanamahatana: Efficient Proper Length Time Series Motif Discovery. ICDM 2013: 1265-1270

15. David Minnen, Charles Lee Isbell Jr., Irfan A. Essa, Thad Starner: Discovering Multivariate Motifs using Subsequence Density Estimation and Greedy Mixture Learning. AAAI 2007: 615-620

123

References

16. Philippe Beaudoin, Stelian Coros, Michiel van de Panne, and Pierre Poulin. 2008. Motion-motif graphs. In Proceedings of the 2008 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA '08)

17. David Minnen, Charles L. Isbell, Irfan A. Essa, Thad Starner: Detecting Subdimensional Motifs: An Efficient Algorithm for Generalized Multivariate Pattern Discovery. ICDM 2007: 601-606

18. Yuan Hao, Yanping Chen, Jesin Zakaria, Bing Hu, Thanawin Rakthanmanon, Eamonn J. Keogh: Towards never-ending learning from time series streams. KDD 2013: 874-882

19. Gustavo E. A. P. A. Batista, Xiaoyue Wang, Eamonn J. Keogh: A Complexity-Invariant Distance Measure for Time Series. SDM 2011: 699-710

20. Thanawin Rakthanmanon, Eamonn J. Keogh, Stefano Lonardi, Scott Evans: Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data. ICDM 2011: 547-556

21. Yasser F. O. Mohammad, Toyoaki Nishida:Constrained Motif Discovery in Time Series. New Generation Comput. 27(4): 319-346 (2009)

22. Nuno Castro, Paulo J. Azevedo:Time Series Motifs Statistical Significance. SDM 2011: 687-69823. Bill Yuan-chi Chiu, Eamonn J. Keogh, Stefano Lonardi: Probabilistic discovery of time series motifs. KDD 2003: 493-49824. Zeeshan Syed, Collin Stultz, Manolis Kellis, Piotr Indyk, John V. Guttag: Motif discovery in physiological datasets: A

methodology for inferring predictive elements. TKDD 4(1) (2010)25. Yasser F. O. Mohammad, Toyoaki Nishida: Exact Discovery of Length-Range Motifs. ACIIDS (2) 2014: 23-32

124

THANK YOUQuestions and Comments

125

Tutorial 2 Finding Repeated Structure in Time Series ...mueen/Tutorial/SDM2015Tutorial2.pdfWe have a...

Documents

Transcript of Tutorial 2 Finding Repeated Structure in Time Series ...mueen/Tutorial/SDM2015Tutorial2.pdfWe have a...