Tutorial 2 Finding Repeated Structure in Time Series ...mueen/Tutorial/SDM2015Tutorial2.pdfWe have a...
Transcript of Tutorial 2 Finding Repeated Structure in Time Series ...mueen/Tutorial/SDM2015Tutorial2.pdfWe have a...
Abdullah MueenUniversity of New Mexico, USA
Eamonn KeoghUniversity of California Riverside, USA
Finding Repeated Structure in Time Series: Algorithms and Applications
Tutorial 2
Funding by NSFIIS-1161997 II
Slides available athttp://www.cs.unm.edu/~mueen/Tutorial/SDM2015Tutorial2.pdf
(we will start 5 min late to allow folks to find the room)
1
Tutorial Structure
• I will start with applications and talk about algorithms after that.
• There will be four Q&A segments. Please hold your question till the next segment.
• There is a feedback form available. Negative/positive, anonymous/known feedbacks are welcome.
• There will be a break at 5:00PM for 10 minutes.
2
What are Time Series?
A time series is a collection of observations made sequentially in time.
25.175025.225025.250025.250025.275025.325025.350025.350025.400025.400025.325025.225025.200025.1750
..
..24.625024.675024.675024.625024.625024.625024.675024.7500
0 50 100 150 200 250 300 350 400 450 50023
24
25
26
27
28
29
3
Repeated Pattern (Motif)
Find the subsequences having very high similarity to each other.
0 2000 4000 6000 8000 100000
102030
0 100 200 300 400 500 600 700 800
Finding Motifs in Time Series, Jessica Lin, Eamonn Keogh, Stefano Lonardi, Pranav Patel, KDD 2002
时间序列中的重复模式
4
General Outline
• Applications (50 minutes)
– As Subroutines in Data Mining
– In Other Scientific Research
• Algorithms (100 minutes)
– Uni-dimensional
– Multi-dimensional
– Open Problems
5
Applications Outline
• Applications– As Subroutines in Data Mining
• Never Ending Learning• Time Series Clustering• Rule Discovery• Dictionary Building
– In Other Scientific Research• Data center chiller management• Worm locomotion analysis• Physiological Prediction• Activity recognition
– Motifs in Other Data-types• Audio• Shapes• Motion
6
…Y dijo Dios: Sea la luz; y fue la luz. Y vio Dios que la luz era buena . Y llamó Dios a …
…And God said, “Let there be light”; and there was light. And God saw the light, that it was good. And God …
If you have parallel texts, then over time you can learn a dictionary with high accuracy.
Motifs allow us to learn, forever, without an explicit teacher…
7
…Y dijo Dios: Sea la luz; y fue la luz. Y vio Dios que la luz era buena . Y llamó Dios a …
…And God said, “Let there be light”; and there was light. And God saw the light, that it was good. And God …
Suppose however that the unknown “language” is not discrete, but real-valued time series? In this case, repeated pattern discovery can help*…
*Yuan Hao, Yanping Chen et al (2013). Towards Never-Ending Learning from Time Series Streams. SIGKDD 2013
Note the mapping is non-linear, the learning algorithms in this domain are non-trivial.
If you have parallel texts, then over time you can learn a dictionary with high accuracy.
Motifs allow us to learn, forever, without an explicit teacher…
8
Motifs allow us to learn, forever, without an explicit teacher…
This dataset contains standard IADL housekeeping activities (vacuuming, ironing, dusting, brooming,, watering plants etc). We have a discrete (binary) “text” that notes if the hand is near a cleaning instrument, and a real-valued accelerometer “text”
P (X-axis acceleration of wrist)
0 1000 2000 3000
offon
Glove
B (RFID tag)
offon
Iron(omitted)
5 seconds
Binary “text”
Real-valued “text”
9
We can run motif discovery on the time series stream. If we find motifs, we can see if they correlate with the discrete streams…
P (X-axis acceleration of right wrist)
0 1000 2000 3000
offon
Glove
B (RFID tag)
C1 observed C1 observed C1 observed
Glove 1/1
Iron 0/1
(omitted)
offon
Iron
(omitted)
Glove 2/2
Iron 0/2
(omitted)
Glove 3/3
Iron 0/3
(omitted)
5 seconds
In this snippet, the motifs seem to correlate with the presence of a glove…
10
How well does this work?
Over a hour of activity, we learn to recognize a behavior in the time series that indicates the user is putting on a glove.
0 200 4000 200 400
Learned concept C1
True positives
False positives
Discovered pattern
0 minutes
0
0.5
1
iron P(C1=fan) = 0.07
P(C1=iron) = 0.23
P(C1=glove) = 0.70
fan
P(X-axis acceleration of right wrist)Events shown in last slide take place here
55 minutes
glove
Note: There are false positives, but we considered only a single axis for simplicity. 11
200 400 600 800 1000 12000
And how would we evaluate our answer?
Motifs allow us to cluster subsequences of a time series…
12Thanawin Rakthanmanon, Eamonn Keogh, Stefano Lonardi, and Scott Evans (2011). Time Series Epenthesis: Clustering Time Series Streams
Requires Ignoring Some Data. ICDM 2011
== Poem ==In a sort of Runic rhyme,To the throbbing of the bells--Of the bells, bells, bells,To the sobbing of the bells;Keeping time, time, time,As he knells, knells, knells,In a happy Runic rhyme,To the rolling of the bells,--Of the bells, bells, bells--To the tolling of the bells,Of the bells, bells, bells, bells,Bells, bells, bells,--To the moaning and the groan-ing of the bells.
Data: last 30 seconds from 4-min poem, The Bells by Edgar Allan PoeEdgar Allan Poe
200 400 600 800 1000 12000
And how would we evaluate our answer?
Motifs allow us to cluster subsequences of a time series…
13
== Poem ==In a sort of Runic rhyme,To the throbbing of the bells--Of the bells, bells, bells,To the sobbing of the bells;Keeping time, time, time,As he knells, knells, knells,In a happy Runic rhyme,To the rolling of the bells,--Of the bells, bells, bells--To the tolling of the bells,Of the bells, bells, bells, bells,Bells, bells, bells,--To the moaning and the groan-ing of the bells.
== Text in each clusters ==bells, bells, bells,Bells, bells, bells,
Of the bells, bells, bells,Of the bells, bells, bells--
To the throbbing of the bells--To the sobbing of the bells;To the tolling of the bells,
To the rolling of the bells,--To the moaning and the groan-
time, time, time,knells, knells, knells,
sort of Runic rhyme,groaning of the bells.
And how would we evaluate our answer?
200 400 600 800 1000 12000
Motifs allow us to cluster subsequences of a time series…
14
Key observations that make this possible:
• Time Series Motifs!
• We are willing to allow some data to be unexplained by the clustering
• We score the possible clustering's with MDL, this is parameter-free!
• Allowing the clusters to be of different lengths/sizes
Thanawin Rakthanmanon, Eamonn Keogh, Stefano Lonardi, and Scott Evans (2011). Time Series Epenthesis: Clustering Time Series Streams
Requires Ignoring Some Data. ICDM 2011
200 400 600 800 1000 12000
Motifs allow us to cluster subsequences of a time series…
15
Motifs are useful, but can we predict the future?
0 1000 2000 3000 4000 5000 6000 7000 8000-30
-25
-20
-15
-10
What happens next?
Prediction vs. Forecasting (informal definitions)
Forecasting is “always on”, it constantly predicts a value say, two minutes out (we are not doing this)
Prediction only make a prediction occasionally, when it is sure what will happen next
(unpublished work, email Dr. Keogh for preprint)
16
Why Predict the (short-term) Future?If a robot can predict that is it about to fall, it may be able to..
• Prevent the fall• Mitigate the damage of the fall
More importantly, if the robot can predict a human’s actions
• The robot could catch the human!• This would allow more natural human/robot interaction.• Real time is not fast enough for interaction! We need to be a half second before real time.•
• Other examples:• Predict a car crash, tighten seatbelts, apply brakes• Predict the next spoken word after ‘data’ is ‘mining’, then begin prefetching WebPages..• etc
17
Previous attempts at this have largely failed…
However, we can do this, and time series motifs are the key tool
The rule discovery technique will use:• Time Series Motifs• MDL (minimum description length)• Admissible speed-up techniques (not discussed here)
18
Let us start by finding motifs
0 1000 2000 3000 4000 5000 6000 7000 8000 9000-30
-25
-20
-15
-10
0 20 40 60 80 100
First occurrence
Second occurrence
19
We can convert the motifs to a rule
0 1000 2000 3000 4000 5000 6000 7000 8000 9000-30
-25
-20
-15
-10
0 20 40 60 80 100
0 20 40 60 0 20 40
t1 = 7.58
maxlag = 0
We can use the motif to make a rule…
IF we see thisshape, (antecedent)
THEN we see thatshape, (consequent)
within maxlag time
The Euclidean distance between thisshape and the observed window must be within a threshold t1 = 7.58
20
We can monitor streaming data with our rule..
8000 9000
0 20 40 60 0 20 40
t1 = 7.58
maxlag = 0
21
The rule gets invoked…
0 20 40 60 0 20 40
t1 = 7.58
maxlag = 0
5685 5735 5785 5835 5885
rule fires here predicting this shape’s occurrence
22
It seems to work!
0 20 40 60 0 20 40
t1 = 7.58
maxlag = 0
5685 5735 5785 5835 5885
rule fires here predicting this shape’s occurrence
23
What is the ground truth?
0 1000 2000 3000 4000 5000 6000 7000 8000 9000-30
-25
-20
-15
-10
0 20 40 60 80 100
at door chambermy
The first verse of The Raven by Poe in MFCC space
Once upon a midnight dreary, while I pondered weak and weary.. ..rapping at my chamber door……
The phrase “at my chamber door” does appear 6 more times, and we do fire our rule correctly each time, and have no false positives.
What are we invariant to? • Who is speaking? Somewhat, we can handle other males, but females are tricky.• Rate of speech? To a large extent, yes.• Foreign accents? Sore throat? etc
0 20 40 60 0 20 40
t1 = 7.58
maxlag = 0
24
Why we need the Maxlagparameter
Here the maxlag depends on the number of floors we have in our building.
We can hand-edit this rule to generalize for short buildings to tall buildings
Can physicians edit medical rules to generalize from male to female…
t1 = 0.4
maxlag = 4 sec 0 40 80 120-11
-10
-9 waiting for elevator
stepping inside
elevator moves up
elevator stops
walking away
y-
acce
lera
tio
n
t1 = 0.4
maxlag = 4 sec 0 40 80 120-11
-10
-9 waiting for elevator
stepping inside
elevator moves up
elevator stops
walking away
y-
acce
lera
tio
n
25
maxlag = 20 minutes
0 120 180015500 16000 16500 17000
IF we see a Clothes Washer used
THEN we will see Clothes Dryer used within 20 minutes27
This works, really!
Insect Behavior AnalysisBeet Leafhopper (Circulifer tenellus)
plant membraneStylet
voltage source
input resistor
V
0 50 100 150 2000
1
2
to insectconductive glue
voltage reading
to soil near plant
0 1 2 3 4 5 6 7 8x 104
50010001500200025003000350040004500
Additional examples
of the motif
0 50 100 150 200 250 300 350 400-3
-2
-1
0
1
2
3
4
5
6
Instance at 20,925
Instance at 25,473
The electrical penetration graph or EPG is a system used by biologists to study the interaction of insects with plants.
15 minutes of EPG recorded on Beet Leafhopper
As a bead of sticky secretion, which is by-product of sap feeding, is ejected, it temporarily forms a highly conductive bridge between the insect and the plant.
Exact Discovery of Time Series Motifs. A Mueen, et al. SDM, 2009.29
900 1000 1100 1200 900 1000 1100 1200-5
0
5
Length :412600 800 1000 600 800 1000
-5
0
5
Length :565
7000 7200 7400 7600 7000 7200 7400 7600-5
0
5
Length :7995200 5400 5600 5800 5200 5400 5600 5800
-5
0
5
Length :799
3900 4000 4100 4200 3900 4000 4100-5
0
5
Length :3344800 4900 5000 5100 4800 4900 5000 5100
-5
0
5
Length :384
N
S
N
Insect Behavior Analysis
More motifs reveal different feeding patterns of Beet Leafhopper. 30
Applications Outline
• Applications– As Subroutines in Data Mining
• Never Ending Learning• Time Series Clustering• Rule Discovery• Dictionary Building
– In Other Scientific Research• Data center chiller management• Worm locomotion analysis• Physiological Prediction• Activity recognition
– Motifs in Other Data-types• Audio• Shapes• Motion
31
Sustainable Operation and Management of Data Center Chillers using Temporal Data Mining
HP Labs with Virginia Tech
“Our primary goal is to link the time series temperature data gathered from chiller units to high level sustainability characterizations… thus using time series motifs as a crucial intermediate representation to aid in data reduction.”
“switching from motif 8 to motif 5 gives us a nearly $40,000 in annual savings!” Patnaik et al. SIGKDD09
32
A dictionary of behavioral motifs reveals clusters of genes affecting C. elegans locomotion
Brown et al. PNAS 2013; 110(2): 791–796.
Laboratory of Molecular Biology, Cambridge, United Kingdom
Goal: Detect genotype by using the locomotion only.Convert postures to four dimensional time series.
33
Brown A E X et al. PNAS 2013;110:791-796
A dictionary of behavioral motifs reveals clusters of genes affecting C. eleganslocomotion
34
Variability in motif structure is lower in juvenile Directed than in Undirected and similar to that in adult song.
35Social performance reveals unexpected vocal competency in young songbirdsSatoshi Kojima and Allison J. Doupe, PNAS 2011, 108 (4) 1687-1692.
Amplitude Envelop
Motif discovery in physiological datasets: A methodology for inferring predictive elements
University of Michigan and MIT
We evaluated our solution on a population of patients who experienced sudden cardiac death and attempted to discover electrocardiographic activity that may be associated with the endpoint of death. To assess the predictive patterns discovered, we compared likelihood scores for time series motifsin the sudden death population…
Zeeshan Syed et al. TKDD 201036
Constrained Motif Discovery in Time SeriesToyoaki Nishida, Kyoto University
“we use time series motifs to find gesture patterns with applications to robot-human interactions” Okada, Izukura and Nishida 2011
37
Discovering Characteristic Actions from On-Body Sensor Data
David Minnen, Thad Starner, Irfan Essa, and Charles Isbell, Georgia Tech
Our algorithm successfully discovers motifs that correspond to the real exercises with a recall rate of 96.3% and overall accuracy of 86.7% over six exercises and 864 occurrences.
Minnen et al. Symp. Wearable Computers 200638
Motifs can Spot Dance Moves…
x 104
AB
C
EF
A AA BCC
DDDDDDEE
FF
0 0.5 1 1.5 2 2.5 3
0/20/22/4
1/41/21/2
0/30/40/2
0/40/40/4
Hip
Hand
Arm
Legxyz
xyz
xyz
xyz
Data: H. Pohl et al. SMC 2010Algorithm: Mueen, ICDM 2013
Step Action
A Side steps with no arm movement
B Rock steps sideways without arm movement
C Rock steps sideways with arm movement
D Side steps with arm movement
E Side steps with arms up in the air
F Standing still with head bopping
Motifs are from the same dance steps or the same transitions 86% of the time.
Choreography
Time
39
Applications Outline
• Applications– As Subroutines in Data Mining
• Never Ending Learning• Time Series Clustering• Rule Discovery• Dictionary Building
– In Other Scientific Research• Data center chiller management• Worm locomotion analysis• Physiological Prediction• Activity recognition
– Motifs in Other Data-types• Audio• Shapes• Motion
42
Motifs in AudioMice calls are inaudible and have significant noise
Manual inspection over temporal signal is impossible
Features like MFCC are not good for animal song
Just use the raw spectrogram to find repeated calls
43Yuan Hao et al. Parameter-Free Audio Motif Discovery in Large Data Archives. ICDM 2013: 261-270
Projectile shapes
Petroglyphs
Lampasas River Cornertang
Castroville Cornertang
0 50 100 150 200 250 300 350
Algorithm detects a rare cornertang segment – an object that has long intrigued anthropologists.
Algorithm detects similar petroglyphs drawn across continents and centuries
Motifs in Shapes
44
Motifs in Motion
Two motion can be stitched together by transiting from one motif to the other, a very useful technique for motion synthesis.
45Yankov et al. Detecting time series motifs under uniform scaling. KDD 2007
Mueen et al. A disk-aware algorithm for time series motif discovery. Data
Min. Knowl. Discov. 2011.
Time Series Motifs have 1,000 of Uses
• ..for discovering motifs in the music data is called the Mueen-Keogh (MK) algorithm.. Cabredo et al. 2011
• we apply the MK motif algorithm to time series retrieved from seismicsignals… Cassisi et al 2012
• we take motif developed by Keogh in order to support a medical expert in discovering interesting knowledge. Kitaguchi.
• for the problem of estimation of Micro-drilled Hole Wall of PWBs we take the Motif method developed by Keogh... Toshiki et al. (fabrication)
• the most efficient motif provided a power savings of 41 This translates to an annual reduction of 287 tons of CO2. Watson InterPACK09.
• We use Keogh's Motifs for unsupervised discovery of abnormal humanbehavior in multi-dimensional time series data... Vahdatpour SDM 2010.
• variability of behavior, using motifs, provides more consistent groupings of households across different clustering algorithms… Ian Dent 2014
46
Questions and Comments
47
Algorithms Outline
• Algorithms– Definition, Distance Measures and Invariances– Exact Algorithms
• Fixed Length• Enumeration of All length• K-motif Discovery• Online Maintenance
– Approximate Algorithms• Random Projection Algorithm
– Multi-dimensional Motif Discovery– Open Problems
48
Definition of Time Series Motifs
1. Length of the motif
2. Support of the motif
3. Similarity of the Pattern
4. Relative Position of the Pattern
20 40 60 80 100 120 140 160 180 200-2
-1
0
1
2
49
Distance Measures
• The choices are
– Euclidean Distance
– Correlation
– Dynamic Time Warping
– Longest Common Subsequences
– Uniformly Scaled Euclidean Distance
– Sliding Nearest Neighbor Distance
50
Euclidean Distance Metric
y
x
d(x,y)
Given two time series
x = x1…xn
and
y = y1…yn
their z-Normalized Euclidean distance is
defined as:
n
i
ii
y
yi
i
x
xii
)yx(yxd
yy
xx
1
2ˆˆ),(
ˆ,ˆ
Early abandoning reduces number of operations when minimizing
51
0 10 20 30 40 50 60 70 80 90-2
0
2sum of squared error exceeded best2
Pearson’s Correlation Coefficient
• Given two time series x and y of length m.
• Sufficient Statistics:
• Correlation Coefficient:
𝑐𝑜𝑟𝑟 𝒙, 𝒚 = 𝑖=1𝑚 𝑥𝑖𝑦𝑖 −𝑚𝜇𝑥𝜇𝑦
𝑚𝜎𝑥𝜎𝑦
Where 𝜇𝑥 = 𝑖=1𝑚 𝑥𝑖𝑚
and 𝜎𝑥2 = 𝑖=1𝑚 𝑥𝑖
2
𝑚− 𝜇𝑥
2
• Early abandoning is possible when maximizing
• Correlation is not a metric, therefore, use of triangular inequality needs special attention
𝑖=1𝑚 𝑥𝑖𝑦𝑖 𝑖=1
𝑚 𝑥𝑖 𝑖=1𝑚 𝑦𝑖 𝑖=1
𝑚 𝑥𝑖2 𝑖=1
𝑚 𝑦𝑖2
52
Relationship with Euclidean Distance
𝑑 𝒙, 𝒚 = 2𝑚(1 − 𝑐𝑜𝑟𝑟(𝒙, 𝒚))
𝑥𝑖 =𝑥𝑖−𝜇𝑥
𝜎𝑥and 𝑦𝑖 =
𝑦𝑖−𝜇𝑦
𝜎𝑦
𝑑2 𝒙, 𝒚 =
𝑖=1
𝑚
𝑥𝑖 − 𝑦𝑖2
Minimizing z-normalized Euclidean distance and Maximizing Pearson’s correlation coefficient are identical in effect for motif discovery. 53
Euclidean Vs Dynamic Time Warping
Euclidean DistanceSequences are aligned “one to one”.
“Warped” Time AxisNonlinear alignments are possible.
54
C
QC
Q
How is DTW Calculated?
),(),( nmDCQDTW
Warping path w• Quadratic time complexity• DTW is not a metric
)1,1(),,1(),1,(min)(),( 2 jiDjiDjiDcqjiD ji
55
A four-slide digression, to make sure you understand what invariances are, and why they are important
56
Suppose we are walking in a cemetery in Japan.
We see an interesting grave marker, and we want to learn more about it.
We can take a photo of it and search a database….
57
Campana and Keogh (2010). A Compression Based Distance Measure for Texture. SDM 2010.
58
In order to do this, we must have a distance measure with the right invariances
Color invariance
Occlusion invariance
Size invariance
Rotation invariance59
Time Series Data has Unique Invariances
• These invariances are domain/problem dependent
• They include
– Complexity invariance
– Warping invariance
– Uniform scaling invariance
– Occlusion invariance
– Rotation/phase invariance
– Offset invariance
– Amplitude invariance
• Sometimes you achieve the invariance in the distance measure, sometimes by preprocessing the data.
• In this work, we will just assume offset/amplitude invariance. See [a] for a visual tour of time series invariances.
[a] Batista, et al (2011) A Complexity-Invariant Distance Measure for Time Series. SDM 2011
Z-normalization of each subsequence removes these
60
-2
-1
0
1
2
Z-Normalization ensures scale and offset invariances
x
xi
xx
ˆ
61
-5
-3
-1
1
3
0 100-5
-3
-1
1
3 Query
Distance to the query
0 500 1000 1500
0
40
80
120
Threshold = 30
Wandering Baseline
Without Normalization only 75% of the beats are missed
Algorithms Outline
• Algorithms– Definition, Distance Measures and Invariances– Exact Algorithms
• Fixed Length• Enumeration of All length• K-motif Discovery• Online Maintenance
– Approximate Algorithms• Random Projection Algorithm
– Multi-dimensional Motif Discovery– Open Problems
62
Simplest Definition of Time Series Motifs
1. Length of the motif = Given
2. Support of the motif = 2
3. Similarity of the Pattern = Euclidean distance
4. Relative Position of the Pattern = non-overlapping
20 40 60 80 100 120 140 160 180 200-2
-1
0
1
2
Given a length, the most similar/least distant pair of non-overlapping subsequences
63
Problem Formulation
The most similar pair of non-overlapping
subsequences
100 200 300 400 500 600 700 800 900 1000
-8000
-7500
-7000
12345678...
873
time:1000
The closest pair of points in high dimensional
space
Optimal algorithm in two dimension : Θ(n log n)For large dimensionality d, optimum algorithm is
effectively Θ(n2d) 64
Lower Bound
If P, Q and R are three points in a d-space
d(P,Q)+d(Q,R) ≥ d(P,R)
d(P,Q) ≥ |d(Q,R) - d(P,R)|
A third point R provides a very inexpensive lower bound on the true distance
If the lower bound is larger than the existing best, skip d(P, Q)
d(P,Q) ≥ |d(Q,R) - d(P,R)| ≥ BestPairDistance
P Q
R
65
Circular Projection
r
Pick a reference point r
Circularly Project all points on a line passing through the reference point
Equivalent to computing
distance from r and then sorting the points according
to distance
1
5
3
716
10
12
20
11
6
24
21
18
2
22
17
15
23
13
148
49
19 r
66
The Order Line
r
P Q
r
|d(Q, r) - d(P, r)|
d(Q, r)
d(P, r)
k = 1k = 2k = 3
k=1:n-1• Compare every pair
having k-1 points in between
• Do k scans of the order line, starting with the 1st to kth point
BestPairDistance
0
1
5
3
716
10
12
20
11
6
24
21
18
2
22
17
15
23
13
148
49
19 r
67
Correctness
• If we search for all offset=1,2,…,n-1 then all possible pairs are considered.
– n(n-1)/2 pairs
• if for any offset=k, none of the k scans needs an actual distance computation then for the rest of the offsets=k+1,…,n-1 no
distance computation will be needed.
r
68
Performance
• Orders of Magnitude faster• Exact in execution• No sacrifice of the quality of the results
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
10 30 50 70 90 110
X103
X103
Brute Force Brute Force with Optimized Distance Computation Our Algorithm
69
(MK)
• Use multiple reference points for tighter lower bounds.
Multiple References
70
r1 r2
yx
105 106 107 108 109 1010105
106
107
108
109
1010
Pruning by Triangular Inequality
Pru
nin
g b
y P
tole
mai
c In
equ
alit
y Ptolemaic bound wins
Linear bound wins
𝑥𝑦 ≥|𝑥𝑟1. 𝑦𝑟2 + 𝑥𝑟2. 𝑦𝑟1|
𝑟1𝑟2
Ptolemaic bound
Pruning by Multiple References
71
P
Q
R1
R3
R4
R2max( | d(P,Ri) - d(Q,Ri) | )
0 10 20 30 40 50 60
5
6
7
8
Lo
gar
ithm
of
tim
e
(sec
on
ds)
Number of Reference Time
Series Used
Time to find the motif in a
database of 64,000
random walk time series
of length 1,024
5 10 15 20 25 30 35
500
1000
1500
2000
2500
3000
0 5 10 15 20 25 30 350
200
400
600
800
1000
1200
1400
0 5 10 15 20 25 30 350
200
400
600
800
1000
1200
1400
1600
1800
Algorithms Outline
• Algorithms– Definition, Distance Measures and Invariances– Exact Algorithms
• Fixed Length• Enumeration of All length• K-motif Discovery• Online Maintenance
– Approximate Algorithms• Random Projection Algorithm
– Multi-dimensional Motif Discovery– Open Problems
72
Questions and Comments
73
Simplest Definition of Time Series Motifs
1. Length of the motif = Given All
2. Support of the motif = 2
3. Similarity of the Pattern = Euclidean distance
4. Relative Position of the Pattern = non-overlapping
20 40 60 80 100 120 140 160 180 200-2
-1
0
1
2
The most similar/least distant pairs of non-overlapping subsequences at all lengths.
74
Goals: Enumerating Motifs
1. Remove the length parameter
2. Search for motifs in a range of lengths and report
– ALL of the motifs of all of the lengths
3. Retain Scalability
75
Bound on Extension
1. Two time series 𝒙 and 𝒚 of length 𝑚
2. Their normalized Euclidean distance 𝑑 𝒙, 𝒚
3. Find 𝑑𝐿𝐵 𝒙+1, 𝒚+1 if we increase the length of 𝒙and 𝒚 by appending the next two numbers.
Values Changed
1 2 3 4 5
345678910
Without Normalization
1 2 3 4 5-4-3-2-101234
With Normalization
76
Intuition
1 2 3 4 5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
1 2 3 4 5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
1 2 3 4 5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Length 4 Length 5 Length 5
Append 10 and re-normalize Append 20 and re-normalizeNormalized
Area shrinks
Area between blue and red is the distance between the signals
Area shrinks further
If infinity is appended to both the signals, they will have zero area/distance. 77
Bounding Euclidean Distance
𝑑𝐿𝐵2 𝒙+1, 𝒚+1 =
1
𝜎𝑚2 𝑑𝑚
2 𝒙, 𝒚 < 𝑑𝑚2 ( 𝒙, 𝒚)
Variances of 𝒙+1and 𝒚+1, 𝜎𝑚2 =
𝑚
𝑚+1+
𝑚
𝑚+1 2 𝑧2
𝑧 = maximum normalized value in the databaseA safe approximation 𝑧 = max(𝑎𝑏𝑠 𝒙 , 𝑎𝑏𝑠 𝒚 )
78
Experimental Validation of the Bounds
2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3x 105
20
20.5
21
21.5
22
22.5
23
23.5
24
24.5
25
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 105
0
5
10
15
20
25
30
35
Pairs in ascending order of distances
Norm
aliz
ed D
ista
nce
Blue: Distances of random signals of length 255Gray: Distances of the signals when they are extended by one random sample Red: The upper and lower bounds before observing the extensions 79
Intuition
3.94 =4.57
1.16
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000-8.5
-8
-7.5
-7 x 103
145, 5410, 1.26
8345, 4211, 2.63
1655, 9461, 2.96
6531, 2501, 3.17
851, 1440, 3.73
2512, 3110, 3.98
1685, 9260, 4.57
Length 𝑚=7
𝑂 𝑛 pairs in order of
the distances for
length 𝑚
145, 5410, 1.79
8345, 4211, 1.63
1655, 9461, 3.61
6531, 2501, 2.71
851, 1440, 3.83
2512, 3110, 4.18
1685, 9260, 4.27
Length 𝑚=8
8345, 4211, 1.63
145, 5410, 1.79
6531, 2501, 2.71
1655, 9461, 3.61
851, 1440, 3.83
2512, 3110, 4.18
1685, 9260, 4.27
𝜎72=1.16
Location 1Location 2
Distance
80
----, -----, -----
----, -----, -----
n = 10000
2.77 =3.68
1.32
Intuition
8345, 4211, 1.63
145, 5410, 1.79
6531, 2501, 2.71
1655, 9461, 3.61
851, 1440, 3.83
Length 𝑚=8
8345, 4211, 1.23
145, 5410, 1.98
6531, 2501, 1.71
1655, 9461, 3.68
851, 1440, 3.61
Length 𝑚=9
8345, 4211, 1.23
6531, 2501, 1.71
145, 5410, 1.98
851, 1440, 3.61
1655, 9461, 3.68
𝜎82=1.32
• Once in every 10 lengths, the exact ordered list is required to be populated.
• This yields a 10x speed-up from running fixed-length motif discovery for all lengths.
81
Sanity Check
0 1000 2000 3000 4000 5000 6000-2
0
2
4
1380 1400 1420 1440 1460-5
0
5
Length :871320 1340 1360 1380 1400 1420
-5
0
5
Length :105600 700 800
-5
0
5
Length :2992200 2300 2400
-5
0
5
Length :299
(1) (2) (3) (4)
White Noise
http://www.cs.unm.edu/~mueen/Projects/MOEN/index.html
• Three Patterns planted in a random signal with different scaling.
• The algorithm finds them appropriately.82
Experimental Results: Scalability
0 2 4 6 8 10 12 14 16x 104
0
1
2
3
4
5
6
7x 105
Data Length (n)
Exec
uti
on
Tim
e in
Sec
on
ds
Smart Brute ForceEEGEOGRandom WalkIterative MK
1 2 3 4 5 6 7 8 9 10
0
2
4
6
8
10
12
14
16
18x 104
Smart Brute ForceEEGRandom WalkEOG
Exec
uti
on
Tim
e in
Sec
on
ds
Range of Lengths (maxLen-minLen+1)
x 102
83
MOENMOEN
Algorithms Outline
• Algorithms– Definition, Distance Measures and Invariances– Exact Algorithms
• Fixed Length• Enumeration of All length• K-motif Discovery• Online Maintenance
– Approximate Algorithms• Random Projection Algorithm
– Multi-dimensional Motif Discovery– Open Problems
84
Simplest Definition of Time Series Motifs
1. Length of the motif = Given All
2. Support of the motif = 2 k and τ
3. Similarity of the Pattern = Euclidean distance
4. Relative Position of the Pattern = non-overlapping
20 40 60 80 100 120 140 160 180 200-2
-1
0
1
2
The non-overlapping subsequences at all lengths having k or more τ-matches.
85
Optimal algorithm is hard
• Search for locations of the τ-balls that contain k subsequences
• NP-Hard
• Instead we search for a motif representative that has k subsequences within τ
86
How do we find the motif representative?
• Simply take one of the two occurrences as the representative
• Take the average of the two
• Find all occurrences within a threshold of pair and train a HMM to capture the concept (Minnen’07)
87
Using each one in the pair as the representative (k=4, τ=0.9)
88
1
2
3
4
0 50 100 150-4
-2
0
2
0 5000 10000
1
2
4
5
3
0 50 100 150-4
-2
0
2
4
0 5000 10000
EEG Data
EPG Data
It takes only k similarity searches to find other occurrences. The overall complexity remains the same.
Finding top-K motif
89
1000 2000 3000 4000 5000 6000
1000
2000
3000
4000
A A B A x A x A B A x A A x B x A A B A x B A x B A C AxC A A x C
Length :155
Length :255
Length :157
Length :251
A
C
B
• Run MK for K times• Replace occurrences by random noise
between iterations
Algorithms Outline
• Algorithms– Definition, Distance Measures and Invariances– Exact Algorithms
• Fixed Length• Enumeration of All length• K-motif Discovery• Online Maintenance
– Approximate Algorithms• Random Projection Algorithm
– Multi-dimensional Motif Discovery– Open Problems
90
• Streaming time series
• Sliding window of the recent history– What minute long trace repeated in the last hour?
Online Time Series Motifs
91
Discovery
100 200 300 400 500 600 700 800 900 1000
-8000
-7500
-7000
12345678...
873
The most similar pair of non-overlapping
subsequences
Maintenance
12345678...
873874
time:100112345678...
873874875
time:100212345678...
873874875876
time:100312345678...
873874875876877
time:100412345678...
873874875876877878
time:1005
Update motif pair afterevery time tick
time
Problem Formulation
92
• A subsequence is a high dimensional point
– The dynamic closest pair of points problem
• Closest pair may change upon every update
• Naïve approach: Do quadratic comparisons.
4
5
2
7
3
6
89
4
5
2
7
3
6
8
1
4
5
2
7
3
6
8
1
Deletion Insertion
Challenges
93
• Goal: Algorithm with Linear update time
• Previous method for dynamic closest pair (Eppstein,00)
– A matrix of all-pair distances is maintained
• O(w2) space required
– Quad-tree is used to update the matrix
• Maintain a set of neighbors and reverse neighbors for all points
• We do it in O( ) spaceww
Related Work
94
• Smallest nearest neighbor → Closest pair
• Upon insertion
– Find the nearest neighbor; Needs O(w) comparisons.
• Upon deletion
– Find the next NN of all the reverse NN
4
5
2
7
3
6
8
1
4
5
2
7
3
6
8
1
Insertion
9
4
5
2
7
3
6
8
1
Deletion
9
Maintaining Motif
95
1 2 3 4 5 6 7 8FIFO Sliding window
N-listsNeighbors in order
of distances
8 7 21 8 1 24
5 1 17 3 3 67
7 6 75 2 4 55
1 4 83 7 5 72
4 8 42 5 2 13
6 1 38 1 8 38
3 5 66 4 6 46
R-listsPoints that should be updated
Total number of nodes in R-lists is w
5 1 3 24
8 67
4
5
2
7
3
6
8
1
4
7
(ptr, distance)
Data Structure
96
While inserting Updating NN of old points is not necessary A point can be removed from the neighbor list if it
violates the temporal order
Averagelength O( )w
Observations
97
4
5
2
7
3
6
8
1
1 1 21 3 1 2
52
83
2 43 5 3 6
4 7
5
6
4
7
6
1 2 3 4 5 6 7 8
2 3 32 3 2 6
3 7 9
5
54 4 6 8
5 7
6
68
4
2 3 4 5 6 7 8 91
4
5
2
7
3
6
8
1
9
4
5
2
7
3
6
89
10
3 34 3 6 6
57 9
54 4 7 8
5
6
4
6 8
5
7
8
9
3 4 5 6 7 8 9 1021
10
Example
98
• Up to from general dynamic closest pairper point with increasing window size
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Insect
EEG
EOG
RW
Av
erag
e U
pd
ate
Tim
e (S
ec)
0 0.5 1 1.5 2 2.5 3 3.5 4x 104
Window Size (w)0 200 400 600 800 1000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Insect
EEG
EOG
RW
Av
erag
e U
pd
ate
Tim
e (S
ec)
Motif Length (m)
0 0.5 1 1.5 2 2.5 3 3.5 4x 104
102030405060708090100110
Insect
EEG
EOG
RW
Av
erag
e L
eng
th o
f N
-lis
t
Window Size (w)0 200 400 600 800 1000
0
50
100
150
200
250
300
350
Insect
EEG
EOG
RW
Motif Length (m)
Av
erag
e L
eng
th o
f N
-lis
t
Results
99
Algorithms Outline
• Algorithms– Definition, Distance Measures and Invariances– Exact Algorithms
• Fixed Length• Enumeration of All length• K-motif Discovery• Online Maintenance
– Approximate Algorithms• Random Projection Algorithm
– Multi-dimensional Motif Discovery– Open Problems
100
Questions and Comments
101
Abdullah MueenUniversity of New Mexico, USA
Eamonn KeoghUniversity of California Riverside, USA
Finding Repeated Structure in Time Series: Algorithms and Applications
Break; We meet back in this room at 5:15PM
Slides available at: http://www.cs.unm.edu/~mueen/Tutorial/SDM2015Tutorial2.pdf102
General Outline
• Applications (50 minutes)
– As Subroutines in Data Mining
– In Other Scientific Research
• Algorithms (100 minutes)
– Uni-dimensional
– Multi-dimensional
103
Algorithms Outline
• Algorithms– Definition, Distance Measures and Invariances– Exact Algorithms
• Fixed Length• Enumeration of All length• K-motif Discovery• Online Maintenance
– Approximate Algorithms• Random Projection Algorithm
– Multi-dimensional Motif Discovery– Open Problems
104
How do we find approximate motif in a time series?
The obvious brute force search algorithm is just too slow…
Our algorithm is based on a hot idea from bioinformatics, random projection* and the fact that SAX allows us to lower bound discrete representations of time series.
* J Buhler and M Tompa. Finding motifs using random projections. In RECOMB'01. 2001.
Symbolic Aggregate ApproXimation
106
How do we obtain SAX?
First convert the time series to PAA representation, then convert the PAA to symbols
It take linear time
0 20 40 60 80 100 120
C
C
0
--
0 20 40 60 80 100 120
bbb
a
cc
c
a
baabccbc 107
Time series subsequences tend to have a highly Gaussian distribution
-10 0 10
0.001
0.003
0.01 0.02
0.05
0.10
0.25
0.50
0.75
0.90
0.95
0.98 0.99
0.997
0.999
Pro
babi
lity
A normal probability plot of the (cumulative) distribution of
values from subsequences of length 128.
0
--
0 20 40 60 80 100 120
bbb
a
cc
c
a
Why a Gaussian?
108
Visual Comparison
A raw time series of length 128 is transformed into the word “ffffffeeeddcbaabceedcbaaaaacddee.”
– We can use more symbols to represent the time series since each symbol requires fewer bits than real-numbers (float, double)
-3
-2 -1 0 1 2 3
DFT
PLA
Haar
APCA
a b c d e f
109
ab:
:
a:
b
cc:
:
c:
c
ba:
:
c:
c
ab:
:
a:
c
1
2
:
:
58
:
985
0 500 1000
a c b a
T (m= 1000 )
a = 3 { a,b,c}n = 16w = 4
S^
C 1
C 1^
Assume that we have a time series T
of length 1,000, and a motif of length
16, which occurs twice, at time T1
and time T58.
A simple worked example of approximate motif discovery algorithm
The next 3 slides
110
A mask {1,2} was randomly chosen, so the values in columns {1,2} were used to project matrix into buckets.
Collisions are recorded by incrementing the appropriate location in the collision matrix
A simple worked example of approximate motif discovery algorithm
111
A mask {2,4} was randomly chosen, so the values in columns {2,4} were used to project matrix into buckets.
Once again, collisions are recorded by incrementing the appropriate location in the collision matrix
A simple worked example of approximate motif discovery algorithm
112
0
1 1 E( , , , , ) 1-
2
t i w id
i
k wi ak a w d t
iw a a
We can calculate the expected values in the matrix, assuming there are NO patterns…
two randomly-generated words of size w over an alphabet of size a, the probability that they match with up to d errors
t is the length of the projected string. We conclude that if we have k random strings of size w, an entry of the similarity matrix will be hit on average
113
Motif significance involves several independent dimensions
Similarity
Support
Length/size
Assessing significance requires estimating a function S:Rd→R over these dimensions so we can rank the motifs 114
More invariances mean more independent dimensions
Similarity
Support
Length/size
Complexity
Dimensionality
115
Multi-dimensional Motif
• Synchronous
– Treat it as an even higher dimensional problem
– Simple extensions of uni-dimensional algorithms work
– To find sub-dimensional motifs, all possible sub-spaces have to be considered
116
Multi-dimensional Motif
• Non-Synchronous
– Lags among motifs are common
– Subsets of dimensions can possibly construct a motif
David Minnen, Charles Isbell, Irfan Essa, and Thad Starner. Detecting Subdimensional Motifs: An Efficient Algorithm for Generalized Multivariate Pattern Discovery. ICDM '07
Alireza Vahdatpour, Navid Amini, Majid Sarrafzadeh: Toward Unsupervised Activity Discovery Using Multi-Dimensional Motif Detection in Time Series. IJCAI 2009: 1261-1266
117
Coincidence table
coincident(ri, rj) is the number of overlapping occurrences of motif iand motif j
118
Single-dimensional motifs to graph• Produce a co-occurrence graph
• Nodes are single dimensional motifs
• An edge between x and y denotes, x and y always co-occur within a time lag
• Cluster the graph using min-cut algorithms to find multi-dimensional motifs
x 104
0 0.5 1 1.5 2 2.5 3
0/20/22/4
1/41/21/2
0/30/40/2
0/40/40/4
Hip
Hand
Arm
Legxyz
xyz
xyz
xyz
119
Open Problems
• New Invariances:– P1: Find repeated patterns under warping
distance.
– P2: Finding motifs under Complexity invariance.• Uniform scaling (Yankov 06)
• Significance:– P3: Assessing significance of motifs without
discretization.• Parameter-free
• Data adaptive
120
Open Problems
• Algorithmic:– P4: Optimal k-motif for a given threshold– P5: Exact multi-dimensional motif discovery
• Application:– P6: Finding hidden state machine from motifs
• States == clustering• Rules between patterns only• State machine is for rules among clusters• Systems:
– A suite with all the techniques added
• Parallel motif discovery using GPU
121
References1. Abdullah Mueen, Eamonn J. Keogh: Online discovery and maintenance of time series motifs. KDD 2010: 1089-10982. Abdullah Mueen, Eamonn J. Keogh, Qiang Zhu, Sydney Cash, M. Brandon Westover: Exact Discovery of Time Series
Motifs. SDM 2009: 473-4843. Abdullah Mueen, Eamonn J. Keogh, Nima Bigdely Shamlo: Finding Time Series Motifs in Disk-Resident Data. ICDM 2009:
367-3764. Abdullah Mueen: Enumeration of Time Series Motifs of All Lengths. ICDM 2013: 547-5565. Yuan Hao, Mohammad Shokoohi-Yekta, George Papageorgiou, Eamonn J. Keogh: Parameter-Free Audio Motif Discovery
in Large Data Archives. ICDM 2013: 261-2706. Dragomir Yankov, Eamonn J. Keogh, Jose Medina, Bill Yuan-chi Chiu, Victor B. Zordan: Detecting time series motifs under
uniform scaling. KDD 2007: 844-8537. Xiaopeng Xi, Eamonn J. Keogh, Li Wei, Agenor Mafra-Neto: Finding Motifs in a Database of Shapes. SDM 2007: 249-2608. Bill Yuan-chi Chiu, Eamonn J. Keogh, Stefano Lonardi: Probabilistic discovery of time series motifs. KDD 2003: 493-4989. Pranav Patel, Eamonn J. Keogh, Jessica Lin, Stefano Lonardi: Mining Motifs in Massive Time Series Databases. ICDM
2002: 370-37710. Alireza Vahdatpour, Navid Amini, Majid Sarrafzadeh: Toward Unsupervised Activity Discovery Using Multi-Dimensional
Motif Detection in Time Series. IJCAI 2009: 1261-126611. Debprakash Patnaik, Manish Marwah, Ratnesh Sharma, and Naren Ramakrishnan. 2009. Sustainable operation and
management of data center chillers using temporal data mining. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '09)
12. Yuan Li, Jessica Lin, Tim Oates: Visualizing Variable-Length Time Series Motifs. SDM 2012: 895-90613. Tim Oates, Arnold P. Boedihardjo, Jessica Lin, Crystal Chen, Susan Frankenstein, and Sunil Gandhi. 2013. Motif discovery
in spatial trajectories using grammar inference. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management(CIKM '13). ACM, New York, NY, USA, 1465-1468.
14. Sorrachai Yingchareonthawornchai, Haemwaan Sivaraks, Thanawin Rakthanmanon, Chotirat Ann Ratanamahatana: Efficient Proper Length Time Series Motif Discovery. ICDM 2013: 1265-1270
15. David Minnen, Charles Lee Isbell Jr., Irfan A. Essa, Thad Starner: Discovering Multivariate Motifs using Subsequence Density Estimation and Greedy Mixture Learning. AAAI 2007: 615-620
123
References
16. Philippe Beaudoin, Stelian Coros, Michiel van de Panne, and Pierre Poulin. 2008. Motion-motif graphs. In Proceedings of the 2008 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA '08)
17. David Minnen, Charles L. Isbell, Irfan A. Essa, Thad Starner: Detecting Subdimensional Motifs: An Efficient Algorithm for Generalized Multivariate Pattern Discovery. ICDM 2007: 601-606
18. Yuan Hao, Yanping Chen, Jesin Zakaria, Bing Hu, Thanawin Rakthanmanon, Eamonn J. Keogh: Towards never-ending learning from time series streams. KDD 2013: 874-882
19. Gustavo E. A. P. A. Batista, Xiaoyue Wang, Eamonn J. Keogh: A Complexity-Invariant Distance Measure for Time Series. SDM 2011: 699-710
20. Thanawin Rakthanmanon, Eamonn J. Keogh, Stefano Lonardi, Scott Evans: Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data. ICDM 2011: 547-556
21. Yasser F. O. Mohammad, Toyoaki Nishida:Constrained Motif Discovery in Time Series. New Generation Comput. 27(4): 319-346 (2009)
22. Nuno Castro, Paulo J. Azevedo:Time Series Motifs Statistical Significance. SDM 2011: 687-69823. Bill Yuan-chi Chiu, Eamonn J. Keogh, Stefano Lonardi: Probabilistic discovery of time series motifs. KDD 2003: 493-49824. Zeeshan Syed, Collin Stultz, Manolis Kellis, Piotr Indyk, John V. Guttag: Motif discovery in physiological datasets: A
methodology for inferring predictive elements. TKDD 4(1) (2010)25. Yasser F. O. Mohammad, Toyoaki Nishida: Exact Discovery of Length-Range Motifs. ACIIDS (2) 2014: 23-32
124
THANK YOUQuestions and Comments
125