Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping
description
Transcript of Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping
![Page 1: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/1.jpg)
Searching and Mining Trillions of Time Series Subsequences under Dynamic
Time Warping
Thanawin (Art) Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Qiang Zhu, Brandon Westover, Jesin Zakaria, Eamonn Keogh
![Page 2: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/2.jpg)
2
What is a Trillion?• A trillion is simply one million million. • Up to 2011 there have been 1,709 papers. If
every such paper was on time series, and each had looked at five hundred million objects, this would still not add up to the size of the data we consider here.
• However, large time series data considered in a SIGKDD paper was a “mere” one hundred million objects.
![Page 3: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/3.jpg)
3
Dynamic Time Warping
Q
C
C
Q
Similar but out of phase peaks. C
Q
R (Warping Windows)
![Page 4: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/4.jpg)
4
Motivation
• Similarity search is the bottleneck for most time series data mining algorithms.
• The difficulty of scaling search to large datasets explains why most academic work considered at few millions of time series objects.
![Page 5: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/5.jpg)
5
Objective
• Search and mine really big time series. • Allow us to solve higher-level time series data
mining problem such as motif discovery and clustering at scales that would otherwise be untenable.
![Page 6: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/6.jpg)
6
Assumptions (1) • Time Series Subsequences must be Z-Normalized
– In order to make meaningful comparisons between two time series, both must be normalized.
– Offset invariance.– Scale/Amplitude invariance.
• Dynamic Time Warping is the Best Measure (for almost everything)– Recent empirical evidence strongly suggests that none of the
published alternatives routinely beats DTW.
A
BC
![Page 7: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/7.jpg)
7
Assumptions (2) • Arbitrary Query Lengths cannot be Indexed
– If we are interested in tackling a trillion data objects we clearly cannot fit even a small footprint index in the main memory, much less the much larger index suggested for arbitrary length queries.
• There Exists Data Mining Problems that we are Willing to Wait Some Hours to Answer– a team of entomologists has spent three years gathering 0.2 trillion datapoints– astronomers have spent billions dollars to launch a satellite to collect one
trillion datapoints of star-light curve data per day– a hospital charges $34,000 for a daylong EEG session to collect 0.3 trillion
datapoints
![Page 8: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/8.jpg)
8
Proposed Method: UCR Suite• An algorithm for searching nearest neighbor• Support both ED and DTW search• Combination of various optimizations
– Known Optimizations– New Optimizations
![Page 9: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/9.jpg)
Known Optimizations (1)• Using the Squared Distance
• Exploiting Multicores– More cores, more speed
• Lower Bounding– LB_Yi– LB_Kim– LB_Keogh
CU
L Q
LB_Keogh
𝐸𝐷ሺ𝑄,𝐶ሻ=ඨ ሺ𝑞𝑖 − 𝑐𝑖ሻ2𝑛𝑖=1
2
![Page 10: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/10.jpg)
10
Known Optimizations (2) • Early Abandoning of ED
• Early Abandoning of LB_Keogh
CQ
We can early abandon at this point
CU
L
UQ
LU, L is an envelope of Q
bsfcqCQEDn
i ii 1
2)(),(
![Page 11: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/11.jpg)
11
CQ
CU
L
Fully calculated LBKeogh
About to begin calculation of DTW
Partial calculation of DTW
Partial truncation of LBKeogh
K = 0 K = 11
Known Optimizations (3) • Early Abandoning of DTW• Earlier Early Abandoning of DTW using LB_Keogh
C
Q
R (Warping Windows)
Stop if dtw_dist ≥ bsf
dtw_dist (partial)dtw_dist
(partial)lb_keogh
Stop if dtw_dist +lb_keogh ≥ bsf
![Page 12: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/12.jpg)
12
UCR Suite
New OptimizationsKnown Optimizations– Early Abandoning of ED– Early Abandoning of LB_Keogh– Early Abandoning of DTW– Multicores
![Page 13: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/13.jpg)
13
UCR Suite: New Optimizations (1)
• Early Abandoning Z-Normalization – Do normalization only when needed (just in time).– Small but non-trivial. – This step can break O(n) time complexity for ED (and, as
we shall see, DTW).– Online mean and std calculation is needed.
ii
xz
![Page 14: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/14.jpg)
14
UCR Suite: New Optimizations (2)• Reordering Early Abandoning
– We don’t have to compute ED or LB from left to right.– Order points by expected contribution.
CC
Q Q1
32 4
65
7
983
51 42
Standard early abandon ordering Optimized early abandon ordering
- Order by the absolute height of the query point.- This step only can save about 30%-50% of calculations.
Idea
![Page 15: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/15.jpg)
15
UCR Suite: New Optimizations (3)
• Reversing the Query/Data Role in LB_Keogh– Make LB_Keogh tighter.– Much cheaper than DTW.– Triple the data.–
CU
L
UQ
L
Envelop on Q Envelop on C
-------------------
Online envelope calculation.
![Page 16: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/16.jpg)
16
UCR Suite: New Optimizations (4)
• Cascading Lower Bounds– At least 18 lower bounds of DTW was proposed. – Use some lower bounds only on the Skyline.
0
1
O(1) O(n) O(nR)
LB_KimFL LB_KeoghEQ
max(LB_KeoghEQ, LB_KeoghEC)Early_abandoning_DTW
LB_KimLB_YiTi
ghtn
ess
of
low
er b
ound
LB_EcornerLB_FTW DTW
LB_PAA
Tigh
tnes
s of
LB
(LB/
DTW
)
![Page 17: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/17.jpg)
17
UCR Suite
New Optimizations– Just-in-time Z-normalizations– Reordering Early Abandoning– Reversing LB_Keogh– Cascading Lower Bounds
Known Optimizations– Early Abandoning of ED– Early Abandoning of LB_Keogh– Early Abandoning of DTW– Multicores
![Page 18: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/18.jpg)
18
UCR Suite
New Optimizations– Just-in-time Z-normalizations– Reordering Early Abandoning– Reversing LB_Keogh– Cascading Lower Bounds
Known Optimizations– Early Abandoning of ED– Early Abandoning of LB_Keogh– Early Abandoning of DTW– Multicores
State-of-the-art*
*We implemented the State-of-the-art (SOTA) as well as we could.SOTA is simply the UCR Suite without new optimizations.
![Page 19: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/19.jpg)
19
Experimental Result: Random Walk
Million (Seconds)
Billion (Minutes)
Trillion (Hours)
UCR-ED 0.034 0.22 3.16
SOTA-ED 0.243 2.40 39.80
UCR-DTW 0.159 1.83 34.09
SOTA-DTW 2.447 38.14 472.80
• Random Walk: Varying size of the data
Code and data is available at: www.cs.ucr.edu/~eamonn/UCRsuite.html
![Page 20: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/20.jpg)
20
• Random Walk: Varying size of the query
Naïve DTW
100
1000
10000
seconds
SOTA DTW
OPT DTW
(SOTA ED)
For query lengths of 4,096 (rightmost part of this graph) The times are:Naïve DTW : 24,286SOTA DTW : 5,078SOTA ED : 1,850OPT DTW : 567
Query Length
UCR DTWUCR DTW
Experimental Result: Random Walk
![Page 21: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/21.jpg)
21
Chromosome 2: BP 5709500:5782000
Human
Chimp
Gorilla
Orangutan
Gibbon
Rhesus macaque
Catarrhines
Hominidae
Homininae
Hominini
Hominoidea
• Query: Human Chromosome 2 of length 72,500 bps• Data: Chimp Genome 2.9 billion bps• Time: UCR Suite 14.6 hours, SOTA 34.6 days (830 hours)
Experimental Result: DNA
![Page 22: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/22.jpg)
22
• Data: 0.3 trillion points of brain wave• Query: Prototypical Epileptic Spike of 7,000 points (2.3 seconds)• Time: UCR-ED 3.4 hours, SOTA-ED 20.6 days (~500 hours)
Experimental Result: EEG
0 1000 2000 3000 4000 5000 6000 7000
Recorded with platinum-tipped silicon micro-electrode probes inserted 1.0 mm into the cerebral cortex
Recordings made from 96 active electrodes, with data sampled at 30kHz per electrode
Continuous Intracranial EEG
Q
![Page 23: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/23.jpg)
23
• Data: One year of Electrocardiograms 8.5 billion data points.• Query: Idealized Premature Ventricular Contraction (PVC) of
length 421 (R=21=5%).
UCR-ED SOTA-ED UCR-DTW SOTA-DTW
ECG 4.1 minutes 66.6 minutes 18.0 minutes 49.2 hours
Experimental Result: ECG
PVC (aka. skipped beat)
~30,000X faster than real time!
![Page 24: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/24.jpg)
24
Speeding Up Existing Algorithm
• Time Series Shapelets: – SOTA 18.9 minutes, UCR Suite 12.5 minutes
• Online Time Series Motifs: – SOTA 436 seconds, UCR Suite 156 seconds
• Classification of Historical Musical Scores: – SOTA 142.4 hours, UCR Suite 720 minutes
• Classification of Ancient Coins: – SOTA 12.8 seconds , UCR Suite 0.8 seconds
• Clustering of Star Light Curves: – SOTA 24.8 hours, UCR Suite 2.2 hours
![Page 25: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/25.jpg)
25
ConclusionUCR Suite …• is an ultra-fast algorithm for finding nearest
neighbor.• is the first algorithm that exactly mines trillion
real-valued objects in a day or two with a "off-the-shelf machine".
• uses a combination of various optimizations.• can be used as a subroutine to speed up other
algorithms.• Probably close to optimal ;-)
![Page 26: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/26.jpg)
Authors’ Photo
Bilson Campana Abdullah Mueen Gustavo BatistaQiang ZhuBrandon Westover Jesin Zakaria
Eamonn KeoghThanawin Rakthanmanon
![Page 27: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/27.jpg)
Acknowledgements• NSF grants 0803410 and 0808770• FAPESP award 2009/06349-0• Royal Thai Government Scholarship
As an aside: Cool Insect Contest!• Classify insects from wing beat sounds
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5x 10 4-0.2
-0.10
0.10.2
Background noise Bee begins to cross laser Bee has past though the laser
http://www.cs.ucr.edu/~eamonn/CE
![Page 28: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/28.jpg)
28
Thank you for your attention
QUESTION?
Register Today : Cool Insect Contest!
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5x 104-0.2
-0.10
0.10.2
Background noise Bee begins to cross laser Bee has past though the laser
http://www.cs.ucr.edu/~eamonn/CE
![Page 29: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/29.jpg)
29
Backup Slides
![Page 30: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/30.jpg)
30
LB_Keogh
CU
LQ
C
Q
R (Warping Windows)
n
iiiii
iiii
otherwise
LcifLc
UcifUc
CQKeoghLB1
2
2
0
)(
)(
),(_
Ui = max(qi-r : qi+r)Li = min(qi-r : qi+r)
![Page 31: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/31.jpg)
31
Known Optimizations
• Lower Bounding– LB_Yi
– LB_Kim
– LB_Keogh
A
B
CD
max(Q)
min(Q)
CU
LQ
![Page 32: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/32.jpg)
32
Ordering
0 1 2 3 4 5 6 7 8 9 10 x 1070
5
10
15
20
25
30
35 Average Number of Point-to-point Distance Calculation
Data in Progress
Avg
No.
of C
alcu
latio
n
SOTA-ED
UCR-ED
When good candidateis found
CC
Q Q1
32 4
65
7
983
51 42
Standard early abandon ordering Optimized early abandon ordering
This step only can saveabout 50% of calculations
![Page 33: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/33.jpg)
33
UCR Suite• New Optimizations
– Just-in-time Z-normalizations– Reordering Early Abandoning– Reversing LB_Keogh– Cascading Lower Bounds
• Known Optimizations– Early Abandoning of ED/LB_Keogh/DTW– Use Square Distance– Multicores
![Page 34: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping](https://reader037.fdocuments.us/reader037/viewer/2022102910/568143a9550346895db02ffa/html5/thumbnails/34.jpg)
Authors’ Photo
Bilson Campana Abdullah Mueen Gustavo BatistaQiang ZhuBrandon Westover Jesin Zakaria
Eamonn KeoghThanawin Rakthanmanon