SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano...
-
Upload
kaleigh-richard -
Category
Documents
-
view
221 -
download
0
Transcript of SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano...
![Page 1: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/1.jpg)
SAX: a Novel Symbolic Representation of Time
Series
AuthorsJessica LinEamonn KeoghLi WeiStefano Lonardi
PresenterArif Bin Hossain
Slides incorporate materials kindly provided by Prof. Eamonn Keogh
![Page 2: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/2.jpg)
Time Series
A time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. [Wiki]
Example: Economic, Sales, Stock market forecasting EEG, ECG, BCI analysis
0 2000 4000 6000 80000
10
20
30
![Page 3: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/3.jpg)
Problems
Join: Given two data collections, link items occurring in each
Annotation: obtain additional information from given data
Query by content: Given a large data collection, find the k most similar objects to an object of interest.
Clustering: Given a unlabeled dataset, arrange them into groups by their mutual similarity
![Page 4: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/4.jpg)
Problems (Cont.)
Classification: Given a labeled training set, classify future unlabeled examples
Anomaly Detection: Given a large collection of objects, find the one that is most different to all the rest.
Motif Finding: Given a large collection of objects, find the pair that is most similar.
![Page 5: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/5.jpg)
Data Mining Constraints
For example, suppose you have one gig of main memory and want to do K-
means clustering…
For example, suppose you have one gig of main memory and want to do K-
means clustering…Clustering ¼ gig of data, 100 secClustering ½ gig of data, 200 secClustering 1 gig of data, 400 secClustering 1.1 gigs of data, few
hours
Clustering ¼ gig of data, 100 secClustering ½ gig of data, 200 secClustering 1 gig of data, 400 secClustering 1.1 gigs of data, few
hours
Bradley, M. Fayyad, & Reina: Scaling Clustering Algorithms to Large Databases. KDD 1998: 9-15
![Page 6: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/6.jpg)
Generic Data Mining
• Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest
• Approximately solve the problem at hand in main memory
• Make (hopefully very few) accesses to the original data on disk to confirm the solution
![Page 7: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/7.jpg)
Some Common Approximation
![Page 8: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/8.jpg)
Why Symbolic Representation?
• Reduce dimension• Numerosity reduction• Hashing• Suffix Trees• Markov Models• Stealing ideas from text processing/
bioinformatics community
![Page 9: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/9.jpg)
Symbolic Aggregate ApproXimation (SAX)
• Lower bounding of Euclidean distance• Lower bounding of the DTW distance• Dimensionality Reduction• Numerosity Reduction
baabccbc
![Page 10: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/10.jpg)
SAX
Allows a time series of arbitrary length n to be reduced to a string of arbitrary length w (w<<n)
NotationsC A time series C = c1, ….., cn
ĆA Piecewise Aggregate Approximation of a time series Ć = ć1,…ćw
ĈA symbolic representation of a time series Ĉ = ĉ1, …, ĉw
w Number PAA segments representing C
a Alphabet size
![Page 11: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/11.jpg)
How to obtain SAX?
Step 1: Reduce dimension by PAA Time series C of length n can be represented in a
w-dimensional space by a vector Ć = ć1,…ćw
The ith element is calculated by
Reduce dimension from 20 to 5. The 2nd element will be
i
ijjn
wi
wn
wn
cc1)1(
8
52 20
5
j
CjC
![Page 12: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/12.jpg)
How to obtain SAX?
Data is divided into w equal sized frames. Mean value of the data falling within a frame
is calculatedVector of these values becomes the PAA
0 20 40 60 80 100 120
C
C
![Page 13: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/13.jpg)
How to obtain SAX?
Step 2: Discretization Normalize Ć to have a Gaussian distribution Determine breakpoints that will produce a equal-sized
areas under Gaussian curve.
0
-
-
0 20 40 60 80 100 120
bbb
a
cc
c
a
baabccbc
Words: 8Alphabet: 3
![Page 14: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/14.jpg)
Distance Measure
Given 2 time series Q and C Euclidean distance
Distance after transforming the subsequence to PAA
![Page 15: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/15.jpg)
Distance Measure
Define MINDIST after transforming to symbolic representation
MINDIST lower bounds the true distance between the original time series
![Page 16: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/16.jpg)
Numerosity Reduction
Subsequences are extracted by a sliding window
Sequences are mostly repetitive subsequence Sliding window finds aabbcc If the next sequence is also aabbcc, just store the
positionThis optimization depends on the data, but
typically yields a reduction factor of 2 or 3 Space shuttle telemetry with subsequence length 32
![Page 17: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/17.jpg)
Experimental Validation
Clustering Hierarchical Partitional
Classification Nearest neighbor Decision tree
Motif discovery
![Page 18: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/18.jpg)
Hierarchical Clustering
Sample dataset consists 3 decreasing trend, 3 upward shift and 3 normal classes
![Page 19: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/19.jpg)
Partitional Clustering (k-means)
Assign each point to one of k clusters whose center is nearest
Each iteration tries to minimize the sum of squared intra-clustered error
![Page 20: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/20.jpg)
Nearest Neighbor Classification
SAX beats Euclidean distance due to the smoothing effect of dimensional reduction
![Page 21: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/21.jpg)
Decision Tree Classification
Since decision trees are expensive to use with high dimensional dataset, Regression Tree [Geurts.2001] is a better approach for data mining on time series
![Page 22: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/22.jpg)
Motif Discovery
Implemented the random projection algorithm of Tompa and Buhler [ICMB2001] Hashing subsequenced into buckets using a random
subset of their features as a key
![Page 23: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/23.jpg)
New Version: iSAX
Use binary numbers for labeling the wordsDifferent alphabet size(cardinality)within a
wordComparison of words with different
cardinalities
![Page 24: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649c3a5503460f948e4a97/html5/thumbnails/24.jpg)
Thank you
Questions?