Searching and mining sequential data - KTH
Transcript of Searching and mining sequential data - KTH
Searching and mining sequential data
!
Panagiotis PapapetrouProfessor, Stockholm University
Adjunct Professor, Aalto University
Disclaimer: some of the images in slides 62-69 have been taken from UCR andProf. Eamonn Keogh.
Outline• Sequences▫ Examples and analysis tasks
• Searching large sequences▫ EBSM: Embedding-based time series subsequence
matching▫ DRESS: Dimensionality reduction for event subsequence
matching• Learning from time series▫ Shapelets and shapelet trees▫ Random shapelet forests
• Current challenges
2
• Distinct events: e.g., text, DNA• Variables over time: e.g., time series• Interval-based events: e.g., sign language
Sequences
TCTAGGGCA
0 50 100 150 200 250 300 350 400 450 50023
24
25
26
27
28
29
time axis
valueaxis
Analysis Tasks
4
Similarity Search
Classification
Clustering Outlier Detection
Frequent Pattern Mining
• Time series matching / stream monitoring
• DNA sequence alignment
Similarity search
TCTAGGGCA
ischemia
possibility of cancer
ECG of a patient
human genome
} Query-by-hummingmusic piece à sequence of notes
Similarity search
• Time series classification
• Gene classification
ischemia
TCTAGGGCA
possibility of cancer
ECG of a patient
Sequence classification
Two fundamental questions
• How to define an appropriate distance measure
for the task at hand?
• How to speed up the search under that distance
measure?
8
• Given two time series
X = (x1, x2, …, xn) Y = (y1, y2, …, yn)
• Define and compute D (X, Y)
• Or better…
Time Series Similarity
database
query X
D (X, Y) 1-NN
• Given a time series database and a query X• Find the best match of X in the database
• Why is that useful?
Time Series Similarity
Examples
• Find companies with similar stock prices over atime interval
• Find products with similar sell cycles• Find songs with similar music patterns• Cluster users with similar credit card utilization• Find similar subsequences in DNA sequences• Find scenes in video streams
Elastic distance measures(images taken from Eamonn Keogh)
• Euclidean▫ rigid
• Dynamic Time Warping (DTW)▫ allows local scaling
• Longest Common SubSequence (LCSS)▫ allows local scaling▫ ignores outliers
( ) ( )å -º=
n
iii yxYXD
1
2,
• Warping path W: ▫ set of grid cells in the time warping matrix
• DTW finds an optimal warping path W:▫ the path with the smallest matching score
OptimumwarpingpathW(thebestalignment) PropertiesofaDTWlegalpath
I. Boundaryconditions
W1=(1,1)andWK=(n,m)
II. ContinuityGivenWk =(a,b),thenWk-1 =(c,d),wherea-c≤ 1,b-d≤1
III. MonotonicityGivenWk =(a,b),thenWk-1 =(c,d),wherea-c≥0,b-d≥0
Properties of DTW
X
Y
14
Sakoe-ChibaBand ItakuraParallelogram
r =
GlobalConstraintsn Slightlyspeedupthecalculationsandpreventpathologicalwarpingsn Aglobalconstraintlimitstheindicesofthewarpingpath
wk =(i,j)k suchthatj-r £ i £ j+rn Wherer isatermdefiningallowedrangeofwarpingforagivenpointina
sequence
Speeding up search under DTW
15
EBSM++: subsequence and full sequence matching for time series, ACM TODS 2011
EBESM: full matching of temporal interval sequences, DAMI 2017
…using embeddings
RBSM: alignment of event sequences, VLDB 2009
EBSM: subsequence matching for time series, SIGMOD 2008
Strategy: identify candidate matchesDatabase X
indexing structure
Strategy: identify candidate matchesDatabase X
query Q
indexing structure
Strategy: identify candidate matchesDatabase X
candidates candidates
indexing structure
query Q
Strategy: identify candidate matchesDatabase X
candidate: possible matchMain principle: use costly distancecomputation only to evaluate thecandidates
Vector embedding
Database X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15
vector set
Vector embedding
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database
query Q2Q1 Q4Q3 Q5
vector set
Vector embedding
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database
query vector
query Q2Q1 Q4Q3 Q5
vector set
Vector embedding
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database
• Embedding should be such that:– Query vector is similar to match vector
candidate match
query vector
query Q2Q1 Q4Q3 Q5
vector set
Vector embedding
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database
• Using vectors we identify candidates– Much faster than brute-force search
query vector
query Q2Q1 Q4Q3 Q5
vector set
Vector embedding
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database
Using reference sequences
n Apply your favorite distance measure (e.g., DTW) to compute the cost of the
best match of R and X
n Define FR(X) to be that cost
n FR is a 1D embedding
n Each FR (X) à single real number
database sequence X
ref.
Rrow |R|
q Define a set of reference sequences
n Apply the same matching computation against the query:
q Matching cost of match R with Q
n Define FR(Q) to be that cost
database sequence X
ref.
R
query Q
ref.
R
Using reference sequences
Intuition about this embeddingn Suppose Q appears exactly in X
n Then:
n Warping paths are the same
n FR(Q) = FR(X)
n Suppose Q appears inexactly in X
n Then:
n We expect FR(Q) to be similar to FR(X)
n Why? Little tweaks should affect FR(X) little
n No proof, but intuitive, and lots of empirical evidence
• one reference sequence à 1D embedding• two reference sequences à 2D embedding
database sequence X query Q
R2R2
Multi-dimensional embedding
database sequence X query Q
R1R1
database sequence X query Q
R2R2
Multi-dimensional embedding
database sequence X query Q
R1R1
• d reference sequences à d-dimensional embedding F• Basic principle:
• If Xj is the best match of Q • F(Q) should (for most Q) be more similar to F(X, j) than to most FR(X, t)
Filter-and-refine retrievalOffline step:
• Compute FR(X) for each sequence in X
Online steps - given a query Q:• Embedding step:
▫ Compute FR(Q)
• Filter step:
▫ Compare FR(Q) to all FR(X)
▫ Select p best matches à p candidate endpoints
• Refine step:
▫ Use dynamic programming to evaluate each candidate
Filter-and-refine performance
• Accuracy:
ücorrect match must be among p candidates, for most queries
• Trade-off:
ü larger p à higher accuracy, lower efficiency
Database X
candidates
Embedding Optimization
• Solution:▫ extract the set of reference sequences with the highest distance
variance in the dataset▫ choose those top K that can guarantee the best trade-off between
accuracy and p▫ cross-validation (one-time pre-processing step)▫ sample of queries (input to the algorithm)▫ large pool of reference sequences
• How many reference sequence to choose?
• Which ones to choose?
Performance• 20 datasets: UCR Time Series Data Mining Repository
• Queries: 5,397
• Query lengths between 60 and 637
• 50% were used for embedding optimization
• EBSM can achieve over an order of magnitude speedupcompared to brute-force search for ALL datasets!
(2% to 5% of the database is examined) with accuracy of 99%
• And twice as fast as LB_PAA+LB_Keogh / UCR_Suite
Application: Query-by-humming• Competitive performance on real queries• Demo available: implementation in Matlab
Similarity search: event sequences
• Similarity search in large sequence databases
à a frequently occurring problem
• Plethora of string matching methods
• Still a high demand for new robust, and
scalable similarity search methods that can
handle
▫ large query lengths
▫ large similarity ranges
• Our focus: Whole sequence matching
TCTAGGGCA
q Expressed Sequence Tag (EST) databases:
ü common for representing large genomes
ü portions of genes expressed as mRNA
ü sequence length at least 500 – 800
Motivation: large query lengths
q Large scale searches need to be performed against other genomic
databases to determine locations of genes
q Searches can also target whole chromosomes, where the goal is to
find chromosome similarities across different organisms
ü High evolutionary divergence à task of identifying distantly related
gene or protein domains by sequence search techniques not always
trivial
ü Mutation process: intermediate sequences may possess features of
many proteins and facilitate detection of remotely related proteins
ü Large range queries, aka remote homology search, in bioinformatics
can be highly beneficial for searching proteins and genomes
Motivation: large query lengths
¤ Given
ü a database S and a query Q
ü a similarity range r = δ|Q|
¤ Task:
ü retrieve all database strings X such that
ED (Q,X) ≤ r
Problem setting
¤ Global alignment [ED, Q-grams, RBE]
ü full-sequence matching approach for global optimization
ü identify the min number of changes that should be performed toone sequence so as to convert it to the other
ü forcing the alignment to span the whole sequence
¤ Local alignment [SW, BLAST, RBSA]
ü identify local regions within the compared sequences that arehighly similar to each other, while they could be globally divergent
ü may allow for gaps in the alignment
¤ Short-read sequencing [SOAP, MAQ, WHAM]
Existing work
¤ Measures how dissimilar two strings are
¤ Computes the minimum cost of edit operations (insertion, deletion, substitution) to convert one string to the other
¤ Auxiliary matrix α
ü Cins: insertion cost
ü Cdel: deletion cost
ü Csub: substitution cost
The Edit distance
¤ Example:
A = ATC and B = ACTG
The Edit distance
A = A – T C
B = A C T G
ED (A,B) = 2
¤ DRESS: a novel filter-and-refine approximate method for speeding up
similarity search under edit distance:
ü no index training required
ü can handle large query lengths and similarity ranges
¤ Efficient string representation in a new space, based on a set of
query-specific codewords à distance computation between strings is
significantly faster than in original space
¤ Extensive experimental evaluation against state-of-the-art on threelarge protein and two DNA sequence datasets
ü DRESS outperforms competitors with very competitive accuracyand retrieval runtime
Our contributions
¤ A three-step framework
The DRESS framework
¤ This is an online query-sensitive step
¤ For each query Q, identify a set of codewords
E = {E1, …, Et}
² E contains the t most frequent substrings of length l
² IMPORANT: no pair in E has an overlapping prefix-suffix
² t and l are identified after appropriate training
The mapping step: alphabet reduction
46
The mapping step: alphabet reduction
¤ Let X = (babfcde) and E = {ab, cd}
The mapping step: example
Sequences are traversedfrom left to right!
¤ The process described above is linear to the database size
¤ Inverted index: precomputed for the database, where we store for each possible code-word of length l:
ü all the sequence indices and positions where it occurs
ü IND[e][i]: list of all positions where codeword e occurs in the ith database sequence
q NOTE: The index is built in linear time
The mapping step: faster than linear
¤ Let Xi be a database sequence and xi denote its E-mapping
The mapping step: in detail
¤Complexity of the mapping step for a single DB
sequence:
¤Good news:
ü Typically the length of xi is much smaller than the
length of Xi
ü Hence, faster than scanning Xi
The mapping step: complexity
¤ Query is mapped to the E-space
¤ Each DB sequence is mapped to the E-space
¤ Brute-force search under Edit Distance on the E-space
¤ Identify a set of candidates that are within a factor of δ’ of
the query length
¤ Refine the candidates in the original space under Edit
Distance
The filter-and-refine steps
¤ The matching range δ’ in the E-space depends on the original δ and
is set to be δ’ = f δ
¤ Clearly, δ’ should be set higher than δ to account for the cases where
the distance in the E-space is a higher percentage of the mapped
query length
¤ The latter can easily happen because the mapped strings are
drastically shorter than the original strings
¤ Hence: while we expect similar strings to have similar mappings, the
loss incurred by the mapping can lead to higher distances as % of the
query length
The matching range in the E-space
¤ Two real datasets
¤ Three methods
¤ Variable query sizes
¤ Several parameters
Experimental setup
¤ UniProt [Protein sequences]: http://www.ebi.ac.uk/uniprot/
o 530,264 strings
o 25-letter alphabet
¤ Human Chromosome 1 [DNA sequences]: ftp://ftp.ncbi.nlm.nih.go
o 249,250,621 bases available
o 4-letter alphabet
Experimental setup: Datasets
Experimental setup: Datasets
¤ DRESS: our method
¤ RBE: an embedding-based method for sequence search
¤ Q-grams: uses inverted indices to find candidate matches
¤ Note that DRESS is an approximate method, while RBE and Q-
grams are exact
Experimental setup: Methods
¤ Retrieval cost on UniProt [401,800]
Experimental results: retrieval cost
v DRESS achieves up to 29 times lower retrieval cost against RBE and up to 75 times lower than Q-grams
¤ Retrieval cost on UniProt [801,max]
v DRESS achieves up to 75 times lower retrieval cost against RBE and up to 461 times lower than Q-grams
Experimental results: retrieval cost
¤ Protein datasets:
Experimental results: runtime
¤ DNA datasets:
¤ DRESS: novel method for whole sequence matching
¤ Supports large query lengths and matching ranges
¤ Experiments demonstrate that for higher values of search range
DRESS produces significantly lower costs and runtimes than the
competitors
¤ Loss of guarantee of 100% recall
¤ This price can be an acceptable trade-off in several domains given
the significant runtime savings
Conclusions
Future work
¤ Further improve the performance of DRESS by, e.g., implementing
multiple filter steps
¤ Study the performance of DRESS on sequences from other
domains, such as text
¤ Adapt DRESS for local alignment and short-read sequencing
¤ Explore alternative ways of producing codewords, e.g., learn
synthetic words
TimeSeriesclassification• Application:Finance,Medicine,Music
• 1-Nearest Neighbor▫ Pros: accurate, robust, simple▫ Cons: time and space complexity (lazy learning); results arenot interpretable
0 200 400 600 800 1000 1200
Solution
• Shapelets:▫ timeseriessubsequence▫ representativeofaclass▫ discriminativefromotherclasses
• Idea:▫ Useshapelets as“attributes”or“features”forsplittinganodeinthedecisiontree
shapelet
Two steps:o Learn a set of discriminative
shapelets (typically a tree-like
structure)
o Predict a class label for a previously
unseen data series using the
minimum distance to shapelets
(following a path in the tree-structure
model)
Random Shapelet Tree
Forgot \bar{s}
TheShapelet Classifier
100 200 3000
0.10.2
0.3
0.4
0.0
I
II
III
IV
V
VI
Shapelet Dictionary
2 4 0 1 3 6 5
I
II
III IV
V
VI
Shapelet Tree
Method Accuracy Time
Shapelet 0.720 0.86
NearestNeighbor 0.543 0.65
TheShapelet Classifier
• Collect all candidate shapeletes in a pool• For each candidate: arrange the time series objectso based on the distance from a candidate shapelet
• Find the optimal split point (maximal information gain)• Pick the candidate achieving best utility as the shapelet
Split Point
0
candidate
Time series objects in the database
TheutilityofaShapelet
CandidatesPool
(Ti − l +1)Ti∈D∑
l=MINLEN
MAXLEN
∑
DB:200Length:275Candidates:7,480,200Computation:3days
ExtractingallShapelets
• Main bottleneck: candidate generation
• Reducethetimeintwoways▫ DistanceEarlyAbandon
o reducetheEuclideandistancecomputationtimebetweentwotimeseries
▫ AdmissibleEntropyPruningo aftercalculatingsometrainingsamples,useanupperboundofinformationgaintocheckagainstthebestcandidateshapelet
Split Point
0
candidate
Time series objects in the database
Speedingupcandidategeneration
100 200 3000
0.10.2
0.3
0.4
0.0
I
II
III
IV
V
VI
Shapelet Dictionary
2 4 0 1 3 6 5
I
II
III IV
V
VI
Shapelet Tree
Method Accuracy Time
Shapelet 0.720 0.86
NearestNeighbor 0.543 0.65
TheShapelet Classifier
• Transformations + k-NN
▫ Improved subsequence searching and matching, using online
normalization, early abandoning, and re-ordering
▫ Dimensionality reduction using symbolic aggregate approximation
• Synthetic shapelet generation
▫ Initialize using K-means clustering
▫ Learn synthetic Shapelets by following a Gradient Ascend approach
Alternativeapproaches
• Feature-based▫ Select the top k most informative shapelets▫ Generate a new dataset D’ of k columns and |D| rows, whereeach element (i,j) is the (minimum) distance from the shapeletsj to data series di▫ Learn any suitable classifier (e.g., SVM, Random Forest) usingthe transformed dataset
Alternativeapproaches
• A tree-structured ensemble based on the classicRandom Forest and the Shapelet Tree classifier
• Learn:▫ Build T random shapelet trees▫ Each tree is built from a random (with replacement)sample of time series in the database D▫ Inspect r random shapelets at each node
• Predict:▫ Let each tree t1, . . . , tT vote for a class label
RandomShapelet Forest
• Paremeters:
o number of shapelets (r) = 100
o number of trees (T) = 500
o min shapelet length (l) = 2
o max shapelet length (u) = m (max time series length)
Experimentalresults
• Friedman’s test: The observed differences in accuracy of RSF against the rest deviate significantly, i.e., p-value = 10-11
75
Multi-variate time series
Future work• Feature tweaking for shapelets:
▫ What is the minimum amount of changes we need
to apply to a shapelet feature so that the
classification result flips?
• Interval sequence features:
▫ Apply random shapelet forests for complex
interval sequences
76
Thank you for your attention!
• "Embedding-based Subsequence Matching: a query-by-humming application". In the Very Large DatabasesJournal (VLDBJ), 24(4): 519-536, 2015
• "Hum-a-song: A Subsequence Matching with Gaps-Range-Tolerances Query-By-Humming System". In Proceedings ofthe Very Large Databases Endowement (PVLDB), Volume4, Issue 11, pages 1930-1933, 2012
• "Embedding-based subsequence matching in time-seriesdatabases". In ACM Transactions on Database Systems (TODS),Vol. 36, Issue 3, Article 17, 2011
78
ReferencesSubsequence matching
• "Generalized Random Shapelet Forests". In the DataMining and Knowledge Discovery Journal (DAMI)2016
• "Early-classification of Time Series". In Proceedingsof the International Conference on DiscoveryScience (DS), Best Paper Award, 2016
• "Query-sensitive Distance Measure Selection forTime Series Nearest Neighbor Classification".Intelligent Data Analysis Journal (IDAJ), 20(1): 5-27, 2016
79
ReferencesTime series classification
• "Indexing Sequences of Temporal Intervals". In the DataMining and Knowledge Discovery Journal (DAMI),Accepted 2016
• "Finding the Longest Common Sub-Pattern in Sequences ofTemporal Intervals". In the Data Mining and KnowledgeDiscovery Journal (DAMI), 29(5): 1178-1210, 2015
• "Mining Frequent Arrangements of Temporal Intervals". InKnowledge and Information Systems (KAIS), Vol. 21, Issue2, pages 133–171, 2010
80
ReferencesSequences of temporal intervals