Searching and mining sequential data - KTH

Searching and mining sequential data

!

Panagiotis PapapetrouProfessor, Stockholm University

Adjunct Professor, Aalto University

Disclaimer: some of the images in slides 62-69 have been taken from UCR andProf. Eamonn Keogh.

Outline• Sequences▫ Examples and analysis tasks

• Searching large sequences▫ EBSM: Embedding-based time series subsequence

matching▫ DRESS: Dimensionality reduction for event subsequence

matching• Learning from time series▫ Shapelets and shapelet trees▫ Random shapelet forests

• Current challenges

2

• Distinct events: e.g., text, DNA• Variables over time: e.g., time series• Interval-based events: e.g., sign language

Sequences

TCTAGGGCA

0 50 100 150 200 250 300 350 400 450 50023

24

25

26

27

28

29

time axis

valueaxis

Analysis Tasks

4

Similarity Search

Classification

Clustering Outlier Detection

Frequent Pattern Mining

• Time series matching / stream monitoring

• DNA sequence alignment

Similarity search

TCTAGGGCA

ischemia

possibility of cancer

ECG of a patient

human genome

} Query-by-hummingmusic piece à sequence of notes

Similarity search

• Time series classification

• Gene classification

ischemia

TCTAGGGCA

possibility of cancer

ECG of a patient

Sequence classification

Two fundamental questions

• How to define an appropriate distance measure

for the task at hand?

• How to speed up the search under that distance

measure?

8

• Given two time series

X = (x1, x2, …, xn) Y = (y1, y2, …, yn)

• Define and compute D (X, Y)

• Or better…

Time Series Similarity

database

query X

D (X, Y) 1-NN

• Given a time series database and a query X• Find the best match of X in the database

• Why is that useful?

Time Series Similarity

Examples

• Find companies with similar stock prices over atime interval

• Find products with similar sell cycles• Find songs with similar music patterns• Cluster users with similar credit card utilization• Find similar subsequences in DNA sequences• Find scenes in video streams

Elastic distance measures(images taken from Eamonn Keogh)

• Euclidean▫ rigid

• Dynamic Time Warping (DTW)▫ allows local scaling

• Longest Common SubSequence (LCSS)▫ allows local scaling▫ ignores outliers

( ) ( )å -º=

n

iii yxYXD

1

2,

• Warping path W: ▫ set of grid cells in the time warping matrix

• DTW finds an optimal warping path W:▫ the path with the smallest matching score

OptimumwarpingpathW(thebestalignment) PropertiesofaDTWlegalpath

I. Boundaryconditions

W1=(1,1)andWK=(n,m)

II. ContinuityGivenWk =(a,b),thenWk-1 =(c,d),wherea-c≤ 1,b-d≤1

III. MonotonicityGivenWk =(a,b),thenWk-1 =(c,d),wherea-c≥0,b-d≥0

Properties of DTW

X

Y

14

Sakoe-ChibaBand ItakuraParallelogram

r =

GlobalConstraintsn Slightlyspeedupthecalculationsandpreventpathologicalwarpingsn Aglobalconstraintlimitstheindicesofthewarpingpath

wk =(i,j)k suchthatj-r £ i £ j+rn Wherer isatermdefiningallowedrangeofwarpingforagivenpointina

sequence

Speeding up search under DTW

15

EBSM++: subsequence and full sequence matching for time series, ACM TODS 2011

EBESM: full matching of temporal interval sequences, DAMI 2017

…using embeddings

RBSM: alignment of event sequences, VLDB 2009

EBSM: subsequence matching for time series, SIGMOD 2008

Strategy: identify candidate matchesDatabase X

indexing structure


query Q

indexing structure


candidates candidates

indexing structure

query Q


candidate: possible matchMain principle: use costly distancecomputation only to evaluate thecandidates

Vector embedding

Database X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15

vector set

Vector embedding

X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database

query Q2Q1 Q4Q3 Q5

vector set

Vector embedding


query vector

query Q2Q1 Q4Q3 Q5

vector set

Vector embedding


• Embedding should be such that:– Query vector is similar to match vector

candidate match

query vector

query Q2Q1 Q4Q3 Q5

vector set

Vector embedding


• Using vectors we identify candidates– Much faster than brute-force search

query vector

query Q2Q1 Q4Q3 Q5

vector set

Vector embedding


Using reference sequences

n Apply your favorite distance measure (e.g., DTW) to compute the cost of the

best match of R and X

n Define FR(X) to be that cost

n FR is a 1D embedding

n Each FR (X) à single real number

database sequence X

ref.

Rrow |R|

q Define a set of reference sequences

n Apply the same matching computation against the query:

q Matching cost of match R with Q

n Define FR(Q) to be that cost

database sequence X

ref.

R

query Q

ref.

R

Using reference sequences

Intuition about this embeddingn Suppose Q appears exactly in X

n Then:

n Warping paths are the same

n FR(Q) = FR(X)

n Suppose Q appears inexactly in X

n Then:

n We expect FR(Q) to be similar to FR(X)

n Why? Little tweaks should affect FR(X) little

n No proof, but intuitive, and lots of empirical evidence

• one reference sequence à 1D embedding• two reference sequences à 2D embedding

database sequence X query Q

R2R2

Multi-dimensional embedding


R1R1


R2R2

Multi-dimensional embedding


R1R1

• d reference sequences à d-dimensional embedding F• Basic principle:

• If Xj is the best match of Q • F(Q) should (for most Q) be more similar to F(X, j) than to most FR(X, t)

Filter-and-refine retrievalOffline step:

• Compute FR(X) for each sequence in X

Online steps - given a query Q:• Embedding step:

▫ Compute FR(Q)

• Filter step:

▫ Compare FR(Q) to all FR(X)

▫ Select p best matches à p candidate endpoints

• Refine step:

▫ Use dynamic programming to evaluate each candidate

Filter-and-refine performance

• Accuracy:

ücorrect match must be among p candidates, for most queries

• Trade-off:

ü larger p à higher accuracy, lower efficiency

Database X

candidates

Embedding Optimization

• Solution:▫ extract the set of reference sequences with the highest distance

variance in the dataset▫ choose those top K that can guarantee the best trade-off between

accuracy and p▫ cross-validation (one-time pre-processing step)▫ sample of queries (input to the algorithm)▫ large pool of reference sequences

• How many reference sequence to choose?

• Which ones to choose?

Performance• 20 datasets: UCR Time Series Data Mining Repository

• Queries: 5,397

• Query lengths between 60 and 637

• 50% were used for embedding optimization

• EBSM can achieve over an order of magnitude speedupcompared to brute-force search for ALL datasets!

(2% to 5% of the database is examined) with accuracy of 99%

• And twice as fast as LB_PAA+LB_Keogh / UCR_Suite

Application: Query-by-humming• Competitive performance on real queries• Demo available: implementation in Matlab

Similarity search: event sequences

• Similarity search in large sequence databases

à a frequently occurring problem

• Plethora of string matching methods

• Still a high demand for new robust, and

scalable similarity search methods that can

handle

▫ large query lengths

▫ large similarity ranges

• Our focus: Whole sequence matching

TCTAGGGCA

q Expressed Sequence Tag (EST) databases:

ü common for representing large genomes

ü portions of genes expressed as mRNA

ü sequence length at least 500 – 800

Motivation: large query lengths

q Large scale searches need to be performed against other genomic

databases to determine locations of genes

q Searches can also target whole chromosomes, where the goal is to

find chromosome similarities across different organisms

ü High evolutionary divergence à task of identifying distantly related

gene or protein domains by sequence search techniques not always

trivial

ü Mutation process: intermediate sequences may possess features of

many proteins and facilitate detection of remotely related proteins

ü Large range queries, aka remote homology search, in bioinformatics

can be highly beneficial for searching proteins and genomes

Motivation: large query lengths

¤ Given

ü a database S and a query Q

ü a similarity range r = δ|Q|

¤ Task:

ü retrieve all database strings X such that

ED (Q,X) ≤ r

Problem setting

¤ Global alignment [ED, Q-grams, RBE]

ü full-sequence matching approach for global optimization

ü identify the min number of changes that should be performed toone sequence so as to convert it to the other

ü forcing the alignment to span the whole sequence

¤ Local alignment [SW, BLAST, RBSA]

ü identify local regions within the compared sequences that arehighly similar to each other, while they could be globally divergent

ü may allow for gaps in the alignment

¤ Short-read sequencing [SOAP, MAQ, WHAM]

Existing work

¤ Measures how dissimilar two strings are

¤ Computes the minimum cost of edit operations (insertion, deletion, substitution) to convert one string to the other

¤ Auxiliary matrix α

ü Cins: insertion cost

ü Cdel: deletion cost

ü Csub: substitution cost

The Edit distance

¤ Example:

A = ATC and B = ACTG

The Edit distance

A = A – T C

B = A C T G

ED (A,B) = 2

¤ DRESS: a novel filter-and-refine approximate method for speeding up

similarity search under edit distance:

ü no index training required

ü can handle large query lengths and similarity ranges

¤ Efficient string representation in a new space, based on a set of

query-specific codewords à distance computation between strings is

significantly faster than in original space

¤ Extensive experimental evaluation against state-of-the-art on threelarge protein and two DNA sequence datasets

ü DRESS outperforms competitors with very competitive accuracyand retrieval runtime

Our contributions

¤ A three-step framework

The DRESS framework

¤ This is an online query-sensitive step

¤ For each query Q, identify a set of codewords

E = {E1, …, Et}

² E contains the t most frequent substrings of length l

² IMPORANT: no pair in E has an overlapping prefix-suffix

² t and l are identified after appropriate training

The mapping step: alphabet reduction

46

The mapping step: alphabet reduction

¤ Let X = (babfcde) and E = {ab, cd}

The mapping step: example

Sequences are traversedfrom left to right!

¤ The process described above is linear to the database size

¤ Inverted index: precomputed for the database, where we store for each possible code-word of length l:

ü all the sequence indices and positions where it occurs

ü IND[e][i]: list of all positions where codeword e occurs in the ith database sequence

q NOTE: The index is built in linear time

The mapping step: faster than linear

¤ Let Xi be a database sequence and xi denote its E-mapping

The mapping step: in detail

¤Complexity of the mapping step for a single DB

sequence:

¤Good news:

ü Typically the length of xi is much smaller than the

length of Xi

ü Hence, faster than scanning Xi

The mapping step: complexity

¤ Query is mapped to the E-space

¤ Each DB sequence is mapped to the E-space

¤ Brute-force search under Edit Distance on the E-space

¤ Identify a set of candidates that are within a factor of δ’ of

the query length

¤ Refine the candidates in the original space under Edit

Distance

The filter-and-refine steps

¤ The matching range δ’ in the E-space depends on the original δ and

is set to be δ’ = f δ

¤ Clearly, δ’ should be set higher than δ to account for the cases where

the distance in the E-space is a higher percentage of the mapped

query length

¤ The latter can easily happen because the mapped strings are

drastically shorter than the original strings

¤ Hence: while we expect similar strings to have similar mappings, the

loss incurred by the mapping can lead to higher distances as % of the

query length

The matching range in the E-space

¤ Two real datasets

¤ Three methods

¤ Variable query sizes

¤ Several parameters

Experimental setup

¤ UniProt [Protein sequences]: http://www.ebi.ac.uk/uniprot/

o 530,264 strings

o 25-letter alphabet

¤ Human Chromosome 1 [DNA sequences]: ftp://ftp.ncbi.nlm.nih.go

o 249,250,621 bases available

o 4-letter alphabet

Experimental setup: Datasets

Experimental setup: Datasets

¤ DRESS: our method

¤ RBE: an embedding-based method for sequence search

¤ Q-grams: uses inverted indices to find candidate matches

¤ Note that DRESS is an approximate method, while RBE and Q-

grams are exact

Experimental setup: Methods

¤ Retrieval cost on UniProt [401,800]

Experimental results: retrieval cost

v DRESS achieves up to 29 times lower retrieval cost against RBE and up to 75 times lower than Q-grams

¤ Retrieval cost on UniProt [801,max]

v DRESS achieves up to 75 times lower retrieval cost against RBE and up to 461 times lower than Q-grams

Experimental results: retrieval cost

¤ Protein datasets:

Experimental results: runtime

¤ DNA datasets:

¤ DRESS: novel method for whole sequence matching

¤ Supports large query lengths and matching ranges

¤ Experiments demonstrate that for higher values of search range

DRESS produces significantly lower costs and runtimes than the

competitors

¤ Loss of guarantee of 100% recall

¤ This price can be an acceptable trade-off in several domains given

the significant runtime savings

Conclusions

Future work

¤ Further improve the performance of DRESS by, e.g., implementing

multiple filter steps

¤ Study the performance of DRESS on sequences from other

domains, such as text

¤ Adapt DRESS for local alignment and short-read sequencing

¤ Explore alternative ways of producing codewords, e.g., learn

synthetic words

TimeSeriesclassification• Application:Finance,Medicine,Music

• 1-Nearest Neighbor▫ Pros: accurate, robust, simple▫ Cons: time and space complexity (lazy learning); results arenot interpretable

0 200 400 600 800 1000 1200

Solution

• Shapelets:▫ timeseriessubsequence▫ representativeofaclass▫ discriminativefromotherclasses

• Idea:▫ Useshapelets as“attributes”or“features”forsplittinganodeinthedecisiontree

shapelet

Two steps:o Learn a set of discriminative

shapelets (typically a tree-like

structure)

o Predict a class label for a previously

unseen data series using the

minimum distance to shapelets

(following a path in the tree-structure

model)

Random Shapelet Tree

Forgot \bar{s}

TheShapelet Classifier

100 200 3000

0.10.2

0.3

0.4

0.0

I

II

III

IV

V

VI

Shapelet Dictionary

2 4 0 1 3 6 5

I

II

III IV

V

VI

Shapelet Tree

Method Accuracy Time

Shapelet 0.720 0.86

NearestNeighbor 0.543 0.65


• Collect all candidate shapeletes in a pool• For each candidate: arrange the time series objectso based on the distance from a candidate shapelet

• Find the optimal split point (maximal information gain)• Pick the candidate achieving best utility as the shapelet

Split Point

0

candidate

Time series objects in the database

TheutilityofaShapelet

CandidatesPool

(Ti − l +1)Ti∈D∑

l=MINLEN

MAXLEN

∑

DB:200Length:275Candidates:7,480,200Computation:3days

ExtractingallShapelets

• Main bottleneck: candidate generation

• Reducethetimeintwoways▫ DistanceEarlyAbandon

o reducetheEuclideandistancecomputationtimebetweentwotimeseries

▫ AdmissibleEntropyPruningo aftercalculatingsometrainingsamples,useanupperboundofinformationgaintocheckagainstthebestcandidateshapelet

Split Point

0

candidate

Time series objects in the database

Speedingupcandidategeneration

100 200 3000

0.10.2

0.3

0.4

0.0

I

II

III

IV

V

VI

Shapelet Dictionary

2 4 0 1 3 6 5

I

II

III IV

V

VI

Shapelet Tree

Method Accuracy Time

Shapelet 0.720 0.86

NearestNeighbor 0.543 0.65


• Transformations + k-NN

▫ Improved subsequence searching and matching, using online

normalization, early abandoning, and re-ordering

▫ Dimensionality reduction using symbolic aggregate approximation

• Synthetic shapelet generation

▫ Initialize using K-means clustering

▫ Learn synthetic Shapelets by following a Gradient Ascend approach

Alternativeapproaches

• Feature-based▫ Select the top k most informative shapelets▫ Generate a new dataset D’ of k columns and |D| rows, whereeach element (i,j) is the (minimum) distance from the shapeletsj to data series di▫ Learn any suitable classifier (e.g., SVM, Random Forest) usingthe transformed dataset

Alternativeapproaches

• A tree-structured ensemble based on the classicRandom Forest and the Shapelet Tree classifier

• Learn:▫ Build T random shapelet trees▫ Each tree is built from a random (with replacement)sample of time series in the database D▫ Inspect r random shapelets at each node

• Predict:▫ Let each tree t1, . . . , tT vote for a class label

RandomShapelet Forest

• Paremeters:

o number of shapelets (r) = 100

o number of trees (T) = 500

o min shapelet length (l) = 2

o max shapelet length (u) = m (max time series length)

Experimentalresults

• Friedman’s test: The observed differences in accuracy of RSF against the rest deviate significantly, i.e., p-value = 10-11

75

Multi-variate time series

Future work• Feature tweaking for shapelets:

▫ What is the minimum amount of changes we need

to apply to a shapelet feature so that the

classification result flips?

• Interval sequence features:

▫ Apply random shapelet forests for complex

interval sequences

76

Thank you for your attention!

• "Embedding-based Subsequence Matching: a query-by-humming application". In the Very Large DatabasesJournal (VLDBJ), 24(4): 519-536, 2015

• "Hum-a-song: A Subsequence Matching with Gaps-Range-Tolerances Query-By-Humming System". In Proceedings ofthe Very Large Databases Endowement (PVLDB), Volume4, Issue 11, pages 1930-1933, 2012

• "Embedding-based subsequence matching in time-seriesdatabases". In ACM Transactions on Database Systems (TODS),Vol. 36, Issue 3, Article 17, 2011

78

ReferencesSubsequence matching

• "Generalized Random Shapelet Forests". In the DataMining and Knowledge Discovery Journal (DAMI)2016

• "Early-classification of Time Series". In Proceedingsof the International Conference on DiscoveryScience (DS), Best Paper Award, 2016

• "Query-sensitive Distance Measure Selection forTime Series Nearest Neighbor Classification".Intelligent Data Analysis Journal (IDAJ), 20(1): 5-27, 2016

79

ReferencesTime series classification

• "Indexing Sequences of Temporal Intervals". In the DataMining and Knowledge Discovery Journal (DAMI),Accepted 2016

• "Finding the Longest Common Sub-Pattern in Sequences ofTemporal Intervals". In the Data Mining and KnowledgeDiscovery Journal (DAMI), 29(5): 1178-1210, 2015

• "Mining Frequent Arrangements of Temporal Intervals". InKnowledge and Information Systems (KAIS), Vol. 21, Issue2, pages 133–171, 2010

80

ReferencesSequences of temporal intervals

Thank you for your attention!

Panagiotis Papapetrou

[email protected]

Searching and mining sequential data - KTH

Documents

Transcript of Searching and mining sequential data - KTH