Searching and mining sequential data - KTH

81
Searching and mining sequential data Panagiotis Papapetrou Professor, Stockholm University Adjunct Professor, Aalto University Disclaimer: some of the images in slides 62-69 have been taken from UCR and Prof. Eamonn Keogh.

Transcript of Searching and mining sequential data - KTH

Page 1: Searching and mining sequential data - KTH

Searching and mining sequential data

!

Panagiotis PapapetrouProfessor, Stockholm University

Adjunct Professor, Aalto University

Disclaimer: some of the images in slides 62-69 have been taken from UCR andProf. Eamonn Keogh.

Page 2: Searching and mining sequential data - KTH

Outline• Sequences▫ Examples and analysis tasks

• Searching large sequences▫ EBSM: Embedding-based time series subsequence

matching▫ DRESS: Dimensionality reduction for event subsequence

matching• Learning from time series▫ Shapelets and shapelet trees▫ Random shapelet forests

• Current challenges

2

Page 3: Searching and mining sequential data - KTH

• Distinct events: e.g., text, DNA• Variables over time: e.g., time series• Interval-based events: e.g., sign language

Sequences

TCTAGGGCA

0 50 100 150 200 250 300 350 400 450 50023

24

25

26

27

28

29

time axis

valueaxis

Page 4: Searching and mining sequential data - KTH

Analysis Tasks

4

Similarity Search

Classification

Clustering Outlier Detection

Frequent Pattern Mining

Page 5: Searching and mining sequential data - KTH

• Time series matching / stream monitoring

• DNA sequence alignment

Similarity search

TCTAGGGCA

ischemia

possibility of cancer

ECG of a patient

human genome

Page 6: Searching and mining sequential data - KTH

} Query-by-hummingmusic piece à sequence of notes

Similarity search

Page 7: Searching and mining sequential data - KTH

• Time series classification

• Gene classification

ischemia

TCTAGGGCA

possibility of cancer

ECG of a patient

Sequence classification

Page 8: Searching and mining sequential data - KTH

Two fundamental questions

• How to define an appropriate distance measure

for the task at hand?

• How to speed up the search under that distance

measure?

8

Page 9: Searching and mining sequential data - KTH

• Given two time series

X = (x1, x2, …, xn) Y = (y1, y2, …, yn)

• Define and compute D (X, Y)

• Or better…

Time Series Similarity

Page 10: Searching and mining sequential data - KTH

database

query X

D (X, Y) 1-NN

• Given a time series database and a query X• Find the best match of X in the database

• Why is that useful?

Time Series Similarity

Page 11: Searching and mining sequential data - KTH

Examples

• Find companies with similar stock prices over atime interval

• Find products with similar sell cycles• Find songs with similar music patterns• Cluster users with similar credit card utilization• Find similar subsequences in DNA sequences• Find scenes in video streams

Page 12: Searching and mining sequential data - KTH

Elastic distance measures(images taken from Eamonn Keogh)

• Euclidean▫ rigid

• Dynamic Time Warping (DTW)▫ allows local scaling

• Longest Common SubSequence (LCSS)▫ allows local scaling▫ ignores outliers

( ) ( )å -º=

n

iii yxYXD

1

2,

Page 13: Searching and mining sequential data - KTH

• Warping path W: ▫ set of grid cells in the time warping matrix

• DTW finds an optimal warping path W:▫ the path with the smallest matching score

OptimumwarpingpathW(thebestalignment) PropertiesofaDTWlegalpath

I. Boundaryconditions

W1=(1,1)andWK=(n,m)

II. ContinuityGivenWk =(a,b),thenWk-1 =(c,d),wherea-c≤ 1,b-d≤1

III. MonotonicityGivenWk =(a,b),thenWk-1 =(c,d),wherea-c≥0,b-d≥0

Properties of DTW

X

Y

Page 14: Searching and mining sequential data - KTH

14

Sakoe-ChibaBand ItakuraParallelogram

r =

GlobalConstraintsn Slightlyspeedupthecalculationsandpreventpathologicalwarpingsn Aglobalconstraintlimitstheindicesofthewarpingpath

wk =(i,j)k suchthatj-r £ i £ j+rn Wherer isatermdefiningallowedrangeofwarpingforagivenpointina

sequence

Page 15: Searching and mining sequential data - KTH

Speeding up search under DTW

15

EBSM++: subsequence and full sequence matching for time series, ACM TODS 2011

EBESM: full matching of temporal interval sequences, DAMI 2017

…using embeddings

RBSM: alignment of event sequences, VLDB 2009

EBSM: subsequence matching for time series, SIGMOD 2008

Page 16: Searching and mining sequential data - KTH

Strategy: identify candidate matchesDatabase X

Page 17: Searching and mining sequential data - KTH

indexing structure

Strategy: identify candidate matchesDatabase X

Page 18: Searching and mining sequential data - KTH

query Q

indexing structure

Strategy: identify candidate matchesDatabase X

Page 19: Searching and mining sequential data - KTH

candidates candidates

indexing structure

query Q

Strategy: identify candidate matchesDatabase X

candidate: possible matchMain principle: use costly distancecomputation only to evaluate thecandidates

Page 20: Searching and mining sequential data - KTH

Vector embedding

Database X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15

Page 21: Searching and mining sequential data - KTH

vector set

Vector embedding

X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database

Page 22: Searching and mining sequential data - KTH

query Q2Q1 Q4Q3 Q5

vector set

Vector embedding

X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database

Page 23: Searching and mining sequential data - KTH

query vector

query Q2Q1 Q4Q3 Q5

vector set

Vector embedding

X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database

Page 24: Searching and mining sequential data - KTH

• Embedding should be such that:– Query vector is similar to match vector

candidate match

query vector

query Q2Q1 Q4Q3 Q5

vector set

Vector embedding

X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database

Page 25: Searching and mining sequential data - KTH

• Using vectors we identify candidates– Much faster than brute-force search

query vector

query Q2Q1 Q4Q3 Q5

vector set

Vector embedding

X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database

Page 26: Searching and mining sequential data - KTH

Using reference sequences

n Apply your favorite distance measure (e.g., DTW) to compute the cost of the

best match of R and X

n Define FR(X) to be that cost

n FR is a 1D embedding

n Each FR (X) à single real number

database sequence X

ref.

Rrow |R|

q Define a set of reference sequences

Page 27: Searching and mining sequential data - KTH

n Apply the same matching computation against the query:

q Matching cost of match R with Q

n Define FR(Q) to be that cost

database sequence X

ref.

R

query Q

ref.

R

Using reference sequences

Page 28: Searching and mining sequential data - KTH

Intuition about this embeddingn Suppose Q appears exactly in X

n Then:

n Warping paths are the same

n FR(Q) = FR(X)

n Suppose Q appears inexactly in X

n Then:

n We expect FR(Q) to be similar to FR(X)

n Why? Little tweaks should affect FR(X) little

n No proof, but intuitive, and lots of empirical evidence

Page 29: Searching and mining sequential data - KTH

• one reference sequence à 1D embedding• two reference sequences à 2D embedding

database sequence X query Q

R2R2

Multi-dimensional embedding

database sequence X query Q

R1R1

Page 30: Searching and mining sequential data - KTH

database sequence X query Q

R2R2

Multi-dimensional embedding

database sequence X query Q

R1R1

• d reference sequences à d-dimensional embedding F• Basic principle:

• If Xj is the best match of Q • F(Q) should (for most Q) be more similar to F(X, j) than to most FR(X, t)

Page 31: Searching and mining sequential data - KTH

Filter-and-refine retrievalOffline step:

• Compute FR(X) for each sequence in X

Online steps - given a query Q:• Embedding step:

▫ Compute FR(Q)

• Filter step:

▫ Compare FR(Q) to all FR(X)

▫ Select p best matches à p candidate endpoints

• Refine step:

▫ Use dynamic programming to evaluate each candidate

Page 32: Searching and mining sequential data - KTH

Filter-and-refine performance

• Accuracy:

ücorrect match must be among p candidates, for most queries

• Trade-off:

ü larger p à higher accuracy, lower efficiency

Database X

candidates

Page 33: Searching and mining sequential data - KTH

Embedding Optimization

• Solution:▫ extract the set of reference sequences with the highest distance

variance in the dataset▫ choose those top K that can guarantee the best trade-off between

accuracy and p▫ cross-validation (one-time pre-processing step)▫ sample of queries (input to the algorithm)▫ large pool of reference sequences

• How many reference sequence to choose?

• Which ones to choose?

Page 34: Searching and mining sequential data - KTH

Performance• 20 datasets: UCR Time Series Data Mining Repository

• Queries: 5,397

• Query lengths between 60 and 637

• 50% were used for embedding optimization

• EBSM can achieve over an order of magnitude speedupcompared to brute-force search for ALL datasets!

(2% to 5% of the database is examined) with accuracy of 99%

• And twice as fast as LB_PAA+LB_Keogh / UCR_Suite

Page 35: Searching and mining sequential data - KTH

Application: Query-by-humming• Competitive performance on real queries• Demo available: implementation in Matlab

Page 36: Searching and mining sequential data - KTH

Similarity search: event sequences

• Similarity search in large sequence databases

à a frequently occurring problem

• Plethora of string matching methods

• Still a high demand for new robust, and

scalable similarity search methods that can

handle

▫ large query lengths

▫ large similarity ranges

• Our focus: Whole sequence matching

TCTAGGGCA

Page 37: Searching and mining sequential data - KTH

q Expressed Sequence Tag (EST) databases:

ü common for representing large genomes

ü portions of genes expressed as mRNA

ü sequence length at least 500 – 800

Motivation: large query lengths

q Large scale searches need to be performed against other genomic

databases to determine locations of genes

q Searches can also target whole chromosomes, where the goal is to

find chromosome similarities across different organisms

Page 38: Searching and mining sequential data - KTH

ü High evolutionary divergence à task of identifying distantly related

gene or protein domains by sequence search techniques not always

trivial

ü Mutation process: intermediate sequences may possess features of

many proteins and facilitate detection of remotely related proteins

ü Large range queries, aka remote homology search, in bioinformatics

can be highly beneficial for searching proteins and genomes

Motivation: large query lengths

Page 39: Searching and mining sequential data - KTH

¤ Given

ü a database S and a query Q

ü a similarity range r = δ|Q|

¤ Task:

ü retrieve all database strings X such that

ED (Q,X) ≤ r

Problem setting

Page 40: Searching and mining sequential data - KTH

¤ Global alignment [ED, Q-grams, RBE]

ü full-sequence matching approach for global optimization

ü identify the min number of changes that should be performed toone sequence so as to convert it to the other

ü forcing the alignment to span the whole sequence

¤ Local alignment [SW, BLAST, RBSA]

ü identify local regions within the compared sequences that arehighly similar to each other, while they could be globally divergent

ü may allow for gaps in the alignment

¤ Short-read sequencing [SOAP, MAQ, WHAM]

Existing work

Page 41: Searching and mining sequential data - KTH

¤ Measures how dissimilar two strings are

¤ Computes the minimum cost of edit operations (insertion, deletion, substitution) to convert one string to the other

¤ Auxiliary matrix α

ü Cins: insertion cost

ü Cdel: deletion cost

ü Csub: substitution cost

The Edit distance

Page 42: Searching and mining sequential data - KTH

¤ Example:

A = ATC and B = ACTG

The Edit distance

A = A – T C

B = A C T G

ED (A,B) = 2

Page 43: Searching and mining sequential data - KTH

¤ DRESS: a novel filter-and-refine approximate method for speeding up

similarity search under edit distance:

ü no index training required

ü can handle large query lengths and similarity ranges

¤ Efficient string representation in a new space, based on a set of

query-specific codewords à distance computation between strings is

significantly faster than in original space

¤ Extensive experimental evaluation against state-of-the-art on threelarge protein and two DNA sequence datasets

ü DRESS outperforms competitors with very competitive accuracyand retrieval runtime

Our contributions

Page 44: Searching and mining sequential data - KTH

¤ A three-step framework

The DRESS framework

Page 45: Searching and mining sequential data - KTH

¤ This is an online query-sensitive step

¤ For each query Q, identify a set of codewords

E = {E1, …, Et}

² E contains the t most frequent substrings of length l

² IMPORANT: no pair in E has an overlapping prefix-suffix

² t and l are identified after appropriate training

The mapping step: alphabet reduction

Page 46: Searching and mining sequential data - KTH

46

The mapping step: alphabet reduction

Page 47: Searching and mining sequential data - KTH

¤ Let X = (babfcde) and E = {ab, cd}

The mapping step: example

Sequences are traversedfrom left to right!

Page 48: Searching and mining sequential data - KTH

¤ The process described above is linear to the database size

¤ Inverted index: precomputed for the database, where we store for each possible code-word of length l:

ü all the sequence indices and positions where it occurs

ü IND[e][i]: list of all positions where codeword e occurs in the ith database sequence

q NOTE: The index is built in linear time

The mapping step: faster than linear

Page 49: Searching and mining sequential data - KTH

¤ Let Xi be a database sequence and xi denote its E-mapping

The mapping step: in detail

Page 50: Searching and mining sequential data - KTH

¤Complexity of the mapping step for a single DB

sequence:

¤Good news:

ü Typically the length of xi is much smaller than the

length of Xi

ü Hence, faster than scanning Xi

The mapping step: complexity

Page 51: Searching and mining sequential data - KTH

¤ Query is mapped to the E-space

¤ Each DB sequence is mapped to the E-space

¤ Brute-force search under Edit Distance on the E-space

¤ Identify a set of candidates that are within a factor of δ’ of

the query length

¤ Refine the candidates in the original space under Edit

Distance

The filter-and-refine steps

Page 52: Searching and mining sequential data - KTH

¤ The matching range δ’ in the E-space depends on the original δ and

is set to be δ’ = f δ

¤ Clearly, δ’ should be set higher than δ to account for the cases where

the distance in the E-space is a higher percentage of the mapped

query length

¤ The latter can easily happen because the mapped strings are

drastically shorter than the original strings

¤ Hence: while we expect similar strings to have similar mappings, the

loss incurred by the mapping can lead to higher distances as % of the

query length

The matching range in the E-space

Page 53: Searching and mining sequential data - KTH

¤ Two real datasets

¤ Three methods

¤ Variable query sizes

¤ Several parameters

Experimental setup

Page 54: Searching and mining sequential data - KTH

¤ UniProt [Protein sequences]: http://www.ebi.ac.uk/uniprot/

o 530,264 strings

o 25-letter alphabet

¤ Human Chromosome 1 [DNA sequences]: ftp://ftp.ncbi.nlm.nih.go

o 249,250,621 bases available

o 4-letter alphabet

Experimental setup: Datasets

Page 55: Searching and mining sequential data - KTH

Experimental setup: Datasets

Page 56: Searching and mining sequential data - KTH

¤ DRESS: our method

¤ RBE: an embedding-based method for sequence search

¤ Q-grams: uses inverted indices to find candidate matches

¤ Note that DRESS is an approximate method, while RBE and Q-

grams are exact

Experimental setup: Methods

Page 57: Searching and mining sequential data - KTH

¤ Retrieval cost on UniProt [401,800]

Experimental results: retrieval cost

v DRESS achieves up to 29 times lower retrieval cost against RBE and up to 75 times lower than Q-grams

Page 58: Searching and mining sequential data - KTH

¤ Retrieval cost on UniProt [801,max]

v DRESS achieves up to 75 times lower retrieval cost against RBE and up to 461 times lower than Q-grams

Experimental results: retrieval cost

Page 59: Searching and mining sequential data - KTH

¤ Protein datasets:

Experimental results: runtime

¤ DNA datasets:

Page 60: Searching and mining sequential data - KTH

¤ DRESS: novel method for whole sequence matching

¤ Supports large query lengths and matching ranges

¤ Experiments demonstrate that for higher values of search range

DRESS produces significantly lower costs and runtimes than the

competitors

¤ Loss of guarantee of 100% recall

¤ This price can be an acceptable trade-off in several domains given

the significant runtime savings

Conclusions

Page 61: Searching and mining sequential data - KTH

Future work

¤ Further improve the performance of DRESS by, e.g., implementing

multiple filter steps

¤ Study the performance of DRESS on sequences from other

domains, such as text

¤ Adapt DRESS for local alignment and short-read sequencing

¤ Explore alternative ways of producing codewords, e.g., learn

synthetic words

Page 62: Searching and mining sequential data - KTH

TimeSeriesclassification• Application:Finance,Medicine,Music

• 1-Nearest Neighbor▫ Pros: accurate, robust, simple▫ Cons: time and space complexity (lazy learning); results arenot interpretable

0 200 400 600 800 1000 1200

Page 63: Searching and mining sequential data - KTH

Solution

• Shapelets:▫ timeseriessubsequence▫ representativeofaclass▫ discriminativefromotherclasses

• Idea:▫ Useshapelets as“attributes”or“features”forsplittinganodeinthedecisiontree

shapelet

Page 64: Searching and mining sequential data - KTH

Two steps:o Learn a set of discriminative

shapelets (typically a tree-like

structure)

o Predict a class label for a previously

unseen data series using the

minimum distance to shapelets

(following a path in the tree-structure

model)

Random Shapelet Tree

Forgot \bar{s}

TheShapelet Classifier

Page 65: Searching and mining sequential data - KTH

100 200 3000

0.10.2

0.3

0.4

0.0

I

II

III

IV

V

VI

Shapelet Dictionary

2 4 0 1 3 6 5

I

II

III IV

V

VI

Shapelet Tree

Method Accuracy Time

Shapelet 0.720 0.86

NearestNeighbor 0.543 0.65

TheShapelet Classifier

Page 66: Searching and mining sequential data - KTH

• Collect all candidate shapeletes in a pool• For each candidate: arrange the time series objectso based on the distance from a candidate shapelet

• Find the optimal split point (maximal information gain)• Pick the candidate achieving best utility as the shapelet

Split Point

0

candidate

Time series objects in the database

TheutilityofaShapelet

Page 67: Searching and mining sequential data - KTH

CandidatesPool

(Ti − l +1)Ti∈D∑

l=MINLEN

MAXLEN

DB:200Length:275Candidates:7,480,200Computation:3days

ExtractingallShapelets

Page 68: Searching and mining sequential data - KTH

• Main bottleneck: candidate generation

• Reducethetimeintwoways▫ DistanceEarlyAbandon

o reducetheEuclideandistancecomputationtimebetweentwotimeseries

▫ AdmissibleEntropyPruningo aftercalculatingsometrainingsamples,useanupperboundofinformationgaintocheckagainstthebestcandidateshapelet

Split Point

0

candidate

Time series objects in the database

Speedingupcandidategeneration

Page 69: Searching and mining sequential data - KTH

100 200 3000

0.10.2

0.3

0.4

0.0

I

II

III

IV

V

VI

Shapelet Dictionary

2 4 0 1 3 6 5

I

II

III IV

V

VI

Shapelet Tree

Method Accuracy Time

Shapelet 0.720 0.86

NearestNeighbor 0.543 0.65

TheShapelet Classifier

Page 70: Searching and mining sequential data - KTH

• Transformations + k-NN

▫ Improved subsequence searching and matching, using online

normalization, early abandoning, and re-ordering

▫ Dimensionality reduction using symbolic aggregate approximation

• Synthetic shapelet generation

▫ Initialize using K-means clustering

▫ Learn synthetic Shapelets by following a Gradient Ascend approach

Alternativeapproaches

Page 71: Searching and mining sequential data - KTH

• Feature-based▫ Select the top k most informative shapelets▫ Generate a new dataset D’ of k columns and |D| rows, whereeach element (i,j) is the (minimum) distance from the shapeletsj to data series di▫ Learn any suitable classifier (e.g., SVM, Random Forest) usingthe transformed dataset

Alternativeapproaches

Page 72: Searching and mining sequential data - KTH

• A tree-structured ensemble based on the classicRandom Forest and the Shapelet Tree classifier

• Learn:▫ Build T random shapelet trees▫ Each tree is built from a random (with replacement)sample of time series in the database D▫ Inspect r random shapelets at each node

• Predict:▫ Let each tree t1, . . . , tT vote for a class label

RandomShapelet Forest

Page 73: Searching and mining sequential data - KTH

• Paremeters:

o number of shapelets (r) = 100

o number of trees (T) = 500

o min shapelet length (l) = 2

o max shapelet length (u) = m (max time series length)

Experimentalresults

Page 74: Searching and mining sequential data - KTH

• Friedman’s test: The observed differences in accuracy of RSF against the rest deviate significantly, i.e., p-value = 10-11

Page 75: Searching and mining sequential data - KTH

75

Multi-variate time series

Page 76: Searching and mining sequential data - KTH

Future work• Feature tweaking for shapelets:

▫ What is the minimum amount of changes we need

to apply to a shapelet feature so that the

classification result flips?

• Interval sequence features:

▫ Apply random shapelet forests for complex

interval sequences

76

Page 77: Searching and mining sequential data - KTH

Thank you for your attention!

Page 78: Searching and mining sequential data - KTH

• "Embedding-based Subsequence Matching: a query-by-humming application". In the Very Large DatabasesJournal (VLDBJ), 24(4): 519-536, 2015

• "Hum-a-song: A Subsequence Matching with Gaps-Range-Tolerances Query-By-Humming System". In Proceedings ofthe Very Large Databases Endowement (PVLDB), Volume4, Issue 11, pages 1930-1933, 2012

• "Embedding-based subsequence matching in time-seriesdatabases". In ACM Transactions on Database Systems (TODS),Vol. 36, Issue 3, Article 17, 2011

78

ReferencesSubsequence matching

Page 79: Searching and mining sequential data - KTH

• "Generalized Random Shapelet Forests". In the DataMining and Knowledge Discovery Journal (DAMI)2016

• "Early-classification of Time Series". In Proceedingsof the International Conference on DiscoveryScience (DS), Best Paper Award, 2016

• "Query-sensitive Distance Measure Selection forTime Series Nearest Neighbor Classification".Intelligent Data Analysis Journal (IDAJ), 20(1): 5-27, 2016

79

ReferencesTime series classification

Page 80: Searching and mining sequential data - KTH

• "Indexing Sequences of Temporal Intervals". In the DataMining and Knowledge Discovery Journal (DAMI),Accepted 2016

• "Finding the Longest Common Sub-Pattern in Sequences ofTemporal Intervals". In the Data Mining and KnowledgeDiscovery Journal (DAMI), 29(5): 1178-1210, 2015

• "Mining Frequent Arrangements of Temporal Intervals". InKnowledge and Information Systems (KAIS), Vol. 21, Issue2, pages 133–171, 2010

80

ReferencesSequences of temporal intervals

Page 81: Searching and mining sequential data - KTH

Thank you for your attention!

Panagiotis Papapetrou

[email protected]