Lecture 3 Dynamic Time Warping...
Transcript of Lecture 3 Dynamic Time Warping...
![Page 1: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/1.jpg)
Lecture 3Dynamic Time Warping (DTW)
Yossi Keshet
October 29, 2012
![Page 2: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/2.jpg)
A Very Simple Speech Recognizer
w⇤ = arg minw2vocab
distance(A0test, A0
w)
signal processing — Extracting features A0 from audio A.e.g., MFCC with deltas and double deltas.e.g., for 1s signal with 10ms frame rate) ⇠100⇥40values in A0.
dynamic time warping — Handling time/rate variation in thedistance measure.
82 / 120
![Page 3: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/3.jpg)
The Problem
distance(A0test, A0
w)?=
X
t
framedist(A0test,t , A0
w ,t)
In general, samples won’t even be same length.
83 / 120
![Page 4: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/4.jpg)
Problem Formulation
Have two audio samples; convert to feature vectors.Each xt , yt is ⇠40-dim vector, say.
X = (x1, x2, . . . , xTx )
Y = (y1, y2, . . . , yTy )
Compute distance(X , Y ).
84 / 120
![Page 5: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/5.jpg)
Linear Time Normalization
Idea: omit/duplicate frames uniformly in Y . . .So same length as X .
distance(X , Y ) =TxX
t=1
framedist(xt , yt⇥ TyTx
)
85 / 120
![Page 6: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/6.jpg)
What’s the Problem?
Handling silence.
silence CAT silence
silence CAT silence
Solution: endpointing.
Do vowels and consonants stretch equally in time?Want nonlinear alignment scheme!
86 / 120
![Page 7: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/7.jpg)
Alignments and Warping FunctionsCan specify alignment between times in X and Y using . . .
Warping functions ⌧x(t), ⌧y(t), t = 1, . . . , T .i.e., time ⌧x(t) in X aligns with time ⌧y(t) in Y .
Total distance is sum of distance between aligned vectors.
distance⌧x ,⌧y (X , Y ) =TX
k=1
framedist(x⌧x (t), y⌧y (t))
12345
1 2 3 4 5 6 7
⌧ 2(t
)
⌧1(t) 87 / 120
![Page 8: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/8.jpg)
Computing Frame Distances: framedist(x , y)
Let xd , yd denote d th dimension (this slide only).Euclidean (L2)
pPd(xd � yd)2
Lp norm ppP
d |xd � yd |p
weighted Lp norm ppP
d wd |xd � yd |p
Itakura dI(X , Y ) log(aT Rpa/G2)
Symmetrized Itakura dI(X , Y ) + dI(Y , X )
Weighting each feature vector component differently.e.g., for variance normalization.Called liftering when applied to cepstra.
88 / 120
![Page 9: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/9.jpg)
Another Example Alignment
12345
1 2 3 4 5 6 7
⌧ 2(t
)
⌧1(t)
89 / 120
![Page 10: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/10.jpg)
Constraining Warping Functions
Begin at the beginning; end at the end. (Any exceptions?)⌧x(1) = 1, ⌧x(T ) = Tx ; ⌧y(1) = 1, ⌧y(T ) = Ty .
Don’t move backwards (monotonicity).⌧x(t + 1) � ⌧x(t); ⌧y(t + 1) � ⌧y(t).
Don’t move forwards too far (locality).e.g., ⌧x(t + 1) ⌧x(t) + 1; ⌧y(t + 1) ⌧y(t) + 1.
12345
1 2 3 4 5 6 7
⌧ 2(t
)
⌧1(t) 90 / 120
![Page 11: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/11.jpg)
Local Paths
Can summarize/encode local alignment constraints . . .By enumerating legal ways to extend an alignment.
e.g., three possible extensions or local paths.⌧1(t + 1) = ⌧1(t) + 1, ⌧2(t + 1) = ⌧2(t) + 1.⌧1(t + 1) = ⌧1(t) + 1, ⌧2(t + 1) = ⌧2(t).⌧1(t + 1) = ⌧1(t), ⌧2(t + 1) = ⌧2(t) + 1.
91 / 120
![Page 12: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/12.jpg)
Which Alignment?
Given alignment, easy to compute distance:
distance⌧x ,⌧y (X , Y ) =TX
t=1
framedist(x⌧x (t), y⌧y (t))
Lots of possible alignments given X , Y .Which one to use to calculate distance(X , Y )?The best one!
distance(X , Y ) = minvalid ⌧x , ⌧y
�distance⌧x ,⌧y (X , Y )
92 / 120
![Page 13: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/13.jpg)
Which Alignment?
distance(X , Y ) = minvalid ⌧x , ⌧y
�distance⌧x ,⌧y (X , Y )
Hey, there are many, many possible ⌧x , ⌧y .Exponentially many in Tx and Ty , in fact.
How in blazes are we going to compute this?
93 / 120
![Page 14: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/14.jpg)
Dynamic Programming
Solve problem by caching subproblem solutions . . .Rather than recomputing them.
Why the mysterious name?The inventor (Richard Bellman) was trying to hide . . .What he was working on from his boss.
For our purposes, we focus on shortest path problems.
96 / 120
![Page 15: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/15.jpg)
Case Study
Let’s consider small example: Tx = 3; Ty = 2.Allow steps that advance tx and ty by at most 1.
Matrix of frame distances: framedist(xt , yt 0).x1 x2 x3
y2 4 1 2y1 3 0 5
Let’s make a graph.Label each arc with framedist(xt , yt 0) at destination.Need dummy arc to get distance at starting point.
97 / 120
![Page 16: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/16.jpg)
Case Study
1
2
1 2 3
3
4 1
0
1 2
5
1 2
2
A
B
Each path from point A to B corresponds to alignment.Distance for alignment = sum of scores along path.Goal: find alignment with smallest total distance.
98 / 120
![Page 17: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/17.jpg)
And Now For Something Completely Different
Let’s take a break and look at something totally unrelated.Here’s a map with distances labeled between cities.Let’s say we want to find the shortest route . . .
From point A to point B.
3
4 1
0
1 2
5
1 2
2
A
B
99 / 120
![Page 18: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/18.jpg)
Wait a Second!
I’m having déjà vu!In fact, length of shortest route in 2nd problem . . .
Is same as distance for best alignment in 1st problem.i.e., problem of finding best alignment is equivalent to . . .Single-pair shortest path problem for . . .
Directed acyclic graphs.Has well-known dynamic programming solution.
Will come up again for HMM’s, finite-state machines.
100 / 120
![Page 19: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/19.jpg)
Key Observation 1
Shortest distance d(S) from start to state S is . . .d(S0) + distance(S0, S) for some predecessor S0 of S.
If know d(S0) for all predecessors of S . . .Easy to compute d(S).
d(S) = minS0!S
{d(S0) + distance(S0, S)}
1 23
4
19
1
3
3
10
11
102 / 120
![Page 20: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/20.jpg)
Proposed Algorithm
Loop through all states S in some order.For each state, compute distance to start state d(S).d(S0) must already be known for all predecessors S0.
Is there always an ordering on states such that . . .If there is an arc S0 ! S, . . .d(S0) is computed before d(S)?
103 / 120
![Page 21: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/21.jpg)
Key Observation 2
For all acyclic graphs, there is a topological sorting . . .Such that all arcs go from earlier to later states.
In many cases, topological sorting is obvious by inspection.Otherwise, can be computed in time O(states + arcs).
104 / 120
![Page 22: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/22.jpg)
The Shortest Path Algorithm
Sort states topologically: number from 1, . . . , N.Start state = state 1; final start = state N.
d(1) = 0.For S = 2, . . . , N do
d(S) = minS0!S
{d(S0) + distance(S0, S)}
Final answer: d(N).Total time: O(states + arcs).
0 11 323
74
819
110
3
113
10
1111
105 / 120
![Page 23: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/23.jpg)
Shortest Path and DTW
1
2
1 2 3
3
4 1
0
1 2
5
1 2
2
A
B
What are the states of the graph?What is a topological sorting for the states?What are the predecessors for each state?
106 / 120
![Page 24: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/24.jpg)
Shortest Path and DTW
d(1, 1) = framedist(x1, y1).For t1 = 1, . . . , 3 do.
For t2 = 1, . . . , 2 do.
d(t1, t2) = min{d(t1 � 1, t2 � 1) + framedist(xt1 , yt2),
d(t1 � 1, t2) + framedist(xt1 , yt2),
d(t1, t2 � 1) + framedist(xt1 , yt2) }
Final answer: d(3, 2).
107 / 120
![Page 25: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/25.jpg)
DTW: Example
1
2
1 2 3
3
4 1
0
1 2
5
1 2
2
A
B
x1 x2 x3
y2 4 1 2y1 3 0 5
d(·, ·) 1 2 32 7 4 51 3 3 8
108 / 120
![Page 26: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/26.jpg)
Normalizing Distances
Two-step path incurs two frame distances, while . . .One-step path incurs single frame distance.Biases against two-step path.
Idea: use weights to correct for these types of biases.110 / 120
![Page 27: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/27.jpg)
Weighted Local Paths (Sakoe and Chiba)
2
1
2
2
1
d(t1, t2) = min{d(t1 � 1, t2 � 1) + 2⇥ framedist(xt1 , yt2),
d(t1 � 2, t2 � 1) + 2⇥ framedist(xt1�1, yt2) +
framedist(xt1 , yt2),
d(t1 � 1, t2 � 2) + 2⇥ framedist(xt1 , yt2�1) +
framedist(xt1 , yt2) } 111 / 120
![Page 28: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/28.jpg)
Speeding Things Up
What is time complexity of DTW?
O(states + arcs)
For long utterances, can be expensive.Idea: put ceiling on maximum amount of warping, e.g.,
|⌧x(t)� ⌧y(t)| T0
Another idea: beam pruning or rank pruning.Can make computation linear.
112 / 120
![Page 29: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/29.jpg)
A DTW Recognizer: The Whole Damn Thing
w⇤ = arg minw2vocab
distance(A0test, A0
w)
Training: collect audio Aw for each word w in vocab.Apply signal processing) A0
w .Called template for word w .(Can collect multiple templates for each word.)
Test time: given audio Atest, convert to A0test.
For each w , compute distance(A0test, A0
w) using DTW.Return w with smallest distance.
113 / 120
![Page 30: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/30.jpg)
Recap
DTW is effective for computing distance between signals . . .Using nonlinear time alignment.
Ad hoc selection of frame distance and local paths.Finds best path in exponential space in quadratic time . . .
Using dynamic programming.Signal processing, DTW: all you need for simple ASR.
e.g., old-school cell phone name dialer.
114 / 120
![Page 31: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/31.jpg)
Back to the Future
Some recent work has revisited DTW.Can extend algorithm to connected speech.
Word models (as opposed to phonetic models).Don’t average word instances together; keep separate!Can model longer-distance acoustic dependencies . . .
As compared to conventional GMM/HMM systems.Hasn’t gone anywhere yet.
115 / 120
![Page 32: Lecture 3 Dynamic Time Warping (DTW)u.cs.biu.ac.il/~jkeshet/teaching/spr2017/spr2017_lecture3.pdf · A DTW Recognizer: The Whole Damn Thing. w ... Finds best path in exponential space](https://reader033.fdocuments.us/reader033/viewer/2022042108/5e87ccbc5d5e50411231cc6f/html5/thumbnails/32.jpg)
References
R. Bellman, “Eye of the Hurricane”. World ScientificPublishing Company, Singapore, 1984.
H. Sakoe and S. Chiba, “Dynamic Programming AlgorithmOptimization for Spoken Word Recognition”, IEEE Trans. onAcoustics Speech and Signal Processing, vol. ASSP-26, pp.43–45, Feb. 1978.
116 / 120