The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task
-
Upload
mediaeval2012 -
Category
Technology
-
view
459 -
download
1
description
Transcript of The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task
The TUM Cumulative DTW Approach for the Spoken Web
Search Task
Cyril Joder, Felix Weninger, Martin Wöllmer, Björn Schuller
Institute for Human-Machine CommunicationTechnische Universität München
Summary
• Not a „system“• Low-level features only• No ASR• Little „engineering“• Method of integrating discriminative
training into DTW
Mediaeval 2012 Workshop 2
Cumulative DTW (CDTW)
• Limitations of DTW: – Only one local cost function (distance)– Usually manual parameter tuning
• Idea: – Use different local cost functions for each step – Automatic learning of these functions as
combination of general features
Mediaeval 2012 Workshop 3
From DTW to CDTW
• Local cost function:
Mediaeval 2012 Workshop 4
(𝑖 , 𝑗)
(𝑖 , 𝑗−1)
(𝑖−1 , 𝑗 )
(𝑖−1 , 𝑗−1)
𝛼1
𝛼2
𝛼3
• Dynamic Programming:
From DTW to CDTW
• Local step function:
Mediaeval 2012 Workshop 5
(𝑖 , 𝑗)
(𝑖 , 𝑗−1)
(𝑖−1 , 𝑗 )
(𝑖−1 , 𝑗−1)
𝑠1(𝑖 , 𝑗)
𝑠2(𝑖 , 𝑗)
𝑠3( 𝑖 , 𝑗)
• Dynamic Programming:
From DTW to CDTW
• Local step function:
Mediaeval 2012 Workshop 6
(𝑖 , 𝑗)
(𝑖 , 𝑗−1)
(𝑖−1 , 𝑗 )
(𝑖−1 , 𝑗−1)
𝑠1(𝑖 , 𝑗)
𝑠2(𝑖 , 𝑗)
𝑠3( 𝑖 , 𝑗)
• Dynamic Programming:
Softmax?
+ Differentiable– Allow for an optimization of the
+ Combine several alignment paths– More robust to local changes
- Only give a score (not the optimal path)
Mediaeval 2012 Workshop 7
Features
• Acoustic descriptors: MFCC++ (D=36) – HTK, 25 ms, CMN, Global normalization
• Features (k=1…D):– Local distance
– “Local self-similarity”
; – Distance / product of the self-similarities
Mediaeval 2012 Workshop 8
Decision
• Are the two sequences instances of the same word/expression?
• Learning of the parameters.– Backpropagation (stochastic gradient descent)– Training data: queries/utterances of dev set
Mediaeval 2012 Workshop 9
Decision𝑆( 𝐼 , 𝐽 )𝐼+ 𝐽
Search Procedure
Given query and utterance
1) Feature extraction
2) Candidate search in
3) CDTW comparison
4) Score post-processing
Mediaeval 2012 Workshop 10
Candidate Search
• Align query with entire utterance– CDTW with backtracking– “Scores” for each point
• Extract potential starts and ends– Peak-picking of scores
• Filter by duration– Only allow warping factors < 2
Mediaeval 2012 Workshop 11
Candidate Search
• Align query with entire utterance– CDTW with backtracking– “Scores” for each point
• Extract potential starts and ends– Peak-picking of scores
• Filter by duration– Only allow warping factors < 2
Mediaeval 2012 Workshop 12
CDTW Score Post-Processing
• Same decision function as for learning– Many false positives– Bias toward some queries
• Heuristic post-processing:– For each query, subtract a specific threshold– Threshold: 90-th percentile of the CDTW
scores for that query
Mediaeval 2012 Workshop 13
Results
run devQ-devC evalQ-devC devQ-evalC evalQ-evalC
P(miss) 55.6% 59.5% 60.2% 54.5%
P(FA) 1.18% 1.13% 1.17% 1.13%
ATWV 0.263 0.333 0.164 0.290
Mediaeval 2012 Workshop 14
• Great improvement over naive DTW– ATWV = 0.065 on devQ-devC
• ATWV scores depend on the run
Results
• DET curves similar
• CDTW seems to generalize well
• Decision function has to be improved
Mediaeval 2012 Workshop 15
Conclusion
• CDTW: promising results– Data-based approach with satisfactory results– Significantly outperforms (naive) DTW– Good generalization
• Future work:– Decision function– Acoustic descriptors– Integrate „hard“ path constraints into search
Mediaeval 2012 Workshop 16