Phoneme Similarity Matrices to Improve Long Audio Alignment for Automatic Subtiitling
Transcript of Phoneme Similarity Matrices to Improve Long Audio Alignment for Automatic Subtiitling
1 1
Phoneme Similarity Matrices to Improve Long Audio Alignment for Automatic Subtitling
Pablo Ruiz, Aitor Álvarez, Haritz Arzelus {pruiz,aalvarez,harzelus}@vicomtech.org
Vicomtech-IK4, Donostia/San Sebastián, Spain www.vicomtech.org
LREC, Reykjavik 28 May 2014
2 2
Introduction
THE NEED
Huge subtitling demand by broadcasters, for accessibility compliance: Automatic subtitling is an attractive option.
TALK OUTLINE
• Alignment system for automatic subtitling – Phone decoder
– Grapheme-to-Phoneme transcriber
– Alignment algorithm
• Alignment Results applied to Subtitling
3 3
Alignment approach
• Long Audio Alignment: Aligns the audio signal with a human transcript for the audio.
• System by Bordel et al. (2012): – Alignment with Hirschberg’s algorithm.
– Simple system, but accuracy comparable to more complex approaches (cf. Moreno et al. 1999)
– Uses a BINARY scoring matrix to evaluate alignment operations.
• Our system: Hirschberg’s algorithm, using NON-BINARY scoring matrices, improving vs. binary
4 4
Speech-Text Alignment System
Phone Decoder
G2P
Hirschberg’s Algorithm
Similarity Matrices
Aligned Words and Subtitles
Tim
e-c
od
es
(TC
) Phones
Aligned Phones + TC
5 5
Phone Decoder
Acoustic Model (AM): HTK – Monophone (18 MFCC + Δ + ΔΔ)
Bigram Phoneme Language Model (LM)
Sources for Corpora
• Spanish: Albayzin, Multext, SAVAS (AM), newspapers (LM)
• English: TIMIT (AM), newspapers (LM)
Languages Train Test PER
Spanish ~15H (LM 45 M) ~5H 40.65%
English ~4H (LM 369M) ~1H 35.52%
6 6
Grapheme-to-Phoneme (G2P) Transcriptors
• Spanish: rule-based
• English: inferred from CMU Dictionary using Phonetisaurus (WFST)
• Phone-sets and details: https://sites.google.com/site/similaritymatrices
7 7
Alignment Algorithm
• Hirschberg’s algorithm for sequence alignment –Dynamic programming, optimization to Needleman-
Wunsch, applicable to longer sequences.
–Four operations • Insertions
• Deletions
• Substitutions
• Matches
–Each operation is evaluated with a scoring function • Binary vs. non-binary
• Goal: Promotes aligning equal or similar, easily confusable, phonemes. Prevents aligning dissimilar phonemes.
BIN
AR
Y Gap Penalty
Similarity Score
No
n B
inar
y
8 8
Speech-Text Alignment (recap)
Phone Decoder
G2P
Hirschberg’s Algorithm
Similarity Matrices
Aligned Words and Subtitles
Tim
e-c
od
es
(TC
) Phones
Aligned Phones + TC
9 9
Similarity Function: Input Features and Values
• Phonological similarity: Multivalued features weighted by salience (i.e. impact in similarity)
Feature Values (English)
Feature Salience
10 10
Similarity Function: Definition (cf. Kondrak 2002)
𝜎𝑠𝑢𝑏 𝑝, 𝑞 = ( 𝐶𝑠𝑢𝑏 −𝛿 𝑝, 𝑞 − 𝑉 𝑝 − 𝑉 𝑞 ) / 100
where if 𝑝 = 𝑞 , 𝑉 𝑝 − 𝑉 𝑞 = 0
else 𝑉 𝑥 = 0 if 𝑥 is a consonan𝑡𝐶𝑣𝑤𝑙 otherwise
𝛿 𝑝, 𝑞 = |diff 𝑝, 𝑞, 𝑓 | × salience(𝑓)
𝑓∈𝑹
𝜎𝑠𝑘𝑖𝑝 𝑝 = |𝐶𝑠𝑘𝑖𝑝 / 100|
σsub: score for substituting phoneme p with q
1
2
3
4
Csub = 3500 Cvwl = 1000 Cskip = 1000 5
σskip: cost of skipping
14 14
Evaluation: Method
• Alignment accuracy at word level and subtitle level compared to manual subtitles created by professional subtitlers
Alignment Test Corpora
SPANISH ENGLISH
Words 8,774 4,732
Subtitles 1,249 471
Content Clean speech Noisy speech
Some reference subtitles missing
15 15
Evaluation: Results at Word Level
Spanish – Word Level
seconds 0 ≤0.1 ≤0.5 ≤1.0 ≤2.0
BinaryBaseline 14.17 57.71 72.65 76.21 79.02
PhonologicalSimilarity 23.01 82.21 93.50 95.50 96.83
English – Word Level
seconds 0 ≤0.1 ≤0.5 ≤1.0 ≤2.0
BinaryBaseline 0.28 4.81 19.04 29.69 43.24
PhonologicalSimilarity 1.93 23.76 48.90 61.58 72.72
Percentage of words aligned within a given deviation range from reference
+81 +24 +20 +19 +18
1Gain in percentage points
+1.6 +19 +30 +32 +29
+81 +24 +20 +19 +18
+1.6 +19 +30 +32 +29
16 16
Evaluation: Results at Subtitle Level
Spanish – Subtitle Level
seconds 0 ≤0.1 ≤0.5 ≤1.0 ≤2.0
BinaryBaseline 10.57 45.08 73.26 95.12 100
PhonologicalSimilarity 18.01 66.45 87.83 98.80 100
English – Subtitle Level
seconds 0 ≤0.1 ≤0.5 ≤1.0 ≤2.0
BinaryBaseline 0.21 4.25 37.15 84.29 100
PhonologicalSimilarity 0.42 8.28 46.07 86.84 100
Percentage of subtitles aligned within a given deviation range from reference
+71 +21 +14 +3.5 =
+0.2 +4 +9 +2.5 =
1Gain in percentage points
+71 +21 +14 +3.5 =
+0.2 +4 +9 +2.5 =
17 17
Conclusions and Further Work
• Continuous scoring matrices based on phonological similarity improve alignment compared to a binary matrix.
• Further work: Other similarity criteria can also improve results.
English – Subtitle Level
seconds 0 ≤0.1 ≤0.5 ≤1.0 ≤2.0
Binary 0.21 4.03 36.94 84.08 100
PhonologicalSim 0.42 8.28 43.10 85.99 100
DecoderErrorBased 0.42 11.46 48.83 88.32 100
ICASSP (May 2014), Text, Speech and Dialogue (Sept 2014)
+0.2 +4 +6 +2 =
+0.2 +7 +12 +4 =
18 18
Thank you
https://sites.google.com/site/similaritymatrices