Phoneme Similarity Matrices to Improve Long Audio Alignment for Automatic Subtiitling

18
1 1 Phoneme Similarity Matrices to Improve Long Audio Alignment for Automatic Subtitling Pablo Ruiz, Aitor Álvarez, Haritz Arzelus {pruiz,aalvarez,harzelus}@vicomtech.org Vicomtech-IK4, Donostia/San Sebastián, Spain www.vicomtech.org LREC, Reykjavik 28 May 2014

Transcript of Phoneme Similarity Matrices to Improve Long Audio Alignment for Automatic Subtiitling

1 1

Phoneme Similarity Matrices to Improve Long Audio Alignment for Automatic Subtitling

Pablo Ruiz, Aitor Álvarez, Haritz Arzelus {pruiz,aalvarez,harzelus}@vicomtech.org

Vicomtech-IK4, Donostia/San Sebastián, Spain www.vicomtech.org

LREC, Reykjavik 28 May 2014

2 2

Introduction

THE NEED

Huge subtitling demand by broadcasters, for accessibility compliance: Automatic subtitling is an attractive option.

TALK OUTLINE

• Alignment system for automatic subtitling – Phone decoder

– Grapheme-to-Phoneme transcriber

– Alignment algorithm

• Alignment Results applied to Subtitling

3 3

Alignment approach

• Long Audio Alignment: Aligns the audio signal with a human transcript for the audio.

• System by Bordel et al. (2012): – Alignment with Hirschberg’s algorithm.

– Simple system, but accuracy comparable to more complex approaches (cf. Moreno et al. 1999)

– Uses a BINARY scoring matrix to evaluate alignment operations.

• Our system: Hirschberg’s algorithm, using NON-BINARY scoring matrices, improving vs. binary

4 4

Speech-Text Alignment System

Phone Decoder

G2P

Hirschberg’s Algorithm

Similarity Matrices

Aligned Words and Subtitles

Tim

e-c

od

es

(TC

) Phones

Aligned Phones + TC

5 5

Phone Decoder

Acoustic Model (AM): HTK – Monophone (18 MFCC + Δ + ΔΔ)

Bigram Phoneme Language Model (LM)

Sources for Corpora

• Spanish: Albayzin, Multext, SAVAS (AM), newspapers (LM)

• English: TIMIT (AM), newspapers (LM)

Languages Train Test PER

Spanish ~15H (LM 45 M) ~5H 40.65%

English ~4H (LM 369M) ~1H 35.52%

6 6

Grapheme-to-Phoneme (G2P) Transcriptors

• Spanish: rule-based

• English: inferred from CMU Dictionary using Phonetisaurus (WFST)

• Phone-sets and details: https://sites.google.com/site/similaritymatrices

7 7

Alignment Algorithm

• Hirschberg’s algorithm for sequence alignment –Dynamic programming, optimization to Needleman-

Wunsch, applicable to longer sequences.

–Four operations • Insertions

• Deletions

• Substitutions

• Matches

–Each operation is evaluated with a scoring function • Binary vs. non-binary

• Goal: Promotes aligning equal or similar, easily confusable, phonemes. Prevents aligning dissimilar phonemes.

BIN

AR

Y Gap Penalty

Similarity Score

No

n B

inar

y

8 8

Speech-Text Alignment (recap)

Phone Decoder

G2P

Hirschberg’s Algorithm

Similarity Matrices

Aligned Words and Subtitles

Tim

e-c

od

es

(TC

) Phones

Aligned Phones + TC

9 9

Similarity Function: Input Features and Values

• Phonological similarity: Multivalued features weighted by salience (i.e. impact in similarity)

Feature Values (English)

Feature Salience

10 10

Similarity Function: Definition (cf. Kondrak 2002)

𝜎𝑠𝑢𝑏 𝑝, 𝑞 = ( 𝐶𝑠𝑢𝑏 −𝛿 𝑝, 𝑞 − 𝑉 𝑝 − 𝑉 𝑞 ) / 100

where if 𝑝 = 𝑞 , 𝑉 𝑝 − 𝑉 𝑞 = 0

else 𝑉 𝑥 = 0 if 𝑥 is a consonan𝑡𝐶𝑣𝑤𝑙 otherwise

𝛿 𝑝, 𝑞 = |diff 𝑝, 𝑞, 𝑓 | × salience(𝑓)

𝑓∈𝑹

𝜎𝑠𝑘𝑖𝑝 𝑝 = |𝐶𝑠𝑘𝑖𝑝 / 100|

σsub: score for substituting phoneme p with q

1

2

3

4

Csub = 3500 Cvwl = 1000 Cskip = 1000 5

σskip: cost of skipping

14 14

Evaluation: Method

• Alignment accuracy at word level and subtitle level compared to manual subtitles created by professional subtitlers

Alignment Test Corpora

SPANISH ENGLISH

Words 8,774 4,732

Subtitles 1,249 471

Content Clean speech Noisy speech

Some reference subtitles missing

15 15

Evaluation: Results at Word Level

Spanish – Word Level

seconds 0 ≤0.1 ≤0.5 ≤1.0 ≤2.0

BinaryBaseline 14.17 57.71 72.65 76.21 79.02

PhonologicalSimilarity 23.01 82.21 93.50 95.50 96.83

English – Word Level

seconds 0 ≤0.1 ≤0.5 ≤1.0 ≤2.0

BinaryBaseline 0.28 4.81 19.04 29.69 43.24

PhonologicalSimilarity 1.93 23.76 48.90 61.58 72.72

Percentage of words aligned within a given deviation range from reference

+81 +24 +20 +19 +18

1Gain in percentage points

+1.6 +19 +30 +32 +29

+81 +24 +20 +19 +18

+1.6 +19 +30 +32 +29

16 16

Evaluation: Results at Subtitle Level

Spanish – Subtitle Level

seconds 0 ≤0.1 ≤0.5 ≤1.0 ≤2.0

BinaryBaseline 10.57 45.08 73.26 95.12 100

PhonologicalSimilarity 18.01 66.45 87.83 98.80 100

English – Subtitle Level

seconds 0 ≤0.1 ≤0.5 ≤1.0 ≤2.0

BinaryBaseline 0.21 4.25 37.15 84.29 100

PhonologicalSimilarity 0.42 8.28 46.07 86.84 100

Percentage of subtitles aligned within a given deviation range from reference

+71 +21 +14 +3.5 =

+0.2 +4 +9 +2.5 =

1Gain in percentage points

+71 +21 +14 +3.5 =

+0.2 +4 +9 +2.5 =

17 17

Conclusions and Further Work

• Continuous scoring matrices based on phonological similarity improve alignment compared to a binary matrix.

• Further work: Other similarity criteria can also improve results.

English – Subtitle Level

seconds 0 ≤0.1 ≤0.5 ≤1.0 ≤2.0

Binary 0.21 4.03 36.94 84.08 100

PhonologicalSim 0.42 8.28 43.10 85.99 100

DecoderErrorBased 0.42 11.46 48.83 88.32 100

ICASSP (May 2014), Text, Speech and Dialogue (Sept 2014)

+0.2 +4 +6 +2 =

+0.2 +7 +12 +4 =

18 18

Thank you

https://sites.google.com/site/similaritymatrices