MIL Speech Seminar TRACHEOESOPHAGEAL SPEECH REPAIR Arantza del Pozo CUED Machine Intelligence...

MIL Speech Seminar

TRACHEOESOPHAGEAL SPEECH REPAIR

Arantza del PozoCUED Machine Intelligence Laboratory

November 20th 2006

Arantza del Pozo @ CUED Machine Intelligence Laboratory 2

OUTLINE

Speech repair Tracheoesophageal (TE) speech

Laryngectomy Acoustic properties Main limitations

Excitation repair Previous attempts Adopted approach Baseline system Enhanced system Results

Duration repair Preliminary experiments Regression tree modelling Improving TE recognition Fixing recognition artifacts Results

Conclusions and future work


OUTLINE







SPEECH REPAIR

SPEECH REPAIR SYSTEM

Speech Model Deviant features Correction algorithms


OUTLINE







Laryngectomy

Laryngectomy is a surgical procedure which involves the removal of the larynx, i.e. vocal cords, epiglottis and tracheal rings

Speech rehabilitation after laryngectomy Esophageal speech TE speech Electrolaryngeal

speech

TE speech is the most frequently used voice restoration technique after laryngectomy


Acoustic properties of TE speech

Voicing source highly variable and deviant Lower F0 (female) and higher jitter and shimmer Higher high-frequency noise and lower

harmonic-to-noise-ratio (HNR), glottal-to-noise excitation ratio (GNE), band-energy difference (BED)

Some evidence of higher formant values in Spanish and Dutch TE speech

Shorter maximum phonation time, longer vowel duration and slower speaking rates


Main limitations of TE speech

Inability to properly control the EXCITATION deviant glottal waveforms irregular pitch and amplitude contours higher turbulence noise spectral envelope deviations caused by coupling

DURATION deviations caused by the disconnection between the lungs and the vocal tract more pauses longer vowels slower rates rushes before breaks


OUTLINE







Previous excitation repair attempts

Qi et al. Resynthesis of female TE words with a synthetic

glottal waveform and with smoothed and raised F0 Replacement of voice source and conversion of

spectral envelopes

Limitations of previous repair attempts Only most obvious deviant features have been

tackled Evaluation limited to sustained vowels and words Only a small number of TE speakers and qualities

have been tested Degree of perceptual enhancement has not been

quantified


Adopted approach

DATA 13 TE speakers (11 male, 2 female)

Patients of the Speech and Language Therapy Department of Addenbrookes Hospital, Cambridge

Control group of 11 normal speakers (8 male, 3 female)

BASELINE SYSTEM Glottal resynthesis Jitter and shimmer reduction

ENHANCED SYSTEM Spectral envelope smoothing and

Tilt reduction

Feature correction

Perceptual evaluation

DEVIANT FEATURES:-voice source-jitter & shimmer-spectral envelope


Baseline system

Glottal resynthesisbreathiness reduction

Jitter and shimmer reductionroughness reduction

Lip radiation VT


Enhanced system (1/2)

Resynthesised speech still has a harsh quality caused by deviations in TE spectral envelopes (SE)

Spectral envelope analysis Higher std of formant gains, frequencies and

bandwidths and spectral distortion Lower relative gain difference between 1st and 3rd

formants and spectral tilt


Enhanced system (2/2)

Enhancement algorithm To reduce differences

between estimated consecutive SE

LSF median smoothing To decrease spectral tilt

Low-pass filtering


Results

Perceptual tests

original baseline enhanced

“more breathy”

82.69% 17.31%

“harsher” 73.72% 26.28%

“more normal speaker”

58.33% 41.67%

38.78% 61.22%


OUTLINE







Preliminary experiments

Duration deviations more pauses longer vowels slower rates rushes before breaks

Possible duration repair approaches Rule-based

Reduce pauses, reduce vowels, increase speech rate, increase duration of phones before breaks, etc.

Difficult to obtain adequate reduction/increase rates Break sentence rhythm

Transplantation of average normal phone durations Phone durations obtained with Forced Alignment (FA) Overall improvement which increased naturalness of TE

sentences Sentence rhythm was preserved

Duration repair algorithm is an automatization of the transplantation experiment


Regression tree modelling (1/2)

Classification and regression trees (CART) are widely used for duration modelling in TTS systems

Employed features are extracted from text Phone identity Identities of previous and next phones Position of syllable in word Position of word in sentence Number of syllables before/after a break Type of lexical stress Lexical stress type of previous and next syllables ...

A speech repair framework constrains the possible feature space to recognisable features For TE speech repair, assumed that only phone recognition is

viable Features relying on word, syllable or lexical stress information

cannot be used


Regression tree modelling (2/2)

Several CART trees were built with different features Explored features

Phone identity Identities of previous and next phones Positions of phones in the sentence Pitch and energy (as an attempt to incorporate some stress info) Short pauses (SP) not regarded as phones, modelled independently

Trees T1 F1: phone identity T2 F2: F1 + previous & next phone identities (broad class) T3 F3: F2+ position of phone in sentence T4 F4: F3+ pitch (positive/negative/no slope) T5 F5: F4+ energy (positive/negative/no slope) TSP number of phones since previous sp & until next sp

Performance measured as Mean Squared Error (MSE) between normal mean durations used for transplantation and predicted values T3>T2>T1>T5>T4 Substitution of T3+TSP predicted durations of TE sentences with FA

phone segmentation almost indistinguishable from transplantation


Improving TE recognition (1/2)

Little work on automatic TE speech recognition Haderlein et al. (2004) adapted a speech recogniser

trained on normal speech to single TE speakers by unsupervised HMM interpolation and obtained an average word accuracy of 36.4%

Focus on improving TE phone recognition Novel performance measures which take recognition (r),

segmentation (s) and duration prediction (p) errors into account

NP

isirSPC

NP

i

1

)()(

NP

ipsP

SPE

ES

s

ESP

i

1 1

)()(FA

REC


Improving TE recognition (2/2)

Explored systems Baseline (BL): monophone HMM trained on WSJCAM0 R1: BL + CMN + CMLLR R2: R1 + MAP R3: R1 + bigram LM R4: R1 + trigram LM R5: CUHTK 2003 BN LVCSR + CMLLR phone level output

Results R5>R4>R3>R1>R2

BL R1 R2 R3 R4 R5

SPC [%] 0.1634 0.3129 0.3044 0.3249 0.3340 0.5148

SPE [ms]

39.444 29.713 30.926 27.329 26.682 14.257


Fixing recognition artifacts

Use of best recognised labels for duration repair still produced artifacts

Method for robust duration modification (RM) Take recognition confidence into account

computed from TE phone duration probability distributions recogniser confidence scores takes phone confusions into account in R4

OPN ddd )1(

Pd


Results

Objective evaluation: MSE between repaired sentences and target transplanted durations R5+RM>R5>R4+RM>R4>original TE durations

Subjective evaluation: perceptual test

RANK (1-5) O T R

PREFERENCETEST

R448%

R552% - > = <

T - M 0.54 0.22 0.24

M - O 0.52 0.31 0.17

T - O 0.66 0.20 0.14


OUTLINE







CONCLUSIONS AND FUTURE WORK

Deviant TE excitation and duration features have been identified and repaired

Synthetic quality of excitation repaired speech nullifies results in some cases

Future work Improve excitation resynthesis quality Improve TE speech recognition step Attempt text-based features for duration modelling

MIL Speech Seminar TRACHEOESOPHAGEAL SPEECH REPAIR Arantza del Pozo CUED Machine Intelligence...

Documents

Transcript of MIL Speech Seminar TRACHEOESOPHAGEAL SPEECH REPAIR Arantza del Pozo CUED Machine Intelligence...