Learning Long-Term Temporal Features

Post on 01-Feb-2016

21 views 0 download

Tags:

description

Learning Long-Term Temporal Features. A Comparative Study Barry Chen. Log-Critical Band Energies. Log-Critical Band Energies. Conventional Feature Extraction. Log-Critical Band Energies. TRAPS/HATS Feature Extraction. What is a TRAP? (Background Tangent). - PowerPoint PPT Presentation

Transcript of Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Learning Long-Term Temporal Features

A Comparative Study

Barry Chen

May 4, 2004 Speech Lunch Talk

Log-Critical Band Energies

May 4, 2004 Speech Lunch Talk

Log-Critical Band Energies

ConventionalFeature Extraction

May 4, 2004 Speech Lunch Talk

Log-Critical Band Energies

TRAPS/HATSFeature Extraction

May 4, 2004 Speech Lunch Talk

What is a TRAP? (Background Tangent)

• TRAPs were originally developed by our colleagues at OGI: Sharma, Jain (now at SRI), Hermansky and Sivadas (both now at IDIAP)

• Stands for TempRAl Pattern

• TRAP = a narrow frequency speech energy pattern over a period of time (usually 0.5 – 1 second long)

May 4, 2004 Speech Lunch Talk

Example of TRAPS

Mean Temporal Patterns for 45 phonemes at 500 Hz

May 4, 2004 Speech Lunch Talk

TRAPS Motivation

• Psychoacoustic studies suggest that human peripheral auditory system integrates information on a longer time scale

• Information measurements (joint mutual information) show information still exists >100ms away within single critical-band

• Potential robustness to speech degradations

May 4, 2004 Speech Lunch Talk

Let’s Explore• TRAPS and HATS are examples of a

specific two-stage approach to learning long-term temporal features

• Is this constrained two-stage approach better than an unconstrained one-stage approach?

• Are the non-linear transformations of critical band trajectories, provided in different ways by TRAPS and HATS, actually necessary?

May 4, 2004 Speech Lunch Talk

Learn Everything in One Step

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

May 4, 2004 Speech Lunch Talk

One-Stage Approach

May 4, 2004 Speech Lunch Talk

2-Stage Linear Approaches

May 4, 2004 Speech Lunch Talk

PCA/LDA Comments

• PCA on log critical band energy trajectories scales and rotates dimensions in directions of highest variance

• LDA projects in directions that maximize class separability measured by between class covariance over within class covariance

• Keep top 40 dimensions for comparison with MLP-based approaches

May 4, 2004 Speech Lunch Talk

2-Stage MLP-Based Approaches

May 4, 2004 Speech Lunch Talk

MLP Comments• As with the other 2-stage approaches, we first

learn patterns independently in separate critical band trajectories, and then learn correlations among these discriminative trajectories

• Interpretation of various MLP layers:1. Input to hidden weights – discriminant linear

transformations2. Hidden unit outputs – Non-linear discriminant

transforms 3. Before Softmax – transforms hidden activation space

to unnormalized phone probability space 4. Output Activations – critical band phone probabilities

May 4, 2004 Speech Lunch Talk

Experimental Setup• Training: ~68 hours of conversational telephone

speech from English CallHome, Switchboard I, and Switchboard Cellular

– 1/10 used for cross-validation set for MLPs

• Testing: 2001 Hub-5 Evaluation Set (Eval2001) – 2,255,609 frames and 62,890 words

• Back-end recognizer: SRI’s Decipher System. 1st pass decoding using a bigram language model and within-word triphone acoustic models (thanks to Andreas Stolcke for all his help)

May 4, 2004 Speech Lunch Talk

Frame Accuracy Performance

62.0%

63.0%

64.0%

65.0%

66.0%

67.0%

68.0%

1 5 B a n d s x 5 1 F ra me s P C A 4 0 L D A 4 0 H A T S B e fo re S ig mo id H A T S T R A P S B e fo re S o ftma x T R A P S P L P 9 F ra me s

Fra

me

Acc

ura

cy

15 Bands x 51 Frames

PCA 40

LDA 40

HATS Before Sigmoid

HATS

TRAPS Before Softmax

TRAPS

PLP 9 Frames

May 4, 2004 Speech Lunch Talk

Standalone Feature System

• Transform MLP outputs by:1. log transform to make features more Gaussian

2. PCA for decorrelation

• Same as Tandem setup introduced by Hermansky, Ellis, and Sharma

• Use transformed MLP outputs as front-end features for the SRI recognizer

May 4, 2004 Speech Lunch Talk

Standalone Features

36.0%

38.0%40.0%

42.0%44.0%

46.0%48.0%

50.0%

15B

ands

x

LDA

40

HA

TS

TR

AP

S

Wo

rd E

rro

r R

ate

15 Bands x 51 Frames

PCA 40

LDA 40

HATS Before Sigmoid

HATS

TRAPS Before Softmax

TRAPS

PLP 9 Frames

May 4, 2004 Speech Lunch Talk

Combination W/State-of-the-Art Front-End Feature

• SRI’s 2003 PLP front-end feature is 12th order PLP with three deltas. Then heteroskedastic discriminant analysis (HLDA) transforms this 52 dimensional feature vector to 39 dimensional HLDA(PLP+3d)

• Concatenate PCA truncated MLP features to HLDA(PLP+3d) and use as augmented front-end feature– Similar to Qualcom-ICSI-OGI features in

AURORA

May 4, 2004 Speech Lunch Talk

Combo W/PLP Baseline Features

32.0%

33.0%

34.0%

35.0%

36.0%

37.0%

38.0%

H L D A (P L P +3 d ) 1 5 B a n d s x 5 1

F ra me s

P C A 4 0 L D A 4 0 H A T S B e fo re

S ig mo id

H A T S T R A P S B e fo re

S o ftma x

T R A P S P L P 9 F ra me s H A T S + P L P 9

F ra me s

Wo

rd E

rro

r R

ate

HLDA(PLP+3d)

15 Bands x 51 Frames

PCA 40

LDA 40

HATS Before Sigmoid

HATS

TRAPS Before Softmax

TRAPS

PLP 9 Frames

HATS + PLP 9 Frames

May 4, 2004 Speech Lunch Talk

Ranking Table

System Frame Acc. Standalone Combination15 Bands x 51 Frames 6 6 6PCA 40 5 2 2LDA 40 4 3 2HATS Before Sigmoid 3 4 2HATS 1 1 1TRAPS Before Softmax 2 4 5TRAPS 7 7 7

May 4, 2004 Speech Lunch Talk

Observations

• Throughout the three various testing setups:

1. HATS is always #1

2. The one-stage 15 Bands x 51 Frames is always #6 or second last

3. TRAPS is always last

4. PCA, LDA, HATS before sigmoid, and TRAPS before softmax flip flop in performance

May 4, 2004 Speech Lunch Talk

Interpretation• Learning constraints introduced by the 2-stage

approach is helpful if done right.• Non-linear discriminant transform of HATS is

better than linear discriminant transforms from LDA and HATS before sigmoid

• The further mapping from hidden activations to critical-band phone posteriors is not helpful– Perhaps, mapping to critical-band phones is too

difficult and inherently noisy

• Finally, like TRAPS, HATS is complementary to the more conventional features and combines synergistically with PLP 9 Frames.

May 4, 2004 Speech Lunch Talk

May 4, 2004 Speech Lunch Talk

Frame Accuracy Performance

System Frame Acc. Rel. Improvement15 Bands x 51 Frames 64.7% -

PCA 40 65.5% 1.2%LDA 40 65.5% 1.2%HATS Before Sigmoid 65.8% 1.7%HATS 66.9% 3.4%TRAPS Before Softmax 65.9% 1.7%TRAPS 64.0% -1.2%

PLP 9 Frames 67.6% N/A

May 4, 2004 Speech Lunch Talk

Standalone Features WER

System WER Rel. Improvement15 Bands x 51 Frames 48.0% -

PCA 40 45.3% 5.6%LDA 40 46.5% 3.1%HATS Before Sigmoid 45.9% 4.4%HATS 44.5% 7.3%TRAPS Before Softmax 45.9% 4.4%TRAPS 48.2% -0.4%

PLP 9 Frames 41.2% N/A

May 4, 2004 Speech Lunch Talk

Combo W/PLP Baseline FeaturesSystem WER Rel. ImprovementHLDA(PLP+3d) 37.2% -

15 Bands x 51 Frames 37.1% 0.3%PCA 40 36.8% 1.1%LDA 40 36.8% 1.1%HATS Before Sigmoid 36.8% 1.1%HATS 36.0% 3.2%TRAPS Before Softmax 36.9% 0.8%TRAPS 37.2% 0.0%PLP 9 Frames 36.1% 3.0%100.0%Inverse Entropy ComboHATS + PLP 9 Frames 34.0% 8.6%