Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

16
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University Contact Information: Box 9571 Mississippi State University Mississippi State, Mississippi 39762 Tel: 662-325-8335 Fax: 662-325-2298 • URL: http://www.isip.msstate.edu/publications/seminars/msstate _misc/2004/ gsa/ Email: {parihar,picone}@isip.msstate.edu

description

Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation. • Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University •Contact Information: Box 9571 - PowerPoint PPT Presentation

Transcript of Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

Page 1: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

Performance Analysis of Advanced Front Endson the Aurora Large Vocabulary Evaluation

• Authors:Naveen Parihar and Joseph PiconeInst. for Signal and Info. ProcessingDept. Electrical and Computer Eng.Mississippi State University

• Contact Information:Box 9571Mississippi State UniversityMississippi State, Mississippi 39762Tel: 662-325-8335Fax: 662-325-2298

• URL: http://www.isip.msstate.edu/publications/seminars/msstate_misc/2004/gsa/

Email: {parihar,picone}@isip.msstate.edu

Page 2: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

INTRODUCTIONBLOCK DIAGRAM APPROACH

Core components:

• Transduction

• Feature extraction

• Acoustic modeling (hidden Markov models)

• Language modeling (statistical N-grams)

• Search (Viterbi beam)

• Knowledge sources

Page 3: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

INTRODUCTIONAURORA EVALUATION OVERVIEW

• WSJ 5K (closed task) with seven (digitally-added) noise conditions

• Common ASR system• Two participants:

QIO: QualC., ICSI, OGI; MFA: Moto., FrTel., Alcatel

• Client/server applications

• Evaluate robustness in noisy environments

• Propose a standard for LVCSR applications

Performance Summary

SiteTest Set

CleanNoise

(Sennh)

Noise

(MultiM)

Base (TS1) 15% 59% 75%

Base (TS2) 19% 33% 50%

QIO (TS2) 17% 26% 41%

MFA (TS2) 15% 26% 40%

Page 4: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

• Is the 31% relative improvement (34.5% vs. 50.3%) operationally significant ?

INTRODUCTIONMOTIVATION

• Aurora Large Vocabulary (ALV) evaluation goal was at least a 25% relative improvement over the baseline MFCC front end

MFCC: Overall WER – 50.3%

8 kHz – 49.6% 16 kHz – 51.0%

TS1 TS2 TS1 TS2

58.1% 41.0% 62.2% 39.8%

QIO: Overall WER – 37.5%

8 kHz – 38.4% 16 kHz – 36.5%

TS1 TS2 TS1 TS2

43.2% 33.6% 40.7% 32.4%

MFA: Overall WER – 34.5%

8 kHz – 34.5% 16 kHz – 34.4%

TS1 TS2 TS1 TS2

37.5% 31.4% 37.2% 31.5%

ALV Evaluation Results

• Generic baseline LVCSR system with no front end specific tuning

• Would front end specific tuning change the rankings?

Page 5: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

EVALUATION PARADIGMTHE AURORA – 4 DATABASE

Acoustic Training:• Derived from 5000 word WSJ0 task• TS1 (clean), and TS2 (multi-condition)• Clean plus 6 noise conditions• Randomly chosen SNR between 10 and 20 dB• 2 microphone conditions (Sennheiser and secondary)• 2 sample frequencies – 16 kHz and 8 kHz• G.712 filtering at 8 kHz and P.341 filtering at 16 kHz

Development and Evaluation Sets:• Derived from WSJ0 Evaluation and Development sets• 14 test sets for each• 7 test sets recorded on Sennheiser; 7 on secondary• Clean plus 6 noise conditions• Randomly chosen SNR between 5 and 15 dB• G.712 filtering at 8 kHz and P.341 filtering at 16 kHz

Page 6: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

EVALUATION PARADIGMBASELINE LVCSR SYSTEM

Standard context-dependent cross-word HMM-based system:

• Acoustic models: state-tied4-mixture cross-word triphones

• Language model: WSJ0 5K bigram

• Search: Viterbi one-best using lexical trees for N-gram cross-word decoding

• Lexicon: based on CMUlex

• Real-time: 4 xRT for training and 15 xRT for decoding on an800 MHz Pentium

MonophoneModeling

State-Tying

CD-TriphoneModeling

CD-TriphoneModeling

MixtureModeling (2,4)

Training Data

Page 7: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

EVALUATION PARADIGMWI007 ETSI MFCC FRONT END

• Zero-mean debiasing

• 10 ms frame duration

• 25 ms Hamming window

• Absolute energy

• 12 cepstral coefficients

• First and second derivatives

Input Speech

Fourier Transf. Analysis

Cepstral Analysis

Zero-mean andPre-emphasis

Energy

/

Page 8: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

FRONT END PROPOSALSQIO FRONT END

• 10 msec frame duration

• 25 msec analysis window

• 15 RASTA-like filtered cepstral coefficients

• MLP-based VAD

• Mean and variance normalization

• First and second derivatives

Fourier Transform

RASTA

Mel-scale Filter Bank

DCT

Mean/VarianceNormalization

Input Speech

/

MLP-basedVAD

Page 9: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

FRONT END PROPOSALSMFA FRONT END

• 10 msec frame duration• 25 msec analysis window• Mel-warped Wiener filter

based noise reduction• Energy-based VADNest• Waveform processing to

enhance SNR• Weighted log-energy• 12 cepstral coefficients• Blind equalization (cepstral

domain)• VAD based on acceleration of

various energy based measures

• First and second derivatives

Input Speech

Noise Reduction

Cepstral Analysis

Waveform Processing

Blind Equalization

Feature Processing

VADNest

VAD

/

Page 10: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

EXPERIMENTAL RESULTSFRONT END SPECIFIC TUNING

• Pruning beams (word, phone and state) were opened during the tuning process to eliminate search errors.

• Tuning parameters: State-tying thresholds: solves the problem of

sparsity of training data by sharing state distributions among phonetically similar states

Language model scale: controls influence of the language model relative to the acoustic models (more relevant for WSJ)

Word insertion penalty: balances insertions and deletions (always a concern in noisy environments)

Page 11: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

EXPERIMENTAL RESULTSFRONT END SPECIFIC TUNING

• QIO FE - 7.5% relative improvement

• MFA FE - 9.4% relative improvement

• Ranking is still the same (14.9% vs. 12.5%) !

FE Cond. # of

Tied

States

State Tying Thresholds LM Scale

Word

Ins. Pen.

WER

Split Merge Occu.

QIO Base 3209 165 165 840 18 10 16.1%

QIO Tuned 3512 125 125 750 20 10 14.9%

MFA Base 3208 165 165 840 18 10 13.8%

MFA Tuned 4254 100 100 600 18 05 12.5%

Page 12: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

EXPERIMENTAL RESULTSCOMPARISON OF TUNING

Front End

Train Set

Tuning Average WER over 14 Test Sets

QIO 1 No 43.1%

QIO 2 No 38.1%

Avg. WER No 38.4%

QIO 1 Yes 45.7%

QIO 2 Yes 35.3%

Avg. WER Yes 40.5%

MFA 1 No 37.5%

MFA 2 No 31.8%

Avg. WER No 34.7%

MFA 1 Yes 37.0%

MFA 2 Yes 31.1%

Avg. WER Yes 34.1%

• Same Ranking: relative performance gap increased from9.6% to 15.8%

• On TS1, MFA FE significantly better on all 14 test sets (MAPSSWE p=0.1%)

• On TS2, MFA FE significantly better only on test sets 5 and 14

Page 13: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

EXPERIMENTAL RESULTSMICROPHONE VARIATION

• Train on Sennheiser mic.; evaluate on secondary mic.

• Matched conditions result in optimal performance

• Significant degradation for all front ends on mismatched conditions

• Both QIO and MFA provide improved robustness relative to MFCC baseline

0

10

20

30

40

Sennheiser Secondary

ETSI QIO MFA

Page 14: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

EXPERIMENTAL RESULTSADDITIVE NOISE

ETSI QIO MFA

0

10

20

30

40

50

60

70

TS2 TS3 TS4 TS5 TS6 TS7

• Performance degrades on noise condition when systems are trained only on clean data

• Both QIO and MFA deliver improved performance

0

10

20

30

40

TS2 TS3 TS4 TS5 TS6 TS7

• Exposing systems to noise and microphone variations (TS2) improves performance

Page 15: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

SUMMARY AND CONCLUSIONSWHAT HAVE WE LEARNED?

• Both QIO and MFA front ends achieved ALV evaluation goal of improving performance by at least 25% relative over ETSI baseline

• WER is still high ( ~ 35%), human benchmarks have reported low error rates (~1%). Improvement in performance is not operationally significant

• Front end specific parameter tuning did not result in significant change in overall performance (MFA still outperforms QIO)

• Both QIO and MFA front ends handle convolution and additive noise better than ETSI baseline

Page 16: Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

APPENDIXAVAILABLE RESOURCES

• Speech Recognition Toolkits: compare front ends to standard approaches using a state of the art ASR toolkit

• ETSI DSR Website: reports and front end standards

• Aurora Project Website: recognition toolkit, multi-CPU scripts, database definitions, publications, and performance summary of the baseline MFCC front end