Post on 05-Jan-2016
description
Performance Analysis of Advanced Front Endson the Aurora Large Vocabulary Evaluation
• Authors:Naveen Parihar and Joseph PiconeInst. for Signal and Info. ProcessingDept. Electrical and Computer Eng.Mississippi State University
• Contact Information:Box 9571Mississippi State UniversityMississippi State, Mississippi 39762Tel: 662-325-8335Fax: 662-325-2298
• URL: http://www.isip.msstate.edu/publications/seminars/msstate_misc/2004/gsa/
Email: {parihar,picone}@isip.msstate.edu
INTRODUCTIONBLOCK DIAGRAM APPROACH
Core components:
• Transduction
• Feature extraction
• Acoustic modeling (hidden Markov models)
• Language modeling (statistical N-grams)
• Search (Viterbi beam)
• Knowledge sources
INTRODUCTIONAURORA EVALUATION OVERVIEW
• WSJ 5K (closed task) with seven (digitally-added) noise conditions
• Common ASR system• Two participants:
QIO: QualC., ICSI, OGI; MFA: Moto., FrTel., Alcatel
• Client/server applications
• Evaluate robustness in noisy environments
• Propose a standard for LVCSR applications
Performance Summary
SiteTest Set
CleanNoise
(Sennh)
Noise
(MultiM)
Base (TS1) 15% 59% 75%
Base (TS2) 19% 33% 50%
QIO (TS2) 17% 26% 41%
MFA (TS2) 15% 26% 40%
• Is the 31% relative improvement (34.5% vs. 50.3%) operationally significant ?
INTRODUCTIONMOTIVATION
• Aurora Large Vocabulary (ALV) evaluation goal was at least a 25% relative improvement over the baseline MFCC front end
MFCC: Overall WER – 50.3%
8 kHz – 49.6% 16 kHz – 51.0%
TS1 TS2 TS1 TS2
58.1% 41.0% 62.2% 39.8%
QIO: Overall WER – 37.5%
8 kHz – 38.4% 16 kHz – 36.5%
TS1 TS2 TS1 TS2
43.2% 33.6% 40.7% 32.4%
MFA: Overall WER – 34.5%
8 kHz – 34.5% 16 kHz – 34.4%
TS1 TS2 TS1 TS2
37.5% 31.4% 37.2% 31.5%
ALV Evaluation Results
• Generic baseline LVCSR system with no front end specific tuning
• Would front end specific tuning change the rankings?
EVALUATION PARADIGMTHE AURORA – 4 DATABASE
Acoustic Training:• Derived from 5000 word WSJ0 task• TS1 (clean), and TS2 (multi-condition)• Clean plus 6 noise conditions• Randomly chosen SNR between 10 and 20 dB• 2 microphone conditions (Sennheiser and secondary)• 2 sample frequencies – 16 kHz and 8 kHz• G.712 filtering at 8 kHz and P.341 filtering at 16 kHz
Development and Evaluation Sets:• Derived from WSJ0 Evaluation and Development sets• 14 test sets for each• 7 test sets recorded on Sennheiser; 7 on secondary• Clean plus 6 noise conditions• Randomly chosen SNR between 5 and 15 dB• G.712 filtering at 8 kHz and P.341 filtering at 16 kHz
EVALUATION PARADIGMBASELINE LVCSR SYSTEM
Standard context-dependent cross-word HMM-based system:
• Acoustic models: state-tied4-mixture cross-word triphones
• Language model: WSJ0 5K bigram
• Search: Viterbi one-best using lexical trees for N-gram cross-word decoding
• Lexicon: based on CMUlex
• Real-time: 4 xRT for training and 15 xRT for decoding on an800 MHz Pentium
MonophoneModeling
State-Tying
CD-TriphoneModeling
CD-TriphoneModeling
MixtureModeling (2,4)
Training Data
EVALUATION PARADIGMWI007 ETSI MFCC FRONT END
• Zero-mean debiasing
• 10 ms frame duration
• 25 ms Hamming window
• Absolute energy
• 12 cepstral coefficients
• First and second derivatives
Input Speech
Fourier Transf. Analysis
Cepstral Analysis
Zero-mean andPre-emphasis
Energy
/
FRONT END PROPOSALSQIO FRONT END
• 10 msec frame duration
• 25 msec analysis window
• 15 RASTA-like filtered cepstral coefficients
• MLP-based VAD
• Mean and variance normalization
• First and second derivatives
Fourier Transform
RASTA
Mel-scale Filter Bank
DCT
Mean/VarianceNormalization
Input Speech
/
MLP-basedVAD
FRONT END PROPOSALSMFA FRONT END
• 10 msec frame duration• 25 msec analysis window• Mel-warped Wiener filter
based noise reduction• Energy-based VADNest• Waveform processing to
enhance SNR• Weighted log-energy• 12 cepstral coefficients• Blind equalization (cepstral
domain)• VAD based on acceleration of
various energy based measures
• First and second derivatives
Input Speech
Noise Reduction
Cepstral Analysis
Waveform Processing
Blind Equalization
Feature Processing
VADNest
VAD
/
EXPERIMENTAL RESULTSFRONT END SPECIFIC TUNING
• Pruning beams (word, phone and state) were opened during the tuning process to eliminate search errors.
• Tuning parameters: State-tying thresholds: solves the problem of
sparsity of training data by sharing state distributions among phonetically similar states
Language model scale: controls influence of the language model relative to the acoustic models (more relevant for WSJ)
Word insertion penalty: balances insertions and deletions (always a concern in noisy environments)
EXPERIMENTAL RESULTSFRONT END SPECIFIC TUNING
• QIO FE - 7.5% relative improvement
• MFA FE - 9.4% relative improvement
• Ranking is still the same (14.9% vs. 12.5%) !
FE Cond. # of
Tied
States
State Tying Thresholds LM Scale
Word
Ins. Pen.
WER
Split Merge Occu.
QIO Base 3209 165 165 840 18 10 16.1%
QIO Tuned 3512 125 125 750 20 10 14.9%
MFA Base 3208 165 165 840 18 10 13.8%
MFA Tuned 4254 100 100 600 18 05 12.5%
EXPERIMENTAL RESULTSCOMPARISON OF TUNING
Front End
Train Set
Tuning Average WER over 14 Test Sets
QIO 1 No 43.1%
QIO 2 No 38.1%
Avg. WER No 38.4%
QIO 1 Yes 45.7%
QIO 2 Yes 35.3%
Avg. WER Yes 40.5%
MFA 1 No 37.5%
MFA 2 No 31.8%
Avg. WER No 34.7%
MFA 1 Yes 37.0%
MFA 2 Yes 31.1%
Avg. WER Yes 34.1%
• Same Ranking: relative performance gap increased from9.6% to 15.8%
• On TS1, MFA FE significantly better on all 14 test sets (MAPSSWE p=0.1%)
• On TS2, MFA FE significantly better only on test sets 5 and 14
EXPERIMENTAL RESULTSMICROPHONE VARIATION
• Train on Sennheiser mic.; evaluate on secondary mic.
• Matched conditions result in optimal performance
• Significant degradation for all front ends on mismatched conditions
• Both QIO and MFA provide improved robustness relative to MFCC baseline
0
10
20
30
40
Sennheiser Secondary
ETSI QIO MFA
EXPERIMENTAL RESULTSADDITIVE NOISE
ETSI QIO MFA
0
10
20
30
40
50
60
70
TS2 TS3 TS4 TS5 TS6 TS7
• Performance degrades on noise condition when systems are trained only on clean data
• Both QIO and MFA deliver improved performance
0
10
20
30
40
TS2 TS3 TS4 TS5 TS6 TS7
• Exposing systems to noise and microphone variations (TS2) improves performance
SUMMARY AND CONCLUSIONSWHAT HAVE WE LEARNED?
• Both QIO and MFA front ends achieved ALV evaluation goal of improving performance by at least 25% relative over ETSI baseline
• WER is still high ( ~ 35%), human benchmarks have reported low error rates (~1%). Improvement in performance is not operationally significant
• Front end specific parameter tuning did not result in significant change in overall performance (MFA still outperforms QIO)
• Both QIO and MFA front ends handle convolution and additive noise better than ETSI baseline
APPENDIXAVAILABLE RESOURCES
• Speech Recognition Toolkits: compare front ends to standard approaches using a state of the art ASR toolkit
• ETSI DSR Website: reports and front end standards
• Aurora Project Website: recognition toolkit, multi-CPU scripts, database definitions, publications, and performance summary of the baseline MFCC front end