PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM
description
Transcript of PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM
PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM
Naveen Parihar, and Joseph PiconeCenter for Advanced Vehicular Systems
Mississippi State University{parihar, picone}@isip.msstate.edu
D. PearceSpeech and MultiModal Group
Motorola Labs, [email protected]
H. G. HirschDept. of Elec. Eng. and Computer Science
Niederrhein University, [email protected]
URL: www.isip.msstate.edu/projects/ies/publications/conferences/
Page 2 of 15Performance Analysis of ALV Baseline System
Abstract
In this paper, we present the design and analysis of the baseline recognition system used for ETSI Aurora large vocabulary (ALV) evaluation. The experimental paradigm is presented along with the results from a number of experiments designed to minimize the computational requirements for the system. The ALV baseline system achieved a WER of 14.0% on the standard 5K Wall Street Journal task, and required 4 xRT for training and 15 xRT for decoding (on an 800 MHz Pentium processor). It is shown that increasing the sampling frequency from 8 kHz to 16 kHz improves performance significantly only for the noisy test conditions. Utterance detection resulted in significant improvements only on the noisy conditions for the mismatched training conditions. Use of the DSR standard VQ-based compression algorithm did not result in a significant degradation. The model mismatch and microphone mismatch resulted in a relative
increase in WER by 300% and 200%, respectively.
Page 3 of 15Performance Analysis of ALV Baseline System
Motivation
• ALV goal was at least a 25% relative improvement over the baseline MFCC front end
• Develop generic baseline LVCSR system with no front end specific tuning
• Benchmark the baseline MFCC front end using generic LVCSR system on six focus conditions — sampling frequency reduction, utterance detection, feature-vector compression, model mismatch, microphone variation, and additive noise
Page 4 of 15Performance Analysis of ALV Baseline System
ALV Baseline System Development
Standard context-dependent cross-word HMM-based system:
• Acoustic models: state-tied16-mixture cross-word triphones
• Language model: WSJ0 5K bigram
• Search: Viterbi one-best using lexical trees for N-gram cross-word decoding
• Lexicon: based on CMUlex
• Performance: 8.3% WER at 85xRT
MonophoneModeling
State-Tying
CD-TriphoneModeling
CD-TriphoneModeling
MixtureModeling (16)
Training Data
Page 5 of 15Performance Analysis of ALV Baseline System
The baseline HMM system
used an ETSI standard
MFCC-based front end:
• Zero-mean debiasing
• 10 ms frame duration
• 25 ms Hamming window
• Absolute energy
• 12 cepstral coefficients
• First and second derivatives
ETSI WI007 Front End
Fourier Transf. Analysis
Cepstral Analysis
Zero-mean andPre-emphasis
Energy
/
Input Speech
Page 6 of 15Performance Analysis of ALV Baseline System
Real-time reduction
Factor WER Relative Degrad.
Baseline system 8.3% N/A
Terminal filtering 8.4% 1%
ETSI front end 9.6% 14%
Beam Adj. (15xRT) 11.8% 23%
16 to 4 mixtures 14.1% 20%
50% reduction 14.9% 6%
Endpointing 14.0% -6%
• Derived from ISIP WSJ0 system (with CMS)
• Aurora-4 database terminal filtering resulted in marginal degradation
• ETSI WI007 front end is 14% worst (no CMS)
• ALV Baseline System performance: 14.0%
• Real-time: 4 xRT for training and 15 xRT for decoding on an 800 MHz Pentium
Page 7 of 15Performance Analysis of ALV Baseline System
Aurora—4 database
Acoustic Training:• Derived from 5000 word WSJ0 task• TS1 (clean), and TS2 (multi-condition)• Clean plus 6 noise conditions• Randomly chosen SNR between 10 and 20 dB• 2 microphone conditions (Sennheiser and secondary)• 2 sample frequencies – 16 kHz and 8 kHz• G.712 filtering at 8 kHz and P.341 filtering at 16 kHz
Development and Evaluation Sets:• Derived from WSJ0 Evaluation and Development sets• 14 test sets for each• 7 recorded on Sennheiser; 7 on secondary• Clean plus 6 noise conditions• Randomly chosen SNR between 5 and 15 dB• G.712 filtering at 8 kHz and P.341 filtering at 16 kHz
Page 8 of 15Performance Analysis of ALV Baseline System
• Perfectly-matched condition (TrS1 and TS1): No significant degradation
• Mismatched conditions (TrS1 and TS2-TS14): No clear trend
• Matched conditions (TrS2 and TS1-TS14): Significant degradation on noisy conditions recorded on Senn. mic. (TS3-TS8)
Sampling Frequency Reduction
0TS2 TS3 TS4 TS5 TS6 TS7TS1
10
20
30
40
16 kHz 8 kHz
Page 9 of 15Performance Analysis of ALV Baseline System
• Perfectly-matched condition (TrS1 and TS1): No significant improvement
• Mismatched conditions (TrS1 and TS2-TS14): Significant improvement due to reduction in insertions
Utterance Detection
• Matched conditions (TrS2 and TS1-TS14): No significant improvement
TS9 (Sec., Car)
TS2 (Senn., Car)
Test Set
54.4%
41.4%
Sub.
W/O Endpointing
12.3%
3.6%
Del.
15.1%
20.1%
Ins.
13.0%3.6%40.0%
10.1%15.1%49.1%
Sub.
With Endpointing
Ins.Del.
Page 10 of 15Performance Analysis of ALV Baseline System
Feature-vector Compression
• Sampling frequency specific codebooks — 8 kHz and 16 kHz
• Perfectly-matched condition (TrS1 and TS1): No significant degradation
• Mismatched conditions (TrS1 and TS2-TS14): No significant degradation
• Matched conditions (TrS2 and TS1-TS14): Significant degradation on a few matched conditions – TS3,8,9,10,12 at 16 kHz sampling and TS7,12 at 8 kHz sampling frequency
Page 11 of 15Performance Analysis of ALV Baseline System
Model Mismatch
• Perfectly-matched condition (TrS1 and TS1): Best performance
• Mismatched conditions (TrS1 and TS2-TS14): Significant degradations
• Matched conditions (TrS2 and TS1-TS14): Better than mismatched conditions
010
20
30
40
50
60
70
TS2 TS3 TS4 TS5 TS6 TS7
TrS1 TrS2 TS1 (Clean)
TS2 TS3 TS4 TS5 TS6 TS70
10
20
30
40
Page 12 of 15Performance Analysis of ALV Baseline System
Microphone Variation
Senn. Sec.• Train on Sennheiser mic.;
evaluate on secondary mic.
• Perfectly-matched condition (TrS1 and TS1): Optimal performance
• Mismatched condition (TrS1 and TS8): Significant degradation
• Matched conditions: Less severe degradation when samples of sec. microphone seen during training 0
5
10
15
20
25
30
35
40
TrS1 TrS2
Page 13 of 15Performance Analysis of ALV Baseline System
Additive Noise
• Matched Conditions: Exposing systems to noise and microphone variations (TS2) improves performance
TS2 TS3 TS4 TS5 TS6 TS70
10
20
30
40
• Mismatched Conditions: Performance degrades on noise condition when systems are trained only on clean data
010
20
30
40
50
60
70
TS2 TS3 TS4 TS5 TS6 TS7
TrS1 TrS2 TS1 (Clean)
Page 14 of 15Performance Analysis of ALV Baseline System
Summary and Conclusions
• Presented a WSJ0 based LVCSR system that runs at 4xRT for training and 15xRT for decoding on a 800 MHz Pentium
• Reduction in benchmarking time from 1034 to 203 days
• Increase in sampling frequency from 8 kHz to 16 kHz results in significant improvement only on matched noisy test conditions
• Utterance detection resulted in significant improvements only on the noisy conditions for the mismatched training conditions
• VQ based compression is robust in DSR environment
• Exposing models to different noisy conditions and microphone conditions improves the speech recognition performance in adverse conditions
Page 15 of 15Performance Analysis of ALV Baseline System
• Aurora Project Website: recognition toolkit, multi-CPU scripts, database definitions, publications, and performance summary of the baseline MFCC front end
Available Resources
• Speech Recognition Toolkits: compare front ends to standard approaches using a state of the art ASR toolkit
• ETSI DSR Website: reports and front end standards
Page 16 of 15Performance Analysis of ALV Baseline System
Brief Bibliography
• N. Parihar, Performance Analysis of Advanced Front Ends, M.S. Dissertation, Mississippi State University, December 2003.
• N. Parihar, and J. Picone, “An Analysis of the Aurora Large Vocabulary
Evaluation,” Eurospeech 2003, pp. 337-340, Geneva, Switzerland, September 2003.
• N. Parihar and J. Picone, “DSR Front End LVCSR Evaluation - AU/384/02,” Aurora Working Group, European Telecommunications Standards Institute, December 06, 2002.
• D. Pearce, “Overview of Evaluation Criteria for Advanced Distributed Speech Recognition,” ETSI STQ-Aurora DSR Working Group, October 2001.
• G. Hirsch, “Experimental Framework for the Performance Evaluation of Speech Recognition Front-ends in a Large Vocabulary Task,” ETSI STQ-Aurora DSR Working Group, December 2002.
• “ETSI ES 201 108 v1.1.2 Distributed Speech Recognition; Front-end Feature Extraction Algorithm; Compression Algorithm,” ETSI, April 2000.