Environmentally robust ASR front end for DNN-based acoustic models
-
Upload
takuya-yoshioka -
Category
Technology
-
view
591 -
download
2
Transcript of Environmentally robust ASR front end for DNN-based acoustic models
![Page 1: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/1.jpg)
![Page 2: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/2.jpg)
• Do not compare results across different tables! – Configurations may differ
• Most results shown here can be found in:
Takuya Yoshioka and Mark J. F. Gales, “Environmentally robust ASR front-end for deep neural network acoustic models,” Computer Speech and Language, vol. 31, no. 1, pp. 65-86, May 2015
![Page 3: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/3.jpg)
1. Motivation 2. Corpus
• AMI meeting corpus 3. Baseline systems
• SI and SAT set-ups 4. Assessment of environmental robustness of
DNN acoustic models 5. Front-end techniques 6. Combined effects
![Page 4: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/4.jpg)
![Page 5: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/5.jpg)
![Page 6: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/6.jpg)
Little investigation done
![Page 7: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/7.jpg)
• Multi-party interaction – 4 participants in each meeting
• Multi-channel recordings
– Distant microphones – only first channel used – Head-set & lapel microphones
• 2 recording set-ups
– 70h scenario-based meetings – 30h real meetings
![Page 8: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/8.jpg)
• Different rooms • Multiple sources of distortion
– Reverberation – Additive noise – Overlapping speech
• Moving speakers • Many non-natives
![Page 9: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/9.jpg)
• SI : speaker independent – For online transcription – DNN-HMM hybrid
• SAT: speaker adaptive training
– For offline transcription – MLP tandem
![Page 10: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/10.jpg)
![Page 11: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/11.jpg)
• Manual segmentations used • Overlapping segments ignored
![Page 12: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/12.jpg)
![Page 13: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/13.jpg)
State output distributions modelled with
– GMM or
– DNN
� �� �
���Q
T
ttttt qpqqPqPp
qxX
110 )|()|()()|(
��
�M
m
jmjmmjmt Ncjp
1
)()( ),;()|( Σμxx
)()|()|(
jpjpjp t
txx �
![Page 14: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/14.jpg)
�
• Discriminative pre-training • Cross entropy fine-tuning • Discriminative pre-training
![Page 15: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/15.jpg)
• Trained on Telta K20 • cuBLAS 5.5 used • Mini-batch size: 800 frames • Learning rate: “newbob” scheduling • 10% held-out data for CV
![Page 16: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/16.jpg)
System Parame-terisation
%WER
Dev Eval Avg
MPE GMM-HMM HLDA 54.7 55.6 55.2
DNN-HMM hybrid FBANK 43.5 42.6 43.1
This work 40.0 39.3 39.7
![Page 17: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/17.jpg)
![Page 18: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/18.jpg)
Data Set Parame-terisation
%WER
Dev Eval Avg
SDM FBANK 43.5 42.6 43.1
IHM FBANK 28.2 24.6 26.4
• 39.2% of the errors caused by acoustic distortion • DNN-HMMs not so robust
![Page 19: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/19.jpg)
![Page 20: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/20.jpg)
�
• Discriminative pre-training • Cross entropy fine-tuning • Discriminative pre-training
![Page 21: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/21.jpg)
![Page 22: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/22.jpg)
Align-ment
DNN input
%WER
Dev Eval Avg
SDM IHM 30.6 27.0 28.8
IHM SDM 41.8 40.8 41.3
IHM SDM 41.7 40.6 41.2 Using 648-2,000 5-4,000 DNN:
DNN training more sensitive to noise than state alignment
![Page 23: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/23.jpg)
![Page 24: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/24.jpg)
Speech enhancement
Feature transformation Multi-stream features
![Page 25: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/25.jpg)
Speech enhancement
Feature transformation Multi-stream features
![Page 26: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/26.jpg)
Previous work – Beamforming yields gains– No investigation on single-microphone algorithms
![Page 27: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/27.jpg)
• Based on linear time (almost) invariant filters • Applied to complex-valued STFT coefficients
• The filters automatically adjusted using observations – WPE for 1ch dereverberation (NTT’s work) – BeamformIt for denoising (ICSI’s work)
• 8 microphones used, dedicated to meetings
• Unlikely to produce irregular transitions
��
����
1
0
,,,,
T
Tkktfkftftf xgxy
![Page 28: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/28.jpg)
Align-ment
Dev Eval
SDM +Derev BFIt
(8mics) SDM +Derev
BFIt (8mics)
MPE 43.8 41.8 38.6 43.0 41.3 36.6
Hybrid 43.5 41.7 38.8 43.3 41.4 36.7
• Dereveberation helps even with single microphone • Multi-microphone beamforming works well
![Page 29: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/29.jpg)
DNN size Context frames
Dev Eval
SDM +Derev SDM +Derev
1,000 5 9 43.8 41.8 43.0 41.3
1,500 5
9 43.5 42.0 42.6 41.1
13 42.8 41.8 42.9 41.2
19 43.0 41.7 42.9 41.2
2,000 5 9 43.8 41.3 42.9 40.4
4.7% gain from 1ch dereverberation (relative)
![Page 30: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/30.jpg)
Speech enhancement
Feature transformation Multi-stream features
![Page 31: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/31.jpg)
No positive results reported previously
![Page 32: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/32.jpg)
• Applied to magnitude spectra • Cross terms (often) ignored
• Frame-by-frame modification – Harmful for DNN?
• Noise estimated using long-term statistics – IMCRA (used here), minimum statistics, etc
• Deltas from un-enhanced speech – Essential for obtaining gains
2
,
2
,
2
, tftftf nxy �
![Page 33: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/33.jpg)
![Page 34: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/34.jpg)
• Applied to FBANK features • The following mismatch function used
• Frame-by-frame modification • Noise model estimated with EM • Deltas from un-enhanced speech
))exp(1log( hynhxy ��� tttt
![Page 35: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/35.jpg)
Enhancement target %WER
Spectrum Feature Dev Eval Avg
N N 42.0 41.1 41.6
Y N 41.3 40.9 41.1
N Y 41.4 40.5 41.0
Y Y 42.0 41.0 41.5
• Small consistent gains • Different methods should not be connected
![Page 36: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/36.jpg)
Enhancement target %WER
Spectrum Feature Dev Eval Avg
N N 42.0 41.1 41.6
Y N 41.3 40.9 41.1
N Y 41.4 40.5 41.0
Y Y 42.0 41.0 41.5
Y Y 41.4 40.4 40.9 Using multi-stream approach:
![Page 37: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/37.jpg)
Speech enhancement
Feature transformation Multi-stream features
![Page 38: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/38.jpg)
• Frame level – FMPE, RDT, FE-CMLLR – Seems to be subsumed by DNN
• Speaker (or environment) level – Global CMLLR, LIN, fDLR, VTLN – Multiple decoding passes required � SAT
• Utterance level – Single-pass decoding � SI
![Page 39: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/39.jpg)
• Seems robust against supervision errors • STC transform used to deal with correlations:
�
�
�
��tx
)()()()( sst
sst bLxAy �
![Page 40: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/40.jpg)
![Page 41: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/41.jpg)
![Page 42: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/42.jpg)
Form of speaker transform
%WER
Dev Eval Avg
None (SI) 42.6 40.2 41.4
Full 37.4 37.4 37.4
Block diagonal 37.3 36.6 37.0
• ~10% relative gains obtained • “Block diagonal” outperforms “full”
![Page 43: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/43.jpg)
Form of speaker transform
%WER
Dev Eval Avg
None (SI) 42.6 40.2 41.4
Full 37.4 37.4 37.4
Block diagonal 37.3 36.6 37.0
None (SI) 27.8 24.2 26.0
Full 23.8 21.6 22.7
On IHM data set
![Page 44: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/44.jpg)
))(()())(()( ucut
ucut bLxAy �
uuc : )(
Clustering performed using: – utterance-specific iVectors – Kmeans (GMM yielded similar performance figures)
![Page 45: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/45.jpg)
)()0()( uu Twmm �m(0)
T
w(u)
Subspace representation of the deviation from UBM
m(0) m(1)
m(2) m(3)
Variability subspace
![Page 46: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/46.jpg)
![Page 47: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/47.jpg)
![Page 48: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/48.jpg)
![Page 49: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/49.jpg)
#Clusters %WER
Dev Eval Avg No QCMLLR 41.9 40.9 41.4
64 41.0 40.4 40.7 32 41.0 40.0 40.5 16 41.5 40.5 41.0
No QCMLLR 27.8 24.2 26.0 32 26.9 23.5 25.2
On IHM data set
![Page 50: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/50.jpg)
• Using 32 clusters yielded best performance • Similar gains on both SDM and IHM
![Page 51: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/51.jpg)
Speech enhancement
Feature transformation Multi-stream features
![Page 52: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/52.jpg)
• Originally proposed by Aachen for shallow MLP tandem configurations
• Exploits DNN’s insensitivity to the increase in input dimensionality
• (Hopefully) complement features masked by noise
• Allows multiple enhancement results to be combined
![Page 53: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/53.jpg)
• Four types of auxiliary features investigated: – MFCC (Δ/Δ2) – PLP – Gammatone cepstra
• Different frequency warping • STFT not used
– Intra-frame delta ( ) • Emphasises spectral peaks/dips
![Page 54: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/54.jpg)
Feature set #features %WER
Dev Eval Avg
FBANK+Δ+Δ2 (baseline) 72 41.9 40.9 41.4
+PLP 85 40.7 40.3 40.5
+Gammatone 88 40.8 40.0 40.4
+MFCC 85 41.1 39.7 40.4
+MFCC+Δ+Δ2 111 40.6 40.2 40.4
+ + 2 120 40.9 39.8 40.4
+MFCC+ + 2 133 40.4 39.8 40.1
![Page 55: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/55.jpg)
• Speech enhancement – Linear filtering – Spectral/feature enhancement
• Feature transformation– Quantised CMLLR – (Global CMLLR for SAT)
• Multi-stream features
![Page 56: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/56.jpg)
� �
Baseline
![Page 57: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/57.jpg)
Front-end %WER
Dev Eval Avg
FBANK baseline 43.1 42.4 42.8
+WPE 41.8 40.7 41.3
+MFCC+ + 2 40.5 40.1 40.3
+IMCRA+FE-VTS 40.0 39.3 39.7
+QCMLLR 40.9 39.5 40.2
• Effects additive except for QCMLLR • QCMLLR may work if applied to the entire feature set
![Page 58: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/58.jpg)
��
![Page 59: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/59.jpg)
System Parame-terisation
%WER
Dev Eval Avg
SAT GMM-HMM MPE trained
HLDA 48.8 50.2 49.5
SAT tandem MPE trained
FBANK 40.7 40.9 40.8
SI hybrid FBANK 43.5 42.6 43.1
• Outperforms SAT GMM-HMM • Outperforms SI hybrid
![Page 60: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/60.jpg)
� �
Baseline
![Page 61: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/61.jpg)
Front-end %WER
Dev Eval Avg
FBANK baseline 40.1 41.3 40.7
+WPE 38.9 39.3 39.1
+MFCC 38.5 38.5 38.5
+IMCRA+FE-VTS 38.4 38.7 38.6
+CMLLR 36.6 36.7 36.7
+CMLLR 36.9 37.0 37.0
+CMLLR 38.4 38.6 38.5
• Effects of WPE and CMLLR are additive • Using auxiliary features yields small gains over CMLLR
features • Denoising subsumed by CMLLR (as expected)
![Page 62: Environmentally robust ASR front end for DNN-based acoustic models](https://reader031.fdocuments.us/reader031/viewer/2022020116/55a697ad1a28ab6b2d8b4781/html5/thumbnails/62.jpg)
• Front-end processing approaches yield gains over state-of-the-art DNN-based AMs – Linear filtering (WPE, BeamformIt) – Spectral/feature enhancement (IMCRA, FE-VTS) – Feature transformation (QCMLLR, CMLLR) – Multi-stream features
• Possible to combine different classes of
approaches