Recent work on Language Identification Pietro Laface POLITECNICO di TORINO Brno 28-06-2009 Pietro...
-
Upload
cecily-lindsey -
Category
Documents
-
view
214 -
download
0
Transcript of Recent work on Language Identification Pietro Laface POLITECNICO di TORINO Brno 28-06-2009 Pietro...
Recent work on Language Identification
Pietro Laface
POLITECNICO di
TORINO
Brno 28-06-2009 Pietro LAFACE
Team
POLITECNICO di TORINO
Pietro Laface Professor
Fabio Castaldo Post-doc
Sandro Cumani PhD student
Ivano Dalmasso Thesis Student
LOQUENDO
Claudio Vair Senior
Researcher
Daniele Colibro Researcher
Emanuele Dalmasso Post-doc
Our technology progress
1. Inter-speaker compensation in feature space GLDS / SVM models (ICASSP 2007) - GMMs
2. SVM using GMM super‑vectors (GMM-SVM) Introduced by MIT-LL for speaker recognition
3. Fast discriminative training of GMMs Alternative to MMIE Exploiting the GMM-SVM separation hyperplanes MIT discriminative GMMs
4. Language factors
GMM super‑vectors
Appending the mean value of all Gaussians in a single stream we get a super-vector
Without normalization
Inter-speaker/channel variation compensation
11 12 1 21 22 2p p pN
We use GMM super-vectors
With Kullback‑Leibler normalization
Training GMM-SVM models Training Discriminative GMMs
Using an UBM in LID
1. The frame based inter-speaker variation compensation approach estimates the inter-speaker compensation factors using the UBM
2. In the GMM-SVM approach all language GMMs share the same weights and variances of the UBM
3. The UBM is used for fast selection of Gaussians
Speaker/channel compensation
in feature space
U is a low rank matrix (estimated offline) projecting the speaker/channel factors subspace in the supervector domain.
x(i) is a low dimensional vector, estimated using the UBM, holding the speaker/channel factors for the current utterance i.
is the occupation probability of the m-th Gaussian
ˆ i im m
m
t = t γ t i o o U x
mγ t
Estimating the U matrix Estimating the U matrix with a large set of differences between models generated using different utterances of the same speaker we compensate the distortions due to the inter-session variability Speaker recognition
Estimating the U matrix with a large set of differences between models generated using different speaker utterances of the same language we compensate the distortions due to inter-speaker/channel variability within the same language Language recognition
GMM-SVM weakness
GMM-SVM models perform very well with rather long test utterances
Exploit the discriminative information given by the GMM-SVM for fast estimation of discriminative GMMs
It is difficult to estimate a robust GMM with a short test utterance
SVM discriminative directions
w1
w2w3
0= w x b
w: normal vector to the class‑separation hyperplane
GMM discriminative training
Shift each Gaussian of a language model along its discriminative direction, given by the vector normal to the class‑separation hyperplane in the KL space
1kw
2kw
Feature Space
kw
KL Space
Utterance GMM
Language GMM
ˆ k k k k k μ μ w ˆgpk k k kgp gp
g
Experiments with 2048 GMMs
Year
Models
Discriminative GMMs GMM-SVM
3s 10s 30s
1996 11.71 (13.71) 3.62 (4.92) 1.01 (1.37)
2003 13.56 (14.40) 5.50 (6.02) 1.42 (1.64)
2005 16.94 (17.85) 9.73 (11.07 ) 4.67 (5.81 )
Pooled EER(%) of Discriminative 2048 GMMs, and GMM-SVM on the NIST LRE tasks.
In parentheses, the average of the EERs of each language.
256-MMI (Brno University – 2006 IEEE Odyssey )
2005 17.1 8.6 4.6
Pushed GMMs (MIT-LL)
Language Factors
Eigenvoice modeling, , and the use of speaker factors as input features to SVMs, has recently been demonstrated to give good results for speaker recognition compared to the standard GMM-SVM approach (Dehak et al. ICASSP 2009).
Analogy
Estimate an eigen-language space, and use the language factors as input features to SVM classifiers (Castaldo et al. submitted to Interspeech 2009).
UBMs = + Vy
Language Factors: advantages
Language factors are low-dimension vectors
Training and evaluating SVMs with different kernels is easy and fast: it requires the dot product of normalized language factors
Using a very large number of training examples is feasible
Small models give good performance
Toward an eigen-language space
After compensation of the nuisances of a GMM adapted from the UBM using a single utterance, residual information about the channel and the speaker remains.
However, most of the undesired variation is removed as demonstrated by the improvements obtained using this technique
Speaker compensated eigenvoices
First approach
Estimating the principal directions of the GMM supervectors of all the training segments before inter-speaker nuisance compensation would produce a set of language independent, “universal” eigenvoices.
After nuisance removal, however, the speaker contribution to the principal components is reduced to the benefit of language discrimination.
Eigen-language space
Second approach Computing the differences between the GMM
supervectors obtained from utterances of a polyglot speaker would compensate the speaker characteristics and would enhance the acoustic components of a language with respect to the others.
We do not have labeled databases including polyglot speakers compute and collect the difference between GMM
supervectors produced by utterances of speakers of two different languages irrespective of the speaker identity, already compensated in the feature domain
Eigen-language space
The number of these differences would grow with the square of utterances of the training set.
Perform Principal Component Analysis on the set of the differences between the set of the supervectors of a language and the average supervector of every other language.
Training corpora
The same used for LRE07 evaluation
All data of the 12 languages in the Callfriend corpus
Half of the NIST LRE07 development corpus
Half of the OSHU corpus provided by NIST for LRE05
The Russian through switched telephone network
− Automatic segmentation
LRE07 30s closed set test
Language factor’s minDCF is always better and more stable
Pushed GMMs (MIT-LL)
Pushed eigen-language GMMs
| 0 | 0
| 0 | 0
ij
ij
ipositive i
i jj
inegative i
i jj
g g
g g
| 0 | 0
| 0 | 0
ij
ij
ipositive UBM i
i jj
inegative UBM i
i jj
g μ V y
g μ V y
The same approach to obtain discriminative GMMs from the language factors
Min DCFs and (%EER)
Models 30s 10s 3s
GMM-SVM (KL kernel)0.029(3.43)
0.085(9.12)
0.201(21.3)
GMM-SVM (Identity kernel)
0.031(3.72)
0.087(9.51)
0.200(21.0)
LF-SVM (KL kernel)0.026(3.13)
0.083(9.02)
0.186(20.4)
LF-SVM (Identity kernel)0.026(3.11)
0.083(9.13)
0.187(20.4)
Discriminative GMMs 0.021(2.56)
0.069(7.49)
0.174(18.45)
LF-Discriminative GMMs (KL kernel)
0.025(2.97)
0.084(9.04)
0.186(19.9)
LF-Discriminative GMMs(Identity kernel)
0.025(3.05)
0.084(9.05)
0.186(20.0)
Loquendo-Polito LRE09 System
Acoustic features SVM-GMMs Pushed GMMs
MMIE GMMs
Phonetic transcriber
Phonetic transcriber
Phonetic transcriber
N-gram counts
N-gram counts
N-gram counts TFLLR SMV
TFLLR SMV
TFLLR SMV
Model Training
Phone transcribers
12 phone transcribers for French, German, Greek, Italian, Polish,
Portuguese, Russian, Spanish, Swedish, Turkish, UK and US English.
The statistics of the n-gram phone occurrences collected from the best decoded string of each conversation segment
ASR Recognizer phone-loop grammar with diphone transition constraints
Phone transcribers
10 phone transcribers for Catalan, French, German, Greek, Italian,
Polish, Portuguese, Russian, Spanish, Swedish, Turkish, UK and US English.
The statistics of the n-gram phone occurrences collected from the expected counts from a lattice of each conversation segment
ANN models Same phone-loop grammar - different engine
Multigrams
Two different TFLLR kernels
trigrams pruned multigrams
multigrams can provide useful information about the language by capturing “word parts” within the string sequences
Scoring
The total number of models that we use for scoring an unknown segment is 34:
11 channel dependent models (11 x 2) 12 single channel models (2 telephone and
10 broadcast models only).
23 x 2 for MMIE GMMs (channel independent but M/F)
Calibration and fusion
Pushed GMMs
MMIE GMMs
1-best 3-grams SVMs
1-best n-grams SVMs
Lattice n-grams SVMs
34
46
34
34
34
34 23
34
34
34
G. back-end
G. back-end
G. back-end
G. back-end
G. back-end
34
34
LLR max lre_detection
Multi-class FoCal
max of the channel dependent scores
Language pair recognition
For the language-pair evaluation only the back-ends have been re-trained, keeping unchanged the models of all the sub-systems.
Telephone development corpora
• CALLFRIEND - Conversations split into slices of 150s
• NIST 2003 and NIST 2005
• LRE07 development corpus
• Cantonese and Portuguese data in the 22 Language OGI corpus
• RuSTeN -The Russian through Switched Telephone Network corpus
“Broadcast” development corpora
Incrementally created to include as far as possible the variability within a language due to channel, gender and speaker differences
The development data, further split in training, calibration and test subsets, should cover the mentioned variability
Problems with LRE09 dev data
Often same speaker segments Scarcity of segments for some languages after filtering
same speaker segments Genders are not balanced Excluding “French”, the segments of all languages are
either telephone or broadcast. No audited data available for Hindi, Russian, Spanish and
Urdu on VOA3, only automatic segmentation was provided No segmentation was provided in the first release of the
development data for Cantonese, Korean, Mandarin, and Vietnamese
For these 8 missing languages only the language hypotheses provided by BUT were available for VOA2 data.
Additional “audited” data
For the 8 languages lacking broadcast data, segments have been generated accessing the VOA site looking for the original MP3 files
Goal collect ~300 broadcast segments per language, processed to detect narrowband fragments
The candidates were checked to eliminate segments including music, bad channel distortions, and fragments of other languages
Development data for bootstrap models
The segments were distributed to these sets so that same speaker segments were included in the same set. A set of acoustic (pushed GMMs) bootstrap models has been trained
Telephone and audited/checked broadcast data
Training (50 %)Development (25 %) Test (25 %)
Additional not-audited data from VOA3
Preliminary tests with the bootstrap models indicate the need of additional data
Selected from VOA3 to include new speakers in the train, calibration and test sets assuming that the file label correctly identify
the corresponding language
Speaker selection
Performed by means of a speaker recognizer
We process the audited segments before the others
A new speaker model is added to the current set of speaker models whenever the best recognition score obtained by a segment is less than a threshold
Enriching the training set
Language recognition has been performed using a system combining the acoustic bootstrap models and a phonetic system
A segment has been selected only if the 1-best language hypothesis of our system
had associated a score greater than a given (rather high) threshold
matched the 1-best hypothesis provided by the BUT system
Additional not-audited data from VOA2
Total number of segments for this evaluation
SetCorpora
voa3_A voa2_A ftp_C voa3_S voa2_S ftp_S
Train 529 116 316 1955 590 66
Extendedtrain
114 22 65 2483 574 151
Development 396 85 329 1866 449 45
Suffix: A audited C checked S automatic segmentation
ftp: ftp://8475.ftp.storage.akadns.net/mp3/voa/
Hausa- Decision Cost Function
DCF
Hindi- Decision Cost Function
DCF
Results on the development set
Test on
Systems
PushedGMMs
MMIEGMMs
3-gramsMulti-grams
Lattice Fusion
Broadcast &
telephone1.48 1.70 1.09 1.12 1.06 0.86
Broadcastsubset
1.54 1.69 1.24 1.26 1.14 0.91
Telephonesubset
2.00 2.51 1.45 1.49 1.42 1.21
Average minDCFx100 on 30s test segments
Korean - score cumulative distribution
b-b
t-t
t-b
b-t