JHU-HLTCOE system for VoxSRC 2019
Daniel Garcia-Romero, Alan McCree, David Snyder, and Gregory Sell
Human Language Technology Center Of Excellence
Johns Hopkins University
VoxSRC characteristics
• Unlike recent NIST SRE setups:• No explicit domain shift
• Pair-wise comparison of relatively short segments• VoxCeleb1-E validation set has a median segment duration of 7 seconds
• These characteristics:• Minimize the need for an external classifier:
• Not necessary to perform domain adaptation
• No need to aggregate information from multiple long recordings (e.g. 5 minutes)
• Open the door for end-to-end approaches:• DNNs that output direct pair-wise scores (trained with binary cross-entropy)
• However training is cumbersome (e.g., high sensitivity to harvesting samples and unstable convergence)
Our strategy
• Train embedding extractors using a classification loss• Categorical/multi-class cross-entropy
• Two modules• Embedding extractor (x-vectors)
• E-TDNN
• F-TDNN
• Classification head• Surrogate classification task to produce discriminative embeddings
• We have explored 4 alternatives• Softmax, Angular-Margin Softmax, PLDA softmax, Meta-learning softmax
Embedding
extractor
Classification head
s1 s2 s3 s4 sN
X-vector extractor
Temporal statistics layer
(mean, variance)
Bottleneck
Classification head
s1 s2 s4 sN
1D Conv layers
(TDNN)
s3
Three stages
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
time
Input layer
First layer
Second layer
Third layer
. . .
1D-conv stack (TDNN)
time
Input layer
First layer
Second layer
Third layer
. . .
Temporal statistics layer(mean, variance)
Affine + LeakyReLU(Bottleneck)
Affine + Softmax
s1 s2 s3 s4 sN
Temporal pooling
time
Input layer
First layer
Second layer
Third layer
. . .
Temporal statistics layer(mean, variance)
Affine + LeakyReLU(Bottleneck)
Affine + Softmax
s1 s2 s3 s4 sN
Bottleneck
Extended-TDNN (E-TDNN)
Factorized-TDNN (F-TDNN)
Classification heads
Temporal pooling layer
(mean, variance)
Bottleneck
LeakyReLU
Affine + Softmax
TDNN layers
~3000 dim
256 dim
256 dim
Temporal pooling layer
(mean, variance)
Bottleneck
Length-norm
Ang. Margin Softmax
TDNN layers
~3000 dim
256 dim
256 dim
Traditional Softmax Angular Softmax with margin penalty
• Affine layer (class weights)• Inner products with x-vec• Use external classifier (PLDA)
• Length-normalized linear layer • Cosine score with x-vec as logits• Cosine scoring or PLDA
Classification heads
Temporal pooling layer
(mean, variance)
Bottleneck
LeakyReLU
PLDA + Softmax
TDNN layers
~3000 dim
256 dim
256 dim
Temporal pooling layer
(mean, variance)
Bottleneck
Length-norm
Ang. Softmax (subset)
TDNN layers
~3000 dim
256 dim
256 dim
PLDA Softmax Meta-learning Softmax
• Class weights (moving average)• PLDA scoring for logits• Use internal classifier (PLDA)
• Each minibatch contains pairs of a subset of classes• One of them is used as class weight• Cosine scoring
Scoring
• External Gaussian PLDA classifier• Trained on x-vectors from concatenated segments of a video
• Length-normalization applied to x-vectors
• Internal Gaussian PLDA classifier• Use the parameters trained jointly with the x-vector extractor
• Length-normalization applied to x-vectors
• Cosine scoring• Inner product of unit-length x-vectors
Data preparation
• Augmented VoxCeleb2-dev set for DNN training• 1 million original + 5 million augmented versions
• Randomly pick augmentations from: reverb, noise, music, babble• Typical Kaldi setup
• VoxCeleb1-E (cleaned) as validation set
• 16KHz processing• 80 Mel-filter bank energies (20-7600Hz)
• 25ms window every 10ms
• No speech activity detection (SAD) used• Initially we attempted to use a SAD DNN (trained on open data) to
differentiate between OPEN and FIXED tracks but did not make any difference
DNN training in PyTorch• Data parallelism (4 Nvidia 2080 RTX)
• Each GPU 128 samples (512 mini-bacth)
• Batch-norm was not synchronous (slightly faster)
• Batch construction • Augmented data is stored in Kaldi “arks” of 80 Mel-Filter Bank energies
• Uniformly sample (without replacement) subset of classes
• An audio sample from each class is selected• 4 sec chunk extracted using a random offset
• After all classes are sampled we repeat until end of training
• SGD with momentum optimizer• Learning rate of 0.4 fixed for 50K steps and the exponential decay every 10K
• 150K steps to train the nets
Results (VoxCeleb1-E): Size vs performance
• E-TDNN with AM-Softmax (m=0.3, s=30)
3.28
2.26
1.77 1.61 1.51
0
0.5
1
1.5
2
2.5
3
3.5
128 / 0.5 256 / 1.5 512 / 5 1024 / 15.5 2048 / 60
EER
(%
)
Layer size / Paramaters in millions
Results (VoxCeleb1-E): Cosine vs PLDA classifier
• E-TDNN with AM-Softmax (m=0.3, s=30)
3.28
2.26
1.771.61 1.51
3.08
2.13
1.73 1.61 1.53
0
0.5
1
1.5
2
2.5
3
3.5
128 / 0.5 256 / 1.5 512 / 5 1024 / 15.5 2048 / 60
EER
(%
)
Layer size / Parameters in millions
Cosine G-PLDA
Results: System comparison
SystemClassification
head
Layer/ Emb.
dimension
Params
(millions)Scoring
VoxCeleb1-E
EER (%)
VoxSRC eval
EER (%)
E-TDNNAM-Softmax
(m=0.3, s=30)1024/256 15.5 Cosine 1.61 1.91
E-TDNNAM-Softmax
(m=0.1, s=50)1024/512 16 G-PLDA 1.74 2.15
F-TDNN Softmax 1024/512 12.7 G-PLDA 1.93 2.41
E-TDNNPLDA-
Softmax1024/256 15.5 G-PLDA 1.93 2.27
E-TDNNMeta-learnig
Softmax512/256 5 Cosine 2.43 ----
Results: Fusion of 4 heterogeneous systems
SystemClassification
head
Layer/ Emb.
dimension
Params
(millions)Scoring
VoxCeleb1-E
EER (%)
VoxSRC eval
EER (%)
E-TDNNAM-Softmax
(m=0.3, s=30)1024/256 15.5 Cosine 1.61 1.91
E-TDNNAM-Softmax
(m=0.1, s=50)1024/512 16 G-PLDA 1.74 2.15
F-TDNN Softmax 1024/512 12.7 G-PLDA 1.93 2.41
E-TDNNPLDA-
Softmax1024/256 15.5 G-PLDA 1.93 2.27
Sum fusion ---- ---- 57 ---- 1.22 1.54
Results: Fusion vs single large DNN
SystemClassification
head
Layer/ Emb.
dimension
Params
(millions)Scoring
VoxCeleb1-E
EER (%)
VoxSRC eval
EER (%)
E-TDNNAM-Softmax
(m=0.3, s=30)2048/256 60 Cosine 1.51 1.72
Sum fusion ---- ---- 57 ---- 1.22 1.54
• Fusing smaller heterogeneous systems outperforms a large DNN with similar number of parameters
Conclusions
• VoxSRC provides alternative challenges from those of recent NIST SREs • No explicit domain shift
• Pair-wise comparison of relatively short segments
• We explored 2 x-vector topologies and 4 classification heads• Angular Softmax with margin penalty produces strong systems that can use
cosine distance for pair-wise comparisons
• PLDA-Softmax was able to control its capacity and generalized well at test time
• Our attempt using meta-learning was not very encouraging
• Fusion of 4 systems outperforms a single large network with similar number of parameters
Top Related