Download - JHU-HLTCOE system for VoxSRC 2019 - University of Oxfordvgg/data/voxceleb/data_workshop/JHU... · 2019. 9. 19. · JHU-HLTCOE system for VoxSRC 2019 Daniel Garcia-Romero, Alan McCree,

JHU-HLTCOE system for VoxSRC 2019

Daniel Garcia-Romero, Alan McCree, David Snyder, and Gregory Sell

Human Language Technology Center Of Excellence

Johns Hopkins University

VoxSRC characteristics

• Unlike recent NIST SRE setups:• No explicit domain shift

• Pair-wise comparison of relatively short segments• VoxCeleb1-E validation set has a median segment duration of 7 seconds

• These characteristics:• Minimize the need for an external classifier:

• Not necessary to perform domain adaptation

• No need to aggregate information from multiple long recordings (e.g. 5 minutes)

• Open the door for end-to-end approaches:• DNNs that output direct pair-wise scores (trained with binary cross-entropy)

• However training is cumbersome (e.g., high sensitivity to harvesting samples and unstable convergence)

Our strategy

• Train embedding extractors using a classification loss• Categorical/multi-class cross-entropy

• Two modules• Embedding extractor (x-vectors)

• E-TDNN

• F-TDNN

• Classification head• Surrogate classification task to produce discriminative embeddings

• We have explored 4 alternatives• Softmax, Angular-Margin Softmax, PLDA softmax, Meta-learning softmax

Embedding

extractor

Classification head

s1 s2 s3 s4 sN

X-vector extractor

Temporal statistics layer

(mean, variance)

Bottleneck

Classification head

s1 s2 s4 sN

1D Conv layers

(TDNN)

s3

Three stages

time

Input layer

First layer

Second layer

Third layer

1D-conv stack (TDNN)

time

Input layer

First layer

Second layer

Third layer

. . .

1D-conv stack (TDNN)

time

Input layer

First layer

Second layer

Third layer

. . .

Temporal statistics layer(mean, variance)

Affine + LeakyReLU(Bottleneck)

Affine + Softmax

s1 s2 s3 s4 sN

Temporal pooling

time

Input layer

First layer

Second layer

Third layer

. . .

Temporal statistics layer(mean, variance)

Affine + LeakyReLU(Bottleneck)

Affine + Softmax

s1 s2 s3 s4 sN

Bottleneck

Extended-TDNN (E-TDNN)

Factorized-TDNN (F-TDNN)

Classification heads

Temporal pooling layer

(mean, variance)

Bottleneck

LeakyReLU

Affine + Softmax

TDNN layers

~3000 dim

256 dim

256 dim


(mean, variance)

Bottleneck

Length-norm

Ang. Margin Softmax

TDNN layers

~3000 dim

256 dim

256 dim

Traditional Softmax Angular Softmax with margin penalty

• Affine layer (class weights)• Inner products with x-vec• Use external classifier (PLDA)

• Length-normalized linear layer • Cosine score with x-vec as logits• Cosine scoring or PLDA

Classification heads


(mean, variance)

Bottleneck

LeakyReLU

PLDA + Softmax

TDNN layers

~3000 dim

256 dim

256 dim


(mean, variance)

Bottleneck

Length-norm

Ang. Softmax (subset)

TDNN layers

~3000 dim

256 dim

256 dim

PLDA Softmax Meta-learning Softmax

• Class weights (moving average)• PLDA scoring for logits• Use internal classifier (PLDA)

• Each minibatch contains pairs of a subset of classes• One of them is used as class weight• Cosine scoring

Scoring

• External Gaussian PLDA classifier• Trained on x-vectors from concatenated segments of a video

• Length-normalization applied to x-vectors

• Internal Gaussian PLDA classifier• Use the parameters trained jointly with the x-vector extractor

• Length-normalization applied to x-vectors

• Cosine scoring• Inner product of unit-length x-vectors

Data preparation

• Augmented VoxCeleb2-dev set for DNN training• 1 million original + 5 million augmented versions

• Randomly pick augmentations from: reverb, noise, music, babble• Typical Kaldi setup

• VoxCeleb1-E (cleaned) as validation set

• 16KHz processing• 80 Mel-filter bank energies (20-7600Hz)

• 25ms window every 10ms

• No speech activity detection (SAD) used• Initially we attempted to use a SAD DNN (trained on open data) to

differentiate between OPEN and FIXED tracks but did not make any difference

DNN training in PyTorch• Data parallelism (4 Nvidia 2080 RTX)

• Each GPU 128 samples (512 mini-bacth)

• Batch-norm was not synchronous (slightly faster)

• Batch construction • Augmented data is stored in Kaldi “arks” of 80 Mel-Filter Bank energies

• Uniformly sample (without replacement) subset of classes

• An audio sample from each class is selected• 4 sec chunk extracted using a random offset

• After all classes are sampled we repeat until end of training

• SGD with momentum optimizer• Learning rate of 0.4 fixed for 50K steps and the exponential decay every 10K

• 150K steps to train the nets

Results (VoxCeleb1-E): Size vs performance

• E-TDNN with AM-Softmax (m=0.3, s=30)

3.28

2.26

1.77 1.61 1.51

0

0.5

1

1.5

2

2.5

3

3.5

128 / 0.5 256 / 1.5 512 / 5 1024 / 15.5 2048 / 60

EER

(%

)

Layer size / Paramaters in millions

Results (VoxCeleb1-E): Cosine vs PLDA classifier

• E-TDNN with AM-Softmax (m=0.3, s=30)

3.28

2.26

1.771.61 1.51

3.08

2.13

1.73 1.61 1.53

0

0.5

1

1.5

2

2.5

3

3.5

128 / 0.5 256 / 1.5 512 / 5 1024 / 15.5 2048 / 60

EER

(%

)

Layer size / Parameters in millions

Cosine G-PLDA

Results: System comparison

SystemClassification

head

Layer/ Emb.

dimension

Params

(millions)Scoring

VoxCeleb1-E

EER (%)

VoxSRC eval

EER (%)

E-TDNNAM-Softmax

(m=0.3, s=30)1024/256 15.5 Cosine 1.61 1.91

E-TDNNAM-Softmax

(m=0.1, s=50)1024/512 16 G-PLDA 1.74 2.15

F-TDNN Softmax 1024/512 12.7 G-PLDA 1.93 2.41

E-TDNNPLDA-

Softmax1024/256 15.5 G-PLDA 1.93 2.27

E-TDNNMeta-learnig

Softmax512/256 5 Cosine 2.43 ----

Results: Fusion of 4 heterogeneous systems


head

Layer/ Emb.

dimension

Params

(millions)Scoring

VoxCeleb1-E

EER (%)

VoxSRC eval

EER (%)

E-TDNNAM-Softmax

(m=0.3, s=30)1024/256 15.5 Cosine 1.61 1.91

E-TDNNAM-Softmax

(m=0.1, s=50)1024/512 16 G-PLDA 1.74 2.15

F-TDNN Softmax 1024/512 12.7 G-PLDA 1.93 2.41

E-TDNNPLDA-

Softmax1024/256 15.5 G-PLDA 1.93 2.27

Sum fusion ---- ---- 57 ---- 1.22 1.54

Results: Fusion vs single large DNN


head

Layer/ Emb.

dimension

Params

(millions)Scoring

VoxCeleb1-E

EER (%)

VoxSRC eval

EER (%)

E-TDNNAM-Softmax

(m=0.3, s=30)2048/256 60 Cosine 1.51 1.72

Sum fusion ---- ---- 57 ---- 1.22 1.54

• Fusing smaller heterogeneous systems outperforms a large DNN with similar number of parameters

Conclusions

• VoxSRC provides alternative challenges from those of recent NIST SREs • No explicit domain shift

• Pair-wise comparison of relatively short segments

• We explored 2 x-vector topologies and 4 classification heads• Angular Softmax with margin penalty produces strong systems that can use

cosine distance for pair-wise comparisons

• PLDA-Softmax was able to control its capacity and generalized well at test time

• Our attempt using meta-learning was not very encouraging

• Fusion of 4 systems outperforms a single large network with similar number of parameters