JHU-HLTCOE system for VoxSRC 2019 - University of Oxfordvgg/data/voxceleb/data_workshop/JHU... ·...

29
JHU-HLTCOE system for VoxSRC 2019 Daniel Garcia-Romero, Alan McCree, David Snyder, and Gregory Sell Human Language Technology Center Of Excellence Johns Hopkins University

Transcript of JHU-HLTCOE system for VoxSRC 2019 - University of Oxfordvgg/data/voxceleb/data_workshop/JHU... ·...

  • JHU-HLTCOE system for VoxSRC 2019

    Daniel Garcia-Romero, Alan McCree, David Snyder, and Gregory Sell

    Human Language Technology Center Of Excellence

    Johns Hopkins University

  • VoxSRC characteristics

    • Unlike recent NIST SRE setups:• No explicit domain shift

    • Pair-wise comparison of relatively short segments• VoxCeleb1-E validation set has a median segment duration of 7 seconds

    • These characteristics:• Minimize the need for an external classifier:

    • Not necessary to perform domain adaptation

    • No need to aggregate information from multiple long recordings (e.g. 5 minutes)

    • Open the door for end-to-end approaches:• DNNs that output direct pair-wise scores (trained with binary cross-entropy)

    • However training is cumbersome (e.g., high sensitivity to harvesting samples and unstable convergence)

  • Our strategy

    • Train embedding extractors using a classification loss• Categorical/multi-class cross-entropy

    • Two modules• Embedding extractor (x-vectors)

    • E-TDNN

    • F-TDNN

    • Classification head• Surrogate classification task to produce discriminative embeddings

    • We have explored 4 alternatives• Softmax, Angular-Margin Softmax, PLDA softmax, Meta-learning softmax

    Embedding

    extractor

    Classification head

    s1 s2 s3 s4 sN

  • X-vector extractor

    Temporal statistics layer

    (mean, variance)

    Bottleneck

    Classification head

    s1 s2 s4 sN

    1D Conv layers

    (TDNN)

    s3

    Three stages

  • time

    Input layer

    First layer

    Second layer

    Third layer

    1D-conv stack (TDNN)

  • time

    Input layer

    First layer

    Second layer

    Third layer

    1D-conv stack (TDNN)

  • time

    Input layer

    First layer

    Second layer

    Third layer

    1D-conv stack (TDNN)

  • time

    Input layer

    First layer

    Second layer

    Third layer

    1D-conv stack (TDNN)

  • time

    Input layer

    First layer

    Second layer

    Third layer

    1D-conv stack (TDNN)

  • time

    Input layer

    First layer

    Second layer

    Third layer

    1D-conv stack (TDNN)

  • time

    Input layer

    First layer

    Second layer

    Third layer

    1D-conv stack (TDNN)

  • time

    Input layer

    First layer

    Second layer

    Third layer

    1D-conv stack (TDNN)

  • time

    Input layer

    First layer

    Second layer

    Third layer

    . . .

    1D-conv stack (TDNN)

  • time

    Input layer

    First layer

    Second layer

    Third layer

    . . .

    Temporal statistics layer(mean, variance)

    Affine + LeakyReLU(Bottleneck)

    Affine + Softmax

    s1 s2 s3 s4 sN

    Temporal pooling

  • time

    Input layer

    First layer

    Second layer

    Third layer

    . . .

    Temporal statistics layer(mean, variance)

    Affine + LeakyReLU(Bottleneck)

    Affine + Softmax

    s1 s2 s3 s4 sN

    Bottleneck

  • Extended-TDNN (E-TDNN)

  • Factorized-TDNN (F-TDNN)

  • Classification heads

    Temporal pooling layer

    (mean, variance)

    Bottleneck

    LeakyReLU

    Affine + Softmax

    TDNN layers

    ~3000 dim

    256 dim

    256 dim

    Temporal pooling layer

    (mean, variance)

    Bottleneck

    Length-norm

    Ang. Margin Softmax

    TDNN layers

    ~3000 dim

    256 dim

    256 dim

    Traditional Softmax Angular Softmax with margin penalty

    • Affine layer (class weights)• Inner products with x-vec• Use external classifier (PLDA)

    • Length-normalized linear layer • Cosine score with x-vec as logits• Cosine scoring or PLDA

  • Classification heads

    Temporal pooling layer

    (mean, variance)

    Bottleneck

    LeakyReLU

    PLDA + Softmax

    TDNN layers

    ~3000 dim

    256 dim

    256 dim

    Temporal pooling layer

    (mean, variance)

    Bottleneck

    Length-norm

    Ang. Softmax (subset)

    TDNN layers

    ~3000 dim

    256 dim

    256 dim

    PLDA Softmax Meta-learning Softmax

    • Class weights (moving average)• PLDA scoring for logits• Use internal classifier (PLDA)

    • Each minibatch contains pairs of a subset of classes• One of them is used as class weight• Cosine scoring

  • Scoring

    • External Gaussian PLDA classifier• Trained on x-vectors from concatenated segments of a video

    • Length-normalization applied to x-vectors

    • Internal Gaussian PLDA classifier• Use the parameters trained jointly with the x-vector extractor

    • Length-normalization applied to x-vectors

    • Cosine scoring• Inner product of unit-length x-vectors

  • Data preparation

    • Augmented VoxCeleb2-dev set for DNN training• 1 million original + 5 million augmented versions

    • Randomly pick augmentations from: reverb, noise, music, babble• Typical Kaldi setup

    • VoxCeleb1-E (cleaned) as validation set

    • 16KHz processing• 80 Mel-filter bank energies (20-7600Hz)

    • 25ms window every 10ms

    • No speech activity detection (SAD) used• Initially we attempted to use a SAD DNN (trained on open data) to

    differentiate between OPEN and FIXED tracks but did not make any difference

  • DNN training in PyTorch• Data parallelism (4 Nvidia 2080 RTX)

    • Each GPU 128 samples (512 mini-bacth)

    • Batch-norm was not synchronous (slightly faster)

    • Batch construction • Augmented data is stored in Kaldi “arks” of 80 Mel-Filter Bank energies

    • Uniformly sample (without replacement) subset of classes

    • An audio sample from each class is selected• 4 sec chunk extracted using a random offset

    • After all classes are sampled we repeat until end of training

    • SGD with momentum optimizer• Learning rate of 0.4 fixed for 50K steps and the exponential decay every 10K

    • 150K steps to train the nets

  • Results (VoxCeleb1-E): Size vs performance

    • E-TDNN with AM-Softmax (m=0.3, s=30)

    3.28

    2.26

    1.77 1.61 1.51

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    128 / 0.5 256 / 1.5 512 / 5 1024 / 15.5 2048 / 60

    EER

    (%

    )

    Layer size / Paramaters in millions

  • Results (VoxCeleb1-E): Cosine vs PLDA classifier

    • E-TDNN with AM-Softmax (m=0.3, s=30)

    3.28

    2.26

    1.771.61 1.51

    3.08

    2.13

    1.73 1.61 1.53

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    128 / 0.5 256 / 1.5 512 / 5 1024 / 15.5 2048 / 60

    EER

    (%

    )

    Layer size / Parameters in millions

    Cosine G-PLDA

  • Results: System comparison

    SystemClassification

    head

    Layer/ Emb.

    dimension

    Params

    (millions)Scoring

    VoxCeleb1-E

    EER (%)

    VoxSRC eval

    EER (%)

    E-TDNNAM-Softmax

    (m=0.3, s=30)1024/256 15.5 Cosine 1.61 1.91

    E-TDNNAM-Softmax

    (m=0.1, s=50)1024/512 16 G-PLDA 1.74 2.15

    F-TDNN Softmax 1024/512 12.7 G-PLDA 1.93 2.41

    E-TDNNPLDA-

    Softmax1024/256 15.5 G-PLDA 1.93 2.27

    E-TDNNMeta-learnig

    Softmax512/256 5 Cosine 2.43 ----

  • Results: Fusion of 4 heterogeneous systems

    SystemClassification

    head

    Layer/ Emb.

    dimension

    Params

    (millions)Scoring

    VoxCeleb1-E

    EER (%)

    VoxSRC eval

    EER (%)

    E-TDNNAM-Softmax

    (m=0.3, s=30)1024/256 15.5 Cosine 1.61 1.91

    E-TDNNAM-Softmax

    (m=0.1, s=50)1024/512 16 G-PLDA 1.74 2.15

    F-TDNN Softmax 1024/512 12.7 G-PLDA 1.93 2.41

    E-TDNNPLDA-

    Softmax1024/256 15.5 G-PLDA 1.93 2.27

    Sum fusion ---- ---- 57 ---- 1.22 1.54

  • Results: Fusion vs single large DNN

    SystemClassification

    head

    Layer/ Emb.

    dimension

    Params

    (millions)Scoring

    VoxCeleb1-E

    EER (%)

    VoxSRC eval

    EER (%)

    E-TDNNAM-Softmax

    (m=0.3, s=30)2048/256 60 Cosine 1.51 1.72

    Sum fusion ---- ---- 57 ---- 1.22 1.54

    • Fusing smaller heterogeneous systems outperforms a large DNN with similar number of parameters

  • Conclusions

    • VoxSRC provides alternative challenges from those of recent NIST SREs • No explicit domain shift

    • Pair-wise comparison of relatively short segments

    • We explored 2 x-vector topologies and 4 classification heads• Angular Softmax with margin penalty produces strong systems that can use

    cosine distance for pair-wise comparisons

    • PLDA-Softmax was able to control its capacity and generalized well at test time

    • Our attempt using meta-learning was not very encouraging

    • Fusion of 4 systems outperforms a single large network with similar number of parameters