DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic...

22
NVIDIA NEMO DU-09886-001_v0.11 | July 2020 Best Practices Guide

Transcript of DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic...

Page 1: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

NVIDIA NEMO

DU-09886-001_v0.11 | July 2020

Best Practices Guide

Page 2: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | ii

TABLE OF CONTENTS

Chapter 1. Overview............................................................................................ 1Chapter 2. Why NeMo?..........................................................................................2Chapter 3. Training or Transfer learning workflow using NeMo........................................ 3Chapter 4. Using Kaldi Formatted Data..................................................................... 6Chapter 5. Using speech command recognition task for ASR models................................. 7Chapter 6. Training models for Mandarin Chinese language............................................ 8Chapter 7. Natural Language Processing (NLP) Fine Tuning BERT....................................10Chapter 8. Developing A Dialog State Tracking Model.................................................. 12Chapter 9. Efficient Training with NeMo.................................................................. 13

9.1. Using Mixed Precision...................................................................................139.2. Multi-GPU training.......................................................................................13

Chapter 10. Exporting models............................................................................... 15Chapter 11. Tips/Tricks/Areas for Optimization and FAQs............................................. 16Chapter 12. Resources.........................................................................................19

Page 3: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 1

Chapter 1.OVERVIEW

This Best Practices guide is intended for researchers and model developers to learn howto efficiently develop and train speech and language models using NeMo Toolkit. Thetoolkit is available open source as well as a docker container on ngc. This guide makesan assumption that the user has already installed using the getting started instructions.

The conversational AI pipeline consists of three major stages: Automatic SpeechRecognition (ASR), Natural Language Processing (NLP) or Natural LanguageUnderstanding (NLU), and Text-to-Speech (TTS) or voice synthesis. As you talk to acomputer, the ASR phase converts the audio signal into text, the NLP stage interpretsthe question and generates a smart response, and finally the TTS phase converts the textinto speech signals to generate audio for the user. The toolkit enables development andtraining of deep learning models involved in conversational AI and easily chain themtogether.

Page 4: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 2

Chapter 2.WHY NEMO?

NeMo Best Practices Guide

Deep learning model development for Conversational AI is complex, it involvesdefining, building and training several models in specific domains; experimentingseveral times to get high accuracy, fine tuning on multiple tasks and domain specificdata, ensuring training performance and making sure the models are ready fordeployment to inference applications.

Neural Modules are logical blocks of AI applications which take some typed inputs andproduce certain typed outputs. By separating a model into its essential components in abuilding block manner, NeMo helps researchers develop state of the art accuracy modelsfor domain specific data faster and easier.

Collections of modules for core tasks as well as specific to speech recognition, naturallanguage, speech synthesis help develop modular, flexible and reusable pipelines.

A Neural Module’s inputs/outputs have a Neural Type, that describes the semantics, theaxis order and meaning, and the dimensions of the input/output tensors. This typingallows Neural Modules to be safely chained together to build models for applications.

NeMo can be used to train new models or perform transfer learning on existingpretrained models. Pretrained weights per module(such as encoder, decoder) helpaccelerate model training for domain specific data.

ASR, NLP and TTS pretrained models are trained on multiple datasets(including somelanguages such as mandarin) and optimized for high accuracy. They can be used forTransfer learning.

NeMo supports developing models that work with Mandarin Chinese data. Tutorialshelp users train or fine tune models for conversational AI with the mandarin chineselanguage. The export method provided in NeMo makes it easy to transform a trainedmodel into inference ready format for deployment.

A key area of development in the toolkit is interoperability with other tools used byspeech researchers. Data layer for Kaldi compatibility is one such example.

Page 5: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 3

Chapter 3.TRAINING OR TRANSFER LEARNINGWORKFLOW USING NEMO

Software repository at ngc.nvidia.com contains models developed with NeMo trainedon multiple datasets. If you would like to transfer, learn and fine tune these modelsfor your domain specific data, follow the tutorial but download the model fromngc.nvidia.com beforehand.

The table below shows all pretrained models available to use. For BERT, use the modelweights provided for fine tuning specific tasks. If your domain specific data is large,use pretraining. Jasper, Quartznet base model pretrained weights are per module so useencoder or decoder weights as needed.

Models such as dialog state tracking and recognizing speech commands can be trainedusing tutorials as well, these are more experimental models.

For an easy to follow guide on transfer learning and building domain specific ASRmodels, you can follow this blog.

ASR/NLP/TTS Pretrained Model Datasets used fortraining

Tutorial

ASR Jasper 10x5 Librispeech https://github.com/NVIDIA/NeMo/blob/master/examples/asr/notebooks/1_ASR_tutorial_using_NeMo.ipynb

ASR Multi-dataset Jasper10x5

LibriSpeech, MozillaCommon Voice, WSJ,Fisher, and Switchboard

https://github.com/NVIDIA/NeMo/blob/master/examples/asr/notebooks/1_ASR_tutorial_using_NeMo.ipynb

ASR AI-shell2 Jasper 10x5 AI-shell2 Mandarinchinese

https://github.com/NVIDIA/NeMo/blob/master/examples/asr/notebooks/1_ASR_tutorial_using_NeMo.ipynb

ASR Quartznet Librispeech with speedperturbation

Speech RecognitionTutorial

Page 6: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

Training or Transfer learning workflow using NeMo

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 4

ASR QuartzNetLibrispeechMCV Librispeech, Mozillacommon voice

Speech RecognitionTutorial

Jupyter Notebook

ASR Multi-dataset Quartznet LibriSpeech, MozillaCommon Voice, WSJ,Fisher, and Switchboard

Speech RecognitionTutorial

Jupyter Notebook

ASR WSJ-Quartznet Wall streetjournal,Librispeech,Mozilla common voice

Speech RecognitionTutorial

Jupyter Notebook

ASR AI-shell2 Quartznet AI-shell2 Mandarinchinese

Speech RecognitionTutorial

ASR Speech CommandsRecognition V1 Model3x1x1

Speech Command v1 (30classes) dataset

Speech CommandsTutorial

ASR Speech CommandsRecognition V1 Model3x2x1

Speech Command v1 (30classes) dataset

Speech CommandsTutorial

ASR Speech CommandsRecognition V2Model3x2x1

Speech Command v2 (35classes) dataset

Speech CommandsTutorial

ASR Speech CommandsRecognition V2Model3x1x1

Speech Command v2 (35classes) dataset

Speech CommandsTutorial

NLP BertLargeUncased Uncased Wikipediaand Bookcorpus on asequence length 512

Natural LanguageProcessing Tutorial

NLP BertBaseCased Wikipedia andBookcorpus on asequence length 512

Natural LanguageProcessing Tutorial

NLP https://ngc.nvidia.com/catalog/models/nvidia:bertbaseuncasedfornemo

Uncased Wikipediaand Bookcorpus on asequence length 512

Natural LanguageProcessing Tutorial

NLP BertBaseUncasedSquad1 Uncased questionanswering datasetSQuADv1.1

Question AnsweringTutorial

NLP BertBaseUncasedSquad2 Uncased questionanswering datasetSQuADv2.0

Question AnsweringTutorial

NLP BertLargeUncasedSquad1 Uncased questionanswering datasetSQuADv1.1

Question AnsweringTutorial

NLP BertLargeUncasedSquad2 Uncased questionanswering datasetSQuADv2.0

Question AnsweringTutorial

NLP TRADE Dialog StateTracking Model1

MultiWOZ 2.0 dataset Dialog State TrackingTutorial

Page 7: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

Training or Transfer learning workflow using NeMo

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 5

NLP TRADE Dialog StateTracking

Model2

MultiWOZ 2.1 dataset Dialog State TrackingTutorial

NMT Transformer-Big WikiText-2 Natural LanguageProcessing Tutorial

TTS Tacotron2 LJSpeech Speech SynthesisTutorial

TTS Waveglow LJSpeech Speech SynthesisTutorial

TTS FastSpeech LJSpeech Fastspeech Tutorial

Page 8: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 6

Chapter 4.USING KALDI FORMATTED DATA

The Kaldi Speech Recognition Toolkit project began in 2009 at Johns Hopkins University.It is a toolkit written in C++. If researchers have used Kaldi and have datasets that areformatted to be used with the toolkit; they can use NeMo to develop models based onthat data.

To load Kaldi-formatted data, you can simply use the KaldiFeatureDataLayer insteadof the AudioToTextDataLayer. The KaldiFeatureDataLayer takes in the argumentkaldi_dir instead of a manifest_filepath, and this argument should be set tothe directory that contains the files feats.scp and text. See the Kaldi Compatibilitysection of the NVIDIA Neural Modules Developer Guide for more information.

Page 9: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 7

Chapter 5.USING SPEECH COMMAND RECOGNITIONTASK FOR ASR MODELS

Speech Command Recognition is the task of classifying an input audio pattern into a setof discrete classes. It is a subset of Automatic Speech Recognition, sometimes referredto as ”Key Word Spotting”, in which a model is constantly analyzing speech patterns todetect certain "action" classes.

Upon detection of these “commands”, a specific action can be taken. An example Jupyternotebook provided in NeMo shows how to train a Quartznet model with a modifieddecoder head trained on a speech commands dataset.

It is preferred that you use absolute paths to data_dir when preprocessing thedataset.

Page 10: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 8

Chapter 6.TRAINING MODELS FOR MANDARINCHINESE LANGUAGE

Mandarin Chinese is a tonal language, it uses the pitch of a phoneme sound todistinguish word meaning. For the purpose of speech recognition, it is considered acharacter based language and an evaluation metric called CER(character error rate) isused to measure the accuracy of a deep learning model.

In NeMo, the speech recognition models Jasper and Quartznet were trained on a datasetknown as AISHELL-2. AISHELL-2 is a 1000-hour Mandarin Chinese Speech Corpus.The speech utterance contains 12 domains, including keywords, voice command, smarthome, autonomous driving, industrial production, etc.

To train models, use the tutorials provided in Chinese documentation.

If you would like to use the pretrained weights for fine tuning, download from ngcbefore starting to run the tutorial.

‣ AI-shell2 Jasper 10x5: jasper10x5dr.yaml file contains information about thestructure of Jasper model.

‣ AI-shell2 Quartznet: quartznet15x5.yaml file contains information about thestructure of Quartznet model.

‣ The character error rate (CER) accuracy achieved by the models is documented inthe overview.md file.

Dataset processing scripts are provided to easily prepare the datasets. For NaturalLanguage Processing, BERT Pretraining script has been modified by adding tokenizerselection to allow training on Chinese.

Pretrained weights for NLP chinese models have not been provided. For SpeechSynthesis and Text to Speech, data processing and training scripts are providedto easily build and train Tacotron2 on mandarin chinese dataset from data baker.

Page 11: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

Training models for Mandarin Chinese language

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 9

Waveglow is a universal vocoder and could be used with both English and Chinesedatasets.

Pretrained weights for TTS models are currently not provided.

The following table shows all the scripts available for mandarin chinese ASR, NLP andTTS model training

Script Description

get_aishell_data.py Download and process aishell1 dataset tomanifests

process_aishell2_data.py Process aishell2 dataset to manifests

jasper_aishell.py Train mandarin manifests for Jasper or Quartznet

jasper_aishell_infer.py Infer from mandarin manifests

https://github.com/NVIDIA/NeMo/blob/master/examples/nlp/language_modeling/bert_pretraining.py

Added tokenizer selection to allow training onChinese data

asr_postprocessor.py Post processing for ASR tested on asr generateddata samples

language_modeling_transformer.py Transformer language model tested on wiki_zhdataset

token_classification.py Named Entity Recognition tested on people’s dailydataset

.pytext_classification_with_bert Sentence classification tested on THUC NEWSdataset

machine_translation_tutorial.py Added tokenizer auto-select strategy for differentsrc and target language

process_wiki_zh.py Process the wiki_zh dataset

get_data_baker.py Download open source dataset & create manifests

tacotron_mandarin.yaml Configuration file for mandarin corpus

Page 12: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 10

Chapter 7.NATURAL LANGUAGE PROCESSING (NLP)FINE TUNING BERT

NeMo Best Practices Guide

BERT, or Bidirectional Encoder Representations from Transformers, is a neural approachto pre-train language representations which obtains near state-of-the-art resultson a wide array of Natural Language Processing (NLP) tasks, including the GLUEBenchmark and SQuAD Question Answering dataset.

BERT model checkpoints (BERT-large-uncased, BERT-base-uncased) provided can beused for either fine tuning BERT on your custom dataset, or fine tuning downstreamtasks, including GLUE benchmark tasks, question answering tasks e.g. SQuAD, jointintent and slot detection, punctuation and capitalization, named entity recognition, andspeech recognition post processing model to correct mistakes.

The following table shows all tasks supported with the BERT model.

Almost all NLP examples also support RoBERTa and AlBERT models fordownstream tasks fine tuning (see the list of all supported models by callingnemo.collections.nlp.nm.trainables.get_bert_models_list()). The user just needsto specify the name of the model desired while running the example scripts. Thecolumns with the github links point to the scripts with all these models supported inthe master branch.

Task Name Task Description Link to github

Intent Detection and Slot Filling Extract intent of the utteranceand relevant slots.

https://github.com/NVIDIA/NeMo/tree/master/examples/nlp/intent_detection_slot_tagging

Text Classification Assigning a category to a text.For example, this task could beused to categorize sentencesbased on their sentiment.

https://github.com/NVIDIA/NeMo/tree/master/examples/nlp/text_classification

Language Modelling The task to train a transformerbased language model.

https://github.com/NVIDIA/NeMo/tree/master/examples/nlp/language_modeling

Page 13: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

Natural Language Processing (NLP) Fine Tuning BERT

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 11

Neural Machine Translation Translation of a text from asource language to the targetlanguage.

https://github.com/NVIDIA/NeMo/tree/master/examples/nlp/neural_machine_translation.

Question Answering Supports SQuAD dataset. Thegoal of the task is to find asegment of text from thecorresponding reading passagethat answers the question.

https://github.com/NVIDIA/NeMo/tree/master/examples/nlp/question_answering

Name Entity Recognition (NER) Detect and classify namedentities in a given text.

https://github.com/NVIDIA/NeMo/tree/master/examples/nlp/token_classification

Punctuation and Capitalization Predict punctuation andcapitalization for each word in asentence, for example, in outputof end-to-end speech recognitionmodel.

https://github.com/NVIDIA/NeMo/tree/master/examples/nlp/token_classification

GLUE Benchmark https://gluebenchmark.com/

Collection of NLP tasks fortraining and evaluation ofnatural language understandingsystems.

https://github.com/NVIDIA/NeMo/tree/master/examples/nlp/glue_benchmark

State Tracking for Task-orientedDialogue Systems

Tracks the status of the ongoingconversation between a systemand a user.

https://github.com/NVIDIA/NeMo/tree/master/examples/nlp/dialogue_state_tracking

ASR Post Processing with BERT Corrects mistakes in output ofend-to-end speech recognitionmodel.

https://github.com/NVIDIA/NeMo/tree/master/examples/nlp/asr_postprocessor

Page 14: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 12

Chapter 8.DEVELOPING A DIALOG STATE TRACKINGMODEL

A Dialog state tracking model tracks status of an ongoing conversation; typically asequence of utterances exchanged between dialog participants. This is accomplishedby capturing user goals and intentions and encoding them as a set of slots alongwith the corresponding values. A tutorial included in NeMo shows how to train theTRAnsferable Dialogue statE generator (TRADE) model on a MultiWoz dataset. TheMulti-Domain Wizard-of-Oz dataset (MultiWOZ) is a collection of human-to-humanconversations spanning over 7 distinct domains and containing over 10,000 dialogues.The advantage of NeMo implementation of this model is that it achieves better accuracythan the original implementation.

Page 15: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 13

Chapter 9.EFFICIENT TRAINING WITH NEMO

9.1. Using Mixed PrecisionMixed precision accelerates training speed while protecting against noticeableloss. TensorCores is a specific hardware unit that comes with the Volta and Turingarchitectures to accelerate large matrix to matrix multiply-add operations by operatingthem on half precision inputs and returning the result in full precision. Neural networkswhich usually use massive matrix multiplications can be significantly sped up withmixed precision and tensorcores. However, some neural network layers are numericallymore sensitive than others. Apex AMP is a NVIDIA library that maximizes the benefit ofmixed precision and TensorCore usage for a given network.

9.2. Multi-GPU training‣ Why is multi-GPU training preferred over other types of training?

Multi GPU training can reduce the total training time by distributing the workloadonto multiple compute instances. This is particularly important for large neuralnetworks which would otherwise take weeks to train until convergence. Since NeMosupports multi GPU training, no code change is needed to move from single tomulti-GPU training, only a slight change in your launch command is required.

‣ What are the advantages of mixed precision training?

Mixed precision accelerated training speed while protecting against noticeable lossin precision. TensorCores is a specific hardware unit that comes with the Volta andTuring architectures to accelerate large matrix multiply-add operations by operatingon half precision inputs and returning the result in full precision in order to preventloss in precision. Neural networks which usually use massive matrix multiplicationscan be significantly sped up with mixed precision and tensorcores. However, someneural network layers are numerically more sensitive than others. Apex AMP isa NVIDIA library that maximizes the benefit of mixed precision and TensorCoreusage for a given network.

Page 16: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

Efficient Training with NeMo

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 14

‣ What is the difference between multi-GPU and multi-node training?

Multi-node is an abstraction of multi-gpu training, which requires a distributedcompute cluster, where each node can have multiple GPUs. Multi-node training isneeded to scale training beyond a single node to large amounts of GPUs.

From the framework perspective nothing changes from moving to multi-nodetraining. However, a master address and port need to be set up for internodecommunication. Multi-GPU training will then be launched on each node with theseinformation passed. You might also consider the underlying inter node networktopology and type to achieve full performance, such as HPC-style hardware such asNVLink, InfiniBandnetworking, or Ethernet.

Page 17: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 15

Chapter 10.EXPORTING MODELS

NeMo Best Practices Guide

Modules fine-tuned or trained in NeMo can be exported for efficient deployment invariety of formats. The nemo.core.neural_factory.NeuralModuleFactory contains APIdeployment_export(module, output: str, d_format: nemo.core.neural_factory.DeploymentFormat, input_example=None, output_example=None)

which handles export. Currently, it provides several deployment format targets:

1. ONNX 2. TORCHSCRIPT 3. TRTONNX 4. PYTORCH 5. JARVIS

The recommended format to export your modules is ONNX. However, not all modulesmay support export to all formats. "PYTORCH" target will work for all modules and willsimply save the module's weights in a Pytorch checkpoint.

For core ASR models such as Jasper and QuartzNet all 4 formats will work. Examplescripts for export to Jarvis ASR service could be found under the scripts folder in theNeMo repository.

Here is an example of a “NeMo Model”:https://github.com/NVIDIA/NeMo/blob/v0.11.0/examples/asr/QuartzNetModel.ipynb For ASR, specifically, A NeMoModel is a kind ofNeuralModule which contains other neural modules inside it. NeMoModel can haveother NeuralModules inside and their mode, and topology of connections can dependon the mode in which NeMo model is used (training or evaluation).Exporting to .nemofile greatly simplifies the experience of deployment with Jarvis.

Page 18: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 16

Chapter 11.TIPS/TRICKS/AREAS FOR OPTIMIZATIONAND FAQS

Are there areas where performance can be increased?

‣ You should try using mixed precision for improved performance. Note that typicallywhen using mixed precision memory consumption is decreased and larger batchsizes could be used to further improve the performance.

‣ When fine-tuning ASR models on your data it is almost always possible to takeadvantage of NeMo's pre-trained modules. Even if you have different targetvocabulary, or even different language - you can still try starting with pre-trainedweights from Jasper/QuartzNet *encoder* and only adjust *decoder* for your needs.

What is the recommended sampling rate for ASR?

‣ -The released models are based on 16 KHz audio, so use the models with 16 KHzaudio. Reduced performance should be expected for any audio that is upsampledfrom a sampling frequency less than 16 KHz data.

How do we use this toolkit for audio with different type of compression andfrequency than the training domain for ASR?

‣ You have to match the compression and frequency.

How to replace the 6-gram out of the ASR model with custom language model? 2)what is the language format supported in NeMo?

‣ NeMo’s Beam Search decoder with LM neural module supports KenLM languagemodel.

‣ This Jupyter notebook tutorial under model improvements section shows usageof language model.

‣ Instructions on using KenLM:https://nvidia.github.io/NeMo/asr/tutorial.html#using-kenlm.

‣ You should re-train the KenLM language model on your own dataset. To do so,there are two options: see KenLM’s documentation OR look through this scriptto get an idea https://github.com/NVIDIA/NeMo/blob/master/scripts/build_6-gram_OpenSLR_lm.sh and then adjust to use your own dataset.

Page 19: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

Tips/Tricks/Areas for Optimization and FAQs

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 17

‣ If you want to use a different language model (other than KenLM); basically youwill need to implement a corresponding decoder module.

‣ Transformer-XL example is present in OS2S, it would need to be updated to workwith NeMo, here is the code:https://github.com/NVIDIA/OpenSeq2Seq/tree/master/external_lm_rescore.

How do I use Text to speech(TTS)?

‣ Obtain speech data ideally at 22050 Hz, or alternatively at a higher sample rate andthen downsampled to 22050 Hz

‣ If less than 22050 Hz and above 16000 Hz

‣ Retrain waveglow on your own dataset‣ Tweak the spectrogram generation parameters, namely the window_size

and the window_stride for their Fourier Transforms‣ For below 16000 Hz, look into obtaining new data

‣ In terms of bitrate / quantization, the general advice is the higher the better. We havenot experimented enough to state how much this impacts quality

‣ For amount of data, again the more the better, and the more diverse in terms ofphonemes the better

‣ Aim for around 20 hours of speech after filtering for silences and non-speechaudio

‣ Most open speech datasets are in ~10 second format so training Tacotron 2 on audiothat on the order of 10s - 20s per sample is known to work

‣ Additionally the longer the speech samples, the more difficult it will be to trainTacotron 2

‣ Audio files should also be clean.

‣ There should be little background noise or music.‣ Data recorded from a studio mic is likely to be easier to train compared to data

captured using a phone.‣ How to make sure pronunciation of words is correct?

‣ The technical challenge is related to dataset, text to phonetic spelling is required,use phonetic alphabet(notation) that has the name correctly pronounced.

‣ How should Tacotron2 be trained, what are some example parameters

‣ Tacotron2 requires a single speaker dataset‣ Use AMP level O0‣ Trim long silences in the beginning and end‣ optimizer="adam"‣ beta1 = 0.9‣ beta2 = 0.999‣ lr=0.001 (constant)‣ amp_opt_level="O0"‣ weight_decay=1e-6‣ batch_size=48 (per GPU)‣ trim_silence=True

Page 20: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

Tips/Tricks/Areas for Optimization and FAQs

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 18

Page 21: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

www.nvidia.comNVIDIA NeMo DU-09886-001_v0.11 | 19

Chapter 12.RESOURCES

Here is a list of resources:

Developer Blogs

‣ https://devblogs.nvidia.com/how-to-build-domain-specific-automatic-speech-recognition-models-on-gpus/

‣ https://devblogs.nvidia.com/develop-smaller-speech-recognition-models-with-nvidias-nemo-framework/

‣ https://devblogs.nvidia.com/neural-modules-for-speech-language-models/

Jupyter Notebooks

‣ https://github.com/NVIDIA/NeMo/blob/master/examples/asr/notebooks/1_ASR_tutorial_using_NeMo.ipynb

‣ https://github.com/NVIDIA/NeMo/blob/master/examples/asr/notebooks/2_Online_ASR_Microphone_Demo.ipynb

‣ https://github.com/NVIDIA/NeMo/blob/master/examples/asr/notebooks/3_Speech_Commands_using_NeMo.ipynb

‣ https://github.com/NVIDIA/NeMo/blob/master/examples/nlp/language_modeling/BERTPretrainingTutorial.ipynb

‣ https://github.com/NVIDIA/NeMo/blob/master/examples/nlp/text_classification/text_classification_with_bert.ipynb

‣ https://github.com/NVIDIA/NeMo/blob/master/examples/nlp/token_classification/NERWithBERT.ipynb

‣ https://github.com/NVIDIA/NeMo/blob/master/examples/nlp/token_classification/PunctuationWithBERT.ipynb

Domain specific Transfer learning docker container with Jupyter notebooks:

https://ngc.nvidia.com/catalog/containers/nvidia:nemo_asr_app_img

Page 22: DU-09886-001 v0.10 | April 2020 NVIDIA NEMO Best Practices ... · It is a subset of Automatic Speech Recognition, sometimes referred ... used to measure the accuracy of a deep learning

Notice

THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION

REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED,

STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY

DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A

PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever,

NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall

be limited in accordance with the NVIDIA terms and conditions of sale for the product.

THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED,

MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE,

AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A

SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE

(INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER

LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS

FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR

IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.

NVIDIA makes no representation or warranty that the product described in this guide will be suitable for

any specified use without further testing or modification. Testing of all parameters of each product is not

necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and

fit for the application planned by customer and to do the necessary testing for the application in order

to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect

the quality and reliability of the NVIDIA product and may result in additional or different conditions and/

or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any

default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA

product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license,

either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information

in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without

alteration, and is accompanied by all associated conditions, limitations, and notices.

Trademarks

NVIDIA, the NVIDIA logo, and cuBLAS, CUDA, cuDNN, cuFFT, cuSPARSE, DIGITS, DGX, DGX-1, DGX Station,

GRID, Jetson, Kepler, NVIDIA GPU Cloud, Maxwell, NCCL, NVLink, Pascal, Tegra, TensorRT, Tesla and Volta are

trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries.

Other company and product names may be trademarks of the respective companies with which they are

associated.

Copyright

© 2020 NVIDIA Corporation. All rights reserved.

www.nvidia.com